Name: CS584-Assignment 2- The file Groceries.csv contains market basket data Solved
SKU: 50495
Availability: InStock

Description

5/5 - (2 votes)

Question 1

The file Groceries.csv contains market basket data. The variables are:

Customer: Customer Identifier
Item: Name of Product Purchased

After you have imported the CSV file, please discover association rules using this dataset. For your information, the observations have been sorted in ascending order by Customer and then by Item. Also, duplicated items for each customer have been removed.

a.(5 points) Create a data frame that contains the number of unique items in each customer’s market basket. Draw a histogram of the number of unique items. What are the 25^th, 50^th, and the 75^th percentiles of the histogram?

b.(10 points) We are only interested in the k-itemsets that can be found in the market baskets of at least seventy five (75) customers. How many itemsets can we find? Also, what is the largest k value among our itemsets?

c.(10 points) Find out the association rules whose Confidence metrics are greater than or equal to 1%. How many association rules can we find? Please be reminded that a rule must have a nonempty antecedent and a non-empty consequent. Please do not display those rules in your answer.

d.(5 points) Plot the Support metrics on the vertical axis against the Confidence metrics on the horizontal axis for the rules you have found in (c). Please use the Lift metrics to indicate the size of the marker.

e.(5 points) List the rules whose Confidence metrics are greater than or equal to 60%. Please include their Support and Lift metrics.

Question 2 (30 points)

The K-means algorithm works only with interval features. One way to apply the k-means algorithm to categorical features is to transform them into a new interval feature space. However, this approach can be very inefficient, and it does not produce good results.

For clustering categorical features, we should consider the K-modes clustering algorithm which extends the K-means algorithm by using different dissimilarity measures and a different method for computing cluster centers. See this article for more details. Huang, Z. (1997). “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining.” In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1–8. New York: ACM Press.

Please implement the K-modes clustering method in Python and then apply the method to the cars.csv. Your input fields are these four categorical features: Type, Origin, DriveTrain, and Cylinders. Please do not remove the missing or blank values in these four features. Instead, consider these values as a separate category.

The cluster centroids are the modes of the input fields. In the case of tied modes, choose the lexically or numerically lowest one.

Suppose a categorical feature has observed values 𝑣₁, … , 𝑣_𝑝. Their frequencies (i.e., number of observations) are 𝑓₁,… , 𝑓_𝑝. The distance metric between two values is 𝑑(𝑣_𝑖, 𝑣_𝑗) = 0 if 𝑣_𝑖= 𝑣_𝑗.

1 1

Otherwise, 𝑑(𝑣_𝑖, 𝑣_𝑗) = _𝑓𝑖 + _𝑓𝑗. The distance between any two observations is the sum of the distance metric of the four categorical features.

a.(5 points) What are the frequencies of the categorical feature Type?

b.(5 points) What are the frequencies of the categorical feature DriveTrain?

c.(5 points) What is the distance between Origin = ‘Asia’ and Origin = ‘Europe’?

d.(5 points) What is the distance between Cylinders = 5 and Cylinders = Missing?

e.(5 points) Apply the K-modes method with three clusters. How many observations in each of these three clusters? What are the centroids of these three clusters?

f.(5 points) Display the frequency distribution table of the Origin feature in each cluster.

Question 3 (35 points)

Apply the Spectral Clustering method to the FourCircle.csv. Your input fields are x and y. Wherever needed, specify random_state = 60616 in calling the KMeans function.

g.(5 points) Plot y on the vertical axis versus x on the horizontal axis. How many clusters are there based on your visual inspection?

h.(5 points) Apply the K-mean algorithm directly using your number of clusters that you think in (a). Regenerate the scatterplot using the K-mean cluster identifiers to control the color scheme. Please comment on this K-mean result.

i.(10 points) Apply the nearest neighbor algorithm using the Euclidean distance. We will consider the number of neighbors from 1 to 15. What is the smallest number of neighbors that we should use to discover the clusters correctly? Remember that we may need to try a couple of values first and use the eigenvalue plot to validate our choice.

j.(5 points) Using your choice of the number of neighbors in (c), calculate the Adjacency matrix, the Degree matrix, and finally the Laplacian matrix. How many eigenvalues do you determine are practically zero? Please display their calculated values in scientific notation.

k.(10 points) Apply the K-mean algorithm on the eigenvectors that correspond to your “practically” zero eigenvalues. The number of clusters is the number of your “practically” zero eigenvalues

[SOLVED] CS584-Assignment 2- The file Groceries.csv contains market basket data

If Helpful Share:

Description

Related products

CS584-Assignment 1-a program to calculate the density estimator of a histogram

CS584-Assignment 5- use the Multi-Layer Perceptron algorithm to classify SpectralCluster

CS584-Assignment 4- Convolutional Neural Networks

Related in this category

More in this category

CS584-Assignment 1-a program to calculate the density estimator of a histogram

CS584-Assignment 4- Convolutional Neural Networks

CS584-Assignment 1-classify text paragraphs into three categories

CS584-Assignment 2-Word Vectors

CS584-Assignment 3-using a decision tree model to predict the usage of a car

CS584-Assignment 4-multinomial logistic model