Description
Question 1
The file Groceries.csv contains market basket data. The variables are:
- Customer: Customer Identifier
- Item: Name of Product Purchased
After you have imported the CSV file, please discover association rules using this dataset. For your information, the observations have been sorted in ascending order by Customer and then by Item. Also, duplicated items for each customer have been removed.
a.(5 points) Create a data frame that contains the number of unique items in each customer’s market basket. Draw a histogram of the number of unique items. What are the 25th, 50th, and the 75th percentiles of the histogram?
b.(10 points) We are only interested in the k-itemsets that can be found in the market baskets of at least seventy five (75) customers. How many itemsets can we find? Also, what is the largest k value among our itemsets?
c.(10 points) Find out the association rules whose Confidence metrics are greater than or equal to 1%. How many association rules can we find? Please be reminded that a rule must have a nonempty antecedent and a non-empty consequent. Please do not display those rules in your answer.
d.(5 points) Plot the Support metrics on the vertical axis against the Confidence metrics on the horizontal axis for the rules you have found in (c). Please use the Lift metrics to indicate the size of the marker.
e.(5 points) List the rules whose Confidence metrics are greater than or equal to 60%. Please include their Support and Lift metrics.
Question 2 (30 points)
The K-means algorithm works only with interval features. One way to apply the k-means algorithm to categorical features is to transform them into a new interval feature space. However, this approach can be very inefficient, and it does not produce good results.
For clustering categorical features, we should consider the K-modes clustering algorithm which extends the K-means algorithm by using different dissimilarity measures and a different method for computing cluster centers. See this article for more details. Huang, Z. (1997). “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining.” In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1–8. New York: ACM Press.
Please implement the K-modes clustering method in Python and then apply the method to the cars.csv. Your input fields are these four categorical features: Type, Origin, DriveTrain, and Cylinders. Please do not remove the missing or blank values in these four features. Instead, consider these values as a separate category.
The cluster centroids are the modes of the input fields. In the case of tied modes, choose the lexically or numerically lowest one.
Suppose a categorical feature has observed values 𝑣1, … , 𝑣𝑝. Their frequencies (i.e., number of observations) are 𝑓1,… , 𝑓𝑝. The distance metric between two values is 𝑑(𝑣𝑖, 𝑣𝑗) = 0 if 𝑣𝑖 = 𝑣𝑗.
1 1
Otherwise, 𝑑(𝑣𝑖, 𝑣𝑗) = 𝑓𝑖 + 𝑓𝑗. The distance between any two observations is the sum of the distance metric of the four categorical features.
a.(5 points) What are the frequencies of the categorical feature Type?
b.(5 points) What are the frequencies of the categorical feature DriveTrain?
c.(5 points) What is the distance between Origin = ‘Asia’ and Origin = ‘Europe’?
d.(5 points) What is the distance between Cylinders = 5 and Cylinders = Missing?
e.(5 points) Apply the K-modes method with three clusters. How many observations in each of these three clusters? What are the centroids of these three clusters?
f.(5 points) Display the frequency distribution table of the Origin feature in each cluster.
Question 3 (35 points)
Apply the Spectral Clustering method to the FourCircle.csv. Your input fields are x and y. Wherever needed, specify random_state = 60616 in calling the KMeans function.
g.(5 points) Plot y on the vertical axis versus x on the horizontal axis. How many clusters are there based on your visual inspection?
h.(5 points) Apply the K-mean algorithm directly using your number of clusters that you think in (a). Regenerate the scatterplot using the K-mean cluster identifiers to control the color scheme. Please comment on this K-mean result.
i.(10 points) Apply the nearest neighbor algorithm using the Euclidean distance. We will consider the number of neighbors from 1 to 15. What is the smallest number of neighbors that we should use to discover the clusters correctly? Remember that we may need to try a couple of values first and use the eigenvalue plot to validate our choice.
j.(5 points) Using your choice of the number of neighbors in (c), calculate the Adjacency matrix, the Degree matrix, and finally the Laplacian matrix. How many eigenvalues do you determine are practically zero? Please display their calculated values in scientific notation.
k.(10 points) Apply the K-mean algorithm on the eigenvectors that correspond to your “practically” zero eigenvalues. The number of clusters is the number of your “practically” zero eigenvalues



