[SOLVED] CS373-Homework 5 K-Means

20.99 $

Category:

Description

5/5 - (1 vote)

1.1        Theory

  1. What are the benefits of the k-means clustering algorithm? What are the issues? In which situations should we use k-means? Your answer should compare to other algorithms learned in class (both unsupervised and supervised) in less than 4 sentences.

1.2        Implementation

You need to implement the k-means clustering algorithm for this part. This part could be completed by editing only kmeans.py. You need to follow the description of the models discussed in the lecture slides (link) with the following specifications.

Features: Consider the 4 continuous attributes in yelp.csv for X.

Distance: Use Euclidean distance unless otherwise specified.

Score function: Use within-cluster sum of squared error (where rk is the centroid of cluster Ck, d is the distance function.).

K

wc(C) = XX d(x(i),rk)2                                                                                                       (1)

k=1x(i)∈Ck

Make sure to also implement the following cluster options as described in Section 1.4.

  1. The four original attributes for clustering (corresponding to 3.1).
  2. A log transform to reviewCount and checkins (corresponding to 3.2).
  3. Standardize the 4 attributes for clustering (corresponding to 3.3).
  4. Four original attributes and Manhattan distance for clustering (corresponding to 3.4).
  5. A random sample of the data for clustering (corresponding to 3.5).

Report the results obtained on the given train set in your report.

2         Analysis

You only need to include your plots and discussions in your report. Make sure that the code you submit doesn’t include any changes you don’t want to be included.

  1. Cluster the Yelp data using k-means.
    • Use a random set of examples as the initial centroids.
    • Use values of K = [3,6,9,12,24].
    • Plot the within-cluster sum of squares (wc) as a function of K.
    • Choose an appropriate K from the plot and argue why you choose this particular K.
    • For the chosen value of K, plot the clusters with their centroids in two ways: first using latitude longitude and second using reviewCount, checkins. Discuss whether any patterns are visible.
  2. Do a log transform of reviewCount, checkins. Describe how you expect the transformation to change the clustering results. Then repeat the analysis (1). Discuss any differences in the results.

P.T.O

  1. Transform the four original attributes so that each attribute has mean = 0 and stdev = 1. You can do this with the numpy functions, numpy.mean() and numpy.std() (i.e., subtract mean, divide by stdev). Describe how you expect the transformation to change the clustering results. Then repeat the analysis (1). Discuss any differences in the results.
  2. Use Manhattan distance instead of Euclidean distance in the algorithm. Describe how you expect the change in the clustering results. Then repeat the analysis (1). Discuss any differences in the results.
  3. Take a 6% random sample of the data. Describe how you expect the downsampling to change the clustering results. Then run the analysis (i) five times and report the average performance. Specially, you should use a single random 6% sample of the data. Then run 5 trials where you start k-means from different random choices of the initial centroids. Report the average wc when you plot wc vs. K. For your chosen K, determine which trial had performance closest to the reported average. Plot the centroids from that trial. Discuss any differences in the results and comment on the variability you observe.
  4. Improve the score function. To evaluate the clustering, it is not sufficient to measure only the within-cluster sum of squares (wc) that you used above. It is also desired to have each cluster separate from others as much as possible. To improve the resulting clustering, define your own score function that takes into account not only the compactness of the clusters but also the separation of the clusters. Write a formal mathematical expression of your score function and explain why you think your score function is better than the within-cluster sum of squares. Also, using the best configuration from Questions 1-5, plot the results of your score function for K = [3, 6, 9, 12, 24], and compare the results to the appropriate algorithm from Question 1-5.