15.5 Non Hierarchical clustering

\(K\)-means clustering

  1. Divide the data into \(K\) clusters.
  2. Calculate the centroid for each cluster.
  3. Calculate the distance from each point to each cluster centroid.
  4. Assign each point to be in the cluster for which it’s closest to that cluster’s centroid.
  5. Repeat until no points move clusters.
Figure 16.8

Figure 16.8

Interpreting the output:

  • First table shows the mean of each variable for each cluster. Each row is that ‘centroid’ vector.
  • The clustering vector shows the assigned cluster for each record (company)

The kmeans procedure has an nstart argument that lets you choose different starting configurations, and lets you choose the best one. Here I iterate through 10 different configurations and compare the cluster assignment results to the first trial where only one configuration was used.

Things to note

  • I set a seed. That means there’s a random process going on here. I set a seed so each time i compile the notes i get the same results.
  • The cluster numbers are invariant. There are still 3 clusters, with hum:ami showing up in one cluster (numbered 2 in the first row, numbered 1 in the second row). Basically the cluster labels 1 and 2 have been swapped.

15.5.1 Visualizations k-means

Using fviz_cluster from the factoextra package lets you visualize clusters on two dimensions.

If you omit the choose.vars argument, this function will create clusters using the first two principal components instead of the original, standardized data.