15.5 Non Hierarchical clustering
\(K\)-means clustering
- Divide the data into \(K\) clusters.
- Calculate the centroid for each cluster.
- Calculate the distance from each point to each cluster centroid.
- Assign each point to be in the cluster for which it’s closest to that cluster’s centroid.
- Repeat until no points move clusters.

Figure 16.8
k <- kmeans(cluster.dta, centers=3)
k
## K-means clustering with 3 clusters of sizes 4, 7, 14
##
## Cluster means:
## ROR5 DE SALESGR5 EPS5 NPM1 PE
## 1 0.5959748 1.4185646 1.9308011 1.54316849 0.6882745 2.0643489
## 2 -1.3246523 -0.4541930 -0.1171821 0.02674634 0.4116164 -0.1788732
## 3 0.4920477 -0.1782077 -0.4930664 -0.45427845 -0.4024581 -0.5003774
## PAYOUTR1
## 1 0.94802477
## 2 -0.45122839
## 3 -0.04525003
##
## Clustering vector:
## dia dow stf dd uk psm gra hpc mtc acy cz ald rom rei hum hca nme ami
## 2 2 2 2 3 3 3 3 3 3 3 3 3 3 1 1 1 1
## ahs lks win sgl slc kr sa
## 2 2 2 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 14.22540 21.15603 43.58324
## (between_SS / total_SS = 53.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Interpreting the output:
- First table shows the mean of each variable for each cluster. Each row is that ‘centroid’ vector.
- The
clustering vector
shows the assigned cluster for each record (company)
The kmeans
procedure has an nstart
argument that lets you choose different starting configurations, and lets you choose the best one. Here I iterate through 10 different configurations and compare the cluster assignment results to the first trial where only one configuration was used.
set.seed(4567)
k2 <- kmeans(cluster.dta, centers=3, nstart=10)
rbind(k$cluster, k2$cluster)
## dia dow stf dd uk psm gra hpc mtc acy cz ald rom rei hum hca nme ami
## [1,] 2 2 2 2 3 3 3 3 3 3 3 3 3 3 1 1 1 1
## [2,] 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2
## ahs lks win sgl slc kr sa
## [1,] 2 2 2 3 3 3 3
## [2,] 1 1 1 3 3 3 3
Things to note
- I set a seed. That means there’s a random process going on here. I set a seed so each time i compile the notes i get the same results.
- The cluster numbers are invariant. There are still 3 clusters, with
hum:ami
showing up in one cluster (numbered 2 in the first row, numbered 1 in the second row). Basically the cluster labels 1 and 2 have been swapped.
15.5.1 Visualizations k-means
Using fviz_cluster
from the factoextra
package lets you visualize clusters on two dimensions.
fviz_cluster(object=k2, data=cluster.dta, choose.vars = c("ROR5", "DE")) +
theme_bw() + scale_colour_viridis_d() + scale_fill_viridis_d()
If you omit the choose.vars
argument, this function will create clusters using the first two principal components instead of the original, standardized data.
fviz_cluster(object=k2, data=cluster.dta) +
theme_bw() + scale_colour_viridis_d() + scale_fill_viridis_d()