## 15.5 Non Hierarchical clustering

$$K$$-means clustering

1. Divide the data into $$K$$ clusters.
2. Calculate the centroid for each cluster.
3. Calculate the distance from each point to each cluster centroid.
4. Assign each point to be in the cluster for which it’s closest to that cluster’s centroid.
5. Repeat until no points move clusters.
k <- kmeans(cluster.dta, centers=3)
k
## K-means clustering with 3 clusters of sizes 4, 7, 14
##
## Cluster means:
##         ROR5         DE   SALESGR5        EPS5       NPM1         PE
## 1  0.5959748  1.4185646  1.9308011  1.54316849  0.6882745  2.0643489
## 2 -1.3246523 -0.4541930 -0.1171821  0.02674634  0.4116164 -0.1788732
## 3  0.4920477 -0.1782077 -0.4930664 -0.45427845 -0.4024581 -0.5003774
##      PAYOUTR1
## 1  0.94802477
## 2 -0.45122839
## 3 -0.04525003
##
## Clustering vector:
## dia dow stf  dd  uk psm gra hpc mtc acy  cz ald rom rei hum hca nme ami
##   2   2   2   2   3   3   3   3   3   3   3   3   3   3   1   1   1   1
## ahs lks win sgl slc  kr  sa
##   2   2   2   3   3   3   3
##
## Within cluster sum of squares by cluster:
## [1] 14.22540 21.15603 43.58324
##  (between_SS / total_SS =  53.0 %)
##
## Available components:
##
## [1] "cluster"      "centers"      "totss"        "withinss"
## [5] "tot.withinss" "betweenss"    "size"         "iter"
## [9] "ifault"

Interpreting the output:

• First table shows the mean of each variable for each cluster. Each row is that ‘centroid’ vector.
• The clustering vector shows the assigned cluster for each record (company)

The kmeans procedure has an nstart argument that lets you choose different starting configurations, and lets you choose the best one. Here I iterate through 10 different configurations and compare the cluster assignment results to the first trial where only one configuration was used.

set.seed(4567)
k2 <- kmeans(cluster.dta, centers=3, nstart=10)
rbind(k$cluster, k2$cluster)
##      dia dow stf dd uk psm gra hpc mtc acy cz ald rom rei hum hca nme ami
## [1,]   2   2   2  2  3   3   3   3   3   3  3   3   3   3   1   1   1   1
## [2,]   1   1   1  1  3   3   3   3   3   3  3   3   3   3   2   2   2   2
##      ahs lks win sgl slc kr sa
## [1,]   2   2   2   3   3  3  3
## [2,]   1   1   1   3   3  3  3

Things to note

• I set a seed. That means there’s a random process going on here. I set a seed so each time i compile the notes i get the same results.
• The cluster numbers are invariant. There are still 3 clusters, with hum:ami showing up in one cluster (numbered 2 in the first row, numbered 1 in the second row). Basically the cluster labels 1 and 2 have been swapped.

### 15.5.1 Visualizations k-means

Using fviz_cluster from the factoextra package lets you visualize clusters on two dimensions.

fviz_cluster(object=k2, data=cluster.dta, choose.vars = c("ROR5", "DE")) +
theme_bw() + scale_colour_viridis_d() + scale_fill_viridis_d()

If you omit the choose.vars argument, this function will create clusters using the first two principal components instead of the original, standardized data.

fviz_cluster(object=k2, data=cluster.dta) +
theme_bw() + scale_colour_viridis_d() + scale_fill_viridis_d()