## 15.6 Choosing K

This section is under construction. I’m not overly happy with getting different results with the same gap statistic.

### 15.6.1 Visually

nclust.2 <- kmeans(cluster.dta, centers=2, nstart=10) %>%
fviz_cluster(data=cluster.dta, geom="point") + theme_bw() +
ggtitle("2 clusters")+
scale_colour_viridis_d() + scale_fill_viridis_d()
nclust.3 <- kmeans(cluster.dta, centers=3, nstart=10) %>%
fviz_cluster(data=cluster.dta, geom="point") + theme_bw() +
ggtitle("3 clusters")+
scale_colour_viridis_d() + scale_fill_viridis_d()
nclust.4 <- kmeans(cluster.dta, centers=4, nstart=10) %>%
fviz_cluster(data=cluster.dta, geom="point") + theme_bw() +
ggtitle("4 clusters")+
scale_colour_viridis_d() + scale_fill_viridis_d()
nclust.5 <- kmeans(cluster.dta, centers=5, nstart=10) %>%
fviz_cluster(data=cluster.dta, geom="point") + theme_bw() +
ggtitle("5 clusters")+
scale_colour_viridis_d() + scale_fill_viridis_d()

gridExtra::grid.arrange(nclust.2, nclust.3, nclust.4, nclust.5, nrow=2) • Three clusters provides the best appearing groupings.
• The cluster on the right stands out (high on PC2) on it’s own regardless of what happens with the other clusters.
• Cluster #2 in both the 3 and 4 cluster models is the same points.

### 15.6.2 Elbow method

Similar to the scree plot, choose the number of clusters that minimizes the within cluster variance.

fviz_nbclust(cluster.dta, kmeans, method="wss") No real “elbow”.. but $$k=7$$ is where I’d say the change point in the slope is at.

### 15.6.3 Gap statistic

• This can be used for both hierarchical and non-hierarchical clustering.
• Compares total intracluster variation with the expected value under a null distribution of no clustering.
• See Tibshirani et.all for more details.
set.seed(12345)
fviz_nbclust(cluster.dta, kmeans, method="gap_stat") fviz_nbclust(cluster.dta, hcut, method="gap_stat") 