## 15.8 Exploring clusters

In the example with the companies, there were few enough records that I could see the labels on the visualization itself. When you have lots of records, it can be impossible to identify individual records.

But do you need to?

One of the goals is to identify if there are clusters of individuals. Then if there are, you may (or may not) be interested in identifying what makes the clusters similar. If you’re really interested in the characteristics in individuals being close to each other you can use existing plotting techniques that we’ve used for bivarate plots, and for PCA.

Cluster analysis is an exploratory technique. We can use some existing summary tools to explore what the groups, or clusters, are like.

### 15.8.1 Univariate distribution of clusters

What’s the distribution of clusters? This could be answered with a a table or barchart.

table(chem$pred.clust.kmeans) ## ## 1 2 3 ## 4 14 7 ### 15.8.2 Bivariate plots If you have some variables of interest that you think might be contributing to your clustering, create a bivariate plot of those measures against cluster. ggplot(chem, aes(x=PAYOUTR1, col=as.factor(pred.clust.kmeans))) + geom_density() + theme_bw() If you want to see how the clusters vary across two or more dimensions, you could create scatterplots ggplot(chem, aes(x=PAYOUTR1, y=SALESGR5, col=as.factor(pred.clust.kmeans))) + geom_point() + theme_bw() + stat_ellipse(aes(group=pred.clust.kmeans), type="norm") ### 15.8.3 Multivariate plots or scatterplot matrices. This one lets me see that PE does a good job of separating out group 1, and possibly ROR5 pulls out group 3. caret::featurePlot(x = chem[,c(4:10)], y = as.factor(chem$pred.clust.kmeans),
plot = "ellipse", auto.key=list(columns=3))

or heatmaps,

library(dendextend)
# not run, just a reminder where clust.ward came from
# clust.ward <- hclust(d, method="ward.D")

dend.heatmap <- clust.ward %>%  as.dendrogram() %>% ladderize %>% color_branches(k=3)

gplots::heatmap.2(as.matrix(d),
srtCol = 60,
dendrogram = "row",
Rowv = dend.heatmap,
Colv = "Rowv", # order the columns like the rows
trace="none",
margins =c(3,6),
denscol = "grey",
density.info = "density",
col = colorspace::diverge_hcl(10, palette="Green-Brown")
)

Note that this plot is almost identical to when we visualized the distance matrix d using the fviz_dist function in 14.3.2. That’s what the colors are, is the distances. The difference here is that the rows (and columns) are ordered according to their cluster (with the dendogram on the left.)

### 15.8.4 Still too many records / variables

So take a sample.

• Filter on just one cluster and l👀k at your data.
• Select groups of variables that hang together well as shown by PCA and only cluster on those.
• Select groups of variables that mean something scientifically, or that you want to know if they are meaningful contributions to clusters.