15.9 What to watch out for

  • \(K\) means clustering requires the number of clusters to be specified up front. Hierarchical clustering does not have this restriction.
  • The agglomerative coefficient increases with the number of rows. You can NOT use it to compare between two datasets that are very different in size.
  • Cluster analysis methods are sensitive to outliers
  • Different results can occur if you change the order of the data.
  • The centroid does not have to be part of the data set. Alternative methods such as k-medians or k-medoids restrict the centroids to be an actual record that is ‘closest’ to the calculated mean.
  • Number of clusters depends on the desired level of similarity.
  • Since different algorithms can produce different results, this is especially true across software programs. See PMA6 Table 16.4 as an example of comparing cluster analysis results on the chemical data set in SAS, R and Stata.
  • Sample size. You can cluster on datasets with thousands of records, but know that the dendogram leafs will not be readable. In these cases