15.3 Distance Measures

Let’s start with 2 dimensions, and the most familiar type of distance: Euclidean distance.

Credit: Wikipedia

Credit: Wikipedia

Recalling the Pythagorean formula, the Euclidean distance between two points \((p_{1}, p_{2})\) and \((q_{1}, q_{2})\) is

\[ d_{euc} = \sqrt{(q_{1} - p_{1})^{2} + (q_{2} - p_{2})^{2} }\]

This formula is generalizable to \(p\) dimensions, so we can calculate the distance between observations on \(p\) numeric variables. The details of calculating a multivariate Euclidean distance is left to a third semester of calculus, but the concept is the same. This measure is commonly referred to as the Euclidean norm, or L2 norm.

Distance measures are not invariant to changes in scale (units of measurement). Distances between measures that are in the thousands, are much larger than distances between measures that are in the micrograms. This is why you always need to scale the data prior to analysis.

15.3.1 Other measures of distance

  • Euclidean distance tends to be the default in most algorithms.
  • Manhattan distance is similar, calculated using the sum of the absolute value distances:

\[ d_{man} = |(q_{1} - p_{1})| + |(q_{2} - p_{2})| \]

  • Correlation-based distances. Pearson, Spearman, Kendall to name a few.
    • Widely used for gene expression data
    • distance is defined by subtracting the correlation coefficient from 1.

When discussing the closeness of records, we are meaning the minimum distance on all \(p\) dimensions under consideration. For this class we will default to using the Euclidean distance unless otherwise specified.

15.3.2 Gowers’s dissimilarity measure as a distance measures for binary data

When your data is only 0/1, the concept of distance between records (vectors) is not quite the same. In this case you are going to need to use a different type of distance, or dissimilarity measure called the Gower distance. This is created as follows, and used in the same way any other distance matrix is used.

This information has not been read through in great detail, but reading thorough the following document, the information looks credible and there is a reference to an original paper. I’ll trust it.

https://rstudio-pubs-static.s3.amazonaws.com/423873_adfdb38bce8d47579f6dc916dd67ae75.html

15.3.3 Creating the distance matrix.

We can visualize these distances using a heatmap, where here I’ve changed the gradient to show the darker the color, the closer the records are to each other. The diagonal is black, because each record has 0 distance from itself. You can change these colors.

  • order=TRUE sorts the distances, so notice hum is furthest away from almost all other companies except nme.
  • win and lks seem to be a bit ‘further’ away from others.

We will explore two methods of clustering: hierarchical and non-hierarchical