14.6 Standardizing

Often researchers will standardize the \(x\) variables before conducting a PCA.

Standardizing: Take \(X\) and divide each element by \(\sigma_{x}\).

\[\frac{X}{\sigma_{X}}\]
Normalizing: Centering and standardizing.

\[Z = \frac{(X-\bar{X})}{\sigma_{X}}\]
Equivalent to analyzing the correlation matrix (\(\mathbf{R}\)) instead of covariance matrix (\(\mathbf{\Sigma}\)).

Using correlation matrix vs covariance matrix will generate different PC’s

This makes sense given the difference in matricies:

cov(data) #Covariance Matrix
##           X1       X2
## X1 100.74146 50.29187
## X2  50.29187 48.59528
cor(data) #Correlation Matrix
##           X1        X2
## X1 1.0000000 0.7187811
## X2 0.7187811 1.0000000

Standardizing your data prior to analysis (using \(\mathbf{R}\) instead of \(\mathbf{\Sigma}\)) aids the interpretation of the PC’s in a few ways

The total variance is the number of variables \(P\)
The proportion explained by each PC is the corresponding eigenvalue / \(P\)
The correlation between \(C_{i}\) and standardized variable \(x_{j}\) can be written as \(r_{ij} = a_{ij}SD(C_{i})\)

This last point means that for any given \(C_{i}\) we can quantify the relative degree of dependence of the PC on each of the standardized variables. This is a.k.a. the factor loading (we will return to this key term later).

To calculate the principal components using the correlation matrix using princomp, set the cor argument to TRUE.

pr_corr <- princomp(data, cor=TRUE)
summary(pr_corr)
## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.3110229 0.5303008
## Proportion of Variance 0.8593906 0.1406094
## Cumulative Proportion  0.8593906 1.0000000
pr_corr$loadings
## 
## Loadings:
##    Comp.1 Comp.2
## X1  0.707  0.707
## X2  0.707 -0.707
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0

If we use the covariance matrix and change the scale of a variable (i.e. in to cm) that will change the results of the PC’s
Many researchers prefer to use the correlation matrix
- It compensates for the units of measurements for the different variables.
- Interpretations are made in terms of the standardized variables.

\[ C_{1} = 0.707x_1 + 0.707X_2 \\ C_{2} = 0.707x_1 - 0.707X_2 \]

I want to compare them side by side in a nice table.

data.frame(PC1.cov = loadings(pr)[,1],
           PC2.cov = loadings(pr)[,2],
           PC1.cor = loadings(pr_corr)[,1],
           PC2.cor = loadings(pr_corr)[,2]) |> kable(digits=2)

	PC1.cov	PC2.cov	PC1.cor	PC2.cor
X1	0.85	0.52	0.71	0.71
X2	0.52	-0.85	0.71	-0.71