10.3 Multicollinearity (PMA6 8.9)
- Occurs when some of the X variables are highly intercorrelated.
- Computed estimates of regression coefficients are unstable and have large standard errors.
For example, the squared standard error of the \(i\)th slope coefficient (\([SE(\beta_{i})]^2\)) can be written as:
\[ [SE(\beta_{i})]^2 = \frac{S^{2}}{(N-1)(S_{i}^{2})}*\frac{1}{1 - (R_{i})^2} \]
where \(S^{2}\) is the residual mean square, \(S_{i}\) the standard deviation of \(X_{i}\), and \(R_{i}\) the multiple correlation between \(X_{i}\) and all other \(X\)’s.
When \(R_{i}\) is close to 1 (very large), \(1 - (R_{i})^2\) becomes close to 0, which makes \(\frac{1}{1 - (R_{i})^2}\) very large.
This fraction is called the variance inflation factor and is available in most model diagnostics.
big.pen.model <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, data=pen)
performance::check_collinearity(big.pen.model) |> plot()
- Solution: use variable selection to delete some X variables.
- Alternatively, use dimension reduction techniques such as Principal Components (Chapter 14).