- simplify the description of a set of interrelated variables.
- transform a set of correlated variables, to a new set of uncorrelated variables
- dimension reduction: collapse many variables into a few number of variables while maintaining the same amount of variation present in the data.
- Statistical modeling is all about explaining variance in an outcome based on the variance in predictors.
- The new variables are called principal components, and they are ordered by the amount of variance they contain.
- So the first few principal components, may contain the same amount of variance (information) contained in a much larger set of original variables.
- multivariable outlier detection
- individual records that have high values on the principal components variables are candidates for outliers or blunders on multiple variables.
- as a solution for multicollinearity
- often is it useful to obtain the first few principal components corresponding to a set of highly correlated X variables, and then conduct regression analysis on the selected components.
- as a step towards factor analysis (next section)
- as an exploratory technique that may be used in gaining a better understanding of the relationships between measures.
Not variable selection
Principal Components Analysis (PCA) differs from variable selection in two ways:
- No dependent variable exists
- Variables are not eliminated but rather summary variables, i.e., principal components, are computed from all of the original variables.
We are trying to understand a phenomenon by collecting a series of component measurements, but the underlying mechanics is complex and not easily understood by simply looking at each component individually. The data could be redundant and high levels of multicolinearity may be present.