## 14.8 Use in Multiple Regression

• Choose a handful of few principal components to use as predictors in a regression model
• Leads to more stable regression estimates.
• Alternative to variable selection
• Ex: several measures of behavior.
• Use PC$$_{1}$$ or PC$$_{1}$$ and PC$$_{2}$$ as summary measures of all.

### 14.8.1 Example: Modeling acute illness

The 20 depression questions C1:C20 were designed to be added together to create the CESD scale directly. While this is a validate measure, what if some components (e.g. had crying spells) contributes more to someones level of depression than another measure (e.g. people were unfriendly). Since the PC’s are linear combinations of the $$x$$’s, the coefficients $$a$$, or the loadings, aren’t all equal as we’ve seen. So let’s see if the first two PC’s (since that’s what was chosen from the scree plot) can predict chronic illness better than the straight summative score of cesd.

1. Extract PC scores and attach them to the data.

The scores for each PC for each observation is stored in the scores list object in the pc_dep object.

dim(pc_dep$scores); kable(pc_dep$scores[1:5, 1:5])
## [1] 294  20
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
-2.446342 0.6236068 0.1288289 -0.2546597 -0.1624772
-1.452116 -0.1763085 0.5861563 -0.6781969 -0.3225529
-1.468211 -0.4350019 0.2893955 -0.3243790 -0.2513590
-1.324852 1.7766419 1.0833599 1.2651869 -1.1339350
-1.449606 2.3576522 -0.7489288 1.9464680 1.2229057
depress$pc1 <- pc_dep$scores[,1]
depress$pc2 <- pc_dep$scores[,2]

2. Fit a model using those PC scores as covariates

Along with any other covariates chosen by other methods.

glm(acuteill~pc1+pc2, data=depress, family='binomial') %>% summary()
##
## Call:
## glm(formula = acuteill ~ pc1 + pc2, family = "binomial", data = depress)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.3289  -0.8242  -0.7894   1.4447   1.6898
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.87695    0.12901  -6.798 1.06e-11 ***
## pc1          0.07921    0.04608   1.719   0.0856 .
## pc2          0.10321    0.10409   0.992   0.3214
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 357.13  on 293  degrees of freedom
## Residual deviance: 353.09  on 291  degrees of freedom
## AIC: 359.09
##
## Number of Fisher Scoring iterations: 4
glm(acuteill~cesd, data=depress, family='binomial') %>% summary()
##
## Call:
## glm(formula = acuteill ~ cesd, family = "binomial", data = depress)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.1354  -0.8356  -0.7840   1.4622   1.6645
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.09721    0.18479  -5.938 2.89e-09 ***
## cesd         0.02494    0.01392   1.792   0.0731 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 357.13  on 293  degrees of freedom
## Residual deviance: 353.97  on 292  degrees of freedom
## AIC: 357.97
##
## Number of Fisher Scoring iterations: 4

In this example, the model using the PC’s and the model using cesd were very similar. However, this is an example where an aggregate measure such as cesd has already been figured out scientifically and validated. This is not often the case, expecially in exploratory data analysis when you are not sure -how- the measures are correlated.