10.5 Comparing between models (PMA6 9.4)
The goal: Find the subset of independent variables that optimizes (either minimize or maximize) a certain criteria. In other words, the goal is to find the optimal model.
 How do we measure “optimal”?
 How do we measure “optimal”?
First we need to look at two quantities:
10.5.1 RSS: Residual Sum of Squares
Recall the method of least squares introduced in section 9 minimizes the residual sum of squares around the regression plane. This value is central to all following model comparison. How “far away” are the model estimates from the observed?
\[ \sum(Y - \bar{Y})^{2}(1-R^{2}) \]
10.5.2 General F Test
See also Section 10.2.
Two nested models are similar if the p-value for the General F-test is non-significant at a .15 level. Nested: The list of variables in one model is a subset of the list of variables from a bigger model. Similar to all other ANOVA models, you are essentially comparing the difference in RSS between nested models.
# Full model
full.employ.model <- lm(cesd ~ age + chronill + employ, data=depress)
# Reduced model
reduced.employ.model <- lm(cesd ~ age, data=depress)
anova(reduced.employ.model, full.employ.model)
## Analysis of Variance Table
## 
## Model 1: cesd ~ age
## Model 2: cesd ~ age + chronill + employ
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    292 22197                                  
## 2    285 20036  7    2161.4 4.3921 0.0001197 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1anova() not aov().
Other references: https://online.stat.psu.edu/stat501/lesson/6/6.2
10.5.3 Likelihood function
What is the likelihood that we observed the data \(x\), given parameter values \(\theta\). \[ \mathcal{L}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x) \]
- For strictly convenient mathematical matters, we tend to work with the log-likelihood (LL).
 
- Great because \(log\) is a monotonic increasing function, maximizing the LL = maximizing the likelihood function.
 
- We can compare between models using functions based off the LL.
There are several measures we can use to compare between competing models.
10.5.4 Multiple \(R^{2}\)
If the model explains a large amount of variation in the outcome that’s good right? So we could consider using \(R^{2}\) as a selection criteria and trying to find the model that maximizes this value.
- Problem: The multiple \(R^{2}\) always increases as predictors are added to the model.
- Ex. 1: N = 100, P = 1, E(\(R^{2}\)) = 0.01
- Ex. 2: N = 21, P = 10, E(\(R^{2}\)) = 0.5
 
- Problem: \(R^{2} = 1-\frac{Model SS}{Total SS}\) is biased: If population \(R^{2}\) is really zero, then E(\(R^{2}\)) = P/(N-1).
Reference PMA6 Figure 9.1
10.5.5 Adjusted \(R^{2}\)
To alleviate bias use Mean squares instead of SS.
\(R^{2} = 1-\frac{Model MS}{Total MS}\)
equivalently,
\(R^{2}_{adj} = R^{2} - \frac{p(1-R^{2})}{n-p-1}\)
Now Adjusted \(R^{2}\) is approximately unbiased and won’t inflate as \(p\) increases.
10.5.6 Mallows \(C_{p}\)
\[ C_{p} = (N-P-1)\left(\frac{RMSE}{\hat{\sigma}^{2}} -1 \right) + (P+1) \]
where \(RMSE = \frac{RSS}{N-P-1}\).
- Smaller is better
- When all variables are chosen, \(P+1\) is at it’s maximum but the other part of \(C_{p}\) is zero since \(RMSE\)==\(\hat{\sigma}^{2}\)
10.5.7 Akaike Information Criterion (AIC)
- A penalty is applied to the deviance that increases as the number of parameters \(p\) increase.
- Tries to find a parsimonious model that is closer to the “truth”.
 
- Uses an information function, e.g., the likelihood function \((LL)\).
\[ AIC = -2LL + 2p\]
- Smaller is better
- Can also be written as a function of the residual sum of squares (RSS) (in book)
- Estimates the information in one model relative to other models
- So if all models suck, your AIC will just tell you which one sucks less.
 
- Built in AIC()function in R
- Rule of thumb: Model 1 and Model 2 are considered to have significantly different fit if the difference in AIC values is greater than 2.
\[\mid AIC_{1} - AIC_{2}\mid > 2\]