10.5 Comparing between models (PMA6 9.4)
The goal: Find the subset of independent variables that optimizes (either minimize or maximize) a certain criteria. In other words, the goal is to find the optimal model.
How do we measure “optimal”?
First we need to look at two quantities:
10.5.1 RSS: Residual Sum of Squares
Recall the method of least squares introduced in section 9 minimizes the residual sum of squares around the regression plane. This value is central to all following model comparison. How “far away” are the model estimates from the observed?
\[ \sum(Y - \bar{Y})^{2}(1-R^{2}) \]
10.5.2 General F Test
See also Section 10.2.
Two nested models are similar if the p-value for the General F-test is non-significant at a .15 level. Nested: The list of variables in one model is a subset of the list of variables from a bigger model. Similar to all other ANOVA models, you are essentially comparing the difference in RSS between nested models.
# Full model
full.employ.model <- lm(cesd ~ age + chronill + employ, data=depress)
# Reduced model
reduced.employ.model <- lm(cesd ~ age, data=depress)
anova(reduced.employ.model, full.employ.model)
## Analysis of Variance Table
##
## Model 1: cesd ~ age
## Model 2: cesd ~ age + chronill + employ
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 292 22197
## 2 285 20036 7 2161.4 4.3921 0.0001197 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova()
not aov()
.
Other references: https://online.stat.psu.edu/stat501/lesson/6/6.2
10.5.3 Likelihood function
What is the likelihood that we observed the data \(x\), given parameter values \(\theta\). \[ \mathcal{L}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x) \]
- For strictly convenient mathematical matters, we tend to work with the log-likelihood (LL).
- Great because \(log\) is a monotonic increasing function, maximizing the LL = maximizing the likelihood function.
- We can compare between models using functions based off the LL.
There are several measures we can use to compare between competing models.
10.5.4 Multiple \(R^{2}\)
If the model explains a large amount of variation in the outcome that’s good right? So we could consider using \(R^{2}\) as a selection criteria and trying to find the model that maximizes this value.
- Problem: The multiple \(R^{2}\) always increases as predictors are added to the model.
- Ex. 1: N = 100, P = 1, E(\(R^{2}\)) = 0.01
- Ex. 2: N = 21, P = 10, E(\(R^{2}\)) = 0.5
- Problem: \(R^{2} = 1-\frac{Model SS}{Total SS}\) is biased: If population \(R^{2}\) is really zero, then E(\(R^{2}\)) = P/(N-1).
Reference PMA6 Figure 9.1
10.5.5 Adjusted \(R^{2}\)
To alleviate bias use Mean squares instead of SS.
\(R^{2} = 1-\frac{Model MS}{Total MS}\)
equivalently,
\(R^{2}_{adj} = R^{2} - \frac{p(1-R^{2})}{n-p-1}\)
Now Adjusted \(R^{2}\) is approximately unbiased and won’t inflate as \(p\) increases.
10.5.6 Mallows \(C_{p}\)
\[ C_{p} = (N-P-1)\left(\frac{RMSE}{\hat{\sigma}^{2}} -1 \right) + (P+1) \]
where \(RMSE = \frac{RSS}{N-P-1}\).
- Smaller is better
- When all variables are chosen, \(P+1\) is at it’s maximum but the other part of \(C_{p}\) is zero since \(RMSE\)==\(\hat{\sigma}^{2}\)
10.5.7 Akaike Information Criterion (AIC)
- A penalty is applied to the deviance that increases as the number of parameters \(p\) increase.
- Tries to find a parsimonious model that is closer to the “truth”.
- Uses an information function, e.g., the likelihood function \((LL)\).
\[ AIC = -2LL + 2p\]
- Smaller is better
- Can also be written as a function of the residual sum of squares (RSS) (in book)
- Estimates the information in one model relative to other models
- So if all models suck, your AIC will just tell you which one sucks less.
- Built in
AIC()
function in R - Rule of thumb: Model 1 and Model 2 are considered to have significantly different fit if the difference in AIC values is greater than 2.
\[\mid AIC_{1} - AIC_{2}\mid > 2\]