9.7 Comparing between models

The goal: Find the subset of independent variables that optimizes (either minimize or maximize) a certain criteria. In other words, the goal is to find the optimal model.

q How do we measure “optimal”?

First we need to look at two quantities:

9.7.1 RSS: Residual Sum of Squares

Recall the method of least squares introduced in section 8 minimies the residual sum of squares around the regression plane. This value is central to all following model comparison. How ``far away" are the model estimates from the observed?

\[ \sum(Y - \bar{Y})^{2}(1-R^{2}) \]

9.7.2 Likelihood function

What is the likelihood that we observed the data \(x\), given parameter values \(\theta\). \[ \mathcal{L}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x) \]

  • For strictly convenient mathematical matters, we tend to work with the log-likelihood (LL).
  • Great because \(log\) is a monotonic increasing function, maximizing the LL = maximizing the likelihood function.
  • We can compare between models using functions based off the LL.

There are several measures we can use to compare between competing models.

9.7.3 General F Test

Two nested models are similar if the p-value for the General F-test is non-significant at a .15 level.

9.7.4 Multiple \(R^{2}\)

If the model explains a large amount of variation in the outcome that’s good right? So we could consider using \(R^{2}\) as a selection criteria and trying to find the model that maximizes this value.

  • Problem: The multiple \(R^{2}\) always increases as predictors are added to the model.
    • Ex. 1: N = 100, P = 1, E(\(R^{2}\)) = 0.01
    • Ex. 2: N = 21, P = 10, E(\(R^{2}\)) = 0.5
  • Problem: \(R^{2} = 1-\frac{Model SS}{Total SS}\) is biased: If population \(R^{2}\) is really zero, then E(\(R^{2}\)) = P/(N-1).

9.7.5 Adjusted \(R^{2}\)

To alleviate bias use Mean squares instead of SS.

\(R^{2} = 1-\frac{Model MS}{Total MS}\)


\(R^{2}_{adj} = R^{2} - \frac{p(1-R^{2})}{n-p-1}\)

Now Adjusted \(R^{2}\) is approximately unbiased and won’t inflate as \(p\) increases.

9.7.6 Mallows \(C_{p}\)

\[ C_{p} = (N-P-1)\left(\frac{RMSE}{\hat{\sigma}^{2}} -1 \right) + (P+1) \]

where \(RMSE = \frac{RSS}{N-P-1}\).

  • Smaller is better
  • When all variables are chosen, \(P+1\) is at it’s maximum but the other part of \(C_{p}\) is zero since \(RMSE\)==\(\hat{\sigma}^{2}\)

9.7.7 Akaike Information Criterion (AIC)

  • A penalty is applied to the deviance that increases as the number of parameters \(p\) increase.
  • Tries to find a parsimonious model that is closer to the “truth”.
  • Uses an information function, e.g., the likelihood function \((LL)\).

\[ AIC = -2LL + 2p\]

  • Smaller is better
  • Can also be written as a function of the residual sum of squares (RSS) (in book)
  • Estimates the information in one model relative to other models
    • So if all models suck, your AIC will just tell you which one sucks less.
  • Built in AIC() function in R
  • Rule of thumb: Model 1 and Model 2 are considered to have significantly different fit if the difference in AIC values is greater than 2.

\[\mid AIC_{1} - AIC_{2}\mid > 2\]

9.7.8 Bayesian Information Criterion (BIC)

  • Similar to AIC.
  • Built in BIC() function in R
  • Tries to find a parsimonious model that is more likely to be the “truth”. The smaller BIC, the better.

\[ BIC = -2LL + ln(N)*(P+1)\]

9.7.9 AIC vs BIC

  • Both are “penalized likelihood” functions
  • Each = -2log likelihood + penalty
  • AIC: penalty = 2, BIC: penalty = ln(N)
  • For any N > 7, ln(N) > 2
  • Thus, BIC penalizes larger models more heavily.
  • They often agree.
    • When they disagree, AIC chooses a larger model than BIC.