## 9.6 Variable Selection Process

Ref: PMA6 CH 9

Variable selection methods such as the ones described in this section, are most often used when performing an *Exploratory* analysis, where many independent variables have been measured, but a final model to explain the variability of a dependent variable has not yet been determined.

When building a model, we want to choose a set of independent variables that both will yield a good prediction using as few variables as possible (*parsimony*). We also need to consider controlling for moderators and confounders. In many situations where regression is used, the investigator has strong justification for including certain variables in the model.

- previous studies
- accepted theory

The investigator may have prior justification for using certain variables but may be open to suggestions for the remaining variables.

The set of independent variables can be broken down into logical subsets

**Factors of primary interest**. (such as an exposure or treatment)**Potential confounders**. These are measures that could be associated with both the response, and explanatory variables, and which could*explain*the relationship between the primary factor of interest and the outcome. These are typically a set of demographics such as age, gender, ethnicity, and tend to be factors found to be important in prior studies.**Effect Modifiers (Moderators)**. A set of variables that other studies have shown to change or affect the relationship between the explanatory and response variables.**Precision variables (covariates)**. Variables associated with the dependent variable, but not the primary factor of interest.

How variables are chosen for inclusion into a model is heavily driven by the purpose of the model:

- descriptive
- predictive

### 9.6.1 Automated selection procedures

*Forward selection*: Variables are added one at a time until optimal model reached.

- Choose the variable with the highest absolute correlation \(\mid r \mid\) with the outcome.
- Choose the next variable that maximizes the model adjusted \(R^{2}\).
- Repeat until adding additional variables does not improve the model fit significantly.

*Backward elimination*: Variables are removed one at a time until optimal model reached

- Put all variables into the model.
- Remove the least useful variable in the model. This can be done by choosing the variable with the largest \(p\)-value.
- Repeat until removing additional variables reduces the model fit significantly.

*Stepwise selection*: Combination of forward and backward.

- Start with no variables (just \(\bar{Y}\))
- Add the variable that results in the greatest improvement in model fit.
- Add another variable that results in the greatest improvement in model fit after controlling for the first.
- Check to see if removing any variable currently in the model improves the fit.
- Add another variable…
- Check to remove variables…
- Repeat until no variables can be added or removed.

Most programs have the option to **force** variables to be included in the model. This is important in cases where there is a primary factor of interest such as a treatment effect.

“… perhaps the most serious source of error lies in letting statistical procedures make decisions for you.” “Don’t be too quick to turn on the computer. Bypassing the brain to compute by reflex is a sure recipe for disaster.”

Good and Hardin, Common Errors in Statistics (and How to Avoid Them), p. 3, p. 152

Warnings:

- Stopping criteria and algorithm can be different for different software programs.
- Can reject perfectly plausible models from later consideration
- Hides relationships between variables (X3 is added and now X1 is no longer significant. X1 vs X3 should be looked at)

*Best Subsets*

- Select one X with highest simple \(r\) with Y
- Select two X’s with highest multiple \(r\) with Y
- Select three X’s with highest multiple \(r\) with Y etc.
- Compute adjusted R2, AIC or BIC each time.
- Compare and choose among the “best subsets” of various sizes.

### 9.6.2 Implementation in R

Refer to the following refereces

- Notes maintained by Xiaorhui Zhu for a data mining class at Linder College of Business: https://xiaoruizhu.github.io/Data-Mining-R/lecture/3_LinearReg.html
- Jupyter notebook (R kernel) Stats 191 at Stanford. This one uses cross-validation on the stepwise procedures, and demonstrates the dangers of trusting models that come out of blind use of variable selection methods. https://web.stanford.edu/class/stats191/notebooks/Selection.html
- Author and year unknown, but shows best subsets. At least 2 years old, content has not been tested recently. https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

### 9.6.3 Lasso

**L**east **A**bsolute **S**hrinkage and **S**election **O**perator.

Goal is to minimize

\[ RSS + \lambda \sum_{j}\mid \beta_{j} \ \mid \]

where \(\lambda\) is a model complexity penalty parameter.

- Used during cross-validation and AIC/BIC
- “Shrinks” the coefficients, setting some to exactly 0.

- For now, use lasso to choose variables, then fit a model with only those selected variables in the final model.
- Variables chosen in this manner are important, yet biased estimates.