## 8.6 Categorical Predictors

Let’s continue to model the length of the iris petal based on the length of the sepal, controlling for species. But here we’ll keep species as a categorical variable. What happens if we just put the variable in the model?

summary(lm(Petal.Length ~ Sepal.Length + Species, data=iris))
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.76390 -0.17875  0.00716  0.17461  0.79954
##
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)       -1.70234    0.23013  -7.397 1.01e-11 ***
## Sepal.Length       0.63211    0.04527  13.962  < 2e-16 ***
## Speciesversicolor  2.21014    0.07047  31.362  < 2e-16 ***
## Speciesvirginica   3.09000    0.09123  33.870  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2826 on 146 degrees of freedom
## Multiple R-squared:  0.9749, Adjusted R-squared:  0.9744
## F-statistic:  1890 on 3 and 146 DF,  p-value: < 2.2e-16

Examine the coefficient names, Speciesversicolor and Speciesvirginica. R (and most software packages) automatically take a categorical variable and turn it into a series of binary indicator variables. Let’s look at what the software program does in the background. Below is a sample of the iris data. The first column shows the row number, specifically I am only showing 2 sample rows from each species. The second column is the value of the sepal length, the third is the binary indicator for if the iris is from species versicolor, next the binary indicator for if the iris is from species virginica, and lastly the species as a 3 level categorical variable (which is what we’re used to seeing at this point.)

Sepal.Length Speciesversicolor Speciesvirginica Species
1 5.1 0 0 setosa
2 4.9 0 0 setosa
51 7 1 0 versicolor
52 6.4 1 0 versicolor
101 6.3 0 1 virginica
102 5.8 0 1 virginica

### 8.6.1 Factor variable coding

• Most commonly known as “Dummy coding”. Not an informative term to use.
• Better used term: Indicator variable
• Math notation: I(gender == “Female”).
• A.k.a reference coding
• For a nominal X with K categories, define K indicator variables.
• Choose a reference (referent) category:
• Leave it out
• Use remaining K-1 in the regression.
• Often, the largest category is chosen as the reference category.

For the iris example, 2 indicator variables are created for versicolor and virginica. Interpreting the regression coefficients are going to be compared to the reference group. In this case, it is species setosa.

The mathematical model is now written as follows, where $$x_{1}$$ is Sepal Length, $$x_{2}$$ is the indicator for versicolor, and $$x_{3}$$ the indicator for virginica

$Y_{i} \sim \beta_{0} + \beta_{1}x_{i} + \beta_{2}x_{2i} + \beta_{3}x_{3i}+ \epsilon_{i}$

Let’s look at the regression coefficients and their 95% confidence intervals from the main effects model again.

main.eff.model <- lm(Petal.Length ~ Sepal.Length + Species, data=iris)
pander(main.eff.model)
Fitting linear model: Petal.Length ~ Sepal.Length + Species
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.702 0.2301 -7.397 1.005e-11
Sepal.Length 0.6321 0.04527 13.96 1.121e-28
Speciesversicolor 2.21 0.07047 31.36 9.646e-67
Speciesvirginica 3.09 0.09123 33.87 4.918e-71
pander(confint(main.eff.model))
2.5 % 97.5 %
(Intercept) -2.157 -1.248
Sepal.Length 0.5426 0.7216
Speciesversicolor 2.071 2.349
Speciesvirginica 2.91 3.27

In this main effects model, Species only changes the intercept. The effect of species is not multiplied by Sepal length. The interpretations are the following:

• $$b_{1}$$: After controlling for species, Petal length significantly increases with the length of the sepal (0.63, 95% CI 0.54-0.72, p<.0001).
• $$b_{2}$$: Versicolor has on average 2.2cm longer petal lengths compared to setosa (95% CI 2.1-2.3, p<.0001).
• $$b_{3}$$: Virginica has on average 3.1cm longer petal lengths compared to setosa (95% CI 2.9-3.3, p<.0001).