9.4 Interpreting Coefficients
Similar to simple linear regression, each \(\beta_{j}\) coefficient is considered a slope. That is, the amount \(Y\) will change for every 1 unit increase in \(X_{j}\). In a multiple variable regression model, \(\b_{j}\) is the estimated change in \(Y\) after controlling for other predictors in the model.
9.4.1 Continuous predictors
mlr.dad.model <- lm(FFEV1 ~ FAGE + FHEIGHT, data=fev)
summary(mlr.dad.model)
##
## Call:
## lm(formula = FFEV1 ~ FAGE + FHEIGHT, data = fev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34708 -0.34142 0.00917 0.37174 1.41853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.760747 1.137746 -2.427 0.0165 *
## FAGE -0.026639 0.006369 -4.183 4.93e-05 ***
## FHEIGHT 0.114397 0.015789 7.245 2.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 147 degrees of freedom
## Multiple R-squared: 0.3337, Adjusted R-squared: 0.3247
## F-statistic: 36.81 on 2 and 147 DF, p-value: 1.094e-13
confint(mlr.dad.model)
## 2.5 % 97.5 %
## (Intercept) -5.00919751 -0.51229620
## FAGE -0.03922545 -0.01405323
## FHEIGHT 0.08319434 0.14559974
- Holding height constant, a father who is one year older is expected to have a FEV value 0.03 (0.01, 0.04) liters less than another man (p<.0001).
- Holding age constant, a father who is 1cm taller than another man is expected to have a FEV value of 0.11 (.08, 0.15) liter greater than the other man (p<.0001).
For the model that includes age, the coefficient for height is now 0.11, which is interpreted as the rate of change of FEV1 as a function of height after adjusting for age. This is also called the partial regression coefficient of FEV1 on height after adjusting for age.
9.4.2 Binary predictors
Binary predictors (categorical variables with only 2 levels) get converted to a numeric binary indicator variable which only has the values 0 and 1. Whichever level gets assigned to be 0 is called the reference group or level. The regression estimate \(b\) then is the effect of being in group (\(x=1\)) compared to being in the reference (\(x=0\)) group.
Does gender also play a roll in FEV? Let’s look at how gender may impact or change the relationship between FEV and either height or age.
Note, the
fev
data set is in wide form right now, with different columns for mothers and fathers. First I need to reshape the data into long format, so gender is it’s own variable.
# a pivot_longer() probably would have worked here as well
fev_long <- data.frame(gender = c(fev$FSEX, fev$MSEX),
fev1 = c(fev$FFEV1, fev$MFEV1),
ht = c(fev$FHEIGHT, fev$MHEIGHT),
age = c(fev$FAGE, fev$MAGE),
area = c(fev$AREA, fev$AREA))
fev_long$gender <- factor(fev_long$gender, labels=c("M", "F"))
fev_long$area <- factor(fev_long$area, labels=c("Burbank", "Lancaster", "Long Beach", "Glendora"))
So the model being fit looks like:
\[ y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \epsilon_{i}\]
where
- \(x_{1}\): Age
- \(x_{2}\): height
- \(x_{3}\): 0 if Male, 1 if Female
lm(fev1 ~ age + ht + gender, data=fev_long)
##
## Call:
## lm(formula = fev1 ~ age + ht + gender, data = fev_long)
##
## Coefficients:
## (Intercept) age ht genderF
## -2.24051 -0.02354 0.10509 -0.63775
In this model gender is a binary categorical variable, with reference group “Male”. This is detected because the variable that shows up in the regression model output is genderF
. So the estimate shown is for males, compared to females.
Note that I DID NOT have to convert the categorical variable gender
to a binary numeric variable before fitting it into the model. R (and any other software program) will do this for you already.
The regression equation for the model with gender is
\[ y = -2.24 - 0.02 age + 0.11 height - 0.64genderF \]
- \(b_{0}:\) For a male who is 0 years old and 0 cm tall, their FEV is -2.24L.
- \(b_{1}:\) For every additional year older an individual is, their FEV1 decreases by 0.02L.
- \(b_{2}:\) For every additional cm taller an individual is, their FEV1 increases by 0.16L.
- \(b_{3}:\) Females have 0.64L lower FEV compared to males.
Note: The interpretation of categorical variables still falls under the template language of “for every one unit increase in \(X_{p}\), \(Y\) changes by \(b_{p}\)”. Here, \(X_{3}=0\) for males, and 1 for females. So a 1 “unit” change is females compared to males.
9.4.3 Categorical Predictors
Let’s continue to model the FEV for individuals living in Southern California, but now we also consider the effect of city they live in. For those unfamiliar with the region, these cities represent very different environmental profiles.
Let’s fit a model with area
, notice again I do not do anything to the variable area
itself aside from add it into the model.
lm(fev1 ~ age + ht + gender + area, data=fev_long) |> summary()
##
## Call:
## lm(formula = fev1 ~ age + ht + gender + area, data = fev_long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32809 -0.29573 0.00578 0.31588 1.37041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.250564 0.752414 -2.991 0.00302 **
## age -0.022801 0.004151 -5.493 8.59e-08 ***
## ht 0.103866 0.010555 9.841 < 2e-16 ***
## genderF -0.642168 0.078400 -8.191 8.10e-15 ***
## areaLancaster 0.031549 0.084980 0.371 0.71072
## areaLong Beach 0.061963 0.104057 0.595 0.55199
## areaGlendora 0.121589 0.082097 1.481 0.13967
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4777 on 293 degrees of freedom
## Multiple R-squared: 0.6529, Adjusted R-squared: 0.6458
## F-statistic: 91.86 on 6 and 293 DF, p-value: < 2.2e-16
Examine the coefficient names, areaLancaster
, areaLong Beach
and areaGlendora
. Again R automatically take a categorical variable and turn it into a series of binary indicator variables where a 1 indicates if a person is from that area. Notice how someone from Burbank has 0’s for all of the three indicator variables, someone from Lancaster only has a 1 in the areaLancaster
variable and 0 otherwise. And etc for each other area.
areaLancaster | areaLong.Beach | areaGlendora | area | |
---|---|---|---|---|
1 | 0 | 0 | 0 | Burbank |
51 | 1 | 0 | 0 | Lancaster |
75 | 0 | 1 | 0 | Long Beach |
101 | 0 | 0 | 1 | Glendora |
- Most commonly known as “Dummy coding”. Not an informative term to use.
- Better used term: Indicator variable
- Math notation: I(gender == “Female”).
- A.k.a “reference coding” or “one hot encoding”
- For a nominal X with K categories, define K indicator variables.
- Choose a reference (referent) category:
- Leave it out
- Use remaining K-1 in the regression.
- Often, the largest category is chosen as the reference category.
Interpreting the regression coefficients are going to be compared to the reference group. In this case, it is the Burbank area. Why Burbank? Because that is what R sees as the first level. If you want something different, you need to change the factor ordering.
The mathematical model is now written as follows,
\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{4}x_{4i} + \beta_{5}x_{5i} +\beta_{6}x_{6i}\epsilon_{i}\]
where
- \(x_{1}\): Age
- \(x_{2}\): height
- \(x_{3}\): 0 if Male, 1 if Female
- \(x_{4}\): 1 if living in Lancaster, 0 otherwise
- \(x_{5}\): 1 if living in Long Beach, 0 otherwise
- \(x_{6}\): 1 if living in Glendora, 0 otherwise
For someone living in Burbank, \(x_{4}=x_{5}=x_{6} =0\) so the model then is
\[Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \epsilon_{i}\]
For someone living in Lancaster, \(x_{4}=1, x_{5}=0, x_{6} =0\) so the model then is
\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{4}(1) \\ Y_{i} \sim (\beta_{0} + \beta_{4}) + \beta_{1}x_{i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} \epsilon_{i} \]
For someone living in Long Beach, \(x_{4}=0, x_{5}=1, x_{6} =0\) so the model then is
\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{5}(1) \\ Y_{i} \sim (\beta_{0} + \beta_{5}) + \beta_{1}x_{i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} \epsilon_{i} \]
and the model for someone living in Glendora \(x_{4}=0, x_{5}=0, x_{6} =1\) is
\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{6}(1) \\ Y_{i} \sim (\beta_{0} + \beta_{6}) + \beta_{1}x_{i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} \epsilon_{i} \]
In summary, each area gets it’s own intercept, but still has a common slope for all other variables.
\[ y_{i.Burbank} = -2.25 - 0.023(age) + 0.10(ht) -0.64(female) \\ y_{i.Lancaster} = -2.22 - 0.023(age) + 0.10(ht) -0.64(female)\\ y_{i.Long.Beach} = -2.19 - 0.023(age) + 0.10(ht) -0.64(female) \\ y_{i.Glendora} = -2.13 - 0.023(age) + 0.10(ht) -0.64(female) \]
Let’s look interpret the regression coefficients and their 95% confidence intervals from the main effects model again.
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
age | -0.02 | -0.03, -0.01 | <0.001 |
ht | 0.10 | 0.08, 0.12 | <0.001 |
gender | |||
M | — | — | |
F | -0.64 | -0.80, -0.49 | <0.001 |
area | |||
Burbank | — | — | |
Lancaster | 0.03 | -0.14, 0.20 | 0.7 |
Long Beach | 0.06 | -0.14, 0.27 | 0.6 |
Glendora | 0.12 | -0.04, 0.28 | 0.14 |
1 CI = Confidence Interval |
- \(b_{4}\): After controlling for age, height and gender, those that live in Lancaster have 0.03 (-0.14, 0.20) higher FEV1 compared to someone living in Burbank (p=0.7).
- \(b_{5}\): After controlling for age, height and gender, those that live in Long Beach have 0.06 (-0.14, 0.27) higher FEV1 compared to someone living in Burbank (p=0.6).
- \(b_{6}\): After controlling for age, height and gender, those that live in Glendora have 0.12 (-0.04, 0.28) higher FEV1 compared to someone living in Burbank (p=0.14).
Beta coefficients for categorical variables are always interpreted as the difference between that particular level and the reference group