9.4 Interpreting Coefficients

Similar to simple linear regression, each \(\beta_{j}\) coefficient is considered a slope. That is, the amount \(Y\) will change for every 1 unit increase in \(X_{j}\). In a multiple variable regression model, \(\b_{j}\) is the estimated change in \(Y\) after controlling for other predictors in the model.

9.4.1 Continuous predictors

mlr.dad.model <- lm(FFEV1 ~ FAGE + FHEIGHT, data=fev)
summary(mlr.dad.model)
## 
## Call:
## lm(formula = FFEV1 ~ FAGE + FHEIGHT, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34708 -0.34142  0.00917  0.37174  1.41853 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.760747   1.137746  -2.427   0.0165 *  
## FAGE        -0.026639   0.006369  -4.183 4.93e-05 ***
## FHEIGHT      0.114397   0.015789   7.245 2.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 147 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.3247 
## F-statistic: 36.81 on 2 and 147 DF,  p-value: 1.094e-13
confint(mlr.dad.model)
##                   2.5 %      97.5 %
## (Intercept) -5.00919751 -0.51229620
## FAGE        -0.03922545 -0.01405323
## FHEIGHT      0.08319434  0.14559974
  • Holding height constant, a father who is one year older is expected to have a FEV value 0.03 (0.01, 0.04) liters less than another man (p<.0001).
  • Holding age constant, a father who is 1cm taller than another man is expected to have a FEV value of 0.11 (.08, 0.15) liter greater than the other man (p<.0001).

For the model that includes age, the coefficient for height is now 0.11, which is interpreted as the rate of change of FEV1 as a function of height after adjusting for age. This is also called the partial regression coefficient of FEV1 on height after adjusting for age.

9.4.2 Binary predictors

Binary predictors (categorical variables with only 2 levels) get converted to a numeric binary indicator variable which only has the values 0 and 1. Whichever level gets assigned to be 0 is called the reference group or level. The regression estimate \(b\) then is the effect of being in group (\(x=1\)) compared to being in the reference (\(x=0\)) group.

Does gender also play a roll in FEV? Let’s look at how gender may impact or change the relationship between FEV and either height or age.

Note, the fev data set is in wide form right now, with different columns for mothers and fathers. First I need to reshape the data into long format, so gender is it’s own variable.

# a pivot_longer() probably would have worked here as well
fev_long <- data.frame(gender = c(fev$FSEX, fev$MSEX), 
                   fev1 = c(fev$FFEV1, fev$MFEV1), 
                   ht = c(fev$FHEIGHT, fev$MHEIGHT), 
                   age = c(fev$FAGE, fev$MAGE), 
                   area = c(fev$AREA, fev$AREA))
fev_long$gender <- factor(fev_long$gender, labels=c("M", "F"))
fev_long$area   <- factor(fev_long$area, labels=c("Burbank", "Lancaster", "Long Beach", "Glendora"))

So the model being fit looks like:

\[ y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \epsilon_{i}\]

where

  • \(x_{1}\): Age
  • \(x_{2}\): height
  • \(x_{3}\): 0 if Male, 1 if Female
lm(fev1 ~ age + ht + gender, data=fev_long)
## 
## Call:
## lm(formula = fev1 ~ age + ht + gender, data = fev_long)
## 
## Coefficients:
## (Intercept)          age           ht      genderF  
##    -2.24051     -0.02354      0.10509     -0.63775

In this model gender is a binary categorical variable, with reference group “Male”. This is detected because the variable that shows up in the regression model output is genderF. So the estimate shown is for males, compared to females.

Note that I DID NOT have to convert the categorical variable gender to a binary numeric variable before fitting it into the model. R (and any other software program) will do this for you already.

The regression equation for the model with gender is

\[ y = -2.24 - 0.02 age + 0.11 height - 0.64genderF \]

  • \(b_{0}:\) For a male who is 0 years old and 0 cm tall, their FEV is -2.24L.
  • \(b_{1}:\) For every additional year older an individual is, their FEV1 decreases by 0.02L.
  • \(b_{2}:\) For every additional cm taller an individual is, their FEV1 increases by 0.16L.
  • \(b_{3}:\) Females have 0.64L lower FEV compared to males.

Note: The interpretation of categorical variables still falls under the template language of “for every one unit increase in \(X_{p}\), \(Y\) changes by \(b_{p}\)”. Here, \(X_{3}=0\) for males, and 1 for females. So a 1 “unit” change is females compared to males.

9.4.3 Categorical Predictors

Let’s continue to model the FEV for individuals living in Southern California, but now we also consider the effect of city they live in. For those unfamiliar with the region, these cities represent very different environmental profiles.

table(fev_long$area)
## 
##    Burbank  Lancaster Long Beach   Glendora 
##         48         98         38        116

Let’s fit a model with area, notice again I do not do anything to the variable area itself aside from add it into the model.

lm(fev1 ~ age + ht + gender + area, data=fev_long) |> summary()
## 
## Call:
## lm(formula = fev1 ~ age + ht + gender + area, data = fev_long)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32809 -0.29573  0.00578  0.31588  1.37041 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.250564   0.752414  -2.991  0.00302 ** 
## age            -0.022801   0.004151  -5.493 8.59e-08 ***
## ht              0.103866   0.010555   9.841  < 2e-16 ***
## genderF        -0.642168   0.078400  -8.191 8.10e-15 ***
## areaLancaster   0.031549   0.084980   0.371  0.71072    
## areaLong Beach  0.061963   0.104057   0.595  0.55199    
## areaGlendora    0.121589   0.082097   1.481  0.13967    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4777 on 293 degrees of freedom
## Multiple R-squared:  0.6529, Adjusted R-squared:  0.6458 
## F-statistic: 91.86 on 6 and 293 DF,  p-value: < 2.2e-16

Examine the coefficient names, areaLancaster, areaLong Beach and areaGlendora. Again R automatically take a categorical variable and turn it into a series of binary indicator variables where a 1 indicates if a person is from that area. Notice how someone from Burbank has 0’s for all of the three indicator variables, someone from Lancaster only has a 1 in the areaLancaster variable and 0 otherwise. And etc for each other area.

areaLancaster areaLong.Beach areaGlendora area
1 0 0 0 Burbank
51 1 0 0 Lancaster
75 0 1 0 Long Beach
101 0 0 1 Glendora
  • Most commonly known as “Dummy coding”. Not an informative term to use.
  • Better used term: Indicator variable
  • Math notation: I(gender == “Female”).
  • A.k.a “reference coding” or “one hot encoding”
  • For a nominal X with K categories, define K indicator variables.
    • Choose a reference (referent) category:
    • Leave it out
    • Use remaining K-1 in the regression.
    • Often, the largest category is chosen as the reference category.

Interpreting the regression coefficients are going to be compared to the reference group. In this case, it is the Burbank area. Why Burbank? Because that is what R sees as the first level. If you want something different, you need to change the factor ordering.

levels(fev_long$area)
## [1] "Burbank"    "Lancaster"  "Long Beach" "Glendora"

The mathematical model is now written as follows,

\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{4}x_{4i} + \beta_{5}x_{5i} +\beta_{6}x_{6i}\epsilon_{i}\]

where

  • \(x_{1}\): Age
  • \(x_{2}\): height
  • \(x_{3}\): 0 if Male, 1 if Female
  • \(x_{4}\): 1 if living in Lancaster, 0 otherwise
  • \(x_{5}\): 1 if living in Long Beach, 0 otherwise
  • \(x_{6}\): 1 if living in Glendora, 0 otherwise

For someone living in Burbank, \(x_{4}=x_{5}=x_{6} =0\) so the model then is

\[Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \epsilon_{i}\]

For someone living in Lancaster, \(x_{4}=1, x_{5}=0, x_{6} =0\) so the model then is

\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{4}(1) \\ Y_{i} \sim (\beta_{0} + \beta_{4}) + \beta_{1}x_{i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} \epsilon_{i} \]

For someone living in Long Beach, \(x_{4}=0, x_{5}=1, x_{6} =0\) so the model then is

\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{5}(1) \\ Y_{i} \sim (\beta_{0} + \beta_{5}) + \beta_{1}x_{i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} \epsilon_{i} \]

and the model for someone living in Glendora \(x_{4}=0, x_{5}=0, x_{6} =1\) is

\[ Y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} + \beta_{6}(1) \\ Y_{i} \sim (\beta_{0} + \beta_{6}) + \beta_{1}x_{i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} \epsilon_{i} \]

In summary, each area gets it’s own intercept, but still has a common slope for all other variables.

\[ y_{i.Burbank} = -2.25 - 0.023(age) + 0.10(ht) -0.64(female) \\ y_{i.Lancaster} = -2.22 - 0.023(age) + 0.10(ht) -0.64(female)\\ y_{i.Long.Beach} = -2.19 - 0.023(age) + 0.10(ht) -0.64(female) \\ y_{i.Glendora} = -2.13 - 0.023(age) + 0.10(ht) -0.64(female) \]

Let’s look interpret the regression coefficients and their 95% confidence intervals from the main effects model again.

lm(fev1 ~ age + ht + gender + area, data=fev_long) |> tbl_regression()
Characteristic Beta 95% CI1 p-value
age -0.02 -0.03, -0.01 <0.001
ht 0.10 0.08, 0.12 <0.001
gender
    M
    F -0.64 -0.80, -0.49 <0.001
area
    Burbank
    Lancaster 0.03 -0.14, 0.20 0.7
    Long Beach 0.06 -0.14, 0.27 0.6
    Glendora 0.12 -0.04, 0.28 0.14
1 CI = Confidence Interval
  • \(b_{4}\): After controlling for age, height and gender, those that live in Lancaster have 0.03 (-0.14, 0.20) higher FEV1 compared to someone living in Burbank (p=0.7).
  • \(b_{5}\): After controlling for age, height and gender, those that live in Long Beach have 0.06 (-0.14, 0.27) higher FEV1 compared to someone living in Burbank (p=0.6).
  • \(b_{6}\): After controlling for age, height and gender, those that live in Glendora have 0.12 (-0.04, 0.28) higher FEV1 compared to someone living in Burbank (p=0.14).

Beta coefficients for categorical variables are always interpreted as the difference between that particular level and the reference group