11.4 Logistic Regression

Since the link between \(X\) and \(Y\) is no longer linear, a one unit increase in \(X_{p}\) is no longer associated with a \(b_{p}\) increase in \(Y\). The regression coefficients \(b_{p}\) from a logistic regression must be exponentiated before interpretation.

\[OR = e^{b}\]

The Odds Ratio (OR) provides a directly understandable statistic for the relationship between \(y\) and a specific \(x\) given all other \(x\)’s in the model are fixed. A later example in 11.4.4 provides a numeric example of how Odds Ratios are calculated.

In testing for a relationship between \(x\) and \(y\), our hypothesis is that \(\beta_{p}=0\). Since \(e^{0}=1\), the reference value for the Odds Ratio that signifies no relationship, is 1, not 0.

11.4.1 Interpreting Odds Ratios

Consider a binary outcome with values YES, coded as 1, and NO, coded as 0.

  • OR = 1 = equal chance of response variable being YES given any explanatory variable value. You are not able to predict participants’ responses by knowing their explanatory variable value. This would be a non significant model when looking at the p-value for the explanatory variable in the parameter estimate table.
  • OR > 1 = as the explanatory variable value increases, the presence of a YES response is more likely. We can say that when a participant’s response to the explanatory variable is YES (1), they are more likely to have a response that is a YES (1).
  • OR <1 = as the explanatory variable value increases, the presence of a YES response is less likely. We can say that when a participant’s response to the explanatory variable is YES (1) they are less likely to have a response that is a YES (1).

For a continuous variable X with slope coefficient \(\beta\), the quantity \(e^{b}\) is interpreted as the ratio of the odds for a person with value (X+1) relative to the odds for a person with value X.

11.4.2 Confidence Intervals

Confidence intervals are a range for the population’s predicted odds ratio based on the sample data. We are 95% confident that any given population’s odds ratio would range between those two values.

The OR is not a linear function of the \(x's\), but \(\beta\) is. This means that a CI for the OR is created by calculating a CI for \(\beta\), and then exponentiating the endpoints. A 95% CI for the OR is calculated as:

\[e^{\hat{\beta} \pm 1.96 SE_{\beta}} \]

This math holds for any \(k\) unit change in x. The linearity of the confidence interval only applies at the untransformed level of the \(\beta\)’s. NOT the odds ratio.

11.4.3 Example: The effect of gender on Depression

This uses a cleaned version of the depression data set from PMAS6.

  • Binary outcome variable: Symptoms of Depression (cases)
  • Binary predictor variable: Gender (sex) as an indicator of being female

The outcome \(y\) is a 0/1 Bernoulli random variable. The sum of a vector of Bernoulli’s (\(\sum_{i=1}^{n}y_{i}\)) has a Binomial distribution. When we specify that family = "binomial" the glm() function auto-assigns “logit” link function.

dep_sex_model <- glm(cases ~ sex, data=depress, family="binomial")
summary(dep_sex_model)
## 
## Call:
## glm(formula = cases ~ sex, family = "binomial", data = depress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3125     0.3315  -6.976 3.04e-12 ***
## sex           1.0386     0.3767   2.757  0.00583 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268.12  on 293  degrees of freedom
## Residual deviance: 259.40  on 292  degrees of freedom
## AIC: 263.4
## 
## Number of Fisher Scoring iterations: 5

We exponentiate the coefficients to back transform the \(\beta\) estimates into Odds Ratios

exp(coef(dep_sex_model))
## (Intercept)         sex 
##   0.0990099   2.8251748
exp(confint(dep_sex_model))
##                  2.5 %    97.5 %
## (Intercept) 0.04843014 0.1801265
## sex         1.39911056 6.2142384

Females have 2.8 (1.4, 6.2) times the odds of showing signs of depression compared to males.

11.4.4 Multiple Logistic Regression

Let’s continue with the depression model, but now also include age and income as potential predictors of symptoms of depression.

mvmodel <- glm(cases ~ age + income + sex, data=depress, family="binomial")
summary(mvmodel)
## 
## Call:
## glm(formula = cases ~ age + income + sex, family = "binomial", 
##     data = depress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.67646    0.57881  -1.169  0.24253   
## age         -0.02096    0.00904  -2.318  0.02043 * 
## income      -0.03656    0.01409  -2.595  0.00946 **
## sex          0.92945    0.38582   2.409  0.01600 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268.12  on 293  degrees of freedom
## Residual deviance: 247.54  on 290  degrees of freedom
## AIC: 255.54
## 
## Number of Fisher Scoring iterations: 5
  • The sign of the \(\beta\) coefficients can be interpreted in the same manner as with linear regression as having a positive or negative relationship.
  • The odds of being depressed are less if the respondent has a higher income and is older, and higher if the respondent is female.

The odds of a female being depressed are 2.53 times greater than the odds for Males after adjusting for the effects of age and income (p=.016).

Example calculation for \(OR = e^{\beta}\)

The full model is: \[log(odds) = -0.676 - 0.02096*age - .03656*income + 0.92945*gender\]

We want to calculate the Odds Ratio of depression for women compared to men. \[ OR = \frac{Odds (Y=1|F)}{Odds (Y=1|M)} \]

Write out the equations for men and women separately. \[ = \frac{e^{-0.676 - 0.02096*age - .03656*income + 0.92945(1)}} {e^{-0.676 - 0.02096*age - .03656*income + 0.92945(0)}}\]

Applying rules of exponents to simplify. \[ = \frac{e^{-0.676}e^{- 0.02096*age}e^{- .03656*income}e^{0.92945(1)}} {e^{-0.676}e^{- 0.02096*age}e^{- .03656*income}e^{0.92945(0)}}\]

\[ = \frac{e^{0.92945(1)}} {e^{0.92945(0)}}\]

\[ = e^{0.92945} \]

exp(.92945)
## [1] 2.533116
exp(coef(mvmodel)[4])
##      sex 
## 2.533112

11.4.5 Effect of a k unit change

  • Sometimes a 1 unit change in a continuous variable is not meaningful.
  • \(exp(kb)\) is the incremental odds ratio corresponding to an increase of \(k\) units in the variable X, assuming that the values of all other X variables remain unchanged.
tbl_regression(mvmodel, exponentiate=TRUE)
Characteristic OR1 95% CI1 p-value
age 0.98 0.96, 1.00 0.020
income 0.96 0.94, 0.99 0.009
sex 2.53 1.23, 5.66 0.016
1 OR = Odds Ratio, CI = Confidence Interval
  • The Adjusted odds ratio (AOR) for increase of 1 year of age is 0.98 (95%CI .96, 1.0)
  • How about a 10 year increase in age? \(e^{10*\beta_{age}} = e^{-.21} = .81\)

In generalized linear models, the assumptions on \(\beta\) having a normal distribution

Refer back to equation 11.4.2, the linearity of the relationship is on the \(\beta\) level, not the odds ratio. Do all transformations on the \(\beta\)’s, then the last step is the non-linear transformation of exponentiation.

So the mathmatical order of operations is important! First use coef to extract the coefficients out of the model object (here named mvmodel), then pull the 2nd item out coef(mvmodel)[2], then multiply it by 10, then lastly exponentiate this value.

exp(10*coef(mvmodel)[2])
##       age 
## 0.8109285

Do the same to calculate the confidence interval for a 10 year time span.

round(exp(10*confint(mvmodel)[2,]),3)
##  2.5 % 97.5 % 
##  0.676  0.965

Controlling for gender and income, an individual has 0.81 (95% CI 0.68, 0.97) times the odds of being depressed compared to someone who is 10 years younger than them.

11.4.6 Example: Predictors of smoking status

Consider a logistic model on smoking status (0= never smoked, 1=has smoked) using gender, income, and blood pressure class (bp_class) as predictors.

\[ logit(Y) = \beta_{0} + \beta_{1}\mbox{(female)} + \beta_{2}\mbox{(income)} + \beta_{3}\mbox{(Pre-HTN)} + \beta_{4}\mbox{(HTN-I)} + \beta_{5}\mbox{(HTN-II)} \]

bp.mod <- glm(smoke ~ female_c + income + bp_class, data=addhealth, family='binomial')
tbl_regression(bp.mod, exponentiate=TRUE)
Characteristic OR1 95% CI1 p-value
female_c
    Male
    Female 0.54 0.46, 0.63 <0.001
income 1.00 1.00, 1.00 0.005
bp_class
    Normal
    Pre-HTN 1.08 0.92, 1.26 0.4
    HTN-I 0.98 0.79, 1.21 0.8
    HTN-II 1.03 0.71, 1.50 0.9
1 OR = Odds Ratio, CI = Confidence Interval

It is unlikely that blood pressure is associated with smoking status, all groups are not statistically significantly different from the reference group (all p-values are large). Let’s test that hypothesis formally using a Wald Test.

survey::regTermTest(bp.mod, "bp_class")
## Wald test for bp_class
##  in glm(formula = smoke ~ female_c + income + bp_class, family = "binomial", 
##     data = addhealth)
## F =  0.428004  on  3  and  3723  df: p= 0.73294

The Wald Test has a large p-value of 0.73, thus blood pressure classification is not associated with smoking status.

  • This means blood pressure classification should not be included in a model to explain smoking status.

11.4.7 Model Fit

📖 PMA6 12.8

11.4.7.1 Pseudo \(R^{2}\)

Recall in linear regression, the coefficient of determination \(R^{2}\) could be used as a measure of model fit, as it represents the proportion of variance in the outcome explained by the model. This statistic can’t be calculated for Logistic regression, but several versions of a Pseudo \(R^{2}\) have been proposed, and are printed as part of the output for many statistical packages such as Stata.

11.4.7.2 Goodness of Fit

Another way to assess model fit for the entire model is to perform a goodness-of-fit test. The one developed by Hosmer and Lemeshow (1980) is the most commonly used GoF test. In this approach,

  1. the predicted probability of belonging to the event group (y=1) is calculated for every observation.
  2. these probabilities are sorted in ascending order, then divided into \(j\) equal sized subgroups, typically deciles to create 10 ‘bins’.
  3. For each ‘bin’, or range of probabilities, count the observed number of individuals with the event (y=1) (\(O_{j}\)).
  4. For each ‘bin’, calculate the expected number of events by adding the predicted probabilities in that bin. (\(E_{j}\))

The goodness of fit statistic is then calculated as follows:

\[ GoF \chi^{2} = \sum_{j} \frac{(O_{j}-E_{j})^2}{E_{j}} \]

which approximately follows a \(\chi_{J-2}^{2}\) distribution under relatively large sample sizes.

A large GoF statistic (or small p-value) indicate that the fit may NOT be good

R package: MKmisc, function HLgof.test.

Mkmisc::HLgof.test(fit=fitted(model), obs=model$y)

11.4.7.3 Accurate predictions

Since the logistic regression is all about modeling the probability of an event occurring, we can use that model to create predicted probabilities. The next chapter discusses how to use a logistic regression model for prediction, and what measures of fit are used to assess how “accurate” the model is.