10.3 Logistic Regression

Since the link between \(X\) and \(Y\) is no longer linear, a one unit increase in \(X_{p}\) is no longer associated with a \(b_{p}\) increase in \(Y\). The regression coefficients \(b_{p}\) from a logistic regression must be exponentiated before interpretation.

\[OR = e^{b}\]

The Odds Ratio (OR) provides a directly understandable statistic for the relationship between \(y\) and a specific \(x\) given all other \(x\)’s in the model are fixed. A later example in 10.3.3 provides a numeric example of how Odds Ratios are calculated.

In testing for a relationship between \(x\) and \(y\), our hypothesis is that \(\beta_{p}=0\). Since \(e^{0}=1\), the reference value for the Odds Ratio that signifies no relationship, is 1, not 0.

10.3.1 Interpreting Odds Ratios

Consider a binary outcome with values YES, coded as 1, and NO, coded as 0.

  • OR = 1 = equal chance of response variable being YES given any explanatory variable value. You are not able to predict participants’ responses by knowing their explanatory variable value. This would be a non significant model when looking at the p-value for the explanatory variable in the parameter estimate table.
  • OR > 1 = as the explanatory variable value increases, the presence of a YES response is more likely. We can say that when a participant’s response to the explanatory variable is YES (1), they are more likely to have a response that is a YES (1).
  • OR <1 = as the explanatory variable value increases, the presence of a YES response is less likely. We can say that when a participant’s response to the explanatory variable is YES (1) they are less likely to have a response that is a YES (1).

For a continuous variable X with slope coefficient \(\beta\), the quantity \(e^{b}\) is interpreted as the ratio of the odds for a person with value (X+1) relative to the odds for a person with value X.

Confidence Intervals

Confidence intervals are a range for the population’s predicted odds ratio based on the sample data. We are 95% confident that any given population’s odds ratio would range between those two values.

The OR is not a linear function of the \(x's\), but \(\beta\) is. This means that a CI for the OR is created by calculating a CI for \(\beta\), and then exponentiating the endpoints. A 95% CI for the OR is calculated as:

\[e^{\hat{\beta} \pm 1.96 SE_{\beta}} \]

10.3.2 Example: The effect of gender on Depression

This uses a cleaned version of the depression data set from PMAS5.

  • Binary outcome variable: Symptoms of Depression (cases)
  • Binary predictor variable: Gender (sex) as an indicator of being female

The outcome \(y\) is a 0/1 Bernoulli random variable. The sum of a vector of Bernoulli’s (\(\sum_{i=1}^{n}y_{i}\)) has a Binomial distribution. When we specify that family = "binomial" the glm() function auto-assigns “logit” link function.

We exponentiate the coefficients to back transform the \(\beta\) estimates into Odds Ratios

Females have 2.8 (1.4, 6.2) times the odds of showing signs of depression compared to males.

10.3.3 Multiple Logistic Regression

Let’s continue with the depression model, but now also include age and income as potential predictors of symptoms of depression.

  • The sign of the \(\beta\) coefficients can be interpreted in the same manner as with linear regression as having a positive or negative relationship.
  • The odds of being depressed are less if the respondent has a higher income and is older, and higher if the respondent is female.

The odds of a female being depressed are 2.53 times greater than the odds for Males after adjusting for the effects of age and income (p=.016).

Example calculation for \(OR = e^{\beta}\)

The full model is: \[log(odds) = -0.676 - 0.02096*age - .03656*income + 0.92945*gender\]

We want to calculate the Odds Ratio of depression for women compared to men. \[ OR = \frac{Odds (Y=1|F)}{Odds (Y=1|M)} \]

Write out the equations for men and women separately. \[ = \frac{e^{-0.676 - 0.02096*age - .03656*income + 0.92945(1)}} {e^{-0.676 - 0.02096*age - .03656*income + 0.92945(0)}}\]

Applying rules of exponents to simplify. \[ = \frac{e^{-0.676}e^{- 0.02096*age}e^{- .03656*income}e^{0.92945(1)}} {e^{-0.676}e^{- 0.02096*age}e^{- .03656*income}e^{0.92945(0)}}\]

\[ = \frac{e^{0.92945(1)}} {e^{0.92945(0)}}\]

\[ = e^{0.92945} \]

10.3.4 Effect of a k unit change

  • Sometimes a 1 unit change in a continuous variable is not meaningful.
  • \(exp(kb)\) is the incremental odds ratio corresponding to an increase of \(k\) units in the variable X, assuming that the values of all other X variables remain unchanged.
  • The Adjusted odds ratio (AOR) for increase of 1 year of age is 0.98 (95%CI .96, 1.0)
  • How about a 10 year increase in age? \(e^{10*\beta_{age}} = e^{-.21} = .81\)

with a confidence interval of

Controlling for gender and income, an individual has 0.81 (95% CI 0.68, 0.97) times the odds of being depressed compared to someone who is 10 years younger than them.

10.3.5 Example: Predictors of smoking status

Consider a logistic model on smoking status (0= never smoked, 1=has smoked) using gender, income, and blood pressure class (bp_class) as predictors.

\[ logit(Y) = \beta_{0} + \beta_{1}\mbox{(female)} + \beta_{2}\mbox{(income)} + \beta_{3}\mbox{(Pre-HTN)} + \beta_{4}\mbox{(HTN-I)} + \beta_{5}\mbox{(HTN-II)} \]

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.046 0.1064 9.836 7.881e-23
female_cFemale -0.6182 0.07617 -8.117 4.798e-16
income -3.929e-06 1.411e-06 -2.785 0.005346
bp_classPre-HTN 0.07289 0.08206 0.8882 0.3745
bp_classHTN-I -0.02072 0.1093 -0.1895 0.8497
bp_classHTN-II 0.02736 0.1888 0.1449 0.8848

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 4853 on 3728 degrees of freedom
Residual deviance: 4769 on 3723 degrees of freedom

It is unlikely that blood pressure is associated with smoking status, all groups are not statistically significantly different from the reference group (all p-values are large). Let’s test that hypothesis formally using a Wald Test.

The Wald Test has a large p-value of 0.73, thus blood pressure classification is not associated with smoking status.

  • This means blood pressure classification should not be included in a model to explain smoking status.

10.3.6 Reporting

The following code can be used to create a nice table of odds ratios and their confidence intervals.

95% CI
Female 0.54 0.46 0.63 <0.001
Income 1.00 1.00 1.00 0.005
Pre-HTN vs Normal 1.08 0.92 1.26 0.374
HTN-I vs Normal 0.98 0.79 1.21 0.850
HTN-II vs Normal 1.03 0.71 1.50 0.885

10.3.7 Model Fit

📖 PMA6 12.8 Pseudo \(R^{2}\)

Recall in linear regression, the coefficient of determination \(R^{2}\) could be used as a measure of model fit, as it represents the proportion of variance in the outcome explained by the model. This statistic can’t be calculated for Logistic regression, but several versions of a Pseudo \(R^{2}\) have been proposed, and are printed as part of the output for many statistical packages such as Stata. Goodness of Fit

Another way to assess model fit for the entire model is to perform a goodness-of-fit test. The one developed by Hosmer and Lemeshow (1980) is the most commonly used GoF test. In this approach,

  1. the predicted probability of belonging to the event group (y=1) is calculated for every observation.
  2. these probabilities are sorted in ascending order, then divided into \(j\) equal sized subgroups, typically deciles to create 10 ‘bins’.
  3. For each ‘bin’, or range of probabilities, count the observed number of individuals with the event (y=1) (\(O_{j}\)).
  4. For each ‘bin’, calculate the expected number of events by adding the predicted probabilities in that bin. (\(E_{j}\))

The goodness of fit statistic is then calculated as follows:

\[ GoF \chi^{2} = \sum_{j} \frac{(O_{j}-E_{j})^2}{E_{j}} \]

which approximately follows a \(\chi_{J-2}^{2}\) distribution under relatively large sample sizes.

A large GoF statistic (or small p-value) indicate that the fit may NOT be good

R package: MKmisc, function HLgof.test.

Mkmisc::HLgof.test(fit=fitted(model), obs=model$y) Accurate predictions

Since the logistic regression is all about modeling the probability of an event occurring, we can use that model to create predicted probabilities. The next chapter discusses how to use a logistic regression model for prediction, and what measures of fit are used to assess how “accurate” the model is.