11.4 Logistic Regression
Since the link between \(X\) and \(Y\) is no longer linear, a one unit increase in \(X_{p}\) is no longer associated with a \(b_{p}\) increase in \(Y\). The regression coefficients \(b_{p}\) from a logistic regression must be exponentiated before interpretation.
\[OR = e^{b}\]
The Odds Ratio (OR) provides a directly understandable statistic for the relationship between \(y\) and a specific \(x\) given all other \(x\)’s in the model are fixed. A later example in 11.4.4 provides a numeric example of how Odds Ratios are calculated.
In testing for a relationship between \(x\) and \(y\), our hypothesis is that \(\beta_{p}=0\). Since \(e^{0}=1\), the reference value for the Odds Ratio that signifies no relationship, is 1, not 0.
11.4.1 Interpreting Odds Ratios
Consider a binary outcome with values YES, coded as 1, and NO, coded as 0.
- OR = 1 = equal chance of response variable being YES given any explanatory variable value. You are not able to predict participants’ responses by knowing their explanatory variable value. This would be a non significant model when looking at the p-value for the explanatory variable in the parameter estimate table.
- OR > 1 = as the explanatory variable value increases, the presence of a YES response is more likely. We can say that when a participant’s response to the explanatory variable is YES (1), they are more likely to have a response that is a YES (1).
- OR <1 = as the explanatory variable value increases, the presence of a YES response is less likely. We can say that when a participant’s response to the explanatory variable is YES (1) they are less likely to have a response that is a YES (1).
For a continuous variable X with slope coefficient \(\beta\), the quantity \(e^{b}\) is interpreted as the ratio of the odds for a person with value (X+1) relative to the odds for a person with value X.
11.4.2 Confidence Intervals
Confidence intervals are a range for the population’s predicted odds ratio based on the sample data. We are 95% confident that any given population’s odds ratio would range between those two values.
The OR is not a linear function of the \(x's\), but \(\beta\) is. This means that a CI for the OR is created by calculating a CI for \(\beta\), and then exponentiating the endpoints. A 95% CI for the OR is calculated as:
\[e^{\hat{\beta} \pm 1.96 SE_{\beta}} \]
This math holds for any \(k\) unit change in x. The linearity of the confidence interval only applies at the untransformed level of the \(\beta\)’s. NOT the odds ratio.
11.4.3 Example: The effect of gender on Depression
This uses a cleaned version of the depression data set from PMAS6.
- Binary outcome variable: Symptoms of Depression (
cases
) - Binary predictor variable: Gender (
sex
) as an indicator of being female
The outcome \(y\) is a 0/1 Bernoulli random variable. The sum of a vector of Bernoulli’s (\(\sum_{i=1}^{n}y_{i}\)) has a Binomial distribution. When we specify that family = "binomial"
the glm()
function auto-assigns “logit” link function.
dep_sex_model <- glm(cases ~ sex, data=depress, family="binomial")
summary(dep_sex_model)
##
## Call:
## glm(formula = cases ~ sex, family = "binomial", data = depress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3125 0.3315 -6.976 3.04e-12 ***
## sex 1.0386 0.3767 2.757 0.00583 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268.12 on 293 degrees of freedom
## Residual deviance: 259.40 on 292 degrees of freedom
## AIC: 263.4
##
## Number of Fisher Scoring iterations: 5
We exponentiate the coefficients to back transform the \(\beta\) estimates into Odds Ratios
exp(coef(dep_sex_model))
## (Intercept) sex
## 0.0990099 2.8251748
exp(confint(dep_sex_model))
## 2.5 % 97.5 %
## (Intercept) 0.04843014 0.1801265
## sex 1.39911056 6.2142384
Females have 2.8 (1.4, 6.2) times the odds of showing signs of depression compared to males.
11.4.4 Multiple Logistic Regression
Let’s continue with the depression model, but now also include age and income as potential predictors of symptoms of depression.
mvmodel <- glm(cases ~ age + income + sex, data=depress, family="binomial")
summary(mvmodel)
##
## Call:
## glm(formula = cases ~ age + income + sex, family = "binomial",
## data = depress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.67646 0.57881 -1.169 0.24253
## age -0.02096 0.00904 -2.318 0.02043 *
## income -0.03656 0.01409 -2.595 0.00946 **
## sex 0.92945 0.38582 2.409 0.01600 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268.12 on 293 degrees of freedom
## Residual deviance: 247.54 on 290 degrees of freedom
## AIC: 255.54
##
## Number of Fisher Scoring iterations: 5
- The sign of the \(\beta\) coefficients can be interpreted in the same manner as with linear regression as having a positive or negative relationship.
- The odds of being depressed are less if the respondent has a higher income and is older, and higher if the respondent is female.
The odds of a female being depressed are 2.53 times greater than the odds for Males after adjusting for the effects of age and income (p=.016).
Example calculation for \(OR = e^{\beta}\)
The full model is: \[log(odds) = -0.676 - 0.02096*age - .03656*income + 0.92945*gender\]
We want to calculate the Odds Ratio of depression for women compared to men. \[ OR = \frac{Odds (Y=1|F)}{Odds (Y=1|M)} \]
Write out the equations for men and women separately. \[ = \frac{e^{-0.676 - 0.02096*age - .03656*income + 0.92945(1)}} {e^{-0.676 - 0.02096*age - .03656*income + 0.92945(0)}}\]
Applying rules of exponents to simplify. \[ = \frac{e^{-0.676}e^{- 0.02096*age}e^{- .03656*income}e^{0.92945(1)}} {e^{-0.676}e^{- 0.02096*age}e^{- .03656*income}e^{0.92945(0)}}\]
\[ = \frac{e^{0.92945(1)}} {e^{0.92945(0)}}\]
\[ = e^{0.92945} \]
11.4.5 Effect of a k unit change
- Sometimes a 1 unit change in a continuous variable is not meaningful.
- \(exp(kb)\) is the incremental odds ratio corresponding to an increase of \(k\) units in the variable X, assuming that the values of all other X variables remain unchanged.
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
age | 0.98 | 0.96, 1.00 | 0.020 |
income | 0.96 | 0.94, 0.99 | 0.009 |
sex | 2.53 | 1.23, 5.66 | 0.016 |
1 OR = Odds Ratio, CI = Confidence Interval |
- The Adjusted odds ratio (AOR) for increase of 1 year of age is 0.98 (95%CI .96, 1.0)
- How about a 10 year increase in age? \(e^{10*\beta_{age}} = e^{-.21} = .81\)
In generalized linear models, the assumptions on \(\beta\) having a normal distribution
Refer back to equation 11.4.2, the linearity of the relationship is on the \(\beta\) level, not the odds ratio. Do all transformations on the \(\beta\)’s, then the last step is the non-linear transformation of exponentiation.
So the mathmatical order of operations is important! First use coef
to extract the coefficients out of the model object (here named mvmodel
), then pull the 2nd item out coef(mvmodel)[2]
, then multiply it by 10, then lastly exponentiate this value.
Do the same to calculate the confidence interval for a 10 year time span.
Controlling for gender and income, an individual has 0.81 (95% CI 0.68, 0.97) times the odds of being depressed compared to someone who is 10 years younger than them.
11.4.6 Example: Predictors of smoking status
Consider a logistic model on smoking status (0= never smoked, 1=has smoked) using gender, income, and blood pressure class (bp_class
) as predictors.
\[ logit(Y) = \beta_{0} + \beta_{1}\mbox{(female)} + \beta_{2}\mbox{(income)} + \beta_{3}\mbox{(Pre-HTN)} + \beta_{4}\mbox{(HTN-I)} + \beta_{5}\mbox{(HTN-II)} \]
bp.mod <- glm(smoke ~ female_c + income + bp_class, data=addhealth, family='binomial')
tbl_regression(bp.mod, exponentiate=TRUE)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
female_c | |||
Male | — | — | |
Female | 0.54 | 0.46, 0.63 | <0.001 |
income | 1.00 | 1.00, 1.00 | 0.005 |
bp_class | |||
Normal | — | — | |
Pre-HTN | 1.08 | 0.92, 1.26 | 0.4 |
HTN-I | 0.98 | 0.79, 1.21 | 0.8 |
HTN-II | 1.03 | 0.71, 1.50 | 0.9 |
1 OR = Odds Ratio, CI = Confidence Interval |
It is unlikely that blood pressure is associated with smoking status, all groups are not statistically significantly different from the reference group (all p-values are large). Let’s test that hypothesis formally using a Wald Test.
survey::regTermTest(bp.mod, "bp_class")
## Wald test for bp_class
## in glm(formula = smoke ~ female_c + income + bp_class, family = "binomial",
## data = addhealth)
## F = 0.428004 on 3 and 3723 df: p= 0.73294
The Wald Test has a large p-value of 0.73, thus blood pressure classification is not associated with smoking status.
- This means blood pressure classification should not be included in a model to explain smoking status.
11.4.7 Model Fit
📖 PMA6 12.8
11.4.7.1 Pseudo \(R^{2}\)
Recall in linear regression, the coefficient of determination \(R^{2}\) could be used as a measure of model fit, as it represents the proportion of variance in the outcome explained by the model. This statistic can’t be calculated for Logistic regression, but several versions of a Pseudo \(R^{2}\) have been proposed, and are printed as part of the output for many statistical packages such as Stata.
11.4.7.2 Goodness of Fit
Another way to assess model fit for the entire model is to perform a goodness-of-fit test. The one developed by Hosmer and Lemeshow (1980) is the most commonly used GoF test. In this approach,
- the predicted probability of belonging to the event group (y=1) is calculated for every observation.
- these probabilities are sorted in ascending order, then divided into \(j\) equal sized subgroups, typically deciles to create 10 ‘bins’.
- For each ‘bin’, or range of probabilities, count the observed number of individuals with the event (y=1) (\(O_{j}\)).
- For each ‘bin’, calculate the expected number of events by adding the predicted probabilities in that bin. (\(E_{j}\))
The goodness of fit statistic is then calculated as follows:
\[ GoF \chi^{2} = \sum_{j} \frac{(O_{j}-E_{j})^2}{E_{j}} \]
which approximately follows a \(\chi_{J-2}^{2}\) distribution under relatively large sample sizes.
A large GoF statistic (or small p-value) indicate that the fit may NOT be good
R package: MKmisc
, function HLgof.test
.
Mkmisc::HLgof.test(fit=fitted(model), obs=model$y)
11.4.7.3 Accurate predictions
Since the logistic regression is all about modeling the probability of an event occurring, we can use that model to create predicted probabilities. The next chapter discusses how to use a logistic regression model for prediction, and what measures of fit are used to assess how “accurate” the model is.