Princeton University
2024-02-19
Nuts and bolts of logistic regression
Motivating example: Weight loss, gender, BMI
Everything so far has included a continuous outcome
Model \(y\) as a linear function of predictors
\[ Y_i = b_0 + b_1*x + e_i\]
\[Y \sim~Normal(\mu, \sigma)\]
The simplest kind of categorical data is each case is binary
Note
We usually code these as 0s and 1s and model the probability of 1 occurring
Linear models are not appropriate for this type of data
Makes impossible predictions (values not between 0, 1)
Non-normality of residuals
Heteroskedasticity
Regression models with non-normal response likelihoods
\[\begin{align*} Y_i & \sim \mathrm{Dist}(\mu_i, \tau) \\ g(\mu_i) & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots \end{align*}\]Where:
\(y\) is referenced with \(\mu\)
\(g\) is link function (in logistic regression we use logit link)
We assume that there is some probability p of responding correctly (or selecting what is coded as 1)
We donât know this probability, so we want to use the data to estimate it
Distribution of a single, discrete binary outcome \[y\sim Bernoulli(p)\]
A Bernoulli distribution generates a 1 (âsuccessâ) with probability p
And a 0 (âfailureâ) with probability 1âp=q
Distribution of discrete counts of independent outcomes over a certain set of trials
\[y\sim binomial(N,p)\]
In R we use the glm
function as opposed to lm
library(fivethirtyeight)
bechdel=fivethirtyeight::bechdel
#fit <- glm(
# data = bechdel,
# family = binomial(link="logit"),
# binary ~ 1 + budget_2013)
bechdel %>% dplyr::select(title, binary) %>% head() %>% kable()
title | binary |
---|---|
21 & Over | FAIL |
Dredd 3D | PASS |
12 Years a Slave | FAIL |
2 Guns | FAIL |
42 | FAIL |
47 Ronin | FAIL |
\[\begin{equation*} \log\left(\frac{p}{1 - p}\right)=b_0+b_1X \end{equation*}\]
\[\begin{equation*} p(logit)=\frac{e^{logit}}{1+e^{logit}} \end{equation*}=\frac{exp(logit)}{1+exp(logit)}\]
Step 1
Transform probability into odds
\[\textrm{Odds} = \frac{\# \textrm{successes}}{\# \textrm{failures}}= \frac{\# \textrm{successes}/n}{\# \textrm{failures}/n}= \frac{p}{1-p}\]
Step 2
When we convert a probability to odds, odds always > 0
Asymmetric
Curvy
Solution: take log of the odds
\[logit = log(\frac{p}{1-p})\]
This log odds conversion has a nice property
It converts odds of less than one to negative numbers, because the log of a number between 0 and 1 is always negative
Our data is now linear!
and symmetric
We need a function that scrunches the output of the linear predictor into a range appropriate for this parameter
A squiggle (logistic or Sigmoid curve) to the rescue!
\[p(logit)=\frac{e^{logit}}{1+e^{{logit}}}\]
We keep rotating the log odds(line) and projecting data on to it and transforming into probabilities
These data are a sample of 500 teens from data collected in 2009 through the U.S. Youth Risk Behavior Surveillance System (YRBSS)
Are the odds that young females report trying to lose weight greater than the odds that males do? Is increasing BMI associated with an interest in losing weight, regardless of sex?
DV: *lose.wt.01
, which is coded 1 when someone indicated they were trying to lost weight and 0 when they were not trying to. This is a dichotomized version of the lose.wt
factor variable.
Our two predictor variables will be:
sex
and its related dummy variable female
bmipct
, which is the percentile for a given BMI for members of the same sexlose.wt.01
by sex
bmipct
, faceted by lose.wt
.bmipct
, faceted by both lose.wt
and sex
\[ \begin{align*} \text{lose.wt.01}_i & \sim \operatorname{Binomial}(n = 1, p_i)\\ \text{logit}(p_i) & = \beta_0 \end{align*} \]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.2120267 | 0.0953418 | -2.22386 | 0.0261579 |
Hypotheses:
\(H_0\): \(\beta_j\) = 0
\(H_a\): \(\beta_j \neq 0\)
Test Statistic: \[\text{Wald z} = \frac{\hat{\beta}_j - 0}{SE_{\hat{\beta}_j}}\]
CIs: \[\Large{\hat{\beta}_j \pm z^* SE_{\hat{\beta}_j}}\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.2120267 | 0.0953418 | -2.22386 | 0.0261579 |
The intercept \(\beta_0\) is in the log-odds metric
We can convert it to probability
plogis()
functionfemale
, which gives us the model\[ \begin{align*} \text{lose.wt.01}_i & \sim \operatorname{Binomial}(n = 1, p_i) \\ \text{logit}(p_i) & = \beta_0 + \beta_1 \text{female}_i, \end{align*} \]
female
is an 0/1 dummy variable.If X is a categorical predictor
We can take the exp()
of the log odds and get OR
The odds of Y for group X (females) are expected to be exp(\(b_1\)) times the odds of Y for the baseline group
Greater than 1: One unit increase, the event is more likely to occur
Less than 1: One unit increase, the event is less likely to occur
Equals 1: One unit increase, the likelihood of the event does not change
We can describe the OR in terms of percentage increase/decrease
The odds of wanting to lose weight in females is 2.98 times the odds of wanting to lose weight in men
Emmeans
Can use emmeans
to get probabilities directly from model
female prob SE df asymp.LCL asymp.UCL
0 0.320 0.0307 Inf 0.263 0.383
1 0.584 0.0337 Inf 0.517 0.648
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
contrast odds.ratio SE df null z.ratio p.value
female1 / female0 2.98 0.589 Inf 1 5.520 <.0001
Tests are performed on the log odds ratio scale
We can also calculate differences in probabilities
Marginal effects
Change in probability that the outcome occurs for a 1 unit change in the predictor
#can do this with emmeans but requires more steps
library(marginaleffects) # get predicted probs
# get marginal effects
avg_comparisons(fit2, comparison = "difference") %>%
kable()
term | contrast | estimate | std.error | statistic | p.value | s.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|---|
female | 1 - 0 | 0.2637658 | 0.0455817 | 5.786657 | 0 | 27.05335 | 0.1744273 | 0.3531044 |
Now we add our continuous predictor bmipct
, which gives us the model:
\[ \begin{align*} \text{lose.wt.01}_i & \sim \operatorname{Binomial}(n = 1, p_i) \\ \text{logit}(p_i) & = \beta_0 + \beta_1 \text{bmipct}_i, \end{align*} \]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -2.5685641 | 0.3138083 | -8.185138 | 0 |
bmipct | 0.0355475 | 0.0042828 | 8.300008 | 0 |
What is the OR for \(b_1\)? How do we interpret this?
With continuous variables, obtaining the marginal effects is a bit trickier
Instantaneous rate of change in the predicted probability
#effect is computed for each observed value in the original dataset before being averaged
avg_slopes(fit3)%>%kable()
term | estimate | std.error | statistic | p.value | s.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|
bmipct | 0.0071934 | 0.0005805 | 12.39111 | 0 | 114.7217 | 0.0060556 | 0.0083312 |
Tip
An increase in BMI results in 70% increase in likelihood of wanting to lose weight
female
and bmipct
as predictors:\[ \begin{align*} \text{lose.wt.01}_i & \sim \operatorname{Binomial}(n = 1, p_i) \\ \text{logit}(p_i) & = \beta_0 + \beta_1 \text{female}_i + \beta_2 \text{bmipct}_i. \end{align*} \]
\[ \begin{align*} \text{lose.wt.01}_i & \sim \operatorname{Binomial}(n = 1, p_i) \\ \text{logit}(p_i) & = \beta_0 + \beta_1 \text{female}_i + \beta_2 \text{bmipct}_i + \beta_3 \text{female}_i \cdot \text{bmipct}_i. \end{align*} \]
female:bmipct
(i.e., \(\beta_3\)) would not be statistically significantName | Model | df | df_diff | Chi2 | p | |
---|---|---|---|---|---|---|
fit4 | fit4 | glm | 3 | NA | NA | NA |
fit5 | fit5 | glm | 4 | 1 | 0.4976291 | 0.4805437 |
fit4
with the interaction model fit5
, it appears that the interaction term does not add anything\[R^2 = \frac{SSM}{SST} = \frac{SST-SSE}{SST}=1-\frac{SSE}{SST}\] \[R_{MF}^2 = 1-\frac{D_{\rm residual}}{D_{\rm null}}\]
# R2 for Generalized Linear Regression
R2: 0.243
adj. R2: 0.240
Binned residuals
Look for patterns
If bins have average residuals with large magnitude
Look at averages of other predictors across bins
Interaction may be required
Outcome is dichotomous
Observations are independent
Linearity (in the log odds)
No normality assumptions
Homogeneity of variance
Can plot with easystats
:)
Note
Tip
We fitted a logistic model with a logit link (estimated using ML) with sex and bmipct predicting desire to lose weight (formula: lose.wt.01 ~ 1 + female+ bmipct). The modelâs explanatory power is substantial (Tjurâs \(R^2\) =0.30). The probability of females wanting to lose weight was 33% higher, b= 1.86, 95% CI [1.37, 2.38], p < .001; OR = 2.98, 95% CI [3.93, 10.9]. An increase in BMI was associated with an increased probability of wanting to lose weight, b = 0.05, 95% CI [0.04, 0.06], p < .001, OR = 1.05, 95% CI [1.04, 1.06]. Looking at the average marginal effect, 1 unit increases in BMI was associated with a 80% increase in desire to lose weight.
PSY 504: Advanced Statistics