Princeton University
2024-01-31
The nuts and bolts of multilevel models
How to do it (Monday)
Simpson’s paradox
The word we live in is highly interdependent!
An elaboration on regression
Technique that allows us deal with non-independence between data points (i.e., clustered/nested data)
Explicit partitioning of the variance
Within (intra-group differences)
Between (inter-group differences)
Clustering = Nesting = Grouping = Hierarchies
Key idea: More than one dimension sampling simultaneously
“Nested” designs
Repeated-measures and longitudinal designs
Any complex mixed design
For now we will focus on data with two levels:
Crossed designs (sometimes called cross-classified)
Radon is a carcinogen – a naturally occurring radioactive gas whose decay products are also radioactive – known to cause lung cancer in high concentrations. The EPA sampled more than 80,000 homes across the U.S. Each house came from a randomly selected county and measurements were made on each level of each home. Uranium measurements at the county level were included to improve the radon estimates.
Classic analysis:
Drawback of classic analysis:
MLM Approach:
Classic Analysis:
Drawback of this approach:
Missing data
MLM Approach:
Classic Analysis:
Drawback of ANOVA:
MLM Approach:
When to use them:
Nested designs
Repeated measures
Longitudinal data
Complex designs
Why use them:
What they are:
Dont really care about variance (it is just a nuisance variable)
Data is not actually interdependent
Small number of groups/clusters
You only have a between-subjects design
Words you hear constantly in MLM Land:
What do they all mean?
Two sides to any model
\[ y_i = \color{blue}{b_{0_{\text{(intercept)}}} + b_{1_{\text{(slope)}}} x_i} + e_{i_{\text{(error)}}} \]
Model for the means (fixed part):
Fixed effect (constant effect):
Note
\[ y_i = {b_{0_{\text{(intercept)}}} + b_{1_{\text{(slope)}}} x_i} + \color{red}{e_{i_{\text{(error)}}}} \]
Uncorrelated with fixed part
Variation around the expected values
Normal distributed ~ \(N(\mu,\sigma)\)
Random factors:
Represent higher level grouping variables
Note
Can only be categorical!
The random factor is your clustering variable:
Participants 🧑🤝🧑
Schools 🏫
Words
Pictures 🖼️
Random effects:
How random factors are allowed to vary
Random intercept (most common) : \(U_{0j}\)
Random slope: \(U_{1j}\)
Each level-2 cluster has its own coef for the effect of a predictor on the outcome
Should my variable be fixed or random?
If it is continuous, has few levels (<5), or is an experimental manipulation
Want to estimate variance at each level of factor?
Want a general estimate of variance of factor?
Scenario: Investigating how student performance is influenced by teaching methods and individual student characteristics across different schools.
Data Collected: student socio-economic status (SES), teaching method used (e.g., traditional, modern), and school ID
What is fixed?
What is random?
Blue = fixed
Red = random
\[y_i = \color{blue}{b_{0_{\text{(intercept)}}} + b_{1_{\text{(slope)}}} x_i} + \color{red}{ e_{i_{\text{(error)}}}}\]
\[ e_{i_{\text{(error)}}} = y_i - \hat{y}_i \]
\[ y_{ij} = (\color{blue}{b_{0j_{\text{(intercept)}}}} + \color{red}{U_{0j_{\text{(random intercept)}}}}) + \color{blue}{b_{1_{\text{(slope)}}} x_{ij}} + \color{red}{e_{ij_{\text{(error)}}}} \]
\[ U_{0j} = b_{0j} - b_0 \]
i = individual observation j = group
\[ y_{ij} = ({b_{0j_{\text{(intercept)}}} + U_{0j_{\text{(random intercept)}}}}) + b_{1_{\text{(slope)}}} x_{ij} + \color{red}{ e_{ij_{\text{(error)}}}} \]
Within-group variation
\[ y_{ij} = (\color{blue}{b_{0j_{\text{(intercept)}}}} + \color{red}{U_{0j_{\text{(random intercept)}}}}) + (\color{blue}{b_{1_{\text{(slope)}}} x_{ij}} + \color{red}{U_{1j_{\text{(random slope)}}}}) + \color{red}{e_{ij_{\text{(error)}}}} \]
Important
\[ U_{1j} = b_{1j} - b_1 \]
Level | Equation |
---|---|
Level 1 | \(y_{ij} = b_{0j} + b_{1j}X_{ij} + e_{ij}\) |
Level 2 | \(b_{0j}=γ00+U_{0j}\) \(b_{1j} = \gamma{10} + U_{1j}\) |
Combined | \(y_{ij} = \gamma_{0} + \gamma_{1}X_{ij} + e_{ij} + U_{b0j} + U_{b1j}X_{ij}\) |
Different averages
Random intercept
Different relationships between x and y
Random Slope
Multiple sources of variance?
Each source of variance gets its own residual term
Residuals capture variance
Residuals are added rendering them conditionally independent
The fixed effects (usually) hold your hypothesis tests
Can (essentially) be interpreted like GLM output
For most people, this is all that matters
MLM has random effects output
Variance explained by the random effect
This may be interesting, in and of itself
Fixed effects of random terms are the average estimates across groups
PSY 504: Advaced Statistics