I have a dataset that most likely requires a mixed effects model, but which has a few quirks that make me want to confirm that my current modeling approach is correct.
Background
My DV is a measurement of human behavior before and after an event (each event has exactly 2 measurements: 1 pre-event and 1 post-event) and I am interested in understanding how, if at all, the event influenced the dependent variable. (In case it matters, the dependent variable is a ratio that varies from 0 to 1.)
Not all observations are independent; some observations correspond to the same people to whom the event occurred more than once (but the event has occurred to everyone: once to 65% of people and mostly twice to everyone else).
Setup
I am using a mixed effects model using Python's statsmodels in which I have:
- dv – one row for every dependent variable measurement
- fixed effect – dummy variable indicating if the measurement happened pre- or post- event
- random effect – person ID (groups = ~10k, mean group size = ~3)
- random slope – same dummy variable used as fixed effect (to acknowledge that the rate of change from pre- to post-event may differ by person)
My mixedlm formula looks as follows:
smf.mixedlm('dv ~ pre_post', data, re_formula='pre_post', groups='person_id')
Question
Does this approach make sense? Am I missing something in capturing a repeated measures design nested by person using a mixed effects model?
Best Answer
I am unfamiliar with mixed modeling in python, and I'm not completely certain exactly what the model is that you're fitting. For example, I don't understand how, if your treatment variable is categorical rather than numeric, you are modeling slopes based on that treatment in your model.
What I would do in this situation is, barring any issues with the distributional assumption of normality, fit a mixed model where each response in your dataset is a function of:
a fixed effect corresponding to whether the measurement is taken before the first event, between the first and second event, or after the second event (very much like what you seem to have done)
plus a random effect for the person (like you seem to have done)
plus iid observation specific errors (which I'm assuming the function you're using does)
The equation for this model looks something like:
$Y_{tp} = \mu_{t} + u_{p} + \epsilon_{tp}$, where
$u_{p} \stackrel{iid}{\sim} N(0,\sigma_{u}^2)$
$\epsilon_{tp} \stackrel{iid}{\sim} N(0, \sigma_{\epsilon}^2)$.
Here, $Y_{tp}$ is the observed response for person $p$ at time $t$, where $t \in {0,1,2}$. So $t = 0$ is the before event time, $t = 1$ is the time between the first and second event, and $t = 2$ is the time after the second event. You can assess the effect of the event by looking at $\mu_1 - \mu_0$. Presumably the function you are using in python provides the necessary functionality to get an estimate and the standard error of this quantity.
I will note a few things:
The basic approach seems fine, it's just the details of your data and model that aren't clear to me.