Solved – Mixed Effects Model with Pre-Post Dependent Variable

mixed modelpythonregressionrepeated measuresstatsmodels

I have a dataset that most likely requires a mixed effects model, but which has a few quirks that make me want to confirm that my current modeling approach is correct.

Background

My DV is a measurement of human behavior before and after an event (each event has exactly 2 measurements: 1 pre-event and 1 post-event) and I am interested in understanding how, if at all, the event influenced the dependent variable. (In case it matters, the dependent variable is a ratio that varies from 0 to 1.)

Not all observations are independent; some observations correspond to the same people to whom the event occurred more than once (but the event has occurred to everyone: once to 65% of people and mostly twice to everyone else).

Setup

I am using a mixed effects model using Python's statsmodels in which I have:

  • dv – one row for every dependent variable measurement
  • fixed effect – dummy variable indicating if the measurement happened pre- or post- event
  • random effect – person ID (groups = ~10k, mean group size = ~3)
  • random slope – same dummy variable used as fixed effect (to acknowledge that the rate of change from pre- to post-event may differ by person)

My mixedlm formula looks as follows:

smf.mixedlm('dv ~ pre_post', data, re_formula='pre_post', groups='person_id')

Question

Does this approach make sense? Am I missing something in capturing a repeated measures design nested by person using a mixed effects model?

Best Answer

I am unfamiliar with mixed modeling in python, and I'm not completely certain exactly what the model is that you're fitting. For example, I don't understand how, if your treatment variable is categorical rather than numeric, you are modeling slopes based on that treatment in your model.

What I would do in this situation is, barring any issues with the distributional assumption of normality, fit a mixed model where each response in your dataset is a function of:

  • a fixed effect corresponding to whether the measurement is taken before the first event, between the first and second event, or after the second event (very much like what you seem to have done)

  • plus a random effect for the person (like you seem to have done)

  • plus iid observation specific errors (which I'm assuming the function you're using does)

The equation for this model looks something like:

$Y_{tp} = \mu_{t} + u_{p} + \epsilon_{tp}$, where

$u_{p} \stackrel{iid}{\sim} N(0,\sigma_{u}^2)$

$\epsilon_{tp} \stackrel{iid}{\sim} N(0, \sigma_{\epsilon}^2)$.

Here, $Y_{tp}$ is the observed response for person $p$ at time $t$, where $t \in {0,1,2}$. So $t = 0$ is the before event time, $t = 1$ is the time between the first and second event, and $t = 2$ is the time after the second event. You can assess the effect of the event by looking at $\mu_1 - \mu_0$. Presumably the function you are using in python provides the necessary functionality to get an estimate and the standard error of this quantity.

I will note a few things:

  1. First is that the error term already takes into account the fact that the difference from $t = 0$ to $t = 1$ is different for each person. There is no need to further model "random slopes". Perhaps you were just describing the observation specific random error when you mentioned this, but that was not clear to me.
  2. If you only had two measurements on an individual, as it seems to be the case for most of your data, this model results in a covariance structure that will be the same as if you considered a repeated measures model with an autoregressive covariance structure. The general sense I've gotten from my former statistics professors is that explicitly modeling the temporal structure of your data becomes more necessary as you have measurements over more time points. For two time points, it is unnecessary, and it is likely sufficient to simply consider the random effects model above for three time points.
  3. Modeling repeated measures with an autoregressive type covariance structure requires that measurements are taken at equally spaced time points, and since you don't have the same number of events for each person, this would be a concern for me.
  4. I'm not sure what the "(groups =~ 10k, mean group size = ~ 3)" means, but I'd be concerned that your experimental units are not, in fact, individuals, but groups. This would require further model modification.

The basic approach seems fine, it's just the details of your data and model that aren't clear to me.

Related Question