Regression – How to Conduct a Multilevel Model/Regression for Panel Data in Python

multilevel-analysispanel datapythonregressionrepeated measures

I have yearly data over time (longitudinal data) with repeated measures for many of the subjects. I think I need multilevel modeling/regressions to deal with sure-to-be correlated clusters of measurements for the same individuals over time. The data currently is in separate tables for each year.

I was wondering if there was a way that was built into scikit-learn, like LinearRegression(), that would be able to conduct a multilevel regression where Level 1 is all the data over the years, and Level 2 is for the clustered on the subjects (clusters for each subject's measurements over time). And if so, if it's better to have the longitudinal data laid out length-wise (where the each subject's measures over time are all in one row) or stacked (where each measure for each year is it's own row).

Is there a way to do this?

Best Answer

Linear regression will not be suitable for a multilevel model.

A mixed effects model is a good way to fit most multilevel models.

In python you can use mixedlm in statsmodels. For example:

In [1]: import statsmodels.api as sm

In [2]: import statsmodels.formula.api as smf

In [3]: data = sm.datasets.get_rdataset("dietox", "geepack").data

In [4]: md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])

In [5]: mdf = md.fit()

In [6]: print(mdf.summary())
         Mixed Linear Model Regression Results
========================================================
Model:            MixedLM Dependent Variable: Weight    
No. Observations: 861     Method:             REML      
No. Groups:       72      Scale:              11.3669   
Min. group size:  11      Log-Likelihood:     -2404.7753
Max. group size:  12      Converged:          Yes       
Mean group size:  12.0                                  
--------------------------------------------------------
             Coef.  Std.Err.    z    P>|z| [0.025 0.975]
--------------------------------------------------------
Intercept    15.724    0.788  19.952 0.000 14.179 17.268
Time          6.943    0.033 207.939 0.000  6.877  7.008
Group Var    40.394    2.149                            
========================================================

Related Solutions

Solved – multilevel model vs panel analysis

In general, you want to use fixed effects over random effects when the goal of the analysis is a causal estimate of the effect of some time-varying variable, and you can achieve conditional independence by controlling for all time-invariant heterogeneity via the fixed effects.

Random effects can be seen as fixed effects that have been subject to a ridge penalty. As such, they are biased. Being biased, they will not control for time-invariant heterogeneity, though they will generally provide more efficient prediction.

If you do not need to control for the fixed effects to achieve an unbiased estimate of your coefficient on your time-varying variable, then random effects is the more efficient estimator. Hierarchical models are special cases of random effects models.

Linear Mixed Effects Model – Implementation of 4-Level Repeated Measures

It is important to think about how many data points the model is trying to fit, and also to remember which variables are fixed effects factors, which variables are random effects, and which variables are numeric covariates.

Data points. If the dependent variable is measured at group level, then the amount of data to fit is one value for each group, multiplied by the number of repeated measures sessions and the number of test conditions within each session. The data for regression should have this many rows of data.

Fixed effects. Factors are fixed effects if they span the full range of values relevant to the study. In an analysis of performance under three different study conditions, the study conditions are a fixed effect, lets say Study_Type. If session is to be treated as a factor rather than an ordinal session number, then Session_ID is a fixed effect. Within a session containing multiple study conditions we might treat the measurement order as a factor, with Order=1 (first performance measurement), Order=2 (second measurement with the session), etc.

Roughly, a mixed effects model will estimate the mean value of the dependent variable for each fixed effect cell of the regression model. In this study we will be particularly interested in whether the mean differs by Study_Type.

Random effects. Factors are random effects if the levels included in the study are a subset of a wider potential population, and are meant to be representative of that wider population. If we have a set of groups of subjects that are meant to be roughly indicative of a bigger universe of groups, then Group_ID is a random factor.

Roughly, a mixed effects model estimates how much variance is added to the dependent variable by a random factor as a whole, without bickering about 'oo killed 'oo or which random group had higher or lower performance.

Covariates. In a regression model that includes some categorical predictors (like Study_Type) and some numerical predictor variables (like Group_IV1), the numerical predictors may be called covariates. In this study some numerical predictors were measured at the level of subjects, with multiple subjects within each group. Each of these (Subj_IV1, Subj_IV2) can be aggregated to obtain a single composite measure for the group. I'll call these Avg_IV1, Avg_IV2.

Design and model

The experimental design has

three fixed effects - Study_Type, Order, Session_ID
one random effect - Group
four covariates - Group_IV1, Group_IV2, Avg_IV1, Avg_IV2
one dependent variable - Group_DV

Fixed effects. For sure we want to know how the dependent variable is affected by Study_Type. But the effect of Study_Type might vary across sessions (e.g. if novelty wanes as the experiment drags on, or if subjects learn to learn as they acquire more experience with each session). The effect of Study_Type might also vary depending on the order of measurement within each session. So we probably want to model interactions as well as main effects of the fixed effects.

~ Study_Type * Order * Session_ID

Random effect. Some groups might be better than others at everything. We could model this with a simple random effect:

~ (1|Group_ID)

But more than that, group differences might vary across the different study conditions, which we can model by allowing the "slope" of the group differences to vary with Study_Type:

~ (1+Study_Type|Group_ID)

In principle it is possible that group differences might further vary across the different sessions, or measurement orders, and so forth. If those nuisance effects are substantial, we would do well to include them in the model in order to more clearly measure the effects of interest (like the effect of Study_Type). But if some of the nuisance effects are probably small and we have only a finite amount of data because we tested a limited number of groups, then we might omit the nuisance effects from the model for the sake of simplicity and practicality. Here, I will opt for simplicity and leave out potential interactions of the nuisance fixed effects with groups.

Covariates. We want to test hypotheses about potential effects of the covariates (numerical predictors) on the dependent measure. A simple set of hypotheses would be that the effect of each covariate is the same regardless of what the other covariates are doing (that is, we assume the covariates do not interact with each other). Then our model can just include simple terms for the main effects of the covariates:

~ Group_IV1 + Group_IV2 + Avg_IV1 + Avg_IV2

Possibly the effect of a covariate could be attenuated or enhanced as the study proceeds across sessions. We might wonder if we need to treat session as an ordinal variable rather than a categorial factor. But if there are only two or three sessions we might forge ahead with the categorical Session_ID factor, and consider including interactions of Session_ID with the covariates:

~ Group_IV1 + Group_IV1:Session_ID + Group_IV2 + Group_IV2:Session_ID + ...

Similarly, the effect of a covariate might vary depending on Study_Type (or equivalently, the effect of some Study_Type might be enhanced or attenuated as a function of one of the covariates). In that case our model might want to include terms for interactions between covariates and Study_Type.

Putting it all together.

A plausible model (leaving out potential interactions between covariates and the fixed effects) might be:

Group_DV ~ Group_IV1 + Group_IV2 + Avg_IV1 + Avg_IV2
+ Study_Type * Order * Session_ID
+ (1+Study_Type|Group_ID)