Solved – Hierarchical modelling in Python with statsmodels

multilevel-analysispythonstatsmodels

I have a dataset with random effects at different hierarchies and now I want to analyze how they influence my target variable. Somehow I'm looking into statsmodels Linear Mixed Effect Models to solve my issue. Though I can't figure out through the documentation how to achieve my goal.

To pick up the example from statsmodels with the dietox dataset my example is:

import statsmodels.api as sm
import statsmodels.formula.api as smf

data = sm.datasets.get_rdataset("dietox", "geepack").data
# Only take the last week
data = data.copy().query("Time == 12")
# Convert Vitamin E to number
data["Evit"] = data["Evit"].map(lambda s: int(s.replace("Evit", "")))
md = smf.mixedlm("Weight ~ Feed + Evit", data, groups=data["Evit"])
mdf = md.fit()
print(mdf.summary())

I want to predict the pigs weight in week 12 from the cumulated food intake Feed and vitamin E dosage Evit. The hierarchy is supposed to be groups sharing a vitamin E dose that have multiple pigs assigned to them.

I would expect to have a model that for every $Weight$ in the groups $i = 1…N$ and for every pig $j = 1…M$
$$Weight_{ij} = \beta_0 + \beta_1 Evit_{i} + \gamma_{0i} + \gamma_{1i} Feed_{ij}$$
then $\beta_0$ should be the common intercept shared among all Pigs. $\beta_1$ would be the slope that shows the effect of vitamin E. $\gamma_{0i}$ is the intercept for the group receiving the same vitamin E injections (this does not make much sense with this dataset, but it's required for my actual problem). $\gamma_{1i}$ is the groups slope for the Feed, so that it shows if vitamin dosis lead to different weight gain with the same amount of food.

I would also be interested in how to remove the intercept.

In my actual dataset the group (Evit) determines more variables (for example $Light$) that are the same within the group but different between the groups. I would also like to include these. Also there are more variables that are individual (for example $Hairiness$). So that the final model could look more like this:
$$Weight_{ij} = \beta_0 + \beta_1 Evit_{i} + \beta_2 Light_{i} + \gamma_{0i} + \gamma_{1i} Feed_{ij} + \gamma_{2i} Hairiness_{ij}$$

The question is how do I model this in statsmodels or another Python library.

Best Answer

As I understand it, your beta's are fixed coefficients and your gamma's are random coefficients. You only have one level of clustering (j within i).

Your data should be in long form (one row per observation, not one row per group).

If you are using formulas, you would have a formula for the fixed effects part of the model (the mean structure) and a formula for the random effects part of the model. You also need an indicator variable defining the groups. It would look something like this:

model = sm.MixedLM.from_formula("weight ~ evit+light", groups="pig", re_formula="feed + hairiness")
result = model.fit()

Related Solutions

Solved – Logistic regression with grouped data

So after searching through leads kindly provided by this thread I've concluded that a Cox proportional hazards model would probably be the most appropriate, as this allows for stratification of the data by an ID as above.

For the curious, I came across lifelines for python which has a good implementation and have been doing some moderately successful tests with it.

Thanks all!

Least Squares – How to Decompose Total Slope into Between-Group and Within-Group Contributions?

Let g be the indicator of the first group. That is, it is a vector of length 8 whose first 4 elements are 1 and whose last 4 are 0.

Let P be the projection onto the space spanned by g and 1-g -- if there were k groups then we would consider the space spanned by k vectors but here we have only two -- and let Q=I-P be the orthogonal complement projection. Also let y be ppSpend and x be pctPoor.

Let b, w and t be the between, within and total slopes. That is they are the slopes of the regression (including intercept) of y on Px, y on Qx and y on x respectively. Then we interpret the question as asking what the relationship is among b, w and t and it is:

var(Px) * b + var(Qx) * w = var(x) * t

which follows from the fact that the slopes are given by the three expressions below and that the numerators of b and w sum to the numerator of t (and similarly for the denominators).

b = cov(Px, y) / var(Px)
w = cov(Qx, y) / var(Qx)
t = cov(x, y) / var(x)

Dividing through the equation involving b, w and t by var(x) and letting a = var(Px)/var(x) we can write it as this convex combination.

a * b + (1-a) * w = t

The formula var(Px) / var(x) can be regarded as the squared cosine of the angle between Px and x if we regard squared length to be var.

We can illustrate this using R.

g <- rep(1:0, each = 4)
x <- c(0, 0.1, 0.2, 0.3, 0.7, 0.8, 0.9, 1)
y <- c(8, 9, 10, 11, 5, 6, 7, 8)

n <- length(y)
G <- cbind(g, 1-g)
P <- G %*% solve(crossprod(G), t(G))
Q <- diag(n) - P

b = cov(P %*% x, y) / var(P %*% x); b  # or coef(lm(y ~ P %*% x))[[2]]
##           [,1]
## [1,] -4.285714

w = cov(Q %*% x, y) / var(Q %*% x); w  # or coef(lm(y ~ Q %*% x))[[2]]
##      [,1]
## [1,]   10

t = cov(x, y) / var(x); t  # or coef(lm(y ~ x))[[2]]
## [1] -2.962963

a <- var(P %*% x) / var(x); a
##           [,1]
## [1,] 0.9074074

# P %*% x also equals ave(x, g) in R so we can alternately write a as:
var(ave(x, g)) / var(x)
## [1] 0.9074074

# Using a, b and w from above, we see this equals the t shown above
a * b + (1-a) * w
##           [,1]
## [1,] -2.962963

Best Answer

Related Solutions

Solved – Logistic regression with grouped data

Least Squares – How to Decompose Total Slope into Between-Group and Within-Group Contributions?

Related Question