Multilevel-Model – Analyzing Multilevel Models with Responses Only at Level 2

logisticmixed modelmultilevel-analysis

I have hierarchical data of individuals nested into families. For each individual, I have independent variables such as age, gender, education, and familiarity with product. For each family unit, I also have covariates such as household income, purchase behavior, and distance to retail centers.

The dependent satisfaction measure is only recorded at the family level. More specifically, satisfaction is asked of a head-of-household respondent, who ideally represents the household. While satisfaction is measured on a 5-point scale, we typically re-express it as dichotomous (top 2 box).

I would like to take into consideration the individual-level effects as well as the family-level effects in modeling product satisfaction propensity. Is it appropriate to explore multilevel modeling when the outcome is only measured at the second level? If not, is there a different approach I should be following?

Best Answer

A nice paper about this is the following:

Croon, M. A., & van Veldhoven, M. J. (2007). Predicting group-level outcome variables from variables measured at the individual level: a latent variable multilevel model. Psychological methods, 12(1), 45.

Basically the approach that they outline involves computing adjusted group means on the predictor variables and then regressing the outcome on the adjusted group means. The adjusted group mean for each group is the best linear unbiased predictor (BLUP) of the predictor variable for that group. You can compute those using equations given in the paper or, if you're using R, using the lme4 package and its coef() function.

Edit 2020-09-19:

Since writing this answer in 2015, I've become convinced that the Croon & van Veldhoven (CvV) procedure that I mentioned above is not actually the best way to address this issue. In fact, the intuitive approach of simply aggregating the predictors up to the group level and then doing OLS of the group outcomes on these (unadjusted) predictor group means seems to work just as well, if not better. These two methods are compared in this simulation paper:

Predicting group-level outcome variables: An empirical comparison of analysis strategies

Summary of the paper: while the CvV method does indeed eliminate bias in the OLS parameter estimates from models with group-level outcomes, this comes at the cost of an enormous amount of variance in the parameter estimates, so that the CvV method actually does worse in terms of mean squared error. Furthermore, the simple unadjusted group means procedure is essentially just as effective as the CvV method in controlling Type 1 and 2 error rates.

Related Solutions

Solved – Multilevel model with nested repeated measures design

The only random part here is the individual. Both Time and Treatment are fixed parts. As I understand it, you want global (ie. fixed) estimates of the effect of

Time
Each level of Treatment (except for the reference level)
The interaction between each level of Treatment (except for the reference level) and Time.

The following models will give you that.

fm1 <- lmer(PosQ ~ Treatm * Time + (1|ID), data = analyses.4)
fm2 <- glmer(Conc ~ Treatm * Time + (1|ID), data = analyses.4, family = binomial)

That being said, you can get a random effect of time, ie. a random slope model where the effect of time varies between the individuals.

fm3 <- lmer(PosQ ~ Treatm * Time + (Time|ID), data = analyses.4)
fm4 <- glmer(Conc ~ Treatm * Time + (Time|ID), data = analyses.4, family = binomial)

This is possible since there is within-subject variation with respect to time. However, since there is no within-subject variation with respect to treatment, you cannot do the same for treatment.

Since there is no within-subject variation with respect to treatment, the effect of time in a random slope model is actually the deviance between the individual effect of the particular treatment that the indivdual received and the global estimate of Time, which would measure the average effect of the treatment that corresponds to the reference category of the variable Treatment.

You can use anova() to compare the models and test whether or not it is justified to let the effect of time vary by subject:

anova(fm1, fm3)
anova(fm2, fm4)

would do the testing you need.

Linear Mixed Effects Model – Implementation of 4-Level Repeated Measures

It is important to think about how many data points the model is trying to fit, and also to remember which variables are fixed effects factors, which variables are random effects, and which variables are numeric covariates.

Data points. If the dependent variable is measured at group level, then the amount of data to fit is one value for each group, multiplied by the number of repeated measures sessions and the number of test conditions within each session. The data for regression should have this many rows of data.

Fixed effects. Factors are fixed effects if they span the full range of values relevant to the study. In an analysis of performance under three different study conditions, the study conditions are a fixed effect, lets say Study_Type. If session is to be treated as a factor rather than an ordinal session number, then Session_ID is a fixed effect. Within a session containing multiple study conditions we might treat the measurement order as a factor, with Order=1 (first performance measurement), Order=2 (second measurement with the session), etc.

Roughly, a mixed effects model will estimate the mean value of the dependent variable for each fixed effect cell of the regression model. In this study we will be particularly interested in whether the mean differs by Study_Type.

Random effects. Factors are random effects if the levels included in the study are a subset of a wider potential population, and are meant to be representative of that wider population. If we have a set of groups of subjects that are meant to be roughly indicative of a bigger universe of groups, then Group_ID is a random factor.

Roughly, a mixed effects model estimates how much variance is added to the dependent variable by a random factor as a whole, without bickering about 'oo killed 'oo or which random group had higher or lower performance.

Covariates. In a regression model that includes some categorical predictors (like Study_Type) and some numerical predictor variables (like Group_IV1), the numerical predictors may be called covariates. In this study some numerical predictors were measured at the level of subjects, with multiple subjects within each group. Each of these (Subj_IV1, Subj_IV2) can be aggregated to obtain a single composite measure for the group. I'll call these Avg_IV1, Avg_IV2.

Design and model

The experimental design has

three fixed effects - Study_Type, Order, Session_ID
one random effect - Group
four covariates - Group_IV1, Group_IV2, Avg_IV1, Avg_IV2
one dependent variable - Group_DV

Fixed effects. For sure we want to know how the dependent variable is affected by Study_Type. But the effect of Study_Type might vary across sessions (e.g. if novelty wanes as the experiment drags on, or if subjects learn to learn as they acquire more experience with each session). The effect of Study_Type might also vary depending on the order of measurement within each session. So we probably want to model interactions as well as main effects of the fixed effects.

~ Study_Type * Order * Session_ID

Random effect. Some groups might be better than others at everything. We could model this with a simple random effect:

~ (1|Group_ID)

But more than that, group differences might vary across the different study conditions, which we can model by allowing the "slope" of the group differences to vary with Study_Type:

~ (1+Study_Type|Group_ID)

In principle it is possible that group differences might further vary across the different sessions, or measurement orders, and so forth. If those nuisance effects are substantial, we would do well to include them in the model in order to more clearly measure the effects of interest (like the effect of Study_Type). But if some of the nuisance effects are probably small and we have only a finite amount of data because we tested a limited number of groups, then we might omit the nuisance effects from the model for the sake of simplicity and practicality. Here, I will opt for simplicity and leave out potential interactions of the nuisance fixed effects with groups.

Covariates. We want to test hypotheses about potential effects of the covariates (numerical predictors) on the dependent measure. A simple set of hypotheses would be that the effect of each covariate is the same regardless of what the other covariates are doing (that is, we assume the covariates do not interact with each other). Then our model can just include simple terms for the main effects of the covariates:

~ Group_IV1 + Group_IV2 + Avg_IV1 + Avg_IV2

Possibly the effect of a covariate could be attenuated or enhanced as the study proceeds across sessions. We might wonder if we need to treat session as an ordinal variable rather than a categorial factor. But if there are only two or three sessions we might forge ahead with the categorical Session_ID factor, and consider including interactions of Session_ID with the covariates:

~ Group_IV1 + Group_IV1:Session_ID + Group_IV2 + Group_IV2:Session_ID + ...

Similarly, the effect of a covariate might vary depending on Study_Type (or equivalently, the effect of some Study_Type might be enhanced or attenuated as a function of one of the covariates). In that case our model might want to include terms for interactions between covariates and Study_Type.

Putting it all together.

A plausible model (leaving out potential interactions between covariates and the fixed effects) might be:

Group_DV ~ Group_IV1 + Group_IV2 + Avg_IV1 + Avg_IV2
+ Study_Type * Order * Session_ID
+ (1+Study_Type|Group_ID)