Predictive Models – Usefulness of Mixed Models

mixed modelpredictive-models

I am a bit confused about advantages of mixed models in regard to predictive modelling. Since predictive models are usually meant to predict values of previously unknown observations then it seems obvious to me that the only way a mixed model may be useful is through its ability to provide population-level predictions (that is without adding any random effects). However, the problem is that so far in my experience population-level predictions based on mixed models are significantly worse than predictions based on standard regression models with fixed effects only.

So what is the point of mixed models in regard to prediction problems?

EDIT.
The problem is the following: I fitted a mixed model (with both fixed and random effects) and standard linear model with fixed effects only. When I do cross-validation I get a following hierarchy of predictive accuracy: 1) mixed models when predicting using fixed and random effects (but this works of course only for observations with known levels of random effects variables, so this predictive approach seems not to be suitable for real predictive applications!); 2) standard linear model; 3) mixed model when using population-level predictions (so with random effects thrown out). Thus, the only difference between standard linear model and mixed model are somewhat different value of coefficients due to different estimation methods (i.e. there are the same effects/predictors in both models, but they have different associated coefficients).

So my confusion boils down to a question, why would I ever use a mixed model as a predictive model, since using mixed model to generate population-level predictions seems to be an inferior strategy in comparison to a standard linear model.

Best Answer

It depends on the nature of the data, but in general I would expect the mixed model to outperform the fixed-effects only models.

Let's take an example: modelling the relationship between sunshine and the height of wheat stalks. We have a number of measurements of individual stalks, but many of the stalks are measured at the same sites (which are similar in soil, water and other things that may affect height). Here are some possible models:

1) height ~ sunshine

2) height ~ sunshine + site

3) height ~ sunshine + (1|site)

We want to use these models to predict the height of new wheat stalks given some estimate of the sunshine they will experience. I'm going to ignore the parameter penalty you would pay for having many sites in a fixed-effects only model, and just consider the relative predictive power of the models.

The most relevant question here is whether these new data points you are trying to predict are from one of the sites you have measured; you say this is rare in the real world, but it does happen.

A) New data are from a site you have measured

If so, models #2 and #3 will outperform #1. They both use more relevant information (mean site effect) to make predictions.

B) New data are from an unmeasured site

I would still expect model #3 to outperform #1 and #2, for the following reasons.

(i) Model #3 vs #1:

Model #1 will produce estimates that are biased in favour of overrepresented sites. If you have similar numbers of points from each site and a reasonably representative sample of sites, you should get similar results from both.

(ii) Model #3 vs. #2:

Why would model #3 be better that model #2 in this case? Because random effects take advantage of shrinkage - the site effects will be 'shrunk' towards zero. In other words, you will tend to find less extreme values for site effects when it is specified as a random effect than when it is specified as a fixed effect. This is useful and improves your predictive ability when the population means can reasonably be thought of as being drawn from a normal distribution (see Stein's Paradox in Statistics). If the population means are not expected to follow a normal distribution, this might be a problem, but it's usually a very reasonable assumption and the method is robust to small deviations.

[Side note: by default, when fitting model #2, most software would use one of the sites as a reference and estimate coefficients for the other sites that represent their the deviation from the reference. So it may appear as though there is no way to calculate an overall 'population effect'. But you can calculate this by averaging across predictions for all of the individual sites, or more simply by changing the coding of the model so that coefficients are calculated for every site.]

Related Question