Solved – For model-averaging a GLM, do we average the predictions on the link or response scale

generalized linear modelmodel-averaging

To compute the model-averaged predictions on the response scale of a GLM, which is "correct" and why?

  1. Compute the model averaged prediction on the link scale and then back-transform to the response scale, or
  2. Back transform the predictions to the response scale and then compute the model average

The predictions are close but not equal if the model is a GLM. The different R packages give options for both (with different defaults). Several colleagues have argued vociferously that #1 is wrong because "everyone does #2". My intuition says that #1 is "correct" as it keeps all linear math linear (#2 averages things that are not on a linear scale). A simple simulation finds that #2 has a very (very!) slightly smaller MSE than #1. If #2 is correct, what is the reason? And, if #2 is correct, why is my reason (keep linear math linear) poor reasoning?

Edit 1: Computing marginal means over the levels of another factor in a GLM is a similar problem to the question that I am asking above. Russell Lenth computes marginal means of GLM models using the "timing" (his words) of #1 (in the emmeans package) and his argument is similar to my intuition.

Edit 2: I am using model-averaging to refer to the alternative to model selection where a prediction (or a coefficient) is estimated as the weighted average over all or a subset of "best" nested models (see references and R packages below).

Given $M$ nested models, where $\eta_i^m$ is the linear prediction (in the link space) for individual $i$ for model $m$, and $w_m$ is the weight for model $m$, the model-averaged prediction using #1 above (average on the link scale and then backtransform to the response scale) is:

$$\hat{Y}_i = g^{-1}\Big(\sum_{m=1}^M{w_m \eta_i^m}\Big)$$

and the model-averaged prediction using #2 above (back transform all $M$ predictions and then average on the response scale) is:

$$\hat{Y}_i = \sum_{m=1}^M{w_m g^{-1}(\eta_i^m})$$

Some Bayesian and Frequentist methods of model averaging are:

  • Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T., 1999. Bayesian model averaging: a tutorial. Statistical science, pp.382-401.

  • Burnham, K.P. and Anderson, D.R., 2003. Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media.

  • Hansen, B.E., 2007. Least squares model averaging. Econometrica, 75(4), pp.1175-1189.

  • Claeskens, G. and Hjort, N.L., 2008. Model selection and model averaging. Cambridge Books.

R packages include BMA, MuMIn, BAS, and AICcmodavg. (Note: this is not a question about the wisdom of model-averaging more generally.)

Best Answer

The optimal way of combining estimators or predictors depends on the loss function that you are trying to minimize (or the utility function you are trying to maximize).

Generally speaking, if the loss function measures prediction errors on the response scale, then averaging predictors on the response scale correct. If, for example, you are seeking to minimize the expected squared error of prediction on the response scale, then the posterior mean predictor will be optimal and, depending on your model assumptions, that may be equivalent to averaging predictions on the response scale.

Note that averaging on the linear predictor scale can perform very poorly for discrete models. Suppose that you are using a logistic regression to predict the probability of a binary response variable. If any of the models give a estimated probability of zero, then the linear predictor for that model will be minus infinity. Taking the average of infinity with any number of finite values will still be infinite.

Have you consulted the references that you list? I am sure that Hoeting et al (1999) for example discuss loss functions, although perhaps not in much detail.

Related Question