Solved – Weights with prediction intervals

predictionprediction intervalrregressionweighted-regression

I fitted a weighted regression model to predict age as a function of several DNA methylation markers (expressed in percentages). I used weighted regression because the variance of my original OLS model increases with age.

When using the predict function to generate prediction intervals for a set of new samples,

predict(fGLS, newdata = Testset, interval = "prediction", level = 0.95)

I get the following warning:

Warning message:
In predict.lm(fGLS, newdata = Testset, interval = "prediction",  :
  Assuming constant prediction variance even though model fit is weighted

I tried adding the same weights I used to fit the model and this no longer yielded a warning;

predict(fGLS, newdata = Testset, interval = "prediction", level = 0.95,
        weights = 1/hhat)

I have two questions:

  1. Am I correct in simply adding the same weights I used to fit the weighted regression model, to the predict function? What does this effectively do?

  2. In the first situation, my prediction intervals are roughly the same size throughout the data in my test set. In the second situation, the prediction intervals become larger with increasing age. Does this mean my prediction intervals in the first situation are wrong? Or is it okay to have equal interval sizes since I "corrected" for heteroskedasticity by using weighted regression? In other words, can I afford to simply ignore the warning?

Best Answer

There seems to be some confusion about the purpose of a prediction interval.

If I have frequency weights, then if my weights vector has some element Weights[i] = 10, this indicates for the i-th factor level, there were 10 such people/observations having a similar distribution of characteristic factors.

That weight is endemic to the model and the model alone. When you calculate prediction intervals, it is for an independent 11th person or observation: the uncertainty of the prediction interval is a sum of the uncertainty in your estimates (confidence interval) as well as their individual uncertainty (sampling error).

If in a contrived way, you assume you conduct an independent study and resample another 10 or even 20 people for that i-th factor level and you are interested in prediction intervals for their aggregate mean, you can simply calculate this yourself using a (1/sqrt(10) + 1/n)*se scale for the prediction interval.

Your problem is easily understood by trying to replicate results obtain from predict commands with interval='confidence' and interval='prediction' arguments.

However, it seems in your case that the purpose of weighting here was precision weighting. In that case, you are correct to re-apply the weights, this should yield wider prediction intervals for more highly varied factor levels (higher age and more varied methylation). You can easily check this result for yourself.