Solved – Weights with prediction intervals

predictionprediction intervalrregressionweighted-regression

I fitted a weighted regression model to predict age as a function of several DNA methylation markers (expressed in percentages). I used weighted regression because the variance of my original OLS model increases with age.

When using the predict function to generate prediction intervals for a set of new samples,

predict(fGLS, newdata = Testset, interval = "prediction", level = 0.95)

I get the following warning:

Warning message:
In predict.lm(fGLS, newdata = Testset, interval = "prediction",  :
  Assuming constant prediction variance even though model fit is weighted

I tried adding the same weights I used to fit the model and this no longer yielded a warning;

predict(fGLS, newdata = Testset, interval = "prediction", level = 0.95,
        weights = 1/hhat)

I have two questions:

Am I correct in simply adding the same weights I used to fit the weighted regression model, to the predict function? What does this effectively do?
In the first situation, my prediction intervals are roughly the same size throughout the data in my test set. In the second situation, the prediction intervals become larger with increasing age. Does this mean my prediction intervals in the first situation are wrong? Or is it okay to have equal interval sizes since I "corrected" for heteroskedasticity by using weighted regression? In other words, can I afford to simply ignore the warning?

Best Answer

There seems to be some confusion about the purpose of a prediction interval.

If I have frequency weights, then if my weights vector has some element Weights[i] = 10, this indicates for the i-th factor level, there were 10 such people/observations having a similar distribution of characteristic factors.

That weight is endemic to the model and the model alone. When you calculate prediction intervals, it is for an independent 11th person or observation: the uncertainty of the prediction interval is a sum of the uncertainty in your estimates (confidence interval) as well as their individual uncertainty (sampling error).

If in a contrived way, you assume you conduct an independent study and resample another 10 or even 20 people for that i-th factor level and you are interested in prediction intervals for their aggregate mean, you can simply calculate this yourself using a (1/sqrt(10) + 1/n)*se scale for the prediction interval.

Your problem is easily understood by trying to replicate results obtain from predict commands with interval='confidence' and interval='prediction' arguments.

However, it seems in your case that the purpose of weighting here was precision weighting. In that case, you are correct to re-apply the weights, this should yield wider prediction intervals for more highly varied factor levels (higher age and more varied methylation). You can easily check this result for yourself.

Related Solutions

Solved – R predict with “prediction” option

You didn't construct your new object correctly. You need to take into account the effect of the intercept term by adding a "1" (if you are fitting with an intercept) to your linear function if you are finding the C.I. manually yourself. Check out how I created vector a in the following code. But if you use the predict function to directly find the C.I., then the newdat argument does not need to have any "1" for the intercept. R will take care of that! Check out how I used a.dat below and and found identical results:

> #Model estimation
> lm.1<- lm(Ozone ~., data = airquality_clean)

> #Design Matrix
> x=model.matrix(lm.1)
> 
> #Defining linear function
> a=c(1,200,10,70,1,3)
> 
> #Defining new data.frame
> a.dat=data.frame(Solar.R=a[2],Wind=a[3],Temp=a[4],Month=a[5],Day=a[6])
> 
> #Finding the upper prediction interval
> predict(lm.1,newdat=a.dat)+qt(.995,summary(lm.1)$df[2])*summary(lm.1)$sigma*sqrt(1+t(a)%*%solve(t(x)%*%x)%*%a)
         [,1]
[1,] 103.2434
> 
> #Finding the lower prediction interval
> predict(lm.1,newdat=a.dat)-qt(.995,summary(lm.1)$df[2])*summary(lm.1)$sigma*sqrt(1+t(a)%*%solve(t(x)%*%x)%*%a)
          [,1]
[1,] -16.76174
> 
> 
> #Using predict function
> predict(model_1, a.dat, interval="predict", level=0.99)
       fit       lwr      upr
1 43.24083 -16.76174 103.2434
>

Solved – How to calculate prediction intervals for LOESS

I don't know how to do prediction bands with the original loess function but there is a function loess.sd in the msir package that does just that! Almost verbatim from the msir documentation:

library(msir)
data(cars)
# Calculates and plots a 1.96 * SD prediction band, that is,
# a 95% prediction band
l <- loess.sd(cars, nsigma = 1.96)
plot(cars, main = "loess.sd(cars)", col="red", pch=19)
lines(l$x, l$y)
lines(l$x, l$upper, lty=2)
lines(l$x, l$lower, lty=2)

enter image description here

Your second question is a bit trickier since loess.sd doesn't come with a prediction function, but you can hack it together by linearly interpolating the predicted means and SDs you get out of loess.sd (using approx). These can, in turn, be used to simulate data using a normal distribution with the predicted means and SDs:

# Simulate x data uniformly and y data acording to the loess fit
sim_x <- runif(100, min(cars[,1]), max(cars[,1]))
pred_mean <- approx(l$x, l$y, xout = sim_x)$y
pred_sd <- approx(l$x, l$sd, xout = sim_x)$y
sim_y <- rnorm(100, pred_mean, pred_sd) 

# Plots 95% prediction bands with simulated data 
plot(cars, main = "loess.sd(cars)", col="red", pch=19)
points(sim_x, sim_y, col="blue")
lines(l$x, l$y)
lines(l$x, l$upper, lty=2)
lines(l$x, l$lower, lty=2)

enter image description here

Best Answer

Related Solutions

Solved – R predict with “prediction” option

Solved – How to calculate prediction intervals for LOESS

Related Question