Solved – R predict with “prediction” option

prediction intervalr

I'm trying to match the prediction interval of the predict.lm() function in R using the formula found in this discussion :

Obtaining a formula for prediction limits in a linear model

I'm using a student's quantile in my interval but in the end it's far larger from the one given by predict().

Is there any specific calculation in the predict function, I tried to look at the code but couldn't find any answer.
The formula looks ok as I found exactly the same from others source.

My R code :

airquality_clean <- na.omit(airquality)
attach(airquality_clean)

#Model estimation
model_1 <- lm(Ozone ~., data = airquality_clean)

#Unbias variance of the residuals
sigma_2 <- sum(model_1$residuals**2)/(dim(airquality_clean)[1]-dim(airquality_clean)[2])

#New observation
new <- data.frame(Solar.R=200,Wind=10,Temp=70,Day=1,Month=3)

#Calculated prediction interval
sigma <- sqrt(sigma_2*(1 + as.matrix(new)%*%solve(as.matrix(t(airquality_clean[,-1]))%*%as.matrix(airquality_clean[,-1]))%*%as.matrix(t(new))))
qt <- qt(0.995, df = dim(airquality_clean)[1]-dim(airquality_clean)[2])
int_pred_t <- cbind(predict(model_1, new)-(qt*sigma),predict(model_1, new)+(qt*sigma))
int_pred_t
          [,1]     [,2]
[1,] -22.59931 95.82563

#R prediction interval
predict(model_1, new, interval="predict", level=0.99)}
       fit       lwr      upr
1 36.61316 -21.12916 94.35548

I'm not too far but it's not the same results. If I use a p value from a normal distribution and not a student I'm even closer.

Thank you.

Best Answer

You didn't construct your new object correctly. You need to take into account the effect of the intercept term by adding a "1" (if you are fitting with an intercept) to your linear function if you are finding the C.I. manually yourself. Check out how I created vector a in the following code. But if you use the predict function to directly find the C.I., then the newdat argument does not need to have any "1" for the intercept. R will take care of that! Check out how I used a.dat below and and found identical results:

> #Model estimation
> lm.1<- lm(Ozone ~., data = airquality_clean)

> #Design Matrix
> x=model.matrix(lm.1)
> 
> #Defining linear function
> a=c(1,200,10,70,1,3)
> 
> #Defining new data.frame
> a.dat=data.frame(Solar.R=a[2],Wind=a[3],Temp=a[4],Month=a[5],Day=a[6])
> 
> #Finding the upper prediction interval
> predict(lm.1,newdat=a.dat)+qt(.995,summary(lm.1)$df[2])*summary(lm.1)$sigma*sqrt(1+t(a)%*%solve(t(x)%*%x)%*%a)
         [,1]
[1,] 103.2434
> 
> #Finding the lower prediction interval
> predict(lm.1,newdat=a.dat)-qt(.995,summary(lm.1)$df[2])*summary(lm.1)$sigma*sqrt(1+t(a)%*%solve(t(x)%*%x)%*%a)
          [,1]
[1,] -16.76174
> 
> 
> #Using predict function
> predict(model_1, a.dat, interval="predict", level=0.99)
       fit       lwr      upr
1 43.24083 -16.76174 103.2434
> 
Related Question