Solved – How to correctly weight predicted values from a fitted linear model

rregression

I am using R to fit a linear model.

My code is:

rating_lm <- lm(rating\$flow ~ I(rating\$raw^2) + rating\$raw, data = rating, weights = 1/(rating\$flow)

I then use the following code to get prediction intervals:

b <- predict(rating_lm, interval = "prediction")

The graph below shows: the fitted line (red line), the data points and the prediction intervals (blue lines).

enter image description here

I used the weighting 1/rating\$flow because we are much more confident in the low measured Y values.

I need to use the fitted linear model in a predictive way with new X data. However, when doing this, I have found that the predicted intervals for the new data are not close to those of the fitted model.

My question is: how can I ensure that the new predicted values, have the same (or very similar), predicted intervals as the fitted model?

Best Answer

When you fit a a linear model and generate prediction intervals you assume the model form holds outside the range of the data you used to fit it. The only difference between a confidence interval for the model estimate at a particular point and a prediction interval is the added uncertainty of an independent random error.

Statisticians often warn that it is dangerous to extrapolate a regression model outside the range of the data. That could what is going on here. If you are trying to predict outside the range and the model form does not extend then observed points can lie far outside the prediction interval. The problem is that the implicit assumption with prediction intervals that the model extends is violated.

Related Solutions

Solved – Finding the fitted and predicted values for a statistical model

You have to be a bit careful with model objects in R. For example, whilst the fitted values and the predictions of the training data should be the same in the glm() model case, they are not the same when you use the correct extractor functions:

R> fitted(md2)
        1         2         3         4         5         6 
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112 
R> predict(md2)
         1          2          3          4          5          6 
-0.3192480 -0.3192480 -0.3252830  0.9818840 -0.2785876  0.3252830

That is because the default for predict.glm() is to return predictions on the scale of the linear predictor. To get the fitted values we want to apply the inverse of the link function to those values. fitted() does that for us, and we can get the correct values using predict() as well:

R> predict(md2, type = "response")
        1         2         3         4         5         6 
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112

Likewise with residuals() (or resid()); the values stored in md2$residuals are the working residuals are are unlikely to be what you want. The resid() method allows you to specify the type of residual you want and has a useful default.

For the glm() model, something like this will suffice:

R> data.frame(Age = df$age, Won = df$won, Fitted = fitted(md2))
  Age Won    Fitted
1  18   0 0.4208590
2  18   0 0.4208590
3  23   1 0.4193888
4  50   1 0.7274819
5  19   1 0.4308001
6  39   0 0.5806112

Something similar can be done for the lm() model:

R> data.frame(Age = df$age, Income = df$income, Fitted = fitted(md1))
  Age Income    Fitted
1  18      5  7.893273
2  18      3  7.893273
3  23     47 28.320749
4  50      8 -1.389725
5  19      6  7.603179
6  39      5 23.679251

Regression Analysis – Difference Between Prediction Interval and Confidence Interval Explained

The blue lines don't matter for your prediction. You can see what happen with your data, because it's the same you can expect to happen with your predictions. Please notice that some past observations fall outside of the blue lines, but most of all (about 95%) fall between red lines. That is what we should expect to happen with your predictions: although you have a single red point, the real (future observed) value is expected to fall anywhere between red lines 95% of times.

If we return to your model we can see better what both colours of lines mean. Your model is:

$$\mathit{Y}=\beta_0+\beta_1X_1+\varepsilon$$

Where $\beta_0$ and $\beta_1$ are unknown parameters we only can estimate and $\varepsilon$ is a random variable.

The blue lines are confidence intervals for $\beta_0+\beta_1X_1$, that is, intervals for the green line, but please notice that that is not the future observation (that is just your point estimation of it). Your future observation will include an $\varepsilon$ term which will cause more variability, and the red lines account for that extra variability.

Edit:

It was asked (and answered) in comments what are the blue lines useful for. For completeness, this is the answer:

If you are predicting future observations, confidence intervals (blue lines) don't have a direct use. However they can be useful sometimes. I give a couple of examples:

When epsilon stands for an error of measurement, we are usually more interested in predicting means than in predicting observations. Then, blue lines are more useful than red ones.
Blue lines show how much can we improve predictions by increasing sample size. If blue lines are nearly as apart as red lines, we could improve a lot our predictions with a larger sample; in the opposite case there is very little to gain.

Best Answer

Related Solutions

Solved – Finding the fitted and predicted values for a statistical model

Regression Analysis – Difference Between Prediction Interval and Confidence Interval Explained

Related Question