Regression – Understanding the Shape of Confidence Interval for Predicted Values in Linear Regression

confidence intervallinear modelregressionstandard error

I have noticed that the confidence interval for predicted values in an linear regression tends to be narrow around the mean of the predictor and fat around the minimum and maximum values of the predictor. This can be seen in plots of these 4 linear regressions:

enter image description here

I initially thought this was because most values of the predictors were concentrated around the mean of the predictor. However, I then noticed that the narrow middle of the confidence interval would occur even if many values of were concentrated around the extremes of the predictor, as in the bottom left linear regression, which lots of values of the predictor are concentrated around the minimum of the predictor.

is anyone able to explain why confidence intervals for the predicted values in an linear regression tend to be narrow in the middle and fat at the extremes?

Best Answer

I'll discuss it in intuitive terms.

Both confidence intervals and prediction intervals in regression take account of the fact that the intercept and slope are uncertain - you estimate the values from the data, but the population values may be different (if you took a new sample, you'd get different estimated values).

A regression line will pass through $(\bar x, \bar y)$, and it's best to center the discussion about changes to the fit around that point - that is to think about the line $y= a + b(x-\bar x)$ (in this formulation, $\hat a = \bar y$).

If the line went through that $(\bar x, \bar y)$ point, but the slope were little higher or lower (i.e. if the height of the line at the mean was fixed but the slope was a little different), what would that look like?

You'd see that the new line would move further away from the current line near the ends than near the middle, making a kind of slanted X that crossed at the mean (as each of the purple lines below do with respect to the red line; the purple lines represent the estimated slope $\pm$ two standard errors of the slope).

enter image description here

If you drew a collection of such lines with the slope varying a little from its estimate, you'd see the distribution of predicted values near the ends 'fan out' (imagine the region between the two purple lines shaded in grey, for example, because we sampled again and drew many such slopes near the estimated one; We can get a sense of this by bootstrapping a line through the point ($\bar{x},\bar{y}$)). Here's an example using 2000 resamples with a parametric bootstrap:

enter image description here

If instead you take account of the uncertainty in the constant (making the line pass close to but not quite through $(\bar x, \bar y)$), that moves the line up and down, so intervals for the mean at any $x$ will sit above and below the fitted line.

enter image description here

(Here the purple lines are $\pm$ two standard errors of the constant term either side of the estimated line).

When you do both at once (the line may be up or down a tiny bit, and the slope may be slightly steeper or shallower), then you get some amount of spread at the mean, $\bar x$, because of the uncertainty in the constant, and you get some additional fanning out due to the slope's uncertainty, between them producing the characteristic hyperbolic shape of your plots.

That's the intuition.


Now, if you like, we can consider a little algebra (but it's not essential):

It's actually the square root of the sum of the squares of those two effects - you can see it in the confidence interval's formula. Let's build up the pieces:

The $a$ standard error with $b$ known is $\sigma /\sqrt{n}$ (remember $a$ here is the expected value of $y$ at the mean of $x$, not the usual intercept; it's just a standard error of a mean). That's the standard error of the line's position at the mean ($\bar x$).

The $b$ standard error with $a$ known is $\sigma/\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}$. The effect of uncertainty in slope at some value $x^*$ is multiplied by how far you are from the mean ($x^*-\bar x$) (because the change in level is the change in slope times the distance you move), giving $(x^*-\bar x)\cdot\sigma/\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}$.

Now the overall effect is just the square root of the sum of the squares of those two things (why? because variances of uncorrelated things add, and if you write your line in the $y= a + b(x-\bar x)$ form, the estimates of $a$ and $b$ are uncorrelated. So the overall standard error is the square root of the overall variance, and the variance is the sum of the variances of the components - that is, we have

$\sqrt{(\sigma /\sqrt{n})^2+ \left[(x^*-\bar x)\cdot\sigma/\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\right]^2 }$

A little simple manipulation gives the usual term for the standard error of the estimate of the mean value at $x^*$:

$\sigma\sqrt{\frac{1}{n}+ \frac{(x^*-\bar x)^2}{\sum_{i=1}^n (x_i-\bar{x})^2} }$

If you draw that as a function of $x^*$, you'll see it forms a curve (looks like a smile) with a minimum at $\bar x$, that gets bigger as you move out. That's what gets added to / subtracted from the fitted line (well, a multiple of it is, in order to get a desired confidence level).

[With prediction intervals, there's also the variation in position due to the process variability; this adds another term that shifts the limits up and down, making a much wider spread, and because that term usually dominates the sum under the square root, the curvature is much less pronounced.]

Related Question