Regression – Understanding the Shape of Confidence Interval for Predicted Values in Linear Regression

confidence intervallinear modelregressionstandard error

I have noticed that the confidence interval for predicted values in an linear regression tends to be narrow around the mean of the predictor and fat around the minimum and maximum values of the predictor. This can be seen in plots of these 4 linear regressions:

enter image description here

I initially thought this was because most values of the predictors were concentrated around the mean of the predictor. However, I then noticed that the narrow middle of the confidence interval would occur even if many values of were concentrated around the extremes of the predictor, as in the bottom left linear regression, which lots of values of the predictor are concentrated around the minimum of the predictor.

is anyone able to explain why confidence intervals for the predicted values in an linear regression tend to be narrow in the middle and fat at the extremes?

Best Answer

I'll discuss it in intuitive terms.

Both confidence intervals and prediction intervals in regression take account of the fact that the intercept and slope are uncertain - you estimate the values from the data, but the population values may be different (if you took a new sample, you'd get different estimated values).

A regression line will pass through $(\bar x, \bar y)$, and it's best to center the discussion about changes to the fit around that point - that is to think about the line $y= a + b(x-\bar x)$ (in this formulation, $\hat a = \bar y$).

If the line went through that $(\bar x, \bar y)$ point, but the slope were little higher or lower (i.e. if the height of the line at the mean was fixed but the slope was a little different), what would that look like?

You'd see that the new line would move further away from the current line near the ends than near the middle, making a kind of slanted X that crossed at the mean (as each of the purple lines below do with respect to the red line; the purple lines represent the estimated slope $\pm$ two standard errors of the slope).

enter image description here

If you drew a collection of such lines with the slope varying a little from its estimate, you'd see the distribution of predicted values near the ends 'fan out' (imagine the region between the two purple lines shaded in grey, for example, because we sampled again and drew many such slopes near the estimated one; We can get a sense of this by bootstrapping a line through the point ($\bar{x},\bar{y}$)). Here's an example using 2000 resamples with a parametric bootstrap:

If instead you take account of the uncertainty in the constant (making the line pass close to but not quite through $(\bar x, \bar y)$), that moves the line up and down, so intervals for the mean at any $x$ will sit above and below the fitted line.

enter image description here

(Here the purple lines are $\pm$ two standard errors of the constant term either side of the estimated line).

When you do both at once (the line may be up or down a tiny bit, and the slope may be slightly steeper or shallower), then you get some amount of spread at the mean, $\bar x$, because of the uncertainty in the constant, and you get some additional fanning out due to the slope's uncertainty, between them producing the characteristic hyperbolic shape of your plots.

That's the intuition.

Now, if you like, we can consider a little algebra (but it's not essential):

It's actually the square root of the sum of the squares of those two effects - you can see it in the confidence interval's formula. Let's build up the pieces:

The $a$ standard error with $b$ known is $\sigma /\sqrt{n}$ (remember $a$ here is the expected value of $y$ at the mean of $x$, not the usual intercept; it's just a standard error of a mean). That's the standard error of the line's position at the mean ($\bar x$).

The $b$ standard error with $a$ known is $\sigma/\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}$. The effect of uncertainty in slope at some value $x^*$ is multiplied by how far you are from the mean ($x^*-\bar x$) (because the change in level is the change in slope times the distance you move), giving $(x^*-\bar x)\cdot\sigma/\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}$.

Now the overall effect is just the square root of the sum of the squares of those two things (why? because variances of uncorrelated things add, and if you write your line in the $y= a + b(x-\bar x)$ form, the estimates of $a$ and $b$ are uncorrelated. So the overall standard error is the square root of the overall variance, and the variance is the sum of the variances of the components - that is, we have

$\sqrt{(\sigma /\sqrt{n})^2+ \left[(x^*-\bar x)\cdot\sigma/\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\right]^2 }$

A little simple manipulation gives the usual term for the standard error of the estimate of the mean value at $x^*$:

$\sigma\sqrt{\frac{1}{n}+ \frac{(x^*-\bar x)^2}{\sum_{i=1}^n (x_i-\bar{x})^2} }$

If you draw that as a function of $x^*$, you'll see it forms a curve (looks like a smile) with a minimum at $\bar x$, that gets bigger as you move out. That's what gets added to / subtracted from the fitted line (well, a multiple of it is, in order to get a desired confidence level).

[With prediction intervals, there's also the variation in position due to the process variability; this adds another term that shifts the limits up and down, making a much wider spread, and because that term usually dominates the sum under the square root, the curvature is much less pronounced.]

Related Solutions

R – Interpreting Confidence Intervals for Predicted Values in Mixed Effects Models

It has the same meaning as any other confidence interval: under the assumption that the model is correct, if the experiment and procedure is repeated over and over, 95% of the time the true value of the quantity of interest will lie within the interval. In this case, the quantity of interest is the expected value of the response variable.

It is probably easiest to explain this in the context of a linear model (mixed models are just an extension of this, so the same ideas apply):

The usual assumption is that:

$y_i = X_{i1} \beta_1 + X_{i2} \beta_2 + \ldots X_{ip} \beta_p + \epsilon $

where $y_i$ is the response, $X_{ij}$'s are the covariates, $\beta_j$'s are the parameters, and $\epsilon$ is the error term which has mean zero. The quantity of interest is then:

$E[y_i] = X_{i1} \beta_1 + X_{i2} \beta_2 + \ldots X_{ip} \beta_p $

which is a linear function of the (unknown) parameters, since the covariates are known (and fixed). Since we know the sampling distribution of the parameter vector, we can easily calculate the sampling distribution (and hence the confidence interval) of this quantity.

So why would you want to know it? I guess if you're doing out-of-sample prediction, it could tell you how good your forecast is expected to be (though you'd need to take into account model uncertainty).

Solved – What does confidence interval and p values mean w.r.t linear regression

It's hard to step in when you had people commenting of the caliber of the names above, but I did tried to understand this the silly way... Using the power of [R] to simulate mathematical problems. So I hope it sheds some light into what these uncertainty quantifications attached to the regression parameters mean - that was the question...

So from the perspective of the frequentist there is this Platonic world of absolute representation of every single individual - the population, and we are looking at the shadows on the wall of the cave - the sample. We know that no matter how much we try we'll be off, but we want to have an idea of how far we'll be from the truth.

We can play god, and pretend to create the population, where everything is perfect, and the parameters governing the relationships between variables are glimmering gold. Let's do that by establishing that the variable $x$ will be related to the variable $y$ through the equation, $y = 10 + 0.4\,x$. We define the x's as x = seq(from = 0.0001, to = 100, by = 0.0001 (that is $1 \,million$ observations). The y's will therefore be calculated as y <- 0.4 * x + 10. We can combine these values in a data.frame: population = data.frame(x, y).

From this population we will take $100$ samples. For each sample, we will randomly select $100$ rows of data from the dataset. Let's define the function for sampling rows:

sam <- function(){
  s <- population[sample(nrow(population),100),] 
  s$y <- s$y + rnorm(100, 0, 10)
  s
}

Notice that we are no longer in paradise - now we have noise (rnorm).

And we are going to collect both the intercepts and the slopes (I'll call them betas) of the OLS linear regression run on each one of these $100$ samples. Let's write some lines of code for this:

betas <- 0
intercepts <- 0
for(i in 1:100){
  s <- sam()
  fit <- lm(y ~ x, data = s)
  betas[i] <- coef(fit)[2]
  intercepts[i] <- coef(fit)[1]
}

And combine both into a new data.frame: reg_lines <- data.frame(intercepts, betas). As expected given the normal randomness of the noise the histogram of the slopes will be gaussian looking:

And if we plot all the regression lines that we fitted in each single one of the $100$ samples from the rows in the population we'll see how any single one is just an approximation, because they do oscillate between a maximum and a minimum in both intercept and slope. This is what they look like:

But we do live in the real world, and what we have is just a sample... Just one of those multicolored lines, through which we are trying to estimate the truth (i.e. intercept of $10$ and slope of $0.4$). Let's conjure this sample: S <- population[sample(nrow(population), 100),]; S$y <- S$y + rnorm(100, 0, 10), and its OLS regression line: fit <- lm(y ~ x, data = S).

Since we are playing god, let's plot our biased sample (dark blue dots of the with dark blue fitted regression line) together with the true line in solid green, and the maximum and minimum combinations of intercepts and slopes we got in our simulation (dashed red lines), giving us an idea of how off we could possibly be from the true line):

Let's quantify this possible error using a Wald interval for the slopes to generate the 5% confidence interval:

coef(fit)[2] + c(-1,1) * 1.96 * summary(fit)$coefficients[4], where summary(fit)$coefficients[4] is the calculated standard error of the estimated slope. This gives us; 0.2836088 to 0.4311044 (remember the "true" value $0.4$).

And for the intercept: coef(fit)[1] + c(-1,1) * 1.96 * summary(fit)$coefficients[3], which give us: 9.968347 to 17.640500.

Finally, let's compare these values by those generated by [R] when we type:

confint(fit)
(Intercept) 9.9204599 17.688387
x           0.2826881  0.432025

Pretty close...

OK, so this is a very intuitive approach at seeing what the confidence intervals are trying to answer. And as for the $p$-values, you can read how they are generated here. In general, the text notes that if the regression coefficient in the population is $0$ ($H_o: \beta = 0$) the $t$-statistic will be:

$$t = \frac{\hat\beta_{yx}-\beta{yz}}{SE_{\hat\beta}}= \frac{\hat\beta_{yx}}{SE_{\hat\beta}}$$.

The $SE_{\hat\beta}$ (which we used in the Wald interval) can be calculated in different ways, although the formula given in the text quoted is:

$SE_{\hat\beta}=\sqrt{\frac{var(e)}{var(x) \, (N-2)}}$. If we calculate this manually:

The variance of the independent variable for our sample is: var_x <- (sd(S$x))^2 = 719.0691. The variance for the errors is: var_e <- sum((residuals(fit)- mean(residuals(fit)))^2)/(nrow(S)-1) = 99.76605. And N - 2 = 98 (we lose one $df$ both for the intercept and the slope). Hence, $SE_{\hat\beta} = \small 0.03762643$ ((SE <- sqrt(var_e/(var_x * (N - 2))))). Which happily coincides with that obtained for the slope of x by [R]:

summary(fit)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.80442    1.95718   7.053 2.49e-10 ***
x            0.35736    0.03763   9.497 1.49e-15 ***

So $t=\frac{\hat\beta_{yx}}{SE_{\hat\beta}}= \small 0.3573566 / 0.03762643=9.497488 $ ((t <- coef(fit)[2]/SE)). What else? Right, the $p$-value... pt(9.497488, 98, lower.tail = F) = 7.460233e-16 ~ 0.

Best Answer

Related Solutions

R – Interpreting Confidence Intervals for Predicted Values in Mixed Effects Models

Solved – What does confidence interval and p values mean w.r.t linear regression

Related Question