[Math] Prediction intervals around a regression line

regression

I have a set of observations (x,y).

I want to use x values to predict y.

I plot a simple regression and this gives me an equation y = mx+c. This is the thin black line.

How do I construct confidence intervals around the value of y or any given x? (eg. like the red lines on the graph) – with the aim these red lines should theoretically contain 95% of the data.

enter image description here

Edit: Here is the solution graph using the accepted answer.

enter image description here

Best Answer

What you're basically trying to get is a 95% confidence interval for $y_0$, a new point with $x$-coordinate equal to some value $x_0$, say. If we let $\mu(x_0)$ denote the true mean according to our linear model of points with $x=x_0$, then the formula for $y_0$ is:

$$y_0 = \mu(x_0) + \epsilon_0$$

Where $\epsilon_0$ is the (normally-distributed) random error iid across all points. What this basically means is that if you're trying to predict where a new point $y_0$ will lie, your guess will have randomness arising from your estimate $\hat{\mu}(x_0)$ of $\mu(x_0)$ but also some randomness arising from the error term $\epsilon_0$.

Therefore to get a confidence interval for $y_0$, we just need to study the variability of our estimate of $\mu(x_0)$, and then add a factor to account for the randomness in $\epsilon_0$. We know that the variability of $\epsilon_0$ is equal to $\sigma^2$ which is estimated by $s^2$, so the crux of the matter is to study the standard deviation of your estimate $\hat{\mu}(x_0)$ of the true mean $\mu(x_0)$ and then add in a factor $s$ to account for the error term.

First, express your estimated mean for a point at $x=x_0$ in terms of things you already know. Notice that, if $a$ is your estimate of the $y$-intercept and if $b$ is your estimate of the true slope, then:

$$\hat{\mu}(x_0) = a + x_0 b$$

$a$ and $b$ are random variables. If we assume normal errors in the model then $a$ and $b$ are themselves normally distributed. That means that the linear combination $\hat{\mu}(x_0) = a + x_0b$ will also be normally distributed. All that's left is to calculate the variance (standard deviation) of $\hat{\mu}(x_0)$ and then we can find the confidence interval using a $t$ critical value (since we also have to estimate the true variance of the model's error term).

To find the variance of $\hat{\mu}(x_0)$ we just need the variances of $a$ and $b$ and then their covariance. The standard deviations of $a$ and $b$ are:

$$Var(a) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i-\bar{x})^2} \right),\quad Var(b) = \sigma^2 \frac{1}{\sum (x_i - \bar{x})^2}$$

Their covariance is:

$$Cov(a,b) = -\sigma^2 \frac{\bar{x}}{\sum (x_i - \bar{x})^2}$$

Doing some algebra, this means that

$$SD(\hat{\mu}(x_0)) = \sigma \sqrt{\frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum (x_i - \bar{x})^2}}$$

Now replacing $\sigma$ with $s$, the standard error calculated from your residuals and using the appropriate $t$ critical value for 95% confidence (with $n-2$ degrees of freedom!) yields that your confidence interval width is:

$$ t^* \cdot s \sqrt{\frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum (x_i - \bar{x})^2}} $$

But this is only for the estimated mean. To come full circle and build a prediction interval for a new point, you need to add in a factor of $s$ for the random error which takes the point off of the mean, in other words:

$$ t^* \cdot s \sqrt{1 + \frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum (x_i - \bar{x})^2}} $$

Related Question