Solved – How to estimate the standard deviation of residuals on a residual plot

data visualizationdescriptive statisticsregressionresidualsstandard deviation

So I understand that on the normal histogram and scatter plot, we estimate the standard deviation by looking at the portion which contains $\frac23$ of the data.

In terms of a residual plot from a regression, however, it is a randomly sized region around (x or y)= 0. How do we estimate the standard deviation of the residuals?

Best Answer

You can do this on a plot of residuals - plain, raw residuals, not standardized or studentized - vs anything at all - residuals vs fitted, residuals vs x (indeed any predictor in a multiple regression), residuals vs index number, residuals vs a variable you didn't use in the regression, whatever you like. It's only the fact that the residuals ($y-\hat{y}$) are plotted on one of the axes that's important.

The residuals are most typically plotted on the y-axis so in that case it's the y-axis you pay attention to. It doesn't make a difference what's on the x-axis when doing this, you don't pay any attention to it. [If your residuals have been plotted on the x-axis for some reason then that is the direction you pay attention to.]

Here's a plot of residuals vs fitted values for regression on a particular data set. There are 50 observations (though one of the residuals near 0 is not visible because it lays exactly over another point). Since we want an interval containing 2/3 of the points, we'll have 1/6 of the points above the top of the interval and 1/6 of the points below it.

plot of residuals vs fitted with horizontal lines marked between the 8th and 9th largest and 8th and 9th smallest residuals

Since 50/6 is 8-and-a-bit, let's leave 8 points outside either end, so we'll mark our lines about halfway between the 8th and 9th points from the top and bottom of the plot. [You could mount an argument for drawing the lines right at the 9th point and saying that's 8.5 points outside each of the lines and 50-8.5-8.5 = 33 between them, but this is pretty approximate anyway so I won't worry too much. For example, 2/3 is slightly too small so I'll stick with 8 points outside each end of the interval.]

Measuring the distance between those two horizontal lines gives a distance of about 2.2 in this case. Since that interval containing the middle 2/3 of the data should be about 2 standard deviations wide, we estimate the standard deviation to be $s\approx 2.2/2 = 1.1$.

The actual standard deviation of residuals from the regression output ($\sqrt{SSE/(n-2)}\,$) was $s = 1.102$.

Not too shabby!