Variance vs Mean Squared Error – Understanding the Difference

errorvariance

I'm surprised this hasn't been asked before, but I cannot find the question on stats.stackexchange.

This is the formula to calculate the variance of a normally distributed sample:

$$\frac{\sum(X – \bar{X}) ^2}{n-1}$$

This is the formula to calculate the mean squared error of observations in a simple linear regression:

$$\frac{\sum(y_i – \hat{y}_i) ^2}{n-2}$$

What's the difference between these two formulas? The only difference I can see is that MSE uses $n-2$. So if that's the only difference, why not refer to them as both the variance, but with different degrees of freedom?

Best Answer

The mean squared error as you have written it for OLS is hiding something:

$$\frac{\sum_{i}^{n}(y_i - \hat{y}_i) ^2}{n-2} = \frac{\sum_{i}^{n}\left[y_i - \left(\hat{\beta}_{0} + \hat{\beta}_{x}x_{i}\right)\right] ^2}{n-2}$$

Notice that the numerator sums over a function of both $y$ and $x$, so you lose a degree of freedom for each variable (or for each estimated parameter explaining one variable as a function of the other if your prefer), hence $n-2$. In the formula for the sample variance, the numerator is a function of a single variable, so you lose just one degree of freedom in the denominator.

However, you are on track in noticing that these are conceptually similar quantities. The sample variance measures the spread of the data around the sample mean (in squared units), while the MSE measures the vertical spread of the data around the sample regression line (in squared vertical units).