Variance – Estimator of Variance of Error

linear modelregressionvariance

i. It is a known fact that unbiased estimator of the population variance is given by: $S^{2}=\dfrac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}$

This has no prior distribution assumption, so it should hold for any distribution of $X$.

ii. In linear regression, unbiased estimator of the error ($\epsilon$) variance is given by:
$S^{2}=\dfrac{1}{n-p-1}\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$

which is the sum of squares of the residuals divided by the sample size minus the number of features.

My question is, why we don't use the formula given by (i) treating the residuals as a sample from the distribution of $\epsilon$? What is the difference between the two?

Best Answer

Suppose you have a random variable $Y$ that for simplicity you assume is normally distributed. So $$Y \sim N(\mu, \sigma^2) = \mu + \epsilon \quad \text{ where } \epsilon \sim N(0, \sigma^2)\,. $$

If you knew $\mu$, then an unbiased estimator of $\sigma^2$ will be $$S_1^{2}=\dfrac{1}{n}\sum_{i=1}^{n}\left(y_{i}- \mu\right)^{2}\,.$$

If $\mu$ is not known, then you can estimate $\mu$ with $\bar{y}$. Now an unbiased estimator of the variance $\sigma^2$, as you mention would be $$S_2^{2}=\dfrac{1}{n-1}\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}$$ The $n$ in the denominator changed to $n-1$ because some effort now has gone into estimating $\mu$. This is connected to the concept of degrees of freedom. The link has some nice discussions on this.

Now suppose you don't know $\mu$, but you know that it is of the form $\mu_X = \beta_0 + \beta_1X_1 + \dots + \beta_p X_p$ where the $X$s are given to you and $\beta$s are unknown. This is the linear regression model. $$\mu_X = E(Y \mid X) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\,.$$ Since $\beta$s are unknown you estimate them using ordinary least squares, and calculate an estimator for $\mu_X$ $$\hat{y}_i = \hat{\mu}_X = \hat{\beta}_0 + \hat{\beta}_1X_ + \dots + \hat{\beta}_pX_p,. $$ In estimating $\mu_X$, we had to estimate all of $\beta_0, \beta_1, \dots,\beta_p$, which are $p+1$ parameters. So now an unbiased estimator is

$$S_3^{2}=\dfrac{1}{n-p-1}\sum_{i=1}^{n}\left(y_{i}-\hat{y}_i\right)^{2}\,.$$

Thus, the change in the denominator comes from the fact that each time we are using a different estimator for the mean $\mu$.