Solved – Why error sum of squares has n-2 df (possibly not duplicate, please read on)? (Regression Question Series – Part 4)

chi-squared-distributiondegrees of freedomerrorregressionsums-of-squares

In simple linear regression, the error sum of squares is given by

$$
\text{SSE} = \sum_{i=1}^n(y_i – \hat{y_i})^2 \\
\hat{\sigma}^2 = s^2 = \dfrac{\text{SSE}}{n-2}
$$

where $n-2$ is the degrees of freedom.

Question:
1. Why n-2?

Answers elsewhere:

  1. Most stop with telling us, n-2 because, we need to estimate $\beta_1,\beta_0$
    before calculating $\hat{y}$ (source)
  2. An answer here, suggests, assuming errors are normally distribution ($\varepsilon \sim N(0,\sigma^2)$), the residual sum of squares will have a chi-squared distribution with n-2 df as below.
    $$\begin{aligned}
    \text{SSE} \sim \sigma^2 \text{Chi-Sq(df=n-2)}
    \end{aligned}$$

    Here is the proof of above which again involves matrices and I was lost at orthogonal transformation. In another one here, in hat-matrix.

What did I do?
1. With hope of simpler proof, just like proving unbiased estimator of sample variance as shown here, I attempted as below, but stuck after few steps.

$$\begin{aligned}
E(s^2) &= E\bigg(\dfrac{1}{n-2}\sum_{i=1}^n(y_i – \hat{y_i})^2\bigg) \\
&= E\bigg(\dfrac{1}{n-2}\sum_{i=1}^n(y_i^2 + \hat{y_i}^2 – 2y_i\hat{y_i})\bigg) \\
\end{aligned}$$

For any random variable X,
$$
E\bigg( \sum_{i=1}^2 X_i \bigg) = E\bigg( X_1 + X_2 \bigg) = E(X_1) + E(X_2) = \sum_{i=1}^2 E(X_i)
$$

That is, the expectation permeates in to the summation because $E(X+Y) = E(X) + E(Y)$.

Using same technique,

$$\begin{aligned}
E(s^2) &= E\bigg(\dfrac{1}{n-2}\sum_{i=1}^n(y_i^2 + \hat{y_i}^2 – 2y_i\hat{y_i})\bigg) \\
&=\dfrac{1}{n-2} \sum_{i=1}^n \big( \ E(y_i^2) + E(\hat{y_i}^2) – 2E(y_i\hat{y_i}) \ \big) & \text{(1) stuck}
\end{aligned}$$

I am stuck after this step. I wanted to show above ends up as $\sigma^2$., thus proving $s^2$ of SSE as unbiased.

Is it a duplicate Q?:
I am learning these as part of "Intro to statistics" in Udacity, which is extremely limited in giving a mathematical background (its just basic intuition + formula => apply without understanding system) so I have been using few books 1, 2 as reference and during gaps, will use SE. Topics completed so far (Distributions, MLE, CI, Hypo.Testing) did not require matrices/vectors/quadratic forms yet because so far have been only dealing with single RVs (univariate?), (and chi-squared not yet covered). The books are "Introductory". However, many of the proofs I find here are using vectors/matrices which I find difficult to grasp, so with a hope of simpler answer for "introductory" student I am posting this Q, hopefully thus, also making it not a duplicate.

Previous Question

Best Answer

The key form you want to reach is expression (5) below. The derivation is all algebra.

Assume the model $Y_i=\beta_0+\beta_1X_i+\varepsilon_i$. Then $\bar Y=\beta_0 +\beta_1\bar X+\bar\varepsilon$ so that $$Y_i-\bar Y = \beta_1(X_i-\bar X) + (\varepsilon_i-\bar\varepsilon).\tag1$$ The least squares estimators of $\beta_0$ and $\beta_1$ are, respectively, $$ \hat{\beta_0}:=\bar Y -\hat{\beta_1}\bar X \qquad{\text {and}}\qquad \hat{\beta_1}:=\frac{\sum(X_i-\bar X)(Y_i-\bar Y)}{\operatorname{SSX}},\tag2 $$ where $\operatorname{SSX}:=\sum(X_i-\bar X)^2$. Substitute (1) into the expression for $\hat\beta_1$: $$ \hat\beta_1=\frac{\sum(X_i-\bar X)\left(\beta_1(X_i-\bar X) + (\varepsilon_i-\bar\varepsilon)\right)}{\operatorname{SSX}}=\beta_1+\frac{\sum(X_i-\bar X)(\varepsilon_i-\bar\varepsilon)}{\operatorname{SSX}} $$ to obtain an alternative expression for $\hat\beta_1$ for later use: $$\hat\beta_1-\beta_1=\frac{\sum(X_i-\bar X)(\varepsilon_i-\bar\varepsilon)}{\operatorname{SSX}}.\tag{3} $$ Now derive an expression for $\operatorname{SSE}$. Plug $\hat{\beta_0}$ into $\hat Y_i:=\hat{\beta_0}+\hat{\beta_1}X_i$ to get $$ Y_i-\hat {Y_i} = (\varepsilon_i-\bar\varepsilon) - (\hat{\beta_1}-\beta_1)(X_i-\bar X).\tag4 $$ Square both sides of (4) and sum over $i$. This yields $$ \begin{aligned} \operatorname{SSE}&:=\sum(Y_i-\hat {Y_i})^2\\ &=\sum(\varepsilon_i-\bar\varepsilon)^2-2(\hat\beta_1-\beta_1)\sum(\varepsilon_i-\bar\varepsilon)(X_i-\bar X)+(\hat\beta_1-\beta_1)^2\sum(X_i-\bar X)^2 \\ &\stackrel{(3)}=\sum(\varepsilon_i-\bar\varepsilon)^2 - (\hat{\beta_1}-\beta_1)^2\operatorname{SSX}. \end{aligned} $$ Writing $\sum(\varepsilon_i-\bar\varepsilon)^2=\sum\varepsilon_i^2-n\bar\varepsilon^2$, divide through by $\sigma^2$ in this last expression for SSE and rearrange to the form $$ \boxed{\sum_{i=1}^n\left[\frac{\varepsilon_i}\sigma\right]^2= \left[\frac{\bar\varepsilon}{\sigma/\sqrt n}\right]^2 + \left[\frac{\hat{\beta_1}-\beta_1}{\sigma/\sqrt{\operatorname{SSX}}}\right]^2+\frac{\operatorname{SSE}}{\sigma^2}.}\tag5 $$ Now for some distribution theory. You can check$^\color{red}a$ that each of the bracketed items in (5) has a standard normal distribution. The expectation of the square of a standard normal equals its variance, which is 1. Conclude from (5) the expectation of $\operatorname{SSE}/\sigma^2$ is $n-2$, so $\operatorname{SSE}/(n-2)$ is an unbiased estimator of $\sigma^2$.

We can go further and derive the chi-square($n-2$) distribution for $\operatorname{SSE}/\sigma^2$. What is not obvious, and this is the step that requires matrix algebra and/or multivariable calculus to prove, is that the three terms on the RHS of (5) are mutually independent$^\color{red}b$. Using this, and the fact that the LHS of (5) is the sum of squares of $n$ independent standard normal variables, it follows that $\operatorname{SSE}/\sigma^2$ must have$^\color{red}c$ the distribution of the sum of squares of $n-2$ independent standard normal variables. This is the chi-square($n-2$) distribution.


$\color{red}{a}$: Since $\sum(X_i-\bar X)(\varepsilon_i-\bar\varepsilon)=\sum(X_i-\bar X)\varepsilon_i$, we deduce from (3) that $$E(\hat\beta_1)=\beta_1\qquad\text{and}\qquad\operatorname{Var}(\beta_1)=\frac{\sigma^2}{\operatorname{SSX}}.$$ $\color{red}b$: Define a change of variables from $(\varepsilon_1,\ldots,\varepsilon_n)$ to $(Z_1,\ldots,Z_n)$ by $$ \begin{aligned} Z_1&:=\bar\varepsilon\\ Z_2&:=\hat\beta_1\\ Z_i&:=Y_i-\hat Y_i,\qquad i=3,\ldots,n. \end{aligned} $$ From (4) we see that $\sum(Y_i-\hat Y_i)=0$ and $\sum(X_i-\bar X)(Y_i-\hat Y)=0$, implying that we can solve for $Y_1-\hat Y_1$ and $Y_2-\hat Y_2$ in terms of $Z_3,\ldots,Z_n$, and therefore that $\operatorname{SSE}$ is a function of $Z_3,\ldots,Z_n$. Given (5), we see the joint density of $Z_1,\ldots,Z_n$ has the form $$ g(z_1)h(z_2) k(z_3,\ldots,z_n); $$ note the Jacobian of the transformation is free of $z$ since the map from $\varepsilon$ to $z$ is a multiplication by a constant matrix. This factorization means that $Z_1$, $Z_2$, and $(Z_3,\ldots,Z_n)$ are mutually independent, therefore so are the three terms on the RHS of (5).

$\color{red}c$: One way to prove: Use moment generating functions, independence, and the fact that the moment generating function determines the distribution uniquely.

Related Question