[Math] How to prove sum of errors follow a chi square with $n-2$ degree of freedom in simple linear regression

linear regressionregressionregression analysis

In simple linear regression, the model is
\begin{equation}
Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i
\end{equation}

where $\varepsilon_i$ are i.i.d., and
\begin{equation}
\varepsilon_i \sim N(0, \sigma^2)
\end{equation}

Suppose $b_0$ and $b_1$ are estimators of $\beta_0$ and $\beta_1$, respectively. Then we can have
\begin{equation}
\hat{Y_i} = b_0 + b_1X_i
\end{equation}

Define SSE as
\begin{equation}
SSE = \sum_{i=1}^{n}(Y_i – \hat{Y_i})^2
\end{equation}

How to prove $\frac{SSE}{\sigma^2} \sim \chi^2(n-2)$

NOTE: Though several similar questions before, but actually these answers didn't match with the question. They used some matrix formats and involved some matrix-related knowledge rather than really working on this simplest case. Here, only one independent variable $X$, please do not use matrix format. Thanks!

Best Answer

The chi-square distribution can be deduced using a bit of algebra, and then some distribution theory.

Algebra: Using the overbar to denote sample mean, we have $\bar Y=\beta_0 +\beta_1\bar X+\bar\varepsilon$ so that $$Y_i-\bar Y = \beta_1(X_i-\bar X) + (\varepsilon_i-\bar\varepsilon).\tag1$$ The least squares estimators of $\beta_0$ and $\beta_1$ are, respectively, $$ \hat{\beta_0}=\bar Y -\hat{\beta_1}\bar X \qquad{\text {and}}\qquad \hat{\beta_1}=\frac{\sum(X_i-\bar X)(Y_i-\bar Y)}{\operatorname{SSX}},\tag2 $$ where $\operatorname{SSX}:=\sum(X_i-\bar X)^2$. Plug $\hat{\beta_0}$ into $\hat Y_i:=\hat{\beta_0}+\hat{\beta_1}X_i$ to obtain $$ Y_i-\hat {Y_i} = (\varepsilon_i-\bar\varepsilon) - (\hat{\beta_1}-\beta_1)(X_i-\bar X).\tag3 $$ Square both sides of (3) and sum over $i$. This yields [see (*) below] $$ \operatorname{SSE}:=\sum(Y_i-\hat {Y_i})^2=\sum(\varepsilon_i-\bar\varepsilon)^2 - (\hat{\beta_1}-\beta_1)^2\sum(X_i-\bar X)^2.\tag4 $$

Writing $\sum(\varepsilon_i-\bar\varepsilon)^2=\sum\varepsilon_i^2-n\bar\varepsilon^2$, divide (4) through by $\sigma^2$ and rearrange to the form

$$ \sum\left[\frac{\varepsilon_i}\sigma\right]^2=\frac{\operatorname{SSE}}{\sigma^2} + \left[\frac{\bar\varepsilon}{\sigma/\sqrt n}\right]^2 + \left[\frac{\hat{\beta_1}-\beta_1}{\sigma/\sqrt{\operatorname{SSX}}}\right]^2.\tag5 $$ Distribution theory: It is easy to check that each of the bracketed items in (5) has a standard normal distribution. What is not so obvious, and this is the step that requires matrix algebra to prove, is that the three terms on the RHS of (5) are mutually independent. Since the LHS of (5) is the sum of squares of $n$ independent standard normal variables, it follows that $\operatorname{SSE}/\sigma^2$ must have the distribution of the sum of squares of $n-2$ independent standard normal variables -- this is the chi-square($n-2$) distribution.


(*) What happened to the cross term? After squaring the RHS of (3) and summing over $i$ the cross term is $-2(\hat{\beta_1}-\beta_1)\sum(X_i-\bar X)(\varepsilon_i-\bar\varepsilon)$, which equals $-2(\hat{\beta_1}-\beta_1)^2\sum(X_i-\bar X)^2. $ This follows from the calculation $$ \begin{align} \hat{\beta_1}\sum(X_i-\bar X)^2\stackrel{(2)}=\sum(X_i-\bar X)(Y_i-\bar Y)&=\sum(X_i-\bar X)[\beta_1(X_i-\bar X)+(\varepsilon_i-\bar\varepsilon)]\\&=\beta_1\sum(X_i-\bar X)^2 +\sum(X_i-\bar X)(\varepsilon_i-\bar\varepsilon). \end{align}$$