[Math] Derivation of standard error of regression estimate with degrees of freedom

regression analysisstandard deviationstatistics

I am taking a course of Econometrics:

I need help to understand as to how do we arrive at the formula for standard error of regression $$\hat{\sigma}^2=\frac{\sum{e_i^2}}{n-k}.$$

I understand the bessel's correction required to remove the bias inherent in sample variance. The proof being available at \href{https://en.wikipedia.org/wiki/Bessel's_correction#Proof_of_correctness_.E2.80.93_Alternate_2}{Bessels Correction Proof of Correctness}.

I also found \href{https://stats.stackexchange.com/questions/68766/standard-deviation-of-error-in-simple-linear-regression}{Standard deviation of error in simple linear regression}

\href{https://stats.stackexchange.com/questions/85943/how-to-derive-the-standard-error-of-linear-regression-coefficient}{How to derive the standard error of linear regression coefficient}

But I could not find the proof for the above expression (standard error of regression estimate).

I tried to open the equation on the lines of Bessels Correction proof.

$$e_i=\text{Total SS}- \text{Explained SS}$$

Then I try to expand the Explained sum of squares term, but I got stuck at

$$ \sum _{i=1}^n \operatorname {E} \left((\beta\mathbf{ X}-\bar{y} )^2 \right) = \beta^2 E(x^2)-2\beta\bar{xy}+E(\bar{y}^2)$$

I don't know how to proceed. Can anyone please help ?

Then I read this :

The term "standard error" is more often used in the context of a regression model, and you can find it as "the standard error of regression". It is the square root of the sum of squared residuals from the regression – divided sometimes by sample size n (and then it is the maximum likelihood estimator of the standard deviation of the error term), or by $n−k$ ($k$ being the number of regressors), and then it is the ordinary least squares (OLS) estimator of the standard deviation of the error term.

on \href{https://stats.stackexchange.com/questions/73390/standard-error-vs-standard-deviation-of-sample-mean}{Standard Error vs. Standard Deviation of Sample Mean}

Can anyone suggest a textbook where I can read about these derivations in more details ?

Best Answer

Here's one way. This will work only if you understand matrix algebra and the geometry of $n$-dimensional Euclidean space.

The model says $y_i = \alpha_0 + \sum_{\ell=1}^k \alpha_\ell x_{\ell i} + \varepsilon_i, \quad i=1,\ldots,n $ where

  • $y_i$ and $x_{\ell i}$ are observed;
  • The $\alpha$s are not observed and are to be estimated by least squares;
  • The $\alpha$s are not random, i.e. if a new sample with all new $x$s and $y$s is taken, the $\alpha$ will not change;
  • The $x$s are in effect treated as not random. This is justified by saying we're interested in the conditional distribution of the $y$s given the $x$s. The $y$s are random only because the $\varepsilon$s are;
  • The $\varepsilon$s are not observed. The have expected value $0$ and variance $\sigma^2$ and are uncorrelated. These assumptions are weaker than those that normality and independence.

The $n\times(k+1)$ "design matrix" is $$ X= \begin{bmatrix} 1 & x_{11} & \cdots & x_{k1} \\ \vdots & \vdots & & \vdots \\ 1 & x_{1n} & \cdots & x_{kn} \end{bmatrix} $$ with independent columns and typically $n\gg k$.

The $(k+1)\times 1$ vector of coefficients to be estimated is $$ \alpha= \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \vdots \\ \alpha_k \end{bmatrix}. $$ The model can then be written as $Y= X\alpha+\varepsilon$, where $Y, \varepsilon \in\mathbb R^{n\times 1}$. Then $Y$ has expected value $X\alpha\in\mathbb R^{n\times 1}$ and variance $\sigma^2 I_n\in\mathbb R^{n\times n}$.

The "hat matrix" is $H = X(X^T X)^{-1} X^T$, an $n\times n$ matrix of rank $k+1$. The vector $\widehat Y = HY$ is the orthogonal projection of $Y$ onto the column space of $X$. It is also $\widehat Y=HY = X\widehat\alpha$, where $\widehat\alpha$ is the vector of least-squares estimates of the components of $\alpha$.

The residuals are $\widehat\varepsilon_i = e_i = Y_i-\widehat Y_i = Y_i-(\widehat\alpha_0 + \sum_{\ell=1}^k \widehat\alpha_\ell x_{\ell i})$. These are observable estimates of the unobservable errors. The vector of residuals is $$ \widehat\varepsilon = e = (I-H)Y. $$ This has expected value $(I-H)\operatorname{E}(Y) = (I-H)X\alpha = 0$.

We seek \begin{align} & \operatorname{E}(\|\widehat\varepsilon\|^2) = \operatorname{E}(\|e\|^2) \\[10pt] = {} & \operatorname{E} ( \Big((I-H)Y\Big)^T \Big((I-H)Y\Big)) \\[10pt] = {} & \operatorname{E} (Y^T (I-H) Y) \qquad \text{since } (I-H)^T = I-H = (I-H)^2. \text{ (Check that.)} \end{align} We've projected $Y$ onto the $(n-(k+1))$-dimensional column space of $I-H$. The expected value of the projection is $0$.

I claim the variance of the projection is just $\sigma^2$ times the identity operator on that $(n-(k+1))$-dimensional space. The reason for that is that $I-H$ is itself the identity operator on that $(n-(k+1))$-dimensional space, which is the orthogonal complement of the column space of $X$.

So it's as if we have a random vector $w$ in $(n-(k+1))$-dimensional space with expected value $0$ and variance $\sigma^2 I_{(n-(k+1))\times(n-(k+1))}$, and we're asking what $\operatorname{E}(\|w\|^2)$ is. And that is $\sigma^2(n-(k+1))$.

Hence the expected value of the sum of squares of residuals (which is the "unexplained" sum of squares) is $\sigma^2(n-(k+1))$.

Related Question