Solved – How do we derive the OLS estimate of the variance

least squaresvariance

If we have a linear regression equation $y=X\beta + u$, then we can find the OLS estimate of $\beta$ by minimizing wrt $\hat \beta$: $E(\hat u)=E(y-X\hat\beta )$

However, my textbook suddenly says, out of nowhere, that the OLS estimate of the variance of $u$ (each $u_i$ is iid). $\sigma ^2$ is $\hat \sigma ^2 = \frac {\hat u^T \hat u}{n-K}$, where $n $ is the sample size and $K$ is the amount of independent variables.

I understand that this estimator is unbiased, but I have absolutely no idea how it is derived from the assumption of OLS, or why it is called the OLS estimate of $\sigma$.

How do we derive this estimator?

Best Answer

The estimator for the variance commonly used in regression does not come from the least squares principle, which only produces an estimate for $\boldsymbol{\beta}$. It is just a bias-corrected version (by the factor $\frac{n}{n-K})$ of the empirical variance

$$\widehat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i - \mathbf{x}_i^{T} \widehat{\boldsymbol{\beta}} \right)^2$$ which in turn is the maximum likelihood estimator for $\sigma^2$ under the assumption of a normal distribution. It's confusing that many people claim that that is the OLS estimator of the variance.

Related Solutions

Regression – Conditional Mean Independence: Unbiasedness and Consistency of OLS Estimator

It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.

Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression: \begin{align} E(u|z) = x\alpha_1 + z\alpha_2 + \nu \end{align} The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.

Here is an example to prove the point: \begin{align} \xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\ z &=\xi\\ x &= z^2 + \zeta\\ u &= z+z^2-E(z+z^2)+\nu \end{align}

Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.

Here is a very simple example in R which demonstrates the point:

set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u

summary(lm(y~x+z))

# auxiliary regression
summary(lm(z+z^2~x+z))

Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.

So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4): \begin{align} y = x\beta + z\gamma + E(u|z) + v \end{align}

It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.

Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS): \begin{align} y = x\beta + f(z) + v \end{align} That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.

Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression: \begin{align} y = x\beta + v \end{align} This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.

This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.

Solved – Is the derivations of OLS different if we do not have an intercept

That is not correct if there is no intercept. Without an intercept the OLS estimate would be:

$$ \beta = \frac{\operatorname{E}[XY]}{\operatorname{E}{[X^2]}}$$

In the case of a finite sample, your estimate would be:

$$ \hat{\beta} = \frac{\sum_{i=1}^n x_i y_i }{\sum_{i=1}^n x_i^2} $$

The general case (skip over this if you don't know matrix algebra yet)

If you know matrix algebra, all these are special cases. Minimizing the sum of squares can be written as:

$$ \text{minimize (over $\mathbf{b}$) } \quad\left( \mathbf{y} - X \mathbf{b}\right)'\left( \mathbf{y} - X \mathbf{b}\right) $$

Which has the solution:

$$ \hat{\mathbf{b}} = (X'X)^{-1} X'\mathbf{y}$$

The algebra behind that can be found (among numerous places) on my answer here.

Best Answer

Related Solutions

Regression – Conditional Mean Independence: Unbiasedness and Consistency of OLS Estimator

Solved – Is the derivations of OLS different if we do not have an intercept

The general case (skip over this if you don't know matrix algebra yet)

Related Question