Solved – How to derive variance-covariance matrix of coefficients in linear regression

regression

I am reading a book on linear regression and have some trouble understanding the variance-covariance matrix of $\mathbf{b}$:

enter image description here

The diagonal items are easy enough, but the off-diagonal ones are a bit more difficult, what puzzles me is that
$$
\sigma(b_0, b_1) = E(b_0 b_1) – E(b_0)E(b_1) = E(b_0 b_1) – \beta_0 \beta_1
$$

but there is no trace of $\beta_0$ and $\beta_1$ here.

Best Answer

This is actually a cool question that challenges your basic understanding of a regression.

First take out any initial confusion about notation. We are looking at the regression:

$$y=b_0+b_1x+\hat{u}$$

where $b_0$ and $b_1$ are the estimators of the true $\beta_0$ and $\beta_1$, and $\hat{u}$ are the residuals of the regression. Note that the underlying true and unboserved regression is thus denoted as:

$$y=\beta_0+\beta_1x+u$$

With the expectation of $E[u]=0$ and variance $E[u^2]=\sigma^2$. Some books denote $b$ as $\hat{\beta}$ and we adapt this convention here. We also make use the matrix notation, where b is the 2x1 vector that holds the estimators of $\beta=[\beta_0, \beta_1]'$, namely $b=[b_0, b_1]'$. (Also for the sake of clarity I treat X as fixed in the following calculations.)

Now to your question. Your formula for the covariance is indeed correct, that is:

$$\sigma(b_0, b_1) = E(b_0 b_1) - E(b_0)E(b_1) = E(b_0 b_1) - \beta_0 \beta_1 $$

I think you want to know how comes we have the true unobserved coefficients $\beta_0, \beta_1$ in this formula? They actually get cancelled out if we take it a step further by expanding the formula. To see this, note that the population variance of the estimator is given by:

$$Var(\hat\beta)=\sigma^2(X'X)^{-1}$$

This matrix holds the variances in the diagonal elements and covariances in the off-diagonal elements.

To arrive to the above formula, let's generalize your claim by using matrix notation. Let us therefore denote variance with $Var[\cdot]$ and expectation with $E[\cdot]$.

$$Var[b]=E[b^2]-E[b]E[b']$$

Essentially we have the general variance formula, just using matrix notation. The equation resolves when substituting in the standard expression for the estimator $b=(X'X)^{-1}X'y$. Also assume $E[b]=\beta$ being an unbiased estimator. Hence, we obtain:

$$E[((X'X)^{-1}X'y)^2] - \underset{2 \times 2}{\beta^2}$$

Note that we have on the right hand side $\beta^2$ - 2x2 matrix, namely $bb'$, but you may at this point already guess what will happen with this term shortly.

Replacing $y$ with our expression for the true underlying data generating process above, we have:

\begin{align*} E\Big[\Big((X'X)^{-1}X'y\Big)^2\Big] - \beta^2 &= E\Big[\Big((X'X)^{-1}X'(X\beta+u)\Big)^2\Big]-\beta^2 \\ &= E\Big[\Big(\underbrace{(X'X)^{-1}X'X}_{=I}\beta+(X'X)^{-1}X'u\Big)^2\Big]-\beta^2 \\ &= E\Big[\Big(\beta+(X'X)^{-1}X'u\Big)^2\Big]-\beta^2 \\ &= \beta^2+E\Big[\Big(X'X)^{-1}X'u\Big)^2\Big]-\beta^2 \end{align*}

since $E[u]=0$. Furthermore, the quadratic $\beta^2$ term cancels out as anticipated.

Thus we have:

$$Var[b]=((X'X)^{-1}X')^2E[u^2]$$

By linearity of expectations. Note that by assumption $E[u^2]=\sigma^2$ and $((X'X)^{-1}X')^2=(X'X)^{-1}X'X(X'X)'^{-1}=(X'X)^{-1}$ since $X'X$ is a $K\times K$ symetric matrix and thus the same as its transpose. Finally we arrive at

$$Var[b]=\sigma^2(X'X)^{-1}$$

Now that we got rid of all $\beta$ terms. Intuitively, the variance of the estimator is independent of the value of true underlying coefficient, as this is not a random variable per se. The result is valid for all individual elements in the variance covariance matrix as shown in the book thus also valid for the off diagonal elements as well with $\beta_0\beta_1$ to cancel out respectively. The only problem was that you had applied the general formula for the variance which does not reflect this cancellation at first.

Ultimately, the variance of the coefficients reduces to $\sigma^2(X'X)^{-1}$ and independent of $\beta$. But what does this mean? (I believe you asked also for a more general understanding of the general covariance matrix)

Look at the formula in the book. It simply asserts that the variance of the estimator increases for when the true underlying error term is more noisy ($\sigma^2$ increases), but decreases for when the spread of X increases. Because having more observations spread around the true value, lets you in general build an estimator that is more accurate and thus closer to the true $\beta$. On the other hand, the covariance terms on the off-diagonal become practically relevant in hypothesis testing of joint hypotheses such as $b_0=b_1=0$. Other than that they are a bit of a fudge, really. Hope this clarifies all questions.

Related Question