Zero Covariance vs Independence of Slope and Intercept Estimators in Linear Models with Least Squares

covarianceindependenceleast squaresmathematical-statisticsself-study

$\newcommand{\Cov}{\operatorname{Cov}}$Problem Statement: Under the assumptions of Exercise 11.16, find
$\Cov\big(\hat\beta_0,\hat\beta_1\big).$ Use this answer to show that
$\hat\beta_0$ and $\hat\beta_1$ are independent if $\displaystyle\sum_{i=1}^n
x_i=0.$ [Hint: $\Cov\big(\hat\beta_0,\hat\beta_1\big)=
\Cov\big(\overline{Y}-\hat\beta_1\overline{x},\hat\beta_1\big).$ Use Theorem
5.12 and the results of this section.]

Note: This is Problem 11.17 in Mathematical Statistics with Applications, 5th Ed., by Wackerly, Mendenhall, and Scheaffer.

My Work So Far: The assumptions of Exercise 11.16 are that $Y_1, Y_2,\dots,Y_n$ are independent normal random variables with $E(Y_i)=\beta_0+\beta_1 x_i$ and $V(Y_i)=\sigma^2.$ The first part of this question is largely done for us in the book.
That is, it is derived that
$$\Cov\big(\hat\beta_0,\hat\beta_1\big)
=-\frac{\overline{x}\,\sigma^2}{\displaystyle\sum_{i=1}^n(x_i-\overline{x})^2},$$
where $\operatorname{Var}(Y_i)=\sigma^2.$
Now $\overline{x}=0$ if and only if $\sum_{i=1}^n x_i=0.$ So if the sum is
zero, the covariance is zero. However, just because $\hat\beta_0$ and
$\hat\beta_1$ are normally distributed and their covariance is zero does not
make them independent. That would only be true if they were bivariately
normally distributed.

My Questions: Is what I'm being asked to show even true? That is, is there something about $\hat\beta_0$ and $\hat\beta_1$ being OLS estimators that makes this result true? Or can I show that they are bivariate normal distributed? Zero covariance does not imply independence in general; why should it be so in this situation?

Note 1: in silverfish's answer to this question, it is mentioned in the paragraph beginning with "These two uncertainties apply independently…" that these two uncertainties "…should be technically independent." But it is not proven there, though it is intuitively explained and I could believe it.

Note 2: In this thread, Alecos simply makes the argument that I think the book wants here, but doesn't say anything about why zero covariance implies independence.

Note 3: I have reviewed a few other threads related to this, but none of them answers the main question of why zero covariance should imply independence in this situation, when it doesn't in general.

Best Answer

$\newcommand{\one}{\mathbf 1}\newcommand{\e}{\varepsilon}$I would just go for a linear algebra approach since then we get joint normality easily. You have $y = X\beta + \e$ with $X = (\one \mid x)$ and $\e\sim\mathcal N(\mathbf 0, \sigma^2 I)$.

We know $$ \hat\beta = (X^TX)^{-1}X^Ty \sim \mathcal N(\beta, \sigma^2 (X^TX)^{-1}) $$ where $$ (X^TX)^{-1} = \begin{bmatrix} n & n \bar x \\ n \bar x & x^Tx\end{bmatrix}^{-1} = \frac{1}{x^Tx - n\bar x^2}\begin{bmatrix} x^Tx/n & - \bar x \\ - \bar x & 1\end{bmatrix}. $$ By assumption $X$ is full rank, which in this case means $x$ is not constant (since the only way to be low rank is for $x$ to be in the span of $\one$). This means $\det X^TX \neq 0$, so $\text{Cov}(\hat\beta_0, \hat\beta_1) = 0$ if and only if $\bar x = 0$ and we do indeed have bivariate normality so this is equivalent to independence.

Here's a different approach that avoids using the normal equations. We know $$ \hat\beta_0 = \bar y - \hat\beta_1 \bar x \\ \hat\beta_1 = \frac{\text{Cov}(x,y)}{\text{Var}(x)} $$ and we want to show $\bar x = 0 \implies \hat\beta_0 \perp \hat\beta_1$, where I'm using "$\perp$" to denote independence.

Without losing any generality I'll assume $x^Tx = 1$ (this preserves $\bar x = 0$). Then under the assumption of $\bar x = 0$ we have $$ \hat\beta_0 = \bar y = n^{-1}\one^Ty \\ \hat\beta_1 = x^Ty - \bar y x^T\one = x^Ty. $$

This means $$ {\hat\beta_0 \choose \hat\beta_1} = (n^{-1}\one \mid x)^Ty $$ so this is a linear transformation of a Gaussian and is in turn Gaussian, and the covariance matrix is proportional to $$ (n^{-1}\one \mid x)^T(n^{-1}\one \mid x) = \begin{bmatrix} n^{-1} & 0 \\ 0 & 1\end{bmatrix} $$ which gives us independence.

This result can be generalized by noting that $\bar x = 0$ is equivalent to having an orthogonal design matrix in this case.

Suppose now we have an $n\times p$ full column rank covariate matrix $X$ which is partitioned as $X = (Z\mid W)$ where $Z$ has orthonormal columns and $W$ is unconstrained.

If every column is orthogonal, i.e. $X=Z$, the result is easy as $X^TX = I$ so $$ \hat\beta \sim \mathcal N(\beta, \sigma^2I). $$

I'll prove the following more interesting result: letting $\hat\beta_A$ denote the vector of coefficients for block $A$ of $X$, the elements of $\hat\beta_Z$ are conditionally independent given $\hat\beta_W$.

This can be shown by directly computing the covariance matrix of $\hat\beta_Z \mid \hat\beta_W$ and since $\hat\beta_Z\mid\hat\beta_W$ is still multivariate Gaussian, this gives us independence. I'll take $\sigma^2 = 1$ without losing any generality.

I'll start with the full covariance matrix of $\hat\beta$, which is proportional to $(X^TX)^{-1}$. $X^TX$ is a $2\times 2$ block matrix so we can invert it as $$ (X^TX)^{-1} = \begin{bmatrix}I & Z^TW \\ W^TZ & W^TW\end{bmatrix}^{-1} = \begin{bmatrix} I + Z^TWA^{-1}W^TZ & -Z^TWA^{-1} \\ -A^{-1}W^TZ & A^{-1} \end{bmatrix} $$ where $A = W^TW - W^TZZ^TW = W^T(I-ZZ^T)W$ gives the covariance matrix of $W$ after projecting all columns into the space orthogonal to the column space of $Z$.

It is not true in general that $I + Z^TWA^{-1}W^TZ = I$, so marginally we are not guaranteed independence in the $\hat\beta_Z$. But now if we condition $\hat\beta_Z$ on $\hat\beta_W$ we obtain $$ \text{Var}(\hat\beta_Z \mid \hat\beta_W) = I + Z^TWA^{-1}W^TZ - Z^TWA^{-1} \cdot A \cdot A^{-1}W^TZ = I $$ so we do indeed have conditional independence.

$\square$

Related Solutions

Solved – Correlation between OLS estimators for intercept and slope

Let me try it as follows (really not sure if that is useful intuition):

Based on my above comment, the correlation will roughly be $$-\frac{E(X)}{\sqrt{E(X^2)}}$$ Thus, if $E(X)>0$ instead of $E(X)=0$, most data will be clustered to the right of zero. Thus, if the slope coefficient gets larger, the correlation formula asserts that the intercept needs to become smaller - which makes some sense.

I'm thinking of something like this:

In the blue sample, the slope estimate is flatter, which means the intercept estimate can be larger. The slope for the golden sample is somewhat larger, so the intercept can be somewhat smaller to compensate for this.

On the other hand, if $E(X)=0$, we can have any slope without any constraints on the intercept.

The denominator of the formula can also be interpreted along these lines: if, for a given mean, the variability as measured by $E(X^2)$ increases, the data gets smeared out over the $x$-axis, so that it effectively "looks" more mean-zero again, loosening the constraints on the intercept for a given mean of $X$.

Here's the code, which I hope explains the figure fully:

n <- 30
x_1 <- sort(runif(n,2,3))
beta <- 2
y_1 <- x_1*beta + rnorm(n) # the golden sample

x_2 <- sort(runif(n,2,3)) 
beta <- 2
y_2 <- x_2*beta + rnorm(n) # the blue sample

xax <- seq(-1,3,by=.001)
plot(x_1,y_1,xlim=c(-1,3),ylim=c(-4,7),pch=19,col="gold",ylab="y",xlab="x")
abline(lm(y_1~x_1),col="gold",lwd=2)
abline(v=0,lty=2)
lines(xax,beta*xax) # the "true" regression line
abline(lm(y_2~x_2),col="lightblue",lwd=2)
points(x_2,y_2,pch=19,col="lightblue")

Solved – Estimators independence in simple linear regression

This is tantamount to showing that $\widehat{\boldsymbol{\beta}}$, the vector of estimates, is independent of the residual vector $\mathbf{e}$. I trust that you are familiar with the matrix notation of the model, it makes the proof quite short.

Recall that the OLS estimator is given by $\left( \mathbf{X}^{T}\mathbf{X} \right)^{-1}\mathbf{X}^{T}\mathbf{Y}$ and the vector of residuals by $\left(\mathbf{I}-\mathbf{H} \right)\mathbf{Y}$, where $\mathbf{H}$ is the projection matrix given by $\mathbf{X}\left(\mathbf{X}^{T}\mathbf{X} \right)^{-1}\mathbf{X}^{T}$. Assuming that $\mathbf{Y}$ is multivariate normal, which is the assumption you need in order to construct finite-sample tests and confidence intervals, what we want to do is take advantage of the fact that linear combinations of multivariate normal variables are also multivariate normal. Hence we rewrite these two as follows

$$\begin{bmatrix} \widehat{\boldsymbol{\beta}} \\ \mathbf{e} \end{bmatrix}=\begin{bmatrix} \left( \mathbf{X}^{T}\mathbf{X} \right)^{-1}\mathbf{X}^{T} \\ \mathbf{I}-\mathbf{H} \end{bmatrix} \mathbf{Y}$$

And now we need to remember that if $\mathbf{X}\sim N_p \left(\boldsymbol{\mu}, \boldsymbol{\Sigma} \right)$, then $\mathbf{AX}\sim N_p \left(\mathbf{A}\boldsymbol{\mu}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^{T} \right)$ (the distribution is closed under affine transformations). We are mainly interested in the new covariance matrix as it being diagonal will indicate that the variables are independent and so we focus on that. It is easy to show-and I leave the details to you- that the covariance matrix is of the form

$$ \begin{bmatrix} \sigma^2 \left(\mathbf{X}^{T}\mathbf{X} \right)^{-1} & \mathbf{0} \\ \mathbf{0} & \sigma^2 \left(\mathbf{I}-\mathbf{H} \right) \end{bmatrix} $$

and so we may conclude that the random variables are independent. And since $\widehat{\boldsymbol{\beta}}$ is independent of $\mathbf{e}$, it is also independent of the Mean Squared Error $\frac{\mathbf{e}^{T}\mathbf{e}}{n-k}$, as we wanted to show.

Note that in general lack of correlation does not imply independence, this is a special property of the multivariate normal distribution and undoubtedly one of the reasons it is loved so much by statisticians.

Hope this helps.

Best Answer

Related Solutions

Solved – Correlation between OLS estimators for intercept and slope

Solved – Estimators independence in simple linear regression

Related Question