$E[\Sigma(y_i-\bar{y})^2]=(n-1)\sigma^2 +\beta_1^2\Sigma(x_i-\bar{x})^2$ proof

regressionstatistics

I am having trouble proving the identity below.

$E[\Sigma(y_i-\bar{y})^2]=(n-1)\sigma^2 +\beta_1^2\Sigma(x_i-\bar{x})^2$

where the assumptions are

$Cov[y_i,y_j]=0$ for $i \ne j$

$E[y_i]=\beta_0+\beta_1x_i, Var[y_i]=\sigma^2$

$\hat\beta_0$ and $\hat\beta_1$ are least squares estimate of $\beta_0$ and $\beta_1$

So far I understand that
$$E[\hat\beta_1^2]=\beta_1^2 +\frac{\sigma^2}{\Sigma(x-\bar x)^2}$$

but I seem to really have issue understanding the relationship between x and y 🙁

I am thinking that

$$E\left[n \frac{1}{n}\Sigma(y_i-\bar{y})^2 \right]=n E[Var[Y_i]]= n\sigma^2$$
which looks nothing like the expression…

May I get some help, please?

Best Answer

What follows is a complete explanation with more detail than what might be needed for a full understanding.

In the linear regression model $$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$ the only random variable on the right-hand side of this equation is $$\epsilon_i \sim \operatorname{Normal}(0, \sigma^2).$$ Everything else is either a parameter ($\beta_0$, $\beta_1)$ or a covariate ($x_i$). The left-hand side $y_i$ is therefore a random variable, whose randomness is attributed to the error term. As the errors are independent, so are the responses.

Note there is no parameter estimation mentioned here. $\beta_0$ and $\beta_1$ represent the true parameters for the model, in the sense that if you were to make numerous observations of the response for a given value of $x_i$, you would find that these would be normally distributed with mean $\mu_i = \beta_0 + \beta_1 x_i$ and variance $\sigma^2$.

The best way to understand the random variable $$\sum_{i=1}^n (y_i - \bar y)^2$$ is to first ask, if a single $y_i$ has expectation $\mu_i$, then what is the expectation of the sample mean $\bar y$? This easily follows from the law of total expectation: $$\operatorname{E}[\bar y] = \frac{1}{n} \sum_{i=1}^n \operatorname{E}[y_i] = \frac{1}{n} \sum_{i=1}^n \beta_0 + \beta_1 x_i = \beta_0 + \beta_1 \bar x,$$ which is simply the response at the mean value of the covariate. Let's call this value $\mu$.

Now it is easy to partition the sum of squares, knowing that by construction, the mean deviation of $\bar y$ from $\mu$ is zero: $$\begin{align*} \sum_{i=1}^n (y_i - \bar y)^2 &= \sum_{i=1}^n (y_i - \mu + \mu - \bar y)^2 \\ &= \sum_{i=1}^n \left( (y_i - \mu)^2 + 2(y_i - \mu)(\mu - \bar y) + (\mu - \bar y)^2 \right) \\ &= \sum_{i=1}^n (y_i - \mu)^2 + 2 (\mu - \bar y) \sum_{i=1}^n (y_i - \mu) + n(\mu - \bar y)^2 \\ &= \sum_{i=1}^n (y_i - \mu)^2 + 2 (\mu - \bar y)(n \bar y - n \mu) + n(\mu - \bar y)^2 \\ &= \sum_{i=1}^n (y_i - \mu)^2 - n(\mu - \bar y)^2. \end{align*}$$ The expected value $\operatorname{E}[(\bar y - \mu)^2] = \operatorname{Var}[\bar y]$ by definition, and by the independence of responses, $$\operatorname{Var}[\bar y] \overset{\text{ind}}{=} \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}[y_i] = \frac{\sigma^2}{n}.$$ So all that remains is to compute the expectation of the first term. But $$\operatorname{E}\left[\sum_{i=1}^n (y_i - \mu)^2\right] = \sum_{i=1}^n \operatorname{E}[(y_i - \mu)^2],$$ and since $$\begin{align*}\operatorname{E}[(y_i - \mu)^2] &= \operatorname{E}[(y_i - \mu_i + \mu_i - \mu)^2] \\ &= \operatorname{E}[(y_i - \mu_i)^2] + 2(\mu_i - \mu) \operatorname{E}[y_i - \mu_i] + (\mu_i - \mu)^2 \\ &= \operatorname{Var}[y_i] + (\mu_i - \mu)^2 \\ &= \sigma^2 + (\mu_i - \mu)^2, \end{align*}$$ we obtain after putting everything together $$\begin{align*} \operatorname{E}\left[\sum_{i=1}^n (y_i - \bar y)^2\right] &= n \sigma^2 + \sum_{i=1}^n (\mu_i - \mu)^2 - \sigma^2 \\ &= (n-1)\sigma^2 + \sum_{i=1}^n (\beta_0 + \beta_1 x_i - (\beta_0 + \beta_1 \bar x))^2 \\ &= (n-1)\sigma^2 + \beta_1^2 \sum_{i=1}^n (x_i - \bar x)^2 \end{align*}$$ as claimed.

Throughout our discussion, the only random variables here have been $\epsilon_i$ and any functions of these, such as $y_i$ and $\bar y$. The quantities $\mu_i$ and $\mu$ are not random, being functions of the parameters and covariate. It is important to keep this in mind.