Solved – How to prove whether or not the OLS estimator $\hat{\beta_1}$ will be biased to $\beta_1$

least squaresmultiple regressionself-studyunbiased-estimator

*I scanned through several posts on a similar topic, but only found intuitive explanations (no proof-based explanations).

Let's say I have two models, the first of which represents the true data, $y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \varepsilon$, where $X_1$ and $X_2$ are fixed regressors, and the second of which represents the reduced version, $y = \beta_0 + \beta_1X_1 + \varepsilon$. The second model gives us $\hat{\beta_1}$. Will $\hat{\beta_1}$ be a biased estimator for $\beta_1$?

My first instinct is that it will only be a biased estimator if $X_2$ was a predictor (correlated with $X_1$ and $\beta_2 \ne 0$).

I found mixed ways of going about this, but this is the best I came up with.

$\hat{\beta_1}$ = $\sum_{i=1}^{n} = \frac{(x_i-\bar{x})(y_i-\bar{y})}{(x_i-\bar{x})^2}$ = $\frac{\sum_{i=1}^{n}(x_i-\bar{x})*y_i}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$.

$E(\hat{\beta_1})$ = $\frac{\sum_{i=1}^{n}(x_i-\bar{x})}{\sum_{i=1}^{n}}E(y_i)$

= $\frac{\sum_{i=1}^{n}(x_i-\bar{x})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\beta_0+ \beta_1\frac{\sum_{i=1}^{n}(x_i-\bar{x})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$.

Does this sufficiently prove that it is unbiased for $\beta_1$?

Best Answer

We need to take some care with the notation because the models differ.

Let the first (correct) model be

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\tag{1}$$

where the $\varepsilon_i$ have a common variance and zero means; and write the second model (which governs the very same variables $Y$, so no need to change their name) as

$$Y = \alpha_0 + \alpha_1 X_1 + \delta.\tag{2}$$

As an aside, we may impose no additional assumptions on $\delta$ because these random variables are completely determined by equating the two right hand sides (which, after all, equal the same things):

$$\delta = (\beta_0 - \alpha_0) + (\beta_1 - \alpha_1)X_1 + \beta_2 X_2 + \varepsilon.$$

(From now on I will drop the generic discussion of models to focus on a dataset with explanatory values $x_{1i}$ and $x_{2i},$ responses $y_i,$ and associated error $\varepsilon_i$ and $\delta_i.$)

We can infer, however, that the $\delta_i$ all have the same variances as the $\varepsilon$ and their means are

$$E[\delta_i] = (\beta_0 - \alpha_0) + (\beta_1 - \alpha_1)x_{1i} + \beta_2 x_{2i},$$

which may vary among observations.

Let's return to the analysis. Fitting the second model gives the slope estimate

$$\hat\alpha_1 = \frac{\sum_{i} (y_i - \bar y)(x_{1i} - \bar{x}_1)}{\sum_{i} (x_{1i} - \bar{x}_1)^2}.\tag{*}$$

This is a linear combination of the $y_i-\bar y,$ so use the zero-mean assumption about the $\varepsilon_i$ to compute

$$E[y_i - \bar y] = (\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}) -(\beta_0 + \beta_1 \bar{x}_1 + \beta_2 \bar{x}_2) = \beta_1(x_{1i}-\bar{x}_i) + \beta_2(x_{2i} - \bar{x}_2)$$

and apply linearity of expectation in $(*)$ to compute

$$E[\hat\alpha_1] = \beta_1 + \beta_2\frac{\sum_{i} (x_{2i}-\bar{x}_2)(x_{1i} - \bar{x}_1)}{\sum_{i} (x_{1i} - \bar{x}_1)^2}.$$

Equating this with $\beta_1$ to assess the bias in using $\hat\alpha_1$ to estimate $\beta_1,$ we find it will be unbiased if and only if the second term is zero. This can happen in two ways:

  1. If $\beta_2 = 0.$ (This just means the second model is correct.)

  2. If $\sum_{i} (x_{2i}-\bar{x}_2)(x_{1i} - \bar{x}_1)=0.$ This means the covariance of the $x_1$ data and the $x_2$ data is zero: that is, the design vectors are orthogonal.

If neither of these is the case, the bias is nonzero. That agrees exactly with your intuition.

Related Question