Say I have the true linear model with normal errors:
$y = \beta_1 X_1 + \beta_2 X_2 + \epsilon$
However, I only observe $Z = X_1 + X_2$, so I estimate instead:
$y = \delta (X_1 + X_2) + e$,
can I express $\delta$ in terms of $\beta_1$ and $\beta_2$ (and perhaps $X_1$ and $X_2$)?
Observably when I generate data and estimate the results $\delta$ seems to be the average of the coefficients weighted by the variance of the regressors, but I can't seem to derive a solution.
Incorporating hints:
Thanks for the hints, which are helpful. Here is an attempt — although I feel I haven't gotten to the bottom of it yet:
$y = a(X_1 + X_2) + b(X_1 – X_2) + \epsilon$ (thanks @whuber!)
$y = aX_1 + aX_2 + bX_1 – bX_2 + \epsilon$,
$y = (a + b)X_1 + (a – b)X_2 + \epsilon$, therefore
$\beta_1 = (a+b)$ and $\beta_2 = (a – b)$.
Solving for $a$ and $b$ we get:
$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 – \beta_2)/2$
Effectively I am estimating $a$ while omitting $b$, so
$y = a(X_1 + X_2) + u,$
$u = b(X_1 – X_2) + \epsilon$, therefore my estimate of $\delta$ above will be the estimate of $a$ with omitted variable bias. Calling $X_1 + X_2 = Z$, my estimate of $\hat{\delta}$ is therefore:
$\hat{\delta} = (Z'Z)^{-1}(Z'[Za + (X_1 – X_2)b + \epsilon])$
Ignoring $\epsilon$ as it disappears in expectation, I get
$\hat{\delta} = a + (Z'Z)^{-1}(Z'(X_1 – X_2))b$
$\hat{\delta} = \frac{\beta_1 + \beta_2}{2} + (Z'Z)^{-1}Z'[X_1 – X_2] \frac{\beta_1 – \beta_2}{2}$
Does this look on the right track? Can we reduce the trailing second expression?
I was thinking that since the expression $(Z'Z)^{-1}Z'[X_1 – X_2]$ looks like a regression of $X_1 – X_2$ on $Z$, I could perhaps re-write it as $Var(X_1 + X_2)^{-1}Cov(X_1 + X_2, X_1 – X_2) = Var(X_1+X_2)^{-1}(Var(X_1) – Var(X_2))$?
Best Answer
With a great deal of prodding (full credit to @whuber) I seem to have solved it:
We re-write $y = \beta_1X_1 + \beta_2X_2 + \epsilon$ as
$y = a(X_1 + X_2) + b(X_1 - X_2) + \epsilon$
$y = aX_1 + aX_2 + bX_1 - bX_2 + \epsilon$,
$y = (a + b)X_1 + (a - b)X_2 + \epsilon$, therefore
$\beta_1 = (a+b)$ and $\beta_2 = (a - b)$.
Solving for $a$ and $b$ we get:
$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 - \beta_2)/2$
What I estimate is equivalent to
$y = \delta(X_1 + X_2) + u,$
$u = b(X_1 - X_2) + \epsilon$. We know that the $k$th coefficient of a multiple regression can be recovered by first partialling out the regressors other than $k$. Thus, if I let $z = X_1 + X_2$ and $q = X_1 - X_2$, the regression coefficient $b$ in:
$y = az + bq + \epsilon$ can be obtained with the steps
(1) $y = \delta z + u$,
(2) $q = \lambda z + e$
(3) $u = be + \epsilon$. Substituting expressions back in,
$y = \delta z + be + \epsilon$ from (1).
$y = \delta z + b(q - \lambda z) + \epsilon$ from (2). Therefore:
$y = (\delta - b \lambda)z + bq + \epsilon$. Now we know that $(\delta - b\lambda ) = a = (\beta_1 + \beta_2)/2$, so we solve for $\delta$ which yields:
$\delta = (\beta_1 + \beta_2)/2 + \lambda (\beta_1 - \beta_2)/2$
Here is some R code to check the solution, or to play around with:
Update for case with P predictors, September 2020
Returning to this much later to generalize to case with P predictors at request. The underlying principle is the same: we are going to
With $P$ predictors the true data generating process is:
$$y = \sum_{p=1}^{P} \beta_p x_p + \epsilon$$.
This true data generating process can be re-parameterized in a way that is mathematically equivalent as
$$ y = a (\sum_{p=1}^{P} x_p) + \sum_{p=2}^{P} b_p (x_1 - x_p) + \epsilon \\ y = (a + \sum_{p=2}^{P} b_p) x_1 + \sum_{p=2}^{P}(a - b_p)x_p + \epsilon $$
Now we can solve for these new parameters (a and b's) in terms of the old parameters ($\beta$'s).
$$ a + \sum_{p=2}^{P} b_p = \beta_1 \\ a - b_2 = \beta_2 \\ ... \\ a - b_P = \beta_P $$
therefore
$$ a = \frac{\sum_{p=1}^{P} \beta_P}{P} = \bar{\beta} \\ b_p = \bar{\beta} - \beta_p $$.
Now lets simplify our life a little by setting $$ z = \sum_{p=1}^{P}x_p \\ Q = x_1 - x_p \\ $$ so we can write $$ y = \bar{\beta}z + QA' + \epsilon $$ where $Q$ is an $N \times (P-1)$ matrix of our column differences and $A$ is a $(P-1)$-length row-vector collecting coefficients.
We now move to problem 3, namely that we are not actually estimating the model above we are estimating $$ y = \delta z + u $$ We just repeat the process above for when P = 2 (the application of Frisch-Waughl-Lovell) but try not to lose track of dimensions:
First we partial $z$ out of each column of $Q$ with some abuse of notation $$ q_p = \lambda_p z + e_p $$ so $\Lambda$ would be a vector of length $(P-1)$, and $e$ would be an $N$ by $(P-1)$ matrix.
Then we can re-write $u$ with $z$ removed: $$ u = eA' + \epsilon $$ Plug-in terms $$ y = \delta z + eA' + \epsilon \\ y = \delta z + QA' - (\Lambda A') z + \epsilon \\ y = (\delta - \Lambda A') z + QA' + \epsilon \\ \delta = \bar{\beta} + \Lambda A' $$
In words, this seems to say that the regressor in the short regression of $y$ on the sum of predictors, will be equal to the average of the regression coefficients in the original model, plus a 'bias' term equal in magnitude to the inner product of the ommitted coefficients in the reparameterization and the regression coefficients from a regression of the ommitted variables on the included variable.
Since its always risky to get your math from randos on the internet here's a simulation to support the derivation:
Created on 2020-09-20 by the reprex package (v0.3.0)