Solved – regression coefficient on sum of regressors

regressionregression coefficients

Say I have the true linear model with normal errors:

$y = \beta_1 X_1 + \beta_2 X_2 + \epsilon$

However, I only observe $Z = X_1 + X_2$, so I estimate instead:

$y = \delta (X_1 + X_2) + e$,

can I express $\delta$ in terms of $\beta_1$ and $\beta_2$ (and perhaps $X_1$ and $X_2$)?

Observably when I generate data and estimate the results $\delta$ seems to be the average of the coefficients weighted by the variance of the regressors, but I can't seem to derive a solution.

Incorporating hints:

Thanks for the hints, which are helpful. Here is an attempt — although I feel I haven't gotten to the bottom of it yet:

$y = a(X_1 + X_2) + b(X_1 – X_2) + \epsilon$ (thanks @whuber!)

$y = aX_1 + aX_2 + bX_1 – bX_2 + \epsilon$,

$y = (a + b)X_1 + (a – b)X_2 + \epsilon$, therefore

$\beta_1 = (a+b)$ and $\beta_2 = (a – b)$.

Solving for $a$ and $b$ we get:

$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 – \beta_2)/2$

Effectively I am estimating $a$ while omitting $b$, so

$y = a(X_1 + X_2) + u,$

$u = b(X_1 – X_2) + \epsilon$, therefore my estimate of $\delta$ above will be the estimate of $a$ with omitted variable bias. Calling $X_1 + X_2 = Z$, my estimate of $\hat{\delta}$ is therefore:

$\hat{\delta} = (Z'Z)^{-1}(Z'[Za + (X_1 – X_2)b + \epsilon])$

Ignoring $\epsilon$ as it disappears in expectation, I get

$\hat{\delta} = a + (Z'Z)^{-1}(Z'(X_1 – X_2))b$

$\hat{\delta} = \frac{\beta_1 + \beta_2}{2} + (Z'Z)^{-1}Z'[X_1 – X_2] \frac{\beta_1 – \beta_2}{2}$

Does this look on the right track? Can we reduce the trailing second expression?

I was thinking that since the expression $(Z'Z)^{-1}Z'[X_1 – X_2]$ looks like a regression of $X_1 – X_2$ on $Z$, I could perhaps re-write it as $Var(X_1 + X_2)^{-1}Cov(X_1 + X_2, X_1 – X_2) = Var(X_1+X_2)^{-1}(Var(X_1) – Var(X_2))$?

Best Answer

With a great deal of prodding (full credit to @whuber) I seem to have solved it:

We re-write $y = \beta_1X_1 + \beta_2X_2 + \epsilon$ as

$y = a(X_1 + X_2) + b(X_1 - X_2) + \epsilon$

$y = aX_1 + aX_2 + bX_1 - bX_2 + \epsilon$,

$y = (a + b)X_1 + (a - b)X_2 + \epsilon$, therefore

$\beta_1 = (a+b)$ and $\beta_2 = (a - b)$.

Solving for $a$ and $b$ we get:

$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 - \beta_2)/2$

What I estimate is equivalent to

$y = \delta(X_1 + X_2) + u,$

$u = b(X_1 - X_2) + \epsilon$. We know that the $k$th coefficient of a multiple regression can be recovered by first partialling out the regressors other than $k$. Thus, if I let $z = X_1 + X_2$ and $q = X_1 - X_2$, the regression coefficient $b$ in:

$y = az + bq + \epsilon$ can be obtained with the steps

(1) $y = \delta z + u$,

(2) $q = \lambda z + e$

(3) $u = be + \epsilon$. Substituting expressions back in,

$y = \delta z + be + \epsilon$ from (1).

$y = \delta z + b(q - \lambda z) + \epsilon$ from (2). Therefore:

$y = (\delta - b \lambda)z + bq + \epsilon$. Now we know that $(\delta - b\lambda ) = a = (\beta_1 + \beta_2)/2$, so we solve for $\delta$ which yields:

$\delta = (\beta_1 + \beta_2)/2 + \lambda (\beta_1 - \beta_2)/2$

Here is some R code to check the solution, or to play around with:

rep.func <- function(N  = 100, b1 = 2, b2 = 10, s1 = 1, s2 = 5) {
  x1 <- rnorm(N, 0, s1)
  x2 <- rnorm(N, 0, s2)
  eps <- rnorm(N)
  y <- x1*b1 + x2*b2 + eps
  z <- x1 + x2
  q <- x1 - x2
  lambda <- lm(q ~ z - 1)$coefficients
  est.delta <- lm(y ~ z - 1)$coefficients
  est.betas <- lm(y ~ x1 + x2 - 1)$coefficients
  derived.delta <- sum(est.betas)/2 + lambda * (est.betas[1] - est.betas[2])/2
  c("est.delta" = est.delta, "derived.delta" = derived.delta)
}

rep.func()
#>     est.delta.z derived.delta.z 
#>         9.77236         9.77236

library(ggplot2)
res <- t(replicate(1000, rep.func()))

all.equal(res[,1], res[,2])
#> [1] TRUE

ggplot(as.data.frame(res), aes(est.delta.z, derived.delta.z)) + 
  geom_point() + geom_smooth(method='lm') + theme_minimal()

Update for case with P predictors, September 2020

Returning to this much later to generalize to case with P predictors at request. The underlying principle is the same: we are going to

  1. re-parameterize the true data generating process so that it contains a term for the quantity we are interested in,
  2. determine the value of the re-parameterized coefficients in terms of the original coefficients,
  3. evaluate the effect of ommitted variable bias on the estimation of the re-parameterized coefficient

With $P$ predictors the true data generating process is:

$$y = \sum_{p=1}^{P} \beta_p x_p + \epsilon$$.

This true data generating process can be re-parameterized in a way that is mathematically equivalent as

$$ y = a (\sum_{p=1}^{P} x_p) + \sum_{p=2}^{P} b_p (x_1 - x_p) + \epsilon \\ y = (a + \sum_{p=2}^{P} b_p) x_1 + \sum_{p=2}^{P}(a - b_p)x_p + \epsilon $$

Now we can solve for these new parameters (a and b's) in terms of the old parameters ($\beta$'s).

$$ a + \sum_{p=2}^{P} b_p = \beta_1 \\ a - b_2 = \beta_2 \\ ... \\ a - b_P = \beta_P $$

therefore

$$ a = \frac{\sum_{p=1}^{P} \beta_P}{P} = \bar{\beta} \\ b_p = \bar{\beta} - \beta_p $$.

Now lets simplify our life a little by setting $$ z = \sum_{p=1}^{P}x_p \\ Q = x_1 - x_p \\ $$ so we can write $$ y = \bar{\beta}z + QA' + \epsilon $$ where $Q$ is an $N \times (P-1)$ matrix of our column differences and $A$ is a $(P-1)$-length row-vector collecting coefficients.

We now move to problem 3, namely that we are not actually estimating the model above we are estimating $$ y = \delta z + u $$ We just repeat the process above for when P = 2 (the application of Frisch-Waughl-Lovell) but try not to lose track of dimensions:

First we partial $z$ out of each column of $Q$ with some abuse of notation $$ q_p = \lambda_p z + e_p $$ so $\Lambda$ would be a vector of length $(P-1)$, and $e$ would be an $N$ by $(P-1)$ matrix.

Then we can re-write $u$ with $z$ removed: $$ u = eA' + \epsilon $$ Plug-in terms $$ y = \delta z + eA' + \epsilon \\ y = \delta z + QA' - (\Lambda A') z + \epsilon \\ y = (\delta - \Lambda A') z + QA' + \epsilon \\ \delta = \bar{\beta} + \Lambda A' $$

In words, this seems to say that the regressor in the short regression of $y$ on the sum of predictors, will be equal to the average of the regression coefficients in the original model, plus a 'bias' term equal in magnitude to the inner product of the ommitted coefficients in the reparameterization and the regression coefficients from a regression of the ommitted variables on the included variable.

Since its always risky to get your math from randos on the internet here's a simulation to support the derivation:

## Load mvtnorm because more interesting
# when predictors are correlated
library(mvtnorm)
library(ggplot2)

# create correlation matrix
set.seed(42)

rep_func <- function(N = 1000, P =10) {
  # how the predictors are correlated
  sig <- matrix(round(runif(P^2),2), nrow = P)
  diag(sig) <- 1:P
  sig[lower.tri(sig)] <- t(sig)[lower.tri(sig)]
  
  # Draw covariates
  X <- matrix(rmvnorm(N, mean = 1:P, sigma = sig), nrow = N)
  
  # draw true parameters
  betas <- rnorm(P)
  
  # draw true errors
  eps <- rnorm(N)
  
  # gen y
  y <- X %*% betas + eps
  
  # predictor sums
  Z <- rowSums(X)
  
  # create some helpers
  x1 <- X[,1]
  Q <- x1 - X[,2:P]
  
  # delta through regression of y on sum of predictors
  delta <- lm(y ~ Z - 1)$coef
  
  # true coefficients from reparameterization
  agg.reg <- lm(y ~ Z + Q - 1)$coef
  a <- agg.reg[1]
  A <- agg.reg[2:P]
  
  # lambdas
  lambda <- lm(Q ~ Z - 1)$coef
  
  # compare deltas
  delta_est <- sum(A * lambda) + mean(betas)
  
  c("lm_delta" = delta[[1]], "derived_delta" = delta_est)
}

rep_func()
#>      lm_delta derived_delta 
#>     0.3196803     0.3229088

res <- as.data.frame(t(replicate(1000, rep_func())))
#> Warning in rmvnorm(N, mean = 1:P, sigma = sig): sigma is numerically not
#> positive semidefinite

ggplot(res, aes(lm_delta - derived_delta)) + geom_density() +
  geom_vline(xintercept = 0) +
  hrbrthemes::theme_ipsum() +
  xlim(c(-1*sd(res$lm_delta), 1*sd(res$lm_delta))) +
  ggtitle("LM estimated delta - derived delta", 
          subtitle = "X-axis +/- 1 standard error of estimate")

Created on 2020-09-20 by the reprex package (v0.3.0)

Related Question