With a great deal of prodding (full credit to @whuber) I seem to have solved it:
We re-write $y = \beta_1X_1 + \beta_2X_2 + \epsilon$ as
$y = a(X_1 + X_2) + b(X_1 - X_2) + \epsilon$
$y = aX_1 + aX_2 + bX_1 - bX_2 + \epsilon$,
$y = (a + b)X_1 + (a - b)X_2 + \epsilon$, therefore
$\beta_1 = (a+b)$ and $\beta_2 = (a - b)$.
Solving for $a$ and $b$ we get:
$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 - \beta_2)/2$
What I estimate is equivalent to
$y = \delta(X_1 + X_2) + u,$
$u = b(X_1 - X_2) + \epsilon$. We know that the $k$th coefficient of a multiple regression can be recovered by first partialling out the regressors other than $k$. Thus, if I let $z = X_1 + X_2$ and $q = X_1 - X_2$, the regression coefficient $b$ in:
$y = az + bq + \epsilon$ can be obtained with the steps
(1) $y = \delta z + u$,
(2) $q = \lambda z + e$
(3) $u = be + \epsilon$. Substituting expressions back in,
$y = \delta z + be + \epsilon$ from (1).
$y = \delta z + b(q - \lambda z) + \epsilon$ from (2). Therefore:
$y = (\delta - b \lambda)z + bq + \epsilon$. Now we know that $(\delta - b\lambda ) = a = (\beta_1 + \beta_2)/2$, so we solve for $\delta$ which yields:
$\delta = (\beta_1 + \beta_2)/2 + \lambda (\beta_1 - \beta_2)/2$
Here is some R code to check the solution, or to play around with:
rep.func <- function(N = 100, b1 = 2, b2 = 10, s1 = 1, s2 = 5) {
x1 <- rnorm(N, 0, s1)
x2 <- rnorm(N, 0, s2)
eps <- rnorm(N)
y <- x1*b1 + x2*b2 + eps
z <- x1 + x2
q <- x1 - x2
lambda <- lm(q ~ z - 1)$coefficients
est.delta <- lm(y ~ z - 1)$coefficients
est.betas <- lm(y ~ x1 + x2 - 1)$coefficients
derived.delta <- sum(est.betas)/2 + lambda * (est.betas[1] - est.betas[2])/2
c("est.delta" = est.delta, "derived.delta" = derived.delta)
}
rep.func()
#> est.delta.z derived.delta.z
#> 9.77236 9.77236
library(ggplot2)
res <- t(replicate(1000, rep.func()))
all.equal(res[,1], res[,2])
#> [1] TRUE
ggplot(as.data.frame(res), aes(est.delta.z, derived.delta.z)) +
geom_point() + geom_smooth(method='lm') + theme_minimal()
Update for case with P predictors, September 2020
Returning to this much later to generalize to case with P predictors at request. The underlying principle is the same: we are going to
- re-parameterize the true data generating process so that it contains a term for the quantity we are interested in,
- determine the value of the re-parameterized coefficients in terms of the original coefficients,
- evaluate the effect of ommitted variable bias on the estimation of the re-parameterized coefficient
With $P$ predictors the true data generating process is:
$$y = \sum_{p=1}^{P} \beta_p x_p + \epsilon$$.
This true data generating process can be re-parameterized in a way that is mathematically equivalent as
$$
y = a (\sum_{p=1}^{P} x_p) + \sum_{p=2}^{P} b_p (x_1 - x_p) + \epsilon \\
y = (a + \sum_{p=2}^{P} b_p) x_1 + \sum_{p=2}^{P}(a - b_p)x_p + \epsilon
$$
Now we can solve for these new parameters (a and b's) in terms of the old parameters ($\beta$'s).
$$
a + \sum_{p=2}^{P} b_p = \beta_1 \\
a - b_2 = \beta_2 \\
... \\
a - b_P = \beta_P
$$
therefore
$$
a = \frac{\sum_{p=1}^{P} \beta_P}{P} = \bar{\beta} \\
b_p = \bar{\beta} - \beta_p
$$.
Now lets simplify our life a little by setting
$$
z = \sum_{p=1}^{P}x_p \\
Q = x_1 - x_p \\
$$
so we can write
$$
y = \bar{\beta}z + QA' + \epsilon
$$
where $Q$ is an $N \times (P-1)$ matrix of our column differences and $A$ is a $(P-1)$-length row-vector collecting coefficients.
We now move to problem 3, namely that we are not actually estimating the model above we are estimating
$$
y = \delta z + u
$$
We just repeat the process above for when P = 2 (the application of Frisch-Waughl-Lovell) but try not to lose track of dimensions:
First we partial $z$ out of each column of $Q$ with some abuse of notation
$$
q_p = \lambda_p z + e_p
$$
so $\Lambda$ would be a vector of length $(P-1)$, and $e$ would be an $N$ by $(P-1)$ matrix.
Then we can re-write $u$ with $z$ removed:
$$
u = eA' + \epsilon
$$
Plug-in terms
$$
y = \delta z + eA' + \epsilon \\
y = \delta z + QA' - (\Lambda A') z + \epsilon \\
y = (\delta - \Lambda A') z + QA' + \epsilon \\
\delta = \bar{\beta} + \Lambda A'
$$
In words, this seems to say that the regressor in the short regression of $y$ on the sum of predictors, will be equal to the average of the regression coefficients in the original model, plus a 'bias' term equal in magnitude to the inner product of the ommitted coefficients in the reparameterization and the regression coefficients from a regression of the ommitted variables on the included variable.
Since its always risky to get your math from randos on the internet here's a simulation to support the derivation:
## Load mvtnorm because more interesting
# when predictors are correlated
library(mvtnorm)
library(ggplot2)
# create correlation matrix
set.seed(42)
rep_func <- function(N = 1000, P =10) {
# how the predictors are correlated
sig <- matrix(round(runif(P^2),2), nrow = P)
diag(sig) <- 1:P
sig[lower.tri(sig)] <- t(sig)[lower.tri(sig)]
# Draw covariates
X <- matrix(rmvnorm(N, mean = 1:P, sigma = sig), nrow = N)
# draw true parameters
betas <- rnorm(P)
# draw true errors
eps <- rnorm(N)
# gen y
y <- X %*% betas + eps
# predictor sums
Z <- rowSums(X)
# create some helpers
x1 <- X[,1]
Q <- x1 - X[,2:P]
# delta through regression of y on sum of predictors
delta <- lm(y ~ Z - 1)$coef
# true coefficients from reparameterization
agg.reg <- lm(y ~ Z + Q - 1)$coef
a <- agg.reg[1]
A <- agg.reg[2:P]
# lambdas
lambda <- lm(Q ~ Z - 1)$coef
# compare deltas
delta_est <- sum(A * lambda) + mean(betas)
c("lm_delta" = delta[[1]], "derived_delta" = delta_est)
}
rep_func()
#> lm_delta derived_delta
#> 0.3196803 0.3229088
res <- as.data.frame(t(replicate(1000, rep_func())))
#> Warning in rmvnorm(N, mean = 1:P, sigma = sig): sigma is numerically not
#> positive semidefinite
ggplot(res, aes(lm_delta - derived_delta)) + geom_density() +
geom_vline(xintercept = 0) +
hrbrthemes::theme_ipsum() +
xlim(c(-1*sd(res$lm_delta), 1*sd(res$lm_delta))) +
ggtitle("LM estimated delta - derived delta",
subtitle = "X-axis +/- 1 standard error of estimate")
Created on 2020-09-20 by the reprex package (v0.3.0)
Best Answer
In both cases you're testing completely different things. In other words, from a theoretical point of view, the hypothesis of interest is different.
When performing the t-test, your hypothesis of interest is $$ H_0: \beta_i = 0\: \text{ vs }\: H_1: \beta_i\ne0\:\:i\in\{1,2\} $$ whereas in the Wald test you're testing $$ H_0: \beta_1=\beta_2\: \text{ vs }\: H_1: \beta_1\ne\beta_2 $$ However, based on the way you formulated your question, it appears to me that your null hypothesis is something of the form: $$ H_0: \begin{pmatrix}1&0\\0&1\\1&-1\end{pmatrix}\begin{pmatrix}\beta_1\\\beta_2\end{pmatrix}=\begin{pmatrix}0\\0\\0\end{pmatrix} $$ and therefore the Wald Test you need to perform for the above's null hypothesis.
A final word regarding your example at hand, sometimes your inference may be flawed because you did not take into account certain features of your data when 1) estimating the parameters of interest, or 2) when using the correct critical values. For example, the normal approximation (or $\chi^2$) is not a great one if you have very few observations since it's an asymptotic approximation.
Hope this helps.