With a great deal of prodding (full credit to @whuber) I seem to have solved it:
We re-write $y = \beta_1X_1 + \beta_2X_2 + \epsilon$ as
$y = a(X_1 + X_2) + b(X_1 - X_2) + \epsilon$
$y = aX_1 + aX_2 + bX_1 - bX_2 + \epsilon$,
$y = (a + b)X_1 + (a - b)X_2 + \epsilon$, therefore
$\beta_1 = (a+b)$ and $\beta_2 = (a - b)$.
Solving for $a$ and $b$ we get:
$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 - \beta_2)/2$
What I estimate is equivalent to
$y = \delta(X_1 + X_2) + u,$
$u = b(X_1 - X_2) + \epsilon$. We know that the $k$th coefficient of a multiple regression can be recovered by first partialling out the regressors other than $k$. Thus, if I let $z = X_1 + X_2$ and $q = X_1 - X_2$, the regression coefficient $b$ in:
$y = az + bq + \epsilon$ can be obtained with the steps
(1) $y = \delta z + u$,
(2) $q = \lambda z + e$
(3) $u = be + \epsilon$. Substituting expressions back in,
$y = \delta z + be + \epsilon$ from (1).
$y = \delta z + b(q - \lambda z) + \epsilon$ from (2). Therefore:
$y = (\delta - b \lambda)z + bq + \epsilon$. Now we know that $(\delta - b\lambda ) = a = (\beta_1 + \beta_2)/2$, so we solve for $\delta$ which yields:
$\delta = (\beta_1 + \beta_2)/2 + \lambda (\beta_1 - \beta_2)/2$
Here is some R code to check the solution, or to play around with:
rep.func <- function(N = 100, b1 = 2, b2 = 10, s1 = 1, s2 = 5) {
x1 <- rnorm(N, 0, s1)
x2 <- rnorm(N, 0, s2)
eps <- rnorm(N)
y <- x1*b1 + x2*b2 + eps
z <- x1 + x2
q <- x1 - x2
lambda <- lm(q ~ z - 1)$coefficients
est.delta <- lm(y ~ z - 1)$coefficients
est.betas <- lm(y ~ x1 + x2 - 1)$coefficients
derived.delta <- sum(est.betas)/2 + lambda * (est.betas[1] - est.betas[2])/2
c("est.delta" = est.delta, "derived.delta" = derived.delta)
}
rep.func()
#> est.delta.z derived.delta.z
#> 9.77236 9.77236
library(ggplot2)
res <- t(replicate(1000, rep.func()))
all.equal(res[,1], res[,2])
#> [1] TRUE
ggplot(as.data.frame(res), aes(est.delta.z, derived.delta.z)) +
geom_point() + geom_smooth(method='lm') + theme_minimal()
Update for case with P predictors, September 2020
Returning to this much later to generalize to case with P predictors at request. The underlying principle is the same: we are going to
- re-parameterize the true data generating process so that it contains a term for the quantity we are interested in,
- determine the value of the re-parameterized coefficients in terms of the original coefficients,
- evaluate the effect of ommitted variable bias on the estimation of the re-parameterized coefficient
With $P$ predictors the true data generating process is:
$$y = \sum_{p=1}^{P} \beta_p x_p + \epsilon$$.
This true data generating process can be re-parameterized in a way that is mathematically equivalent as
$$
y = a (\sum_{p=1}^{P} x_p) + \sum_{p=2}^{P} b_p (x_1 - x_p) + \epsilon \\
y = (a + \sum_{p=2}^{P} b_p) x_1 + \sum_{p=2}^{P}(a - b_p)x_p + \epsilon
$$
Now we can solve for these new parameters (a and b's) in terms of the old parameters ($\beta$'s).
$$
a + \sum_{p=2}^{P} b_p = \beta_1 \\
a - b_2 = \beta_2 \\
... \\
a - b_P = \beta_P
$$
therefore
$$
a = \frac{\sum_{p=1}^{P} \beta_P}{P} = \bar{\beta} \\
b_p = \bar{\beta} - \beta_p
$$.
Now lets simplify our life a little by setting
$$
z = \sum_{p=1}^{P}x_p \\
Q = x_1 - x_p \\
$$
so we can write
$$
y = \bar{\beta}z + QA' + \epsilon
$$
where $Q$ is an $N \times (P-1)$ matrix of our column differences and $A$ is a $(P-1)$-length row-vector collecting coefficients.
We now move to problem 3, namely that we are not actually estimating the model above we are estimating
$$
y = \delta z + u
$$
We just repeat the process above for when P = 2 (the application of Frisch-Waughl-Lovell) but try not to lose track of dimensions:
First we partial $z$ out of each column of $Q$ with some abuse of notation
$$
q_p = \lambda_p z + e_p
$$
so $\Lambda$ would be a vector of length $(P-1)$, and $e$ would be an $N$ by $(P-1)$ matrix.
Then we can re-write $u$ with $z$ removed:
$$
u = eA' + \epsilon
$$
Plug-in terms
$$
y = \delta z + eA' + \epsilon \\
y = \delta z + QA' - (\Lambda A') z + \epsilon \\
y = (\delta - \Lambda A') z + QA' + \epsilon \\
\delta = \bar{\beta} + \Lambda A'
$$
In words, this seems to say that the regressor in the short regression of $y$ on the sum of predictors, will be equal to the average of the regression coefficients in the original model, plus a 'bias' term equal in magnitude to the inner product of the ommitted coefficients in the reparameterization and the regression coefficients from a regression of the ommitted variables on the included variable.
Since its always risky to get your math from randos on the internet here's a simulation to support the derivation:
## Load mvtnorm because more interesting
# when predictors are correlated
library(mvtnorm)
library(ggplot2)
# create correlation matrix
set.seed(42)
rep_func <- function(N = 1000, P =10) {
# how the predictors are correlated
sig <- matrix(round(runif(P^2),2), nrow = P)
diag(sig) <- 1:P
sig[lower.tri(sig)] <- t(sig)[lower.tri(sig)]
# Draw covariates
X <- matrix(rmvnorm(N, mean = 1:P, sigma = sig), nrow = N)
# draw true parameters
betas <- rnorm(P)
# draw true errors
eps <- rnorm(N)
# gen y
y <- X %*% betas + eps
# predictor sums
Z <- rowSums(X)
# create some helpers
x1 <- X[,1]
Q <- x1 - X[,2:P]
# delta through regression of y on sum of predictors
delta <- lm(y ~ Z - 1)$coef
# true coefficients from reparameterization
agg.reg <- lm(y ~ Z + Q - 1)$coef
a <- agg.reg[1]
A <- agg.reg[2:P]
# lambdas
lambda <- lm(Q ~ Z - 1)$coef
# compare deltas
delta_est <- sum(A * lambda) + mean(betas)
c("lm_delta" = delta[[1]], "derived_delta" = delta_est)
}
rep_func()
#> lm_delta derived_delta
#> 0.3196803 0.3229088
res <- as.data.frame(t(replicate(1000, rep_func())))
#> Warning in rmvnorm(N, mean = 1:P, sigma = sig): sigma is numerically not
#> positive semidefinite
ggplot(res, aes(lm_delta - derived_delta)) + geom_density() +
geom_vline(xintercept = 0) +
hrbrthemes::theme_ipsum() +
xlim(c(-1*sd(res$lm_delta), 1*sd(res$lm_delta))) +
ggtitle("LM estimated delta - derived delta",
subtitle = "X-axis +/- 1 standard error of estimate")
Created on 2020-09-20 by the reprex package (v0.3.0)
Best Answer
The general result you are looking for (under the stated assumptions) looks like this: For linear regression with $p$ predictor variables (you have two, $X$ and $X^2$) and an intercept, then with $n$ observations, $\mathbf{X}$ the $n \times (p+1)$ design matrix, $\hat{\beta}$ the $p+1$ dimensional estimator and $a \in \mathbb{R}^{p+1}$
$$ \frac{a^T\hat{\beta} - a^T \beta}{\hat{\sigma} \sqrt{a^T(\mathbf{X}^T\mathbf{X})^{-1}a}} \sim t_{n-p-1}.$$
The consequence is that you can construct confidence intervals for any linear combination of the $\beta$ vector using the same $t$-distribution you use to construct a confidence interval for one of the coordinates.
In your case, $p = 2$ and $a^T = (0, x_2 - x_1, x_2^2 - x_1^2)$. The denominator in the formula above is the square root of what you compute as the estimate of the standard error (provided that this is what the software computes ...). Note that the variance estimator, $\hat{\sigma}^2$, is supposed to be the (usual) unbiased estimator, where you divide by the degrees of freedom, $n-p-1$, and not the number of observations $n$.