regression – Confidence Interval for Difference of Means in Regression: Calculation Methods

confidence intervalregression

Suppose I have a quadratic regression model
$$
Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon
$$
with the errors $\epsilon$ satisfying the usual assumptions (independent, normal, independent of the $X$ values). Let $b_0, b_1, b_2$ be the least squares estimates.

I have two new $X$ values $x_1$ and $x_2$, and I'm interested in getting a confidence interval for $v = E(Y|X = x_2) – E(Y|X=x_1) = \beta_1 (x_2 – x_1) + \beta_2 (x_2^2 – x_1^2)$.

The point estimate is $\hat{v} = b_1 (x_2 – x_1) + b_2 (x_2^2 – x_1^2)$, and (correct me if I'm wrong) I can estimate the variance by $$\hat{s}^2 = (x_2 – x_1)^2 \text{Var}(b_1) + (x_2^2 – x_1^2)^2 \text{Var}(b_2) + 2 (x_2 – x_1)(x^2 – x_1^2)\text{Cov}(b_1, b_2)$$ using the variance and covariance estimates of the coefficients provided by the software.

I could use a normal approximation and take $\hat{v} \pm 1.96 \hat{s}$ as a 95% confidence interval for $v$, or I could use a bootstrap confidence interval, but is there a way to work out the exact distribution and use that?

Best Answer

The general result you are looking for (under the stated assumptions) looks like this: For linear regression with $p$ predictor variables (you have two, $X$ and $X^2$) and an intercept, then with $n$ observations, $\mathbf{X}$ the $n \times (p+1)$ design matrix, $\hat{\beta}$ the $p+1$ dimensional estimator and $a \in \mathbb{R}^{p+1}$

$$ \frac{a^T\hat{\beta} - a^T \beta}{\hat{\sigma} \sqrt{a^T(\mathbf{X}^T\mathbf{X})^{-1}a}} \sim t_{n-p-1}.$$

The consequence is that you can construct confidence intervals for any linear combination of the $\beta$ vector using the same $t$-distribution you use to construct a confidence interval for one of the coordinates.

In your case, $p = 2$ and $a^T = (0, x_2 - x_1, x_2^2 - x_1^2)$. The denominator in the formula above is the square root of what you compute as the estimate of the standard error (provided that this is what the software computes ...). Note that the variance estimator, $\hat{\sigma}^2$, is supposed to be the (usual) unbiased estimator, where you divide by the degrees of freedom, $n-p-1$, and not the number of observations $n$.

Related Solutions

Solved – How to calculate a confidence interval for the percentage change between two means

No, it is not valid:

It can produce nonsense: for example if $a_1=-1$, $b_1=2$, $a_2=-2$ and $b_2=3$ then your statement would produce a confidence interval for the percentage change starting at -200% and increasing to -400%.
The width of the confidence interval ought to depend on any dependence between the two random variables.
Even if the the random variables are independent and positive, the confidence interval will depend on their distribution.

Solved – regression coefficient on sum of regressors

With a great deal of prodding (full credit to @whuber) I seem to have solved it:

We re-write $y = \beta_1X_1 + \beta_2X_2 + \epsilon$ as

$y = a(X_1 + X_2) + b(X_1 - X_2) + \epsilon$

$y = aX_1 + aX_2 + bX_1 - bX_2 + \epsilon$,

$y = (a + b)X_1 + (a - b)X_2 + \epsilon$, therefore

$\beta_1 = (a+b)$ and $\beta_2 = (a - b)$.

Solving for $a$ and $b$ we get:

$a = (\beta_1 + \beta_2)/2$ and $b = (\beta_1 - \beta_2)/2$

What I estimate is equivalent to

$y = \delta(X_1 + X_2) + u,$

$u = b(X_1 - X_2) + \epsilon$. We know that the $k$th coefficient of a multiple regression can be recovered by first partialling out the regressors other than $k$. Thus, if I let $z = X_1 + X_2$ and $q = X_1 - X_2$, the regression coefficient $b$ in:

$y = az + bq + \epsilon$ can be obtained with the steps

(1) $y = \delta z + u$,

(2) $q = \lambda z + e$

(3) $u = be + \epsilon$. Substituting expressions back in,

$y = \delta z + be + \epsilon$ from (1).

$y = \delta z + b(q - \lambda z) + \epsilon$ from (2). Therefore:

$y = (\delta - b \lambda)z + bq + \epsilon$. Now we know that $(\delta - b\lambda ) = a = (\beta_1 + \beta_2)/2$, so we solve for $\delta$ which yields:

$\delta = (\beta_1 + \beta_2)/2 + \lambda (\beta_1 - \beta_2)/2$

Here is some R code to check the solution, or to play around with:

rep.func <- function(N  = 100, b1 = 2, b2 = 10, s1 = 1, s2 = 5) {
  x1 <- rnorm(N, 0, s1)
  x2 <- rnorm(N, 0, s2)
  eps <- rnorm(N)
  y <- x1*b1 + x2*b2 + eps
  z <- x1 + x2
  q <- x1 - x2
  lambda <- lm(q ~ z - 1)$coefficients
  est.delta <- lm(y ~ z - 1)$coefficients
  est.betas <- lm(y ~ x1 + x2 - 1)$coefficients
  derived.delta <- sum(est.betas)/2 + lambda * (est.betas[1] - est.betas[2])/2
  c("est.delta" = est.delta, "derived.delta" = derived.delta)
}

rep.func()
#>     est.delta.z derived.delta.z 
#>         9.77236         9.77236

library(ggplot2)
res <- t(replicate(1000, rep.func()))

all.equal(res[,1], res[,2])
#> [1] TRUE

ggplot(as.data.frame(res), aes(est.delta.z, derived.delta.z)) + 
  geom_point() + geom_smooth(method='lm') + theme_minimal()

Update for case with P predictors, September 2020

Returning to this much later to generalize to case with P predictors at request. The underlying principle is the same: we are going to

re-parameterize the true data generating process so that it contains a term for the quantity we are interested in,
determine the value of the re-parameterized coefficients in terms of the original coefficients,
evaluate the effect of ommitted variable bias on the estimation of the re-parameterized coefficient

With $P$ predictors the true data generating process is:

$$y = \sum_{p=1}^{P} \beta_p x_p + \epsilon$$.

This true data generating process can be re-parameterized in a way that is mathematically equivalent as

$$ y = a (\sum_{p=1}^{P} x_p) + \sum_{p=2}^{P} b_p (x_1 - x_p) + \epsilon \\ y = (a + \sum_{p=2}^{P} b_p) x_1 + \sum_{p=2}^{P}(a - b_p)x_p + \epsilon $$

Now we can solve for these new parameters (a and b's) in terms of the old parameters ($\beta$'s).

$$ a + \sum_{p=2}^{P} b_p = \beta_1 \\ a - b_2 = \beta_2 \\ ... \\ a - b_P = \beta_P $$

therefore

$$ a = \frac{\sum_{p=1}^{P} \beta_P}{P} = \bar{\beta} \\ b_p = \bar{\beta} - \beta_p $$.

Now lets simplify our life a little by setting $$ z = \sum_{p=1}^{P}x_p \\ Q = x_1 - x_p \\ $$ so we can write $$ y = \bar{\beta}z + QA' + \epsilon $$ where $Q$ is an $N \times (P-1)$ matrix of our column differences and $A$ is a $(P-1)$-length row-vector collecting coefficients.

We now move to problem 3, namely that we are not actually estimating the model above we are estimating $$ y = \delta z + u $$ We just repeat the process above for when P = 2 (the application of Frisch-Waughl-Lovell) but try not to lose track of dimensions:

First we partial $z$ out of each column of $Q$ with some abuse of notation $$ q_p = \lambda_p z + e_p $$ so $\Lambda$ would be a vector of length $(P-1)$, and $e$ would be an $N$ by $(P-1)$ matrix.

Then we can re-write $u$ with $z$ removed: $$ u = eA' + \epsilon $$ Plug-in terms $$ y = \delta z + eA' + \epsilon \\ y = \delta z + QA' - (\Lambda A') z + \epsilon \\ y = (\delta - \Lambda A') z + QA' + \epsilon \\ \delta = \bar{\beta} + \Lambda A' $$

In words, this seems to say that the regressor in the short regression of $y$ on the sum of predictors, will be equal to the average of the regression coefficients in the original model, plus a 'bias' term equal in magnitude to the inner product of the ommitted coefficients in the reparameterization and the regression coefficients from a regression of the ommitted variables on the included variable.

Since its always risky to get your math from randos on the internet here's a simulation to support the derivation:

## Load mvtnorm because more interesting
# when predictors are correlated
library(mvtnorm)
library(ggplot2)

# create correlation matrix
set.seed(42)

rep_func <- function(N = 1000, P =10) {
  # how the predictors are correlated
  sig <- matrix(round(runif(P^2),2), nrow = P)
  diag(sig) <- 1:P
  sig[lower.tri(sig)] <- t(sig)[lower.tri(sig)]
  
  # Draw covariates
  X <- matrix(rmvnorm(N, mean = 1:P, sigma = sig), nrow = N)
  
  # draw true parameters
  betas <- rnorm(P)
  
  # draw true errors
  eps <- rnorm(N)
  
  # gen y
  y <- X %*% betas + eps
  
  # predictor sums
  Z <- rowSums(X)
  
  # create some helpers
  x1 <- X[,1]
  Q <- x1 - X[,2:P]
  
  # delta through regression of y on sum of predictors
  delta <- lm(y ~ Z - 1)$coef
  
  # true coefficients from reparameterization
  agg.reg <- lm(y ~ Z + Q - 1)$coef
  a <- agg.reg[1]
  A <- agg.reg[2:P]
  
  # lambdas
  lambda <- lm(Q ~ Z - 1)$coef
  
  # compare deltas
  delta_est <- sum(A * lambda) + mean(betas)
  
  c("lm_delta" = delta[[1]], "derived_delta" = delta_est)
}

rep_func()
#>      lm_delta derived_delta 
#>     0.3196803     0.3229088

res <- as.data.frame(t(replicate(1000, rep_func())))
#> Warning in rmvnorm(N, mean = 1:P, sigma = sig): sigma is numerically not
#> positive semidefinite

ggplot(res, aes(lm_delta - derived_delta)) + geom_density() +
  geom_vline(xintercept = 0) +
  hrbrthemes::theme_ipsum() +
  xlim(c(-1*sd(res$lm_delta), 1*sd(res$lm_delta))) +
  ggtitle("LM estimated delta - derived delta", 
          subtitle = "X-axis +/- 1 standard error of estimate")

^{Created on 2020-09-20 by the reprex package (v0.3.0)}

Best Answer

Related Solutions

Solved – How to calculate a confidence interval for the percentage change between two means

Solved – regression coefficient on sum of regressors

Update for case with P predictors, September 2020

Related Question