Linear Regression – Simple Linear Regressions Among Three Pairs of Variables

linearmathematical-statisticsregressionregression coefficients

Let the "ordinary-least-squares regression of $Y$ on $X$" be given by
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\text{.}$$
Suppose I run the following:

  • The OLS regression of $Y$ on $X$
  • The OLS regression of $Y$ on $Z$
  • The OLS regression of $X$ on $Z$

and in all three cases, their slope coefficients $\hat{\beta}_1 = 2$.

I am interested in the slope coefficient of $Y$ on $X + Z$. How does this compare to the value $2$ (i.e., less than, equal to, greater than, or impossible to know)?

Attempt. Let $\hat{\beta}_{Y, X}$ be $\hat{\beta}_1$ in the case of the OLS regression of $Y$ on $X$, and similarly for the other three cases. Then we know that
$$\hat{\beta}_{Y, X} = \dfrac{\sum(x_i – \bar{x})(y_i – \bar{y})}{\sum(x_i – \bar{x})^2}$$
hence
$$\dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n}(x_i – \bar{x})^2} = \dfrac{\sum_{i=1}^{n}(z_i – \bar{z})(y_i – \bar{y})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = \dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(z_i – \bar{z})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = 2\text{.}$$
We also know that regression coefficients are unaffected by centering, so without loss of generality, I assume that all variables are centered and that $\sum_{i=1}^{n}(x_i – \bar{x})^2 = \sum_{i=1}^{n}(z_i – \bar{z})^2 = 1$, leading to
$$\sum_{i=1}^{n}x_iy_i = \sum_{i=1}^{n}x_iz_i = 2\text{.}$$
Thus, by linearity of the arithmetic mean, we have
$$\hat{\beta}_{Y, X+Z} = \dfrac{\sum_{i=1}^{n}y_i(x_i + z_i)}{\sum_{i=1}^{n}(x_i + z_i)^2} = \dfrac{4}{\sum_{i=1}^{n}(x_i^2 + 2x_iz_i + z_i^2)}\text{.}$$
Here's where I'm stuck. Can we do something clever with the above quantity?

I am told the answer is that $\hat{\beta}_{Y, X+Z}< 2$.

Best Answer

First, it is possible for these conditions simultaneously to hold, as I will show.

Second, the regression of $Y$ on $X+Z$ must lie in the open interval $(10/9,2)$ and can attain any value in that interval.


Vector notation is particularly convenient here.

The given information tells us (in the order given in the question) that

  1. $Y = 2X + E$ where $E$ is orthogonal to $X.$
  2. $Y = 2Z + \beta W + F$ where $F$ is orthogonal to $Z$ and $W$ (and $\beta$ is at this point unknown).
  3. $X = 2Z + W$ where $W$ is a zero-sum vector orthogonal to $Z.$

Implicitly, not all of $X,Y,$ and $Z$ are zero, for otherwise there's nothing of interest: we would just be saying, three times over, that $2$ times the zero vector is zero; and then there would be no bounds on the regression of $Y$ against $X+Z.$ Consequently, all of these vectors must be nonzero.

To find $\beta,$ use $(2)$ to regress $Y$ on $X$ as

$$2 = \hat\beta_{Y;X} = \frac{Y\cdot X}{||X||^2} = \frac{4||Z||^2 + \beta||W||^2}{4||Z||^2 + ||W||^2}.$$

When $W\ne 0,$ the unique solution is

$$\beta = 2 + \frac{4||Z||^2}{||W||^2}.$$

(When $W=0$ the equation reads $2=1,$ which has no solutions.)

We may now compute the regression of $Y$ against $X+Z$ as

$$\hat\beta_{Y;X+Z} = \frac{Y\cdot(X+Z)}{||X+Z||^2} = \frac{(2Z+\beta W + F)\cdot(3Z + W)}{||3Z + W||^2} = \frac{10||Z||^2 + 2||W||^2}{9||Z||^2 + ||W||^2}.$$

The right hand fraction is the slope of a ray emanating from the origin and passing through some point in the interior of the segment connecting the points $(10,9)$ and $(2,1)$ in the plane (with the weights given by the relative values of $||Z||^2$ and $||W||^2$), making it obvious the bounds are $10/9$ and $2/1$ -- and they cannot be attained because both weights are nonzero, QED.


Here is a histogram of the regression coefficients for a thousand simulated configurations (in $\mathbb{R}^{10}$):

Figure

The red vertical lines mark the bounds. The simulation supports these by exhibiting a full range of values filling these bounds but never extending beyond them.

The R code shows how the foregoing analysis was implemented. (It includes a post-simulation check that all the given regression coefficients equal $2,$ as intended.)

z <- scale(c(1, rep(0,9))) 
z <- z / sqrt(length(z)-1) # Must be zero mean, unit norm

set.seed(17)
sim <- replicate(1e3, {
  w <- rexp(length(z))
  w <- residuals(lm(w ~ z))
  w2 <- sum(w*w)
  y <- 2 * z + (4/w2 + 2)*w
  x <- 2*z + w
  
  a <- coefficients(lm(y ~ x + 0))
  b <- coefficients(lm(y ~ z + 0))
  c. <- coefficients(lm(x ~ z + 0))
  d <- coefficients(lm(y ~ I(x+z) + 0))
  c(a,b,c.,d)
})
table(sim[1:3,]) # All 2

hist(sim[4,], xlim=c(10/9,2), breaks=seq(10/9, 2, by=1/18), freq=FALSE,
     col=gray(.9), xlab="Value",
     main=expression(hat(beta)[group("", list(Y, X+Z), "")]))
abline(v=c(10/9, 2), col="Red", lwd=2, lty=2)
Related Question