# Linear Regression – Simple Linear Regressions Among Three Pairs of Variables

linearmathematical-statisticsregressionregression coefficients

Let the "ordinary-least-squares regression of $$Y$$ on $$X$$" be given by
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\text{.}$$
Suppose I run the following:

• The OLS regression of $$Y$$ on $$X$$
• The OLS regression of $$Y$$ on $$Z$$
• The OLS regression of $$X$$ on $$Z$$

and in all three cases, their slope coefficients $$\hat{\beta}_1 = 2$$.

I am interested in the slope coefficient of $$Y$$ on $$X + Z$$. How does this compare to the value $$2$$ (i.e., less than, equal to, greater than, or impossible to know)?

Attempt. Let $$\hat{\beta}_{Y, X}$$ be $$\hat{\beta}_1$$ in the case of the OLS regression of $$Y$$ on $$X$$, and similarly for the other three cases. Then we know that
$$\hat{\beta}_{Y, X} = \dfrac{\sum(x_i – \bar{x})(y_i – \bar{y})}{\sum(x_i – \bar{x})^2}$$
hence
$$\dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n}(x_i – \bar{x})^2} = \dfrac{\sum_{i=1}^{n}(z_i – \bar{z})(y_i – \bar{y})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = \dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(z_i – \bar{z})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = 2\text{.}$$
We also know that regression coefficients are unaffected by centering, so without loss of generality, I assume that all variables are centered and that $$\sum_{i=1}^{n}(x_i – \bar{x})^2 = \sum_{i=1}^{n}(z_i – \bar{z})^2 = 1$$, leading to
$$\sum_{i=1}^{n}x_iy_i = \sum_{i=1}^{n}x_iz_i = 2\text{.}$$
Thus, by linearity of the arithmetic mean, we have
$$\hat{\beta}_{Y, X+Z} = \dfrac{\sum_{i=1}^{n}y_i(x_i + z_i)}{\sum_{i=1}^{n}(x_i + z_i)^2} = \dfrac{4}{\sum_{i=1}^{n}(x_i^2 + 2x_iz_i + z_i^2)}\text{.}$$
Here's where I'm stuck. Can we do something clever with the above quantity?

I am told the answer is that $$\hat{\beta}_{Y, X+Z}< 2$$.

First, it is possible for these conditions simultaneously to hold, as I will show.

Second, the regression of $$Y$$ on $$X+Z$$ must lie in the open interval $$(10/9,2)$$ and can attain any value in that interval.

Vector notation is particularly convenient here.

The given information tells us (in the order given in the question) that

1. $$Y = 2X + E$$ where $$E$$ is orthogonal to $$X.$$
2. $$Y = 2Z + \beta W + F$$ where $$F$$ is orthogonal to $$Z$$ and $$W$$ (and $$\beta$$ is at this point unknown).
3. $$X = 2Z + W$$ where $$W$$ is a zero-sum vector orthogonal to $$Z.$$

Implicitly, not all of $$X,Y,$$ and $$Z$$ are zero, for otherwise there's nothing of interest: we would just be saying, three times over, that $$2$$ times the zero vector is zero; and then there would be no bounds on the regression of $$Y$$ against $$X+Z.$$ Consequently, all of these vectors must be nonzero.

To find $$\beta,$$ use $$(2)$$ to regress $$Y$$ on $$X$$ as

$$2 = \hat\beta_{Y;X} = \frac{Y\cdot X}{||X||^2} = \frac{4||Z||^2 + \beta||W||^2}{4||Z||^2 + ||W||^2}.$$

When $$W\ne 0,$$ the unique solution is

$$\beta = 2 + \frac{4||Z||^2}{||W||^2}.$$

(When $$W=0$$ the equation reads $$2=1,$$ which has no solutions.)

We may now compute the regression of $$Y$$ against $$X+Z$$ as

$$\hat\beta_{Y;X+Z} = \frac{Y\cdot(X+Z)}{||X+Z||^2} = \frac{(2Z+\beta W + F)\cdot(3Z + W)}{||3Z + W||^2} = \frac{10||Z||^2 + 2||W||^2}{9||Z||^2 + ||W||^2}.$$

The right hand fraction is the slope of a ray emanating from the origin and passing through some point in the interior of the segment connecting the points $$(10,9)$$ and $$(2,1)$$ in the plane (with the weights given by the relative values of $$||Z||^2$$ and $$||W||^2$$), making it obvious the bounds are $$10/9$$ and $$2/1$$ -- and they cannot be attained because both weights are nonzero, QED.

Here is a histogram of the regression coefficients for a thousand simulated configurations (in $$\mathbb{R}^{10}$$):

The red vertical lines mark the bounds. The simulation supports these by exhibiting a full range of values filling these bounds but never extending beyond them.

The R code shows how the foregoing analysis was implemented. (It includes a post-simulation check that all the given regression coefficients equal $$2,$$ as intended.)

z <- scale(c(1, rep(0,9)))
z <- z / sqrt(length(z)-1) # Must be zero mean, unit norm

set.seed(17)
sim <- replicate(1e3, {
w <- rexp(length(z))
w <- residuals(lm(w ~ z))
w2 <- sum(w*w)
y <- 2 * z + (4/w2 + 2)*w
x <- 2*z + w

a <- coefficients(lm(y ~ x + 0))
b <- coefficients(lm(y ~ z + 0))
c. <- coefficients(lm(x ~ z + 0))
d <- coefficients(lm(y ~ I(x+z) + 0))
c(a,b,c.,d)
})
table(sim[1:3,]) # All 2

hist(sim[4,], xlim=c(10/9,2), breaks=seq(10/9, 2, by=1/18), freq=FALSE,
col=gray(.9), xlab="Value",
main=expression(hat(beta)[group("", list(Y, X+Z), "")]))
abline(v=c(10/9, 2), col="Red", lwd=2, lty=2)