Let the "ordinary-least-squares regression of $Y$ on $X$" be given by
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\text{.}$$
Suppose I run the following:
- The OLS regression of $Y$ on $X$
- The OLS regression of $Y$ on $Z$
- The OLS regression of $X$ on $Z$
and in all three cases, their slope coefficients $\hat{\beta}_1 = 2$.
I am interested in the slope coefficient of $Y$ on $X + Z$. How does this compare to the value $2$ (i.e., less than, equal to, greater than, or impossible to know)?
Attempt. Let $\hat{\beta}_{Y, X}$ be $\hat{\beta}_1$ in the case of the OLS regression of $Y$ on $X$, and similarly for the other three cases. Then we know that
$$\hat{\beta}_{Y, X} = \dfrac{\sum(x_i – \bar{x})(y_i – \bar{y})}{\sum(x_i – \bar{x})^2}$$
hence
$$\dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n}(x_i – \bar{x})^2} = \dfrac{\sum_{i=1}^{n}(z_i – \bar{z})(y_i – \bar{y})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = \dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(z_i – \bar{z})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = 2\text{.}$$
We also know that regression coefficients are unaffected by centering, so without loss of generality, I assume that all variables are centered and that $\sum_{i=1}^{n}(x_i – \bar{x})^2 = \sum_{i=1}^{n}(z_i – \bar{z})^2 = 1$, leading to
$$\sum_{i=1}^{n}x_iy_i = \sum_{i=1}^{n}x_iz_i = 2\text{.}$$
Thus, by linearity of the arithmetic mean, we have
$$\hat{\beta}_{Y, X+Z} = \dfrac{\sum_{i=1}^{n}y_i(x_i + z_i)}{\sum_{i=1}^{n}(x_i + z_i)^2} = \dfrac{4}{\sum_{i=1}^{n}(x_i^2 + 2x_iz_i + z_i^2)}\text{.}$$
Here's where I'm stuck. Can we do something clever with the above quantity?
I am told the answer is that $\hat{\beta}_{Y, X+Z}< 2$.
Best Answer
First, it is possible for these conditions simultaneously to hold, as I will show.
Second, the regression of $Y$ on $X+Z$ must lie in the open interval $(10/9,2)$ and can attain any value in that interval.
Vector notation is particularly convenient here.
The given information tells us (in the order given in the question) that
Implicitly, not all of $X,Y,$ and $Z$ are zero, for otherwise there's nothing of interest: we would just be saying, three times over, that $2$ times the zero vector is zero; and then there would be no bounds on the regression of $Y$ against $X+Z.$ Consequently, all of these vectors must be nonzero.
To find $\beta,$ use $(2)$ to regress $Y$ on $X$ as
$$2 = \hat\beta_{Y;X} = \frac{Y\cdot X}{||X||^2} = \frac{4||Z||^2 + \beta||W||^2}{4||Z||^2 + ||W||^2}.$$
When $W\ne 0,$ the unique solution is
$$\beta = 2 + \frac{4||Z||^2}{||W||^2}.$$
(When $W=0$ the equation reads $2=1,$ which has no solutions.)
We may now compute the regression of $Y$ against $X+Z$ as
$$\hat\beta_{Y;X+Z} = \frac{Y\cdot(X+Z)}{||X+Z||^2} = \frac{(2Z+\beta W + F)\cdot(3Z + W)}{||3Z + W||^2} = \frac{10||Z||^2 + 2||W||^2}{9||Z||^2 + ||W||^2}.$$
The right hand fraction is the slope of a ray emanating from the origin and passing through some point in the interior of the segment connecting the points $(10,9)$ and $(2,1)$ in the plane (with the weights given by the relative values of $||Z||^2$ and $||W||^2$), making it obvious the bounds are $10/9$ and $2/1$ -- and they cannot be attained because both weights are nonzero, QED.
Here is a histogram of the regression coefficients for a thousand simulated configurations (in $\mathbb{R}^{10}$):
The red vertical lines mark the bounds. The simulation supports these by exhibiting a full range of values filling these bounds but never extending beyond them.
The
R
code shows how the foregoing analysis was implemented. (It includes a post-simulation check that all the given regression coefficients equal $2,$ as intended.)