Let the "ordinary-least-squares regression of $Y$ on $X$" be given by

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\text{.}$$

Suppose I run the following:

- The OLS regression of $Y$ on $X$
- The OLS regression of $Y$ on $Z$
- The OLS regression of $X$ on $Z$

and in all three cases, their slope coefficients $\hat{\beta}_1 = 2$.

I am interested in the slope coefficient of $Y$ on $X + Z$. How does this compare to the value $2$ (i.e., less than, equal to, greater than, or impossible to know)?

**Attempt**. Let $\hat{\beta}_{Y, X}$ be $\hat{\beta}_1$ in the case of the OLS regression of $Y$ on $X$, and similarly for the other three cases. Then we know that

$$\hat{\beta}_{Y, X} = \dfrac{\sum(x_i – \bar{x})(y_i – \bar{y})}{\sum(x_i – \bar{x})^2}$$

hence

$$\dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n}(x_i – \bar{x})^2} = \dfrac{\sum_{i=1}^{n}(z_i – \bar{z})(y_i – \bar{y})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = \dfrac{\sum_{i=1}^{n}(x_i – \bar{x})(z_i – \bar{z})}{\sum_{i=1}^{n}(z_i – \bar{z})^2} = 2\text{.}$$

We also know that regression coefficients are unaffected by centering, so without loss of generality, I assume that all variables are centered and that $\sum_{i=1}^{n}(x_i – \bar{x})^2 = \sum_{i=1}^{n}(z_i – \bar{z})^2 = 1$, leading to

$$\sum_{i=1}^{n}x_iy_i = \sum_{i=1}^{n}x_iz_i = 2\text{.}$$

Thus, by linearity of the arithmetic mean, we have

$$\hat{\beta}_{Y, X+Z} = \dfrac{\sum_{i=1}^{n}y_i(x_i + z_i)}{\sum_{i=1}^{n}(x_i + z_i)^2} = \dfrac{4}{\sum_{i=1}^{n}(x_i^2 + 2x_iz_i + z_i^2)}\text{.}$$

Here's where I'm stuck. Can we do something clever with the above quantity?

I am told the answer is that $\hat{\beta}_{Y, X+Z}< 2$.

## Best Answer

First,

it is possible for these conditions simultaneously to hold,as I will show.Second,

the regression of $Y$ on $X+Z$ must lie in the open interval $(10/9,2)$and can attain any value in that interval.Vector notation is particularly convenient here.The given information tells us (in the order given in the question) that

Implicitly, not all of $X,Y,$ and $Z$ are zero, for otherwise there's nothing of interest: we would just be saying, three times over, that $2$ times the zero vector is zero; and then there would be

nobounds on the regression of $Y$ against $X+Z.$ Consequently,all of these vectors must be nonzero.To find $\beta,$ use $(2)$ to regress $Y$ on $X$ as

$$2 = \hat\beta_{Y;X} = \frac{Y\cdot X}{||X||^2} = \frac{4||Z||^2 + \beta||W||^2}{4||Z||^2 + ||W||^2}.$$

When $W\ne 0,$ the

uniquesolution is$$\beta = 2 + \frac{4||Z||^2}{||W||^2}.$$

(When $W=0$ the equation reads $2=1,$ which has no solutions.)

We may now compute the regressionof $Y$ against $X+Z$ as$$\hat\beta_{Y;X+Z} = \frac{Y\cdot(X+Z)}{||X+Z||^2} = \frac{(2Z+\beta W + F)\cdot(3Z + W)}{||3Z + W||^2} = \frac{10||Z||^2 + 2||W||^2}{9||Z||^2 + ||W||^2}.$$

The right hand fraction is the slope of a ray emanating from the origin and passing through some point in the interior of the segment connecting the points $(10,9)$ and $(2,1)$ in the plane (with the weights given by the relative values of $||Z||^2$ and $||W||^2$), making it obvious the bounds are $10/9$ and $2/1$ -- and they cannot be attained because both weights are nonzero,

QED.Here is a histogram of the regression coefficients for a thousand simulated configurations (in $\mathbb{R}^{10}$):

The red vertical lines mark the bounds. The simulation supports these by exhibiting a full range of values filling these bounds but never extending beyond them.

The

`R`

code shows how the foregoing analysis was implemented. (It includes a post-simulation check that all the given regression coefficients equal $2,$ as intended.)