Solved – Costs and benefits of adding more variables to multiple regression

linear modelmultiple regressionregression

What are the costs and the benefits of adding more variables to multiple regression?

Adding a relevant variable can prevent bias in the estimate of the other regression coefficient but can also increase variance of other regression coefficient.

Adding an irrelevant variable can increase the variance of the estimate of other correlation coefficient and will not have any benefits.

Are these correct and what are other pros and cons of adding more variables?

Best Answer

Both statements are true.

For the part of relevant variables I suggest wikipedia's link for both intuition and the algebra behind.

As for adding what you called an irrelevant variable (its true coefficient is 0). Suppose your relevant variables are in a matrix $X$ and you consider another variable $u$. So the new design matrix is $X_{+} = \begin{bmatrix}X &u\end{bmatrix}$. I'll skip some steps, but if you use some linear algebra properties, you'll get

$$ (X_+^TX_+) = \begin{bmatrix} X^TX & X^Tu\\ u^TX & u^Tu \end{bmatrix}, $$

$$(X_+^TX_+)^{-1} = \begin{bmatrix} A_{11} & A_{12}^T\\ A_{12} & A_{22} \end{bmatrix}. $$

Particularly, we're interested on how the variances of the important variables (the ones from $X$) behave, wich means we want to look at the diagonal components of $A_{11}$. By using the formula for block matrices inversion we get

$$A_{11} = (X^TX)^{-1} + (X^TX)^{-1}X^Tu(u^Tu - u^TX(X^TX)^{-1}X^Tu) (X^TX)^{-1},$$

the second term has non-negative diagonal entries (it in fact can be zero if $u$ is orthogonal to the columns of $X$).

To get some intuition on how this works, consider the simple linear regression case

$$Y_i = \beta_0 + \beta_1x_i + \epsilon_i, $$ with $\beta_1 = 0$ ($x$ has no effect on the expected value of $Y$).

We know $\hat{\beta_0} = \bar{Y} - \hat\beta_1 \bar x$, while in the "true" model we should have $\hat\beta_0 = \bar Y$. Similar to the complicated algebra above, we have (considering $x$ in the model),

$$ Var(\hat \beta_0) = Var(\bar Y) + \bar x^2 Var(\hat \beta_1) \geq Var(\bar Y), $$

with equality if $\bar x = 0$ (the mentioned orthogonality condition).

In summary, adding a variable (considering it's not orthogonal to the previous ones) to a linear regression model will cause a bias reduction in the coefficients estimates but an increase in their variances. Since you never know what are the real relevant variables, you need to balance this bias-variance trade-off.

There are a lot of methods proposed for variable selection, one example is LASSO.