Solved – Does adding more variables into a multivariable regression change coefficients of existing variables

multiple regressionmultivariableregression

Say I have a multivariable (several independent variables) regression that consists of 3 variables. Each of those variables has a given coefficient. If I decide to introduce a 4th variable and rerun the regression, will the coefficients of the 3 original variables change?

More broadly: in a multivariable (multiple independent variables) regression, is the coefficient of a given variable influenced by the coefficient of another variable?

Best Answer

A parameter estimate in a regression model (e.g., $\hat\beta_i$) will change if a variable, $X_j$, is added to the model that is:

  1. correlated with that parameter's corresponding variable, $X_i$ (which was already in the model), and
  2. correlated with the response variable, $Y$

An estimated beta will not change when a new variable is added, if either of the above are uncorrelated. Note that whether they are uncorrelated in the population (i.e., $\rho_{(X_i, X_j)}=0$, or $\rho_{(X_j, Y)}=0$) is irrelevant. What matters is that both sample correlations are exactly $0$. This will essentially never be the case in practice unless you are working with experimental data where the variables were manipulated such that they are uncorrelated by design.

Note also that the amount the parameters change may not be terribly meaningful (that depends, at least in part, on your theory). Moreover, the amount they can change is a function of the magnitudes of the two correlations above.

On a different note, it is not really correct to think of this phenomenon as "the coefficient of a given variable [being] influenced by the coefficient of another variable". It isn't the betas that are influencing each other. This phenomenon is a natural result of the algorithm that statistical software uses to estimate the slope parameters. Imagine a situation where $Y$ is caused by both $X_i$ and $X_j$, which in turn are correlated with each other. If only $X_i$ is in the model, some of the variation in $Y$ that is due to $X_j$ will be inappropriately attributed to $X_i$. This means that the value of $X_i$ is biased; this is called the omitted variable bias.