Regression Analysis – Change of Coefficient Value When Including a New Variable in Multiple Regression

effect-sizemultiple regressionregressionregression coefficients

I have a quite stupid question.

It seems quite obvious that when a new variable was included into a multiple regression, the coefficients of independent variables already in the regression could changed. So that the regression model could fit the data more.

What's bothering me is that, according to the interpretation of coefficient, "the effect of X on Y when other independent variables were hold constant".

My questions are:

  1. why doesn't the effect of old variables remain the same when including a new variable?
  2. How does the model decide the effect (coefficient) of each independent variables? (I've learned normal equation and least of square. But I'm hoping if there is a more intuitive answer).

Best Answer

Why doesn't the effect of old variables remain the same when including a new variable?

Reading your question again, I believe that you are referring to the previous point "the effect of X on Y when other independent variables were hold constant" since you mentioned earlier that you understood the change in coefficients.

So, what is described here is the way to interpret the effect of a specific independent variable on the dependent variable. It does not mean that the other variables remain the same when including other variable. For example, let's assume that you are studying real estate prices of a given city based on numbers of rooms and living surface. Your model tells you that:

price = 50000 + 10000 * room + 1000 * living_surface

Then, a way to interpret the effect of the number of rooms is that the price increases by 10000 for an extra room (while you keep living_surface constant). In the same way, the price increases by 1000 for an increase of 1 square meter (while room is held constant).

How does the model decide the effect (coefficient) of each independent variables? (I've learned normal equation and least of square. But I'm hoping if there is a more intuitive answer).

In a nutshell, the goal of the regression is to minimize the residuals (i.e. the distance between the observed values $y_i$ and the fitted values $\hat y_i$). Because residuals can be positive or negative (and might cancel each other), OLS squares them (to get only positive values) and sum them: $\sum(y_i - \hat y_i)^2$. Then, OLS will identify the line that minimize this quantity.

Even if it does not address directly your question, you might be interested by this post.

Related Question