Solved – How to perform multiple regression when one predictor is the sum of two other predictors

multiple regressionregression

I have a query regarding multiple linear regression. Let us assume that I have 3 predictor variables with a linear relationship between them as follows:

c = a + b

On inspection, I find that a,b,c are significantly correlated with each other.

  • Is it appropriate to construct a multiple linear regression with a,b,c in it?
  • Or is it more appropriate to construct two linear regressions, one with c in it and one with a,b only just to test which explain a greater proportion of variance, the whole or the sum of the parts?
  • If it is not appropriate, what is the best solution to this kind of a problem?

Best Answer

If you have an exact linear relationship in your independent variables (a bit more common jargon than predictor variables) like $c=a+b$, then you cannot apply the regression purposefully. In other words, it is misspecified. A statistical software will usually come up with an error message here. Intuitively, there is no unique estimator as there is no room for variation, or in technical terms, you have an non-invertible X matrix and basically try to divide by 0.

If you have a strong correlation between lets say a and c, then this is just strong multicollinearity. You can estimate the coefficients, but you need to take into account three things:

  1. the standard deviations will be very high and the t values very low. Basically two variables try to co-explain the one dependent thing.
  2. the estimated coefficients will be very sensitive to outliers. So, data contamination is a big issue here as the coefficient can dramatically change with just one very small outlier.
  3. Given that the data is not contaminated, you need to adjust your interpretation of the coefficient. In a multiple linear OLS regression, the coefficient indicates what happens to the dependent variable if all other variables are held constant. Take for example: $$ \text{wage} = \text{constant} + \beta_1*\text{education} + \beta_2*\text{IQ} + u. $$ (Here IQ and education are expected to be strongly correlated.)

So you may suddenly wonder that your coefficient $\beta_1$ turns out to be negative, while you expect it to be theoretically by all means positive. This may be due to the very strong correlation between wage and intelligence, as intelligence may have a greater effect on wages and $\beta_1$ is downward adjusted by this effect of intelligence on wage. So, this is correct estimated, but the interpretation is now different.

However, if you take the change in the interpretation of the coefficients into account and your data correct, strong multicollinearity gives an unbiased predictor.

Related Question