Solved – How to perform multiple regression when one predictor is the sum of two other predictors

multiple regressionregression

I have a query regarding multiple linear regression. Let us assume that I have 3 predictor variables with a linear relationship between them as follows:

c = a + b

On inspection, I find that a,b,c are significantly correlated with each other.

Is it appropriate to construct a multiple linear regression with a,b,c in it?
Or is it more appropriate to construct two linear regressions, one with c in it and one with a,b only just to test which explain a greater proportion of variance, the whole or the sum of the parts?
If it is not appropriate, what is the best solution to this kind of a problem?

Best Answer

If you have an exact linear relationship in your independent variables (a bit more common jargon than predictor variables) like $c=a+b$, then you cannot apply the regression purposefully. In other words, it is misspecified. A statistical software will usually come up with an error message here. Intuitively, there is no unique estimator as there is no room for variation, or in technical terms, you have an non-invertible X matrix and basically try to divide by 0.

If you have a strong correlation between lets say a and c, then this is just strong multicollinearity. You can estimate the coefficients, but you need to take into account three things:

the standard deviations will be very high and the t values very low. Basically two variables try to co-explain the one dependent thing.
the estimated coefficients will be very sensitive to outliers. So, data contamination is a big issue here as the coefficient can dramatically change with just one very small outlier.
Given that the data is not contaminated, you need to adjust your interpretation of the coefficient. In a multiple linear OLS regression, the coefficient indicates what happens to the dependent variable if all other variables are held constant. Take for example: $$ \text{wage} = \text{constant} + \beta_1*\text{education} + \beta_2*\text{IQ} + u. $$ (Here IQ and education are expected to be strongly correlated.)

So you may suddenly wonder that your coefficient $\beta_1$ turns out to be negative, while you expect it to be theoretically by all means positive. This may be due to the very strong correlation between wage and intelligence, as intelligence may have a greater effect on wages and $\beta_1$ is downward adjusted by this effect of intelligence on wage. So, this is correct estimated, but the interpretation is now different.

However, if you take the change in the interpretation of the coefficients into account and your data correct, strong multicollinearity gives an unbiased predictor.

Related Solutions

Solved – When to transform predictor variables when doing multiple regression

I take your question to be: how do you detect when the conditions that make transformations appropriate exist, rather than what the logical conditions are. It's always nice to bookend data analyses with exploration, especially graphical data exploration. (Various tests can be conducted, but I'll focus on graphical EDA here.)

Kernel density plots are better than histograms for an initial overview of each variable's univariate distribution. With multiple variables, a scatterplot matrix can be handy. Lowess is also always advisable at the start. This will give you a quick and dirty look at whether the relationships are approximately linear. John Fox's car package usefully combines these:

library(car)
scatterplot.matrix(data)

Be sure to have your variables as columns. If you have many variables, the individual plots can be small. Maximize the plot window and the scatterplots should be big enough to pick out the plots you want to examine individually, and then make single plots. E.g.,

windows()
plot(density(X[,3]))
rug(x[,3])
windows()
plot(x[,3], y)
lines(lowess(y~X[,3]))

After fitting a multiple regression model, you should still plot and check your data, just as with simple linear regression. QQ plots for residuals are just as necessary, and you could do a scatterplot matrix of your residuals against your predictors, following a similar procedure as before.

windows()
qq.plot(model$residuals)
windows()
scatterplot.matrix(cbind(model$residuals,X))

If anything looks suspicious, plot it individually and add abline(h=0), as a visual guide. If you have an interaction, you can create an X[,1]*X[,2] variable, and examine the residuals against that. Likewise, you can make a scatterplot of residuals vs. X[,3]^2, etc. Other types of plots than residuals vs. x that you like can be done similarly. Bear in mind that these are all ignoring the other x dimensions that aren't being plotted. If your data are grouped (i.e. from an experiment), you can make partial plots instead of / in addition to marginal plots.

Hope that helps.

Solved – How to identify which predictors should be included in a multiple regression

The model should be formulated by subject matter expertise. It is not a good idea to use the data to tell you which data to use. The data are not information-rich enough to be able to reliably do this. Should you have too many events per variable (one rule of thumb is to have at least 15 subjects per parameter in the model), strongly consider data reduction methods that are blinded to $Y$. These include principal components, variable clustering, and redundancy analysis. Examples are in my course notes at http://biostat.mc.vanderbilt.edu/CourseBios330.

Best Answer

Related Solutions

Solved – When to transform predictor variables when doing multiple regression

Solved – How to identify which predictors should be included in a multiple regression

Related Question