categorical-data – Residualizing a Binary Variable to Remedy Multicollinearity Issues

categorical datamulticollinearity

Imagine a regression model where there is a continuous-valued response variable and three continuous-valued explanatory variables. For concreteness, imagine that we are interested in the effects of "Predictability," "Length," and "Frequenty," on the reading times "RT" of words.

Suppose further that we know that there is collinearity between "Frequency" and the other two explanatory variables. A way of dealing with this in the model is to residualize "Frequency," along the lines of the R code given (where the second line is the form of the model inputted in to a suitable regression algorithm):

r. <- function (formula, ...) rstandard(lm(formula, ...))
RT~Predictability+Length+r.(Frequency~Length+Predictability)

Now suppose that one of the the variables of interest was binary-valued. Let's say that we wanted to run a model with the form:

RT~Education+ReadsDaily

but we found that neither variable had a significant effect when both were included, but when a saturated systematic component, Education*ReadsDaily was used, the model found highly significant results for both coefficients as well as the interaction term. When the two models are compared, the inclusion of the interaction term decreases the deviance by 16.7. When a VIF analysis is run on the model with an interaction terms, very high values are reported:

ReadDailyY            Education          ReadDailyY:Education 
8.693957             4.266084            15.607665 

The model without an interaction term, which fits poorly, has low values in the VIF analysis:

ReadDailyY  Education 
1.227842    1.227842

I think this means that there is multicollinearity between the two predictors (cf. Agresti 2007:138):

…models with several predictors often suffer from multicollinearity –
correlations among predictors making it seem that no one variable is
important when all the others are in the model. A variable may seem to
have little effect because it overlaps considerably with other
predictors in the model, itself being predicted well by the other
predictors. Deleting such a redundant predictor can be helpful, for
instance to reduce standard errors of other estimated effects.

We would, however, like to try to separate their effects on "RT" rather than combining the variables or discarding one, if at all possible, since the distinction between the two variables happens to be of theoretical importance. It would be possible to do something like the following in R with no error messages:

r. <- function(formula) rstandard(glm(family=binomial(),formula))
RT~Education+r.(ReadsDaily~Education)

The residualized "ReadsDaily" parameter in the model would then be continuous-valued. I suspect that this strategy for dealing with collinearity is methodologically suspect, but I do not know enough to give a reason for why or why not the strategy is to be dispreferred.

Agresti, A. (2007) An Introduction to Categorical Data Analysis. Wiley.

Best Answer

Especially when creating an interaction term by a dummy variable/dichotomous/categorical variable, it can lead to multicollinearity. It is recommended to center your values when creating an interaction term, but it is not essential(taking X - mean to get a mean of 0, R-squared does not change). This is demonstrated here with an example and more here. but especially if you are trying to explain a model and not predict, it is important to remove multicollinearity from interaction terms.

Aiken's book on interpreting interactions. continuous, binary, categorical etc.:

Aiken, L. S., West, S. G., & Ren, R. R. (1991). Multiple regression: testing and interpreting interactions. Sage Publications, Inc.

Related Question