Solved – Regression. Interaction term correlated with the variables

interactionregression

Before fitting a multivariable regression model it's common to check if the predictors are correlated.
That can be done viewing the correlation matrix, at least for linear effects.

Simple least squares regression needs that the predictor variables are independent.
We could tolerate small correlations but the problem gets serious if the variables are perfectly collinear.

It's common to drop some of the correlated variables, keeping the most meaningful.
More complex alternative methods such as PCA o Ridge regression exist.

But here is my question:
If your model include interactions that interactions use to be very correlated with other variables.
For example in $F=a+b·X+c·Y+d·X·Y$
$X·Y$ is likely to be very correlated with $X$ and $Y$.

If my model has an interaction term (and it's statistically significant and important for me) but it's very correlated with some variables…
Should I drop it or keep it?
If I keep it I will be violating the regression condition of non-correlated terms.

Best Answer

Keep it. It's one of those choices between unbiasedness and precision.

The negative effect of having correlated independent variables is that they inflate each other's variance (the statistics that quantifies this phenomenon is called variance inflation factor). The results are enlarged standard error, which leads to lower t-statistics, which lead to higher p-value.

If the interaction term is already statistically significant, then the above problem does not concern your model as much. Yet on the other hand, taking it out can have drastic consequence because without that interaction the estimates in the model ($b$ and $c$) can be biased. Once an estimate is biased there is really not much of a point to discuss its precision. For that reason it's better keep the interaction term.

Also, I'd suggest careful consideration when checking possible interactions. While we can statistically examine each of them, the importance of having a causal framework behind cannot be emphasized enough. Lastly, many analyses were not powered for checking interactions (most researchers didn't consider this as a formal hypothesis) so be mindful not to fool yourself into thinking there isn't one if there isn't one.

Related Question