Solved – Justification for adding interaction terms to a multiple regression

interactionregression

If you have no domain knowledge, is there any justification for adding interaction terms to a multiple regression?
If so, is there any intelligent way to select them? With a large number of independent variables (say over 6) there is a large number of interaction terms (over 58) – wouldn't this just be data mining?

Best Answer

In general, if you have no domain knowledge you should stop, read on the domain you are working on so you have a general idea about what is going on and then revisit your analysis. You do not have to become an expert but you need to understand basic concepts so you avoid reinventing the wheel and be able to communicate your results to domain practitioners.

Having said that, your worry about over-fitting your data / data-dredging is fully legitimate. Multiple unexpected interactions terms are clear warning signs that someone potentially over-fits. There as some great post on this matter (eg. here and here) and I would urge to look this further. Going back to your original question about interaction terms once more they are are some basic rule of thumbs your can follow on whether on not should include interactions terms. For example: if we include squared terms for variables: $x_1$ and $x_2$, not include their interactions terms would correspond to a fitted regression surface that is aligned with the coordinate axes used [1]. Similarly, if it is generally accepted that there is a temporal pattern between your covariates (eg. biomass - life-cycle [2] or (credit) amount - age [3]) you would include an interaction term at first pass at least.

If your criterion of choice (this being AIC, BIC, some $k$-fold CV error, MAD, out-of-sample prediction, whatever,...) suggests that a model with an interaction term is better than a model without an interaction (to your preferred level of difference) then it is probably reasonable to include that interaction term. Nevertheless one should understand what is done. This is because ultimately one needs to explain why the chosen criterion is relevant for this model and how the model in question is relevant to this domain.

  1. Practical Regression and Anova using R by J. Faraway, Chapt. 10
  2. Mixed Effects Models and Extensions in Ecology with R by A. Zuur et al., Chapt. 2
  3. Handbook of Computational Statistics by J. Gentle et al., Chapt. 24
Related Question