Solved – Should I include an interaction term for a covariate if I expect it to be correlated with one or more of the variables

interactionlinear modelmultiple regressionregression

I'm fitting a linear model where the response variable is a measure of physical performance– running speed for example– and the predictor variables are sex and drug treatment, with an interaction term.

$$Y = \beta_{0}+\beta_1X_{sex}+\beta_1X_{drug}+\beta_3X_{sex}X_{drug}$$

We suspect that body weight influences running speed, so we need to include body weight as a covariate to make sure the differences we're seeing aren't adequately explained by body weight alone.

However, body weight is affected by the drug, and of course it also correlates with sex. That makes me think that body weight should be included only as an additive term and its interactions with other terms should not be considered (because they will be spuriously significant):

$$Y = \beta_{0}+\beta_1X_{sex}+\beta_1X_{drug}+\beta_3X_{sex}X_{drug}+\beta_4X_{bw}$$

Am I correct to not include body weight interactions, and are my reasons for doing so correct? Thanks.

Best Answer

I don't know whether you are correct in not including the body weight interactions--the data should decide that--but the reasoning does not appear to be valid.

To see why not, suppose (purely hypothetically) that there is an ideal body weight for running that lies somewhere between the (low) mean female weight and (high) mean male weight. Then it is plausible that performance increases with female body weight and decreases with male body weight. That's a strong interaction between body weight and gender.

(There's also an important nonlinearity involved here: my supposition of an ideal intermediate weight is tantamount to saying that performance is not linearly related to body weight. But in some circumstances--such as when male and female body weights are well separated in the dataset--such nonlinearity could be adequately modeled in a linear fashion by means of this interaction. In effect, the relationship between weight and performance could be $\wedge$-shaped and gender becomes an indicator of which arm of the $\wedge$ is involved.)

Now of course body weights and running performance do not behave exactly like this, but nevertheless the mere possibility that three variables (body weight, gender, and performance) could be so related shows that the reasoning is incorrect. Whether or not there is a body weight-gender interaction will depend on your data: consider including it during preliminary analyses until it is clear that it adds no value to the model.

Related Question