Solved – In linear regression, why should we include quadratic terms when we are only interested in interaction terms

interactionlinear modelmultiple regressionregression

Suppose I am interested in a linear regression model, for $$Y_i = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2$$, because I would like to see if an interaction between the two covariates have an effect on Y.

In a professors' course notes (whom I do not have contact with), it states:
When including interaction terms, you should include their second degree terms. ie $$Y_i = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 +\beta_4x_1^2 + \beta_5x_2^2$$ should be included in the regression.

Why should one include second degree terms when we are only interested in the interactions?

Best Answer

It depends on the goal of inference. If you want to make inference of whether there exists an interaction, for instance, in a causal context (or, more generally, if you want to interpret the interaction coefficient), this recommendation from your professor does make sense, and it comes from the fact that misspecification of the functional form can lead to wrong inferences about interaction.

Here is a simple example where there is no interaction term between $x_1$ and $x_2$ in the structural equation of $y$, yet, if you do not include the quadratic term of $x_1$, you would wrongly conclude that $x_1$ interacts with $x_2$ when in fact it doesn't.

    set.seed(10)
    n <- 1e3
    x1 <- rnorm(n)
    x2 <- x1 + rnorm(n)
    y <- x1 + x2 + x1^2 + rnorm(n)
    summary(lm(y ~ x1 + x2 + x1:x2))
    
    Call:
    lm(formula = y ~ x1 + x2 + x1:x2)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.7781 -0.8326 -0.0806  0.7598  7.7929 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.30116    0.04813   6.257 5.81e-10 ***
    x1           1.03142    0.05888  17.519  < 2e-16 ***
    x2           1.01806    0.03971  25.638  < 2e-16 ***
    x1:x2        0.63939    0.02390  26.757  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.308 on 996 degrees of freedom
    Multiple R-squared:  0.7935,    Adjusted R-squared:  0.7929 
    F-statistic:  1276 on 3 and 996 DF,  p-value: < 2.2e-16

This can be interpreted as simply a case of omitted variable bias, and here $x_1^2$ is the omitted variable. If you go back and include the squared term in your regression, the apparent interaction disappears.

    summary(lm(y ~ x1 + x2 + x1:x2 + I(x1^2)))   
     
    Call:
    lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2))
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.4574 -0.7073  0.0228  0.6723  3.7135 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -0.0419958  0.0398423  -1.054    0.292    
    x1           1.0296642  0.0458586  22.453   <2e-16 ***
    x2           1.0017625  0.0309367  32.381   <2e-16 ***
    I(x1^2)      1.0196002  0.0400940  25.430   <2e-16 ***
    x1:x2       -0.0006889  0.0313045  -0.022    0.982    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.019 on 995 degrees of freedom
    Multiple R-squared:  0.8748,    Adjusted R-squared:  0.8743 
    F-statistic:  1739 on 4 and 995 DF,  p-value: < 2.2e-16

Of course, this reasoning applies not only to quadratic terms, but misspecification of the functional form in general. The goal here is to model the conditional expectation function appropriately to assess interaction. If you are limiting yourself to modeling with linear regression, then you will need to include these nonlinear terms manually. But an alternative is to use more flexible regression modeling, such as kernel ridge regression for instance.