Solved – the difference between a confounder, collinearity, and interaction term

confoundinginteractionmulticollinearitymultiple regressionregression

These terms kind of confuse me because they all seem to imply a certain correlation.

Confounder: influences dependent and independent variable

Collinearity: to me just means correlation between indepedent variables

Interaction term: joint effect of independent variables (but doesn't this require correlation between those variables?)

Best Answer

Your understanding of confounding and collinearity is correct. Note that in many contexts collinearity really refers to "perfect collinearity" where one variable is a linear combination of one or more other variables, but in some contexts it just refers to "high correlation" between variables.

Of course, in order for confounding to occur, there has to be a degree of correlation, though I would avoid saying "collinearity" due to the above.

However:

interaction term: joint effect of independent variables (but doesn't this require correlation between those variables?)

A "joint effect" is a good way to undersdand it, but in no way does it require correlation between the variables. Consider an orthogonal factorial design experiment for example.

As another example we could also show this with a simple simulation of bivariate data where X1 and X2 are uncorrelated yet a meaningful interaction exists:

> set.seed(1)
> N <- 100
> X1 <- rnorm(N)
> X2 <- rnorm(N)
> cor(X1, X2)
[1] -0.0009943199   # X1 and X2 are uncorrelated
> 
> Y <- X1 * X2 + rnorm(N)
> lm(Y ~ X1 * X2) %>% summary()

Call:
lm(formula = Y ~ X1 * X2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.92554 -0.43139  0.00249  0.65651  2.60188 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.03107    0.10439   0.298    0.767    
X1          -0.03352    0.12064  -0.278    0.782    
X2          -0.02822    0.10970  -0.257    0.798    
X1:X2        0.76032    0.14847   5.121 1.57e-06 ***

Related Solutions

Solved – In the logistic regression model one of the independent variables is redundant with the interaction term. How should I deal with it

Software will drop variables when they are collinear. Understanding this situation amounts to figuring more precisely what that means.

There are three independent variables involved, including the constant term. Let's represent their values as the constant (column) vector $X_1 = (1, 1, \ldots, 1)$, a vector of ones and zeros for the dummy $X_2 = (1, 1, \ldots, 1, 0, 0,\ldots, 0)$, and a third apparently arbitrary vector $X_3 = (x_1, x_2, \ldots, x_n)$. (All other valid dummy codings are linear combinations of this particular $X_1$ and $X_2$, so no generality is lost by assuming that this particular binary (0-1) encoding is used.) I have sorted the data so that all the records where the dummy is $1$ come first; suppose there are $k$ of them. (We know $k \ge 1$ and $k \lt n$, for otherwise the dummy would be constant and could not be included in any regression with a constant term.)

Collinearity of these three vectors along with the $X_2 X_3$ interaction means (by definition) that there is a nontrivial linear relation

$$0 = \alpha_1 X_1 + \alpha_2 X_2 + \alpha_3 X_3 + \alpha_4 X_2 X_3$$

The first $k$ equations in this linear combination are

$$0 = \alpha_1 + \alpha_2 + \alpha_3 x_i + \alpha_4 x_i,\quad i=1, 2, \ldots, k.$$

The remaining equations are

$$0 = \alpha_1 + \alpha_3 x_i,\quad i = k+1, \ldots, n.$$

The first group of equations informs us that all the $(\alpha_3 + \alpha_4)x_i$ are equal to the constant $-(\alpha_1+\alpha_2)$ for $1 \le i \le k$. The second group informs us that all the $\alpha_3 x_i$ are equal to the constant $-\alpha_1$ for $k \lt i \le n$. That first statement does not restrict the $x_i$ for $1 \le i \le k$ provided $\alpha_3 + \alpha_4=0$, but the second one then implies that all the $x_i$ are equal to one another for $i \gt k$. For if this were not the case, then necessarily $\alpha_3 = 0$, implying either $\alpha_4=0$ or all the $x_i$ are equal to each other for $1\le i \le k$. If $\alpha_4=0$, these would in turn imply that both $\alpha_1 + \alpha_2=0$ and $\alpha_1=0$, reducing all the $\alpha_i$ to $0$: but that was not the case (the linear relation was nontrivial).

In words, what we have deduced is that the continuous variable $X_3$ exhibits no variation among at least one of the two groups of dummy values.

To confirm this conclusion we may create three examples of such data in R. I have chosen $k=2$ and $n=4$: there are two records for each group of dummy values. In the first case, assigning random values to $X_3$ virtually guarantees there will be variation within both groups:

> set.seed(17)
> x2 <- c(1, 1, 0, 0) # The dummy (binary) variable, sorted as in the analysis
> x3 <- rnorm(4)      # The continuous independent variable
> y <- rnorm(4)       # The dependent variable may have *any* values
> lm(y ~ x2*x3)
Coefficients:
(Intercept)           x2           x3        x2:x3  
     0.6763      -0.9218      -1.2728       0.2703

All variables are retained. (This is OLS regression, not logistic regression, but that doesn't matter: both methods behave identically concerning treatment of collinear independent variables.)

In the second case, let's set the first two first elements of $X_3$ to the same value:

> x3[1] <- x3[2]; lm(y ~ x2*x3)
Coefficients:
(Intercept)           x2           x3        x2:x3  
     0.6763      -0.4745      -1.2728           NA

The interaction is dropped due to the collinearity.

In the third case, let's set the last two elements of $X_3$ to a common value while varying the first two. To do this, I just reverse all the element of $X_3$:

> x3 <- rev(x3); lm(y ~ x2*x3)
Coefficients:
(Intercept)           x2           x3        x2:x3  
      1.217       -1.756       -1.605           NA

Once again the interaction is dropped due to collinearity.

It sounds like SPSS behaves in the same way as R in such cases.

Solved – Interaction terms and effect sizes in multiple regression

If you include an interaction term, then "the" effect no longer exists. Instead you have multiple effects: one for each level of the other variable with which you created the interaction. This is the very point of including interactions, so there is no way around it.

Best Answer

Related Solutions

Solved – In the logistic regression model one of the independent variables is redundant with the interaction term. How should I deal with it

Solved – Interaction terms and effect sizes in multiple regression

Related Question