Solved – In the logistic regression model one of the independent variables is redundant with the interaction term. How should I deal with it

identifiabilityinteractionlogisticregression

In my logistic regression is the dependent variable a dummy variable and I also have two independent variables. One of those is a dummy variable and the other is a metric variable. I also suppose an interaction between those two variables.

I am computing three regressions because I want to explore the influence of the independent variables on the dependent variable in period 1, period 2 and in both periods together.

When I compute the regressions for period 2 and for both periods together, there is no problem.

But when I compute the regression for the period 1, SPSS generated a warning that "because of redundancy the degrees of freedom for at least one variable have been reduced". I actually do not know what that means, but I found out that when I exclude the independent dummy variable from my model for period 1, the interaction term is included in the model. So both variables are somehow identical.

My question is how should I deal with this in my work. Should I just say that for the first period the interaction term and the dummy variable are identical? Or there are some other consequences for the interpretation of my model?

I hope this makes my question a little bit clearer. Thanks again.

Best Answer

Software will drop variables when they are collinear. Understanding this situation amounts to figuring more precisely what that means.

There are three independent variables involved, including the constant term. Let's represent their values as the constant (column) vector $X_1 = (1, 1, \ldots, 1)$, a vector of ones and zeros for the dummy $X_2 = (1, 1, \ldots, 1, 0, 0,\ldots, 0)$, and a third apparently arbitrary vector $X_3 = (x_1, x_2, \ldots, x_n)$. (All other valid dummy codings are linear combinations of this particular $X_1$ and $X_2$, so no generality is lost by assuming that this particular binary (0-1) encoding is used.) I have sorted the data so that all the records where the dummy is $1$ come first; suppose there are $k$ of them. (We know $k \ge 1$ and $k \lt n$, for otherwise the dummy would be constant and could not be included in any regression with a constant term.)

Collinearity of these three vectors along with the $X_2 X_3$ interaction means (by definition) that there is a nontrivial linear relation

$$0 = \alpha_1 X_1 + \alpha_2 X_2 + \alpha_3 X_3 + \alpha_4 X_2 X_3$$

The first $k$ equations in this linear combination are

$$0 = \alpha_1 + \alpha_2 + \alpha_3 x_i + \alpha_4 x_i,\quad i=1, 2, \ldots, k.$$

The remaining equations are

$$0 = \alpha_1 + \alpha_3 x_i,\quad i = k+1, \ldots, n.$$

The first group of equations informs us that all the $(\alpha_3 + \alpha_4)x_i$ are equal to the constant $-(\alpha_1+\alpha_2)$ for $1 \le i \le k$. The second group informs us that all the $\alpha_3 x_i$ are equal to the constant $-\alpha_1$ for $k \lt i \le n$. That first statement does not restrict the $x_i$ for $1 \le i \le k$ provided $\alpha_3 + \alpha_4=0$, but the second one then implies that all the $x_i$ are equal to one another for $i \gt k$. For if this were not the case, then necessarily $\alpha_3 = 0$, implying either $\alpha_4=0$ or all the $x_i$ are equal to each other for $1\le i \le k$. If $\alpha_4=0$, these would in turn imply that both $\alpha_1 + \alpha_2=0$ and $\alpha_1=0$, reducing all the $\alpha_i$ to $0$: but that was not the case (the linear relation was nontrivial).

In words, what we have deduced is that the continuous variable $X_3$ exhibits no variation among at least one of the two groups of dummy values.


To confirm this conclusion we may create three examples of such data in R. I have chosen $k=2$ and $n=4$: there are two records for each group of dummy values. In the first case, assigning random values to $X_3$ virtually guarantees there will be variation within both groups:

> set.seed(17)
> x2 <- c(1, 1, 0, 0) # The dummy (binary) variable, sorted as in the analysis
> x3 <- rnorm(4)      # The continuous independent variable
> y <- rnorm(4)       # The dependent variable may have *any* values
> lm(y ~ x2*x3)
Coefficients:
(Intercept)           x2           x3        x2:x3  
     0.6763      -0.9218      -1.2728       0.2703 

All variables are retained. (This is OLS regression, not logistic regression, but that doesn't matter: both methods behave identically concerning treatment of collinear independent variables.)

In the second case, let's set the first two first elements of $X_3$ to the same value:

> x3[1] <- x3[2]; lm(y ~ x2*x3)
Coefficients:
(Intercept)           x2           x3        x2:x3  
     0.6763      -0.4745      -1.2728           NA  

The interaction is dropped due to the collinearity.

In the third case, let's set the last two elements of $X_3$ to a common value while varying the first two. To do this, I just reverse all the element of $X_3$:

> x3 <- rev(x3); lm(y ~ x2*x3)
Coefficients:
(Intercept)           x2           x3        x2:x3  
      1.217       -1.756       -1.605           NA  

Once again the interaction is dropped due to collinearity.

It sounds like SPSS behaves in the same way as R in such cases.

Related Question