I want to test the presence of an interaction term in a logistic regression with glm().
The formula is:
Gest.hypertension ~ ART.conc + Parity + Age + Smoke + Nulliparity + Syst.disease.type
With Age the only continuous variable and the others categorical.
The interaction term is ART.conc (2 leves) x Parity (3 levels), which are also the variables of interest. I found two ways of representing the interaction in glm();
- By explicitly defining ART.conc * Parity, which create both simple and interaction effect.
- By creating an artificial new variable with interaction(ART.conc, Parity) and adding the simple terms.
This are the results, omitting the other controls variables (OR and p value presented):
model1: Gest.hypertension ~ ART.conc * Parity + ...
ART.conc->yes 3.35 0.073 .
Parity->2 7.15 0.001 **
Parity->3 38 <0.001 ***
ART.conc->yes:Parity->2 0.262 0.054 .
ART.conc->yes:Parity->3 0.0532 0.009 **
model2: Gest.hypertension ~ interaction(ART.conc, Parity) + ART.conc + Parity + ...
interaction(ART.conc, Parity)->yes.1 3.35 0.072590 .
interaction(ART.conc, Parity)->no.2 7.15 0.001206 **
interaction(ART.conc, Parity)->yes.2 6.26 0.002698 **
interaction(ART.conc, Parity)->no.3 38.05 0.000202 ***
interaction(ART.conc, Parity)->yes.3 6.78 0.011071 *
ART.concyes NA NA
Parity2 NA NA
Parity3 NA NA
so the two models give more or less the same results one one of the two terms is zero and that's ok. I have troubles understanding what happened when the two terms are different from zero. For example in the second model that keeping parity == 2 and passing ART.conc from no to yes, we pass from an OR of 7.15 to 6.26. Instead, applying what I understood of logistic regressions and interactions, the coeff for the case (ART.conc, Parity)->yes.2 should be
$$OR_{parity.2}*OR_{art.yes:parity.2} = 1.87$$
very different from 6.26…
Therefore, I'd like to know what I'm missing and which of the model should I use.
Thanks!
UPDATE!!!
Thanks to @gavin simpson answer I thought I solved the problem, and ditched the interaction() trick. But then I noticed a pretty silly error by me.
We said that to interpretate the interaction you do
$$OR_{simple.eff}*OR_{interaction.eff}$$
Taking in example the case ART.yes/Parity.2. I said that to compute the effect of interaction of ART on Parity with the classical method you should do
$$OR_{art.yes}*OR_{art.yes:parity.2} ≈ 3.35*0.262 ≈ 0.87$$
But if you perform
$$OR_{parity.2}*OR_{art.yes:parity.2} ≈ 7.15*0.262 ≈ 1.87$$
I was troubled by this disparity. I need a number that could describe the overall risk of the pathology in the above mentioned case, and instead i got two numbers that i didn't know how to describe.
Trying to solve this I noticed my error. To compute the overall effect, is not enough to multiply one simple effect for the interaction effect, but you should add the other simple effect too!
$$OR_{simple.eff.1}*OR_{simple.eff.2}*OR_{interaction.eff}$$
or
$$OR_{art.yes}*OR_{parity.2}*OR_{art.yes:parity.2} ≈ 3.35*7.15*0.262 ≈ 6.26$$
6.26 is the same OR for the interaction(ART.conc, Parity)->yes.2
that is produced using the interaction trick.
Therefore, using interaction(), with drop=T
option and not putting at all the simple effects by themselves (to avoid the NAs), you have that model 1 and model 2 give the same results, with model two including the already computed effect of the interaction.
Best Answer
Using some dummy data similar to what you must have used:
and for simplicity, form the interaction variable and store it in
dat
:OK, now the
model.matrix
will show us the conversion to terms in the model the represent the factor variables. First, the correct way:And now the way you did it:
The obvious thing is that these two are not the same because there are far more terms in the design matrix for the method using
interaction()
than for the correct way.Now, we also note why in the model with the
interaction()
term you getNA
values for some coefficients. Compare theinteractyes.3
column with theParity3
column in the design matrix from theinteraction()
model. Look familiar? So theinteraction()
has created information that relates to the main effects terms you added to the formula alongside theinteraction()
variable. At the very least, you don't need to specify the main effects terms alongside theinteraction()
variable in that version of the model. However, it'd still be wrong even if you fixed that.Consider
What
interaction()
has done (by default, I'll add) is retain all possible combinations of the levels of the two factors; that's 6 even though we only observed 4 of them in this dummy data set. This shows up in the model matrix as a vector of 0s for bothinteractno.1
andinteractno.3
if you just include theinteraction()
term in the model:Hence we won't be able to identify both of two all 0 columns, so you'll still get an
NA
.In summ, the two ways you mention are not the same model and you'll cause yourself a world of pain if you try to fit interactions using the
interaction()
function.Hopefully by now you realize these forms aren't the same and why, and more importantly why it is a bad idea to circumvent R's formula interface notation.
Aside
I mentioned the default behaviour of
interaction()
. You can do better by adding thedrop = TRUE
to the call:Now we only have the four observed levels. Then you can do:
But then this still isn't right because you have one more term than needed, again because R can't do the right thing about removing one of the combinations of levels which gets absorbed into the
Intercept
term. Note theParity3
andinteract2yes.3
terms; they are coded the same way; theParity3
term works with theIntercept
term to give you the information ininteract2yes.3
.