Statistical Significance – Why Main Effects Lose Significance When Interactions Are Included

lme4-nlmestatistical significance

I have an experiment with 8 binary IV, 1 three-level IV (all categorical) and a continuous DV.
I run an lmer with all main effects of all variables like so:

main_effects<-lmer(agreement~ dir+coref+fuzzy+B_atom+A_atom+A_neg+B_neg+A_qua+B_qua+(1|Index), data=data)

By displaying the summary of that model, I get the following, where factors dir, coref, B_atom, A_neg and B_neg are significant:

Type III Analysis of Variance Table with Satterthwaite's method
        Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
dir    13764.8  6882.4     2 282.50  8.5386 0.0002509 ***
coref   6487.7  6487.7     1 190.33  8.0488 0.0050465 ** 
fuzzy   2538.6  2538.6     1 190.52  3.1495 0.0775475 .  
B_atom  4502.8  4502.8     1 357.59  5.5863 0.0186359 *  
A_atom  2234.7  2234.7     1 357.66  2.7724 0.0967781 .  
A_neg   8802.7  8802.7     1 366.43 10.9209 0.0010446 ** 
B_neg   8995.8  8995.8     1 366.44 11.1605 0.0009215 ***
A_qua     17.3    17.3     1 381.60  0.0215 0.8835529    
B_qua     36.5    36.5     1 381.60  0.0453 0.8315259   

However, if I try to run another lmer containing interactions like so

interaction<-lmer(agreement~ dir*coref*fuzzy*B_atom*A_atom*A_neg*B_neg*A_qua*B_qua+(1|Index), data=data) 

the main effects showing in the first model completely disappear and I only get some few interactions (I only include the siginifcant interactions in the below output to keep this post as short as possible):

fuzzy:B_atom                              8605.7  8605.7     1 309.35 11.3420 0.0008535 ***    
dir:A_neg                                 5929.7  2964.8     2 316.77  3.9075 0.0210654 *     
dir:B_neg                                 6676.6  3338.3     2 317.00  4.3998 0.0130391 *      
dir:coref:fuzzy                           6911.4  3455.7     2 307.90  4.5545 0.0112381 * 

Why do the main effects disappear? If there are interactions of those factors on top of the simple main effects of model 1, shouldn't the main effects be included in model 2 as well? I would understand it if it would have been the other way around: no simple main effects on model 1 but main effects and interactions on model 2. But here I have the opposite situation, which I don't quite get. Most importantly, do I go with the first model or the second one or in other words, would my results be unreliable if I decide to report on the results of model 1 (with main effects only)?

I don't have much experience in this kind of modeling so I am a bit lost right now.

UPDATE
I checked for multicollinearity and it seems that this is not a problem:

        GVIF Df GVIF^(1/(2*Df))
dir    1.217513  2        1.050433
coref  1.458521  1        1.207692
fuzzy  1.496577  1        1.223347
B_atom 1.099159  1        1.048408
A_atom 1.098381  1        1.048037
A_neg  1.702295  1        1.304720
B_neg  1.702118  1        1.304652
A_qua  1.534520  1        1.238758
B_qua  1.534869  1        1.238898

From the reading I did, I found that the GVIF can be used with the same rule of thumb as the VIF: everything below 5 is not highly correlated.
Should I then stick to my first model with the main effects as the "reliable" one without correlations? Or any other suggestions?

Best Answer

Putting aside the issues with p-value calculations in mixed models (see this page and its links), there are a couple of things that could be going on, one of which seems more likely in your case.

First, models with interactions can use up a lot of degrees of freedom. In addition to the 10 degrees of freedom for the specified predictors in your first model (plus 1 for the intercept or grand mean and more for the random effect), each of their two-way, 3-way,...,9-way interactions is an additional parameter value that the second model needs to evaluate, diminishing the residual degrees of freedom. The resulting higher standard errors in coefficient estimates in a model with interactions thus can move a coefficient from "significant" to "non-significant" even if the point estimate of the coefficient is the same.

That doesn't seem to explain your result here, though. The estimated denominator degrees of freedom were typically in the high 300's in the first model, but still in the low 300s for the second model.

More likely here is the second possibility: that the interactions themselves are the interesting results. Look at the predictors involved in the interaction terms you chose to display: dir, core, fuzzy, B_atom, A_neg, B_neg. Those are all of the predictors that were significant, plus the one that came closest to the significance cutoff, in the first model. What the first model told you was that each of those predictors was related to outcome, if you hold constant (ignore) the values of all the other predictors. What the significant interaction terms in the second model told you was that you cannot fairly ignore the values of other predictors when you consider the relation of those individual predictors to outcome.

In that context of important interactions, the "significance" of any "main effect" doesn't really matter: it's the particular combinations of values of dir, core, fuzzy, B_atom, A_neg, and B_neg that matter. Those "main effects" didn't "disappear"; they just didn't reach a threshold of statistical significance on their own. You still need to know the values of the "main-effect" estimates to predict the outcome at any specified combination of the predictors.

Complicating the issue is that different software can report p-values for categorical predictors differently when there are interactions. Sometimes they are reported for a situation where all other predictors are at their reference levels. In that case it's quite possible to get a "non-significant" p-value for a predictor in that scenario, even if is highly associated with outcome at a non-reference level of another predictor due to an interaction. Other software includes a report of p-values for a predictor that includes all of its interactions. So you have to understand how the particular software you are using is presenting its p-values before you can even make the (often arbitrary) decision that the predictor is "non-significant."