Using some dummy data similar to what you must have used:
set.seed(10)
dat <- data.frame(ART.conc = factor(sample(c("yes","no"), 10, replace = TRUE),
levels = c("yes","no")),
Parity = factor(sample(1:3, 10, replace = TRUE)))
and for simplicity, form the interaction variable and store it in dat
:
dat <- transform(dat, interact = interaction(ART.conc, Parity))
OK, now the model.matrix
will show us the conversion to terms in the model the represent the factor variables. First, the correct way:
> model.matrix( ~ ART.conc * Parity, data = dat)
(Intercept) ART.concno Parity2 Parity3 ART.concno:Parity2 ART.concno:Parity3
1 1 1 1 0 1 0
2 1 0 1 0 0 0
3 1 0 0 0 0 0
4 1 1 1 0 1 0
5 1 0 1 0 0 0
6 1 0 1 0 0 0
7 1 0 0 0 0 0
8 1 0 0 0 0 0
9 1 1 1 0 1 0
10 1 0 0 1 0 0
And now the way you did it:
> model.matrix( ~ interact + ART.conc + Parity, data = dat)
(Intercept) interactno.1 interactyes.2 interactno.2 interactyes.3
1 1 0 0 1 0
2 1 0 1 0 0
3 1 0 0 0 0
4 1 0 0 1 0
5 1 0 1 0 0
6 1 0 1 0 0
7 1 0 0 0 0
8 1 0 0 0 0
9 1 0 0 1 0
10 1 0 0 0 1
interactno.3 ART.concno Parity2 Parity3
1 0 1 1 0
2 0 0 1 0
3 0 0 0 0
4 0 1 1 0
5 0 0 1 0
6 0 0 1 0
7 0 0 0 0
8 0 0 0 0
9 0 1 1 0
10 0 0 0 1
The obvious thing is that these two are not the same because there are far more terms in the design matrix for the method using interaction()
than for the correct way.
Now, we also note why in the model with the interaction()
term you get NA
values for some coefficients. Compare the interactyes.3
column with the Parity3
column in the design matrix from the interaction()
model. Look familiar? So the interaction()
has created information that relates to the main effects terms you added to the formula alongside the interaction()
variable. At the very least, you don't need to specify the main effects terms alongside the interaction()
variable in that version of the model. However, it'd still be wrong even if you fixed that.
Consider
> with(dat, levels(interact))
[1] "yes.1" "no.1" "yes.2" "no.2" "yes.3" "no.3"
> with(dat, table(interact))
interact
yes.1 no.1 yes.2 no.2 yes.3 no.3
3 0 3 3 1 0
What interaction()
has done (by default, I'll add) is retain all possible combinations of the levels of the two factors; that's 6 even though we only observed 4 of them in this dummy data set. This shows up in the model matrix as a vector of 0s for both interactno.1
and interactno.3
if you just include the interaction()
term in the model:
> model.matrix( ~ interact, data = dat)
(Intercept) interactno.1 interactyes.2 interactno.2 interactyes.3
1 1 0 0 1 0
2 1 0 1 0 0
3 1 0 0 0 0
4 1 0 0 1 0
5 1 0 1 0 0
6 1 0 1 0 0
7 1 0 0 0 0
8 1 0 0 0 0
9 1 0 0 1 0
10 1 0 0 0 1
interactno.3
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
Hence we won't be able to identify both of two all 0 columns, so you'll still get an NA
.
In summ, the two ways you mention are not the same model and you'll cause yourself a world of pain if you try to fit interactions using the interaction()
function.
Hopefully by now you realize these forms aren't the same and why, and more importantly why it is a bad idea to circumvent R's formula interface notation.
Aside
I mentioned the default behaviour of interaction()
. You can do better by adding the drop = TRUE
to the call:
> with(dat, levels(interact2))
[1] "yes.1" "yes.2" "no.2" "yes.3"\
Now we only have the four observed levels. Then you can do:
> model.matrix( ~ ART.conc + Parity + interact2, data = dat)
(Intercept) ART.concno Parity2 Parity3 interact2yes.2 interact2no.2
1 1 1 1 0 0 1
2 1 0 1 0 1 0
3 1 0 0 0 0 0
4 1 1 1 0 0 1
5 1 0 1 0 1 0
6 1 0 1 0 1 0
7 1 0 0 0 0 0
8 1 0 0 0 0 0
9 1 1 1 0 0 1
10 1 0 0 1 0 0
interact2yes.3
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 1
But then this still isn't right because you have one more term than needed, again because R can't do the right thing about removing one of the combinations of levels which gets absorbed into the Intercept
term. Note the Parity3
and interact2yes.3
terms; they are coded the same way; the Parity3
term works with the Intercept
term to give you the information in interact2yes.3
.
I think that your original model with 504 levels with each level having two readings is problematic because it potentially suffers from complete separation, especially given the small number of positives in your sample. By complete separation I mean that for a given combination of covariates all responses are the same (usually 0 or 1). You might want to try a different optimizer (ie. something along the lines glmerControl(optimizer='bobyqa','Nelder_Mead', etc., ...
) but I would not be very confident that this would work either. In general having some levels with one or two observations is not a problem but when all of them are so low things become computationally odd because you starts having identifiability issues (eg. you definitely cannot evaluate any slopes as a random slope plus a random intercept for every individual would give you one random effect for every observation). You really lose a lot of degrees of freedom any way you count them. You do not show the glmmPQL
output but I suspect a very high variance of the random effect that would strongly suggest that there is complete separation. (EDIT: You now show that output and can you can clearly see that the ratio is indeed very high.)
You might want to consider using the function logistf
from the package with the same name. logistf
will fit a penalized logistic regression model that will probably alleviate the issue you experience; it will not use any random effects.
The rule of thumb for the lowest number of levels a random effect can be estimated reasonably is "5 or 6"; below that your estimate for the standard deviation of that effect would really suffer. With this in mind, no; you using Area
having just four (4) levels is too aggressive. Probably it makes more sense to use it a fixed effect. In general if I do not get at least 10-11 random effects I am a bit worried about the validity of the random effects assumption; we are estimating a Gaussian after-all.
Yes, you could use drop1
but really be careful not to start data-dredging (which is a bad thing). Take any variable selection procedure with a grain of salt. This is issue is extesnsively discussed in Cross-Validated; eg. see the following two great threads for starters here and here. Maybe it is more reasonable to include certain "insignificant" variables in a model so one can control for them and then comment on why they came out insignificant rather then just break down a model to the absolutely bare-bone where everything is uber-signficant. In any case I would strongly suggest using bootstrap to get confidence intervals for estimated parameters.
Best Answer
The fact that
firth=FALSE
doesn't give similar results toglm
is puzzling to me -- hopefully someone else can answer. As far aspl
goes, though, you're almost always better off with profile confidence intervals. The Wald confidence intervals assume that the (implicit) log-likelihood surface is locally quadratic, which is often a bad approximation. Except that they're more computationally intensive, profile confidence intervals are always (? I would welcome counterexamples ?) more accurate. The "improved" p values you get from the Wald estimates are likely overoptimistic.Generate data:
Replicate:
Compare Wald (
confint.default
) with profile CIs forglm
(in this case the profile intervals are actually narrower).Comparing with the
glm2
package (just to make sure thatglm
isn't doing something wonky).