Solved – Understanding confidence intervals in Firth penalized logistic regression

confidence intervalinterpretationlikelihood-ratiologisticregularization

I recently discovered penalized likelihood ratio methods to cope with sparse and/or separated data.

I'm having some problems though in understanding the results a logistic regression using Firth method (package logistf in R) give back.

I have a dataset with an outcome and a predictor, both dichotomous:

      Y.yes Y.no
X.yes 0     22
X.no  7     356

I perform the regressions using logistf() switching the pl and the firth parameters.

pl specifies if confidence intervals and tests should be based on the profile penalized log likelihood (pl=TRUE) or on the Wald method (pl=FALSE).
firth use of Firth's penalized maximum likelihood (firth=TRUE) or the standard maximum likelihood method (firth=FALSE) for the logistic regression.

coefficients, CIs and p values for the xYes case versus xNo are shown.

1. pl=T;firth=T    coef:-3.5    ci.low:-11.6   ci.high:4.85   p.val:1
2. pl=F;firth=T    coef:-3.5    ci.low:-7.89   ci.high:0.88   p.val:0.117
3. pl=T;firth=F    coef:-2.96   ci.low:-11.1   ci.high:1.08   p.val:1
4. pl=F;firth=F    coef:-2.96   ci.low:-8.17   ci.high:2.24   p.val:0.264

I also show results for classical ML logistic via glm():

5. glm   coef:-14.6   ci.low:NA   ci.high:118   p.val:0.992

As we can see, CIs are improved in logistf because we don't find anymore more extreme values or NA or infinite. Still I'm not well sure which option to choose. The fourth model, with both options set false, in my understanding should be identical to normal glm(), while it's clearly not. Also I don't get why choosing Wald CI method (model 2 and 4) improve p values so much in comparison to profile penalized likelihood (model 1 and 3). Moreover choosing Wald CIs gives also shorter CIs ranges.

Things get stranger if I had another binomial covariate X2:

                Y.yes  Y.no
X1.yes  X2.yes  0      12
        X2.no   0      10

X1.no   X2.yes  6      191
        X2.no   1      165

Here are the regressions:

1. pl=T;firth=T    
    X1 coef:-1.63    ci.low:-2.79   ci.high:-0.38   p.val:1
    X2 coef:3.15     ci.low:-0.72   ci.high:11.2    p.val:1
2. pl=F;firth=T    
    X1 coef:-1.63    ci.low:-4.24   ci.high:0.97    p.val:0.220
    X2 coef:3.15     ci.low:-2.11   ci.high:8.42    p.val:0.240
3. pl=T;firth=F    
    X1 coef:-1.08    ci.low:-5.17   ci.high:2.86    p.val:0.006
    X2 coef:3.35     ci.low:1.48    ci.high:11.5    p.val:1
4. pl=F;firth=F    
    X1 coef:-1.08    ci.low:-4.27   ci.high:2.10    p.val:0.504
    X2 coef:3.35     ci.low:-2.71   ci.high:9.42    p.val:0.279
5. glm()    
    X1 coef:-15.6    ci.low:NA      ci.high:196.4   p.val:0.994
    X2 coef:1.645    ci.low:-0.13   ci.high:4.58    p.val:0.130

In the results we see some oddities: in model 1, where CIs are both below the zero but nevertheless the regression has a maximal p value; while in model 3 we have the opposite, that is significant p value but non significant CIs.and again in glm we have totally different values. Furthermore p values change dramatically in the different models, meaning that choosing the wrong model would totally change interpretation of the study!

So my question are, how can I explain the results I reported above? and then, which criteria should I follow to choose the right options for logistf?

thanks!

Best Answer

The fact that firth=FALSE doesn't give similar results to glm is puzzling to me -- hopefully someone else can answer. As far as pl goes, though, you're almost always better off with profile confidence intervals. The Wald confidence intervals assume that the (implicit) log-likelihood surface is locally quadratic, which is often a bad approximation. Except that they're more computationally intensive, profile confidence intervals are always (? I would welcome counterexamples ?) more accurate. The "improved" p values you get from the Wald estimates are likely overoptimistic.

Generate data:

dd <- data.frame(X=rep(c("yes","no"),c(22,363)),
             Y=rep(c("no","yes","no"),c(22,7,356)))
with(dd,table(X,Y))

Replicate:

m_glm <-glm(Y~X,family=binomial,data=dd)
library("logistf")
m_fp <-logistf(Y~X,data=dd,pl=TRUE,firth=TRUE)
m_mp <- logistf(Y~X,data=dd,pl=TRUE,firth=FALSE)
m_fw <-logistf(Y~X,data=dd,pl=FALSE,firth=TRUE)
m_mw <-logistf(Y~X,data=dd,pl=FALSE,firth=FALSE)

Compare Wald (confint.default) with profile CIs for glm (in this case the profile intervals are actually narrower).

confint.default(m_glm)  ## {-2740, 2710}
confint(m_glm)          ## {NA, 118}

Comparing with the glm2 package (just to make sure that glm isn't doing something wonky).

library("glm2")
glm2(Y~X,family=binomial,data=dd)
## similar results to glm(...)

Related Solutions

Solved – Interaction in logistic models with R. Using * operator or create an interaction variable

Using some dummy data similar to what you must have used:

set.seed(10)
dat <- data.frame(ART.conc = factor(sample(c("yes","no"), 10, replace = TRUE),
                                    levels = c("yes","no")),
                  Parity   = factor(sample(1:3, 10, replace = TRUE)))

and for simplicity, form the interaction variable and store it in dat:

dat <- transform(dat, interact = interaction(ART.conc, Parity))

OK, now the model.matrix will show us the conversion to terms in the model the represent the factor variables. First, the correct way:

> model.matrix( ~ ART.conc * Parity, data = dat)
   (Intercept) ART.concno Parity2 Parity3 ART.concno:Parity2 ART.concno:Parity3
1            1          1       1       0                  1                  0
2            1          0       1       0                  0                  0
3            1          0       0       0                  0                  0
4            1          1       1       0                  1                  0
5            1          0       1       0                  0                  0
6            1          0       1       0                  0                  0
7            1          0       0       0                  0                  0
8            1          0       0       0                  0                  0
9            1          1       1       0                  1                  0
10           1          0       0       1                  0                  0

And now the way you did it:

> model.matrix( ~ interact + ART.conc + Parity, data = dat)
   (Intercept) interactno.1 interactyes.2 interactno.2 interactyes.3
1            1            0             0            1             0
2            1            0             1            0             0
3            1            0             0            0             0
4            1            0             0            1             0
5            1            0             1            0             0
6            1            0             1            0             0
7            1            0             0            0             0
8            1            0             0            0             0
9            1            0             0            1             0
10           1            0             0            0             1
   interactno.3 ART.concno Parity2 Parity3
1             0          1       1       0
2             0          0       1       0
3             0          0       0       0
4             0          1       1       0
5             0          0       1       0
6             0          0       1       0
7             0          0       0       0
8             0          0       0       0
9             0          1       1       0
10            0          0       0       1

The obvious thing is that these two are not the same because there are far more terms in the design matrix for the method using interaction() than for the correct way.

Now, we also note why in the model with the interaction() term you get NA values for some coefficients. Compare the interactyes.3 column with the Parity3 column in the design matrix from the interaction() model. Look familiar? So the interaction() has created information that relates to the main effects terms you added to the formula alongside the interaction() variable. At the very least, you don't need to specify the main effects terms alongside the interaction() variable in that version of the model. However, it'd still be wrong even if you fixed that.

Consider

> with(dat, levels(interact))
[1] "yes.1" "no.1"  "yes.2" "no.2"  "yes.3" "no.3" 
> with(dat, table(interact))
interact
yes.1  no.1 yes.2  no.2 yes.3  no.3 
    3     0     3     3     1     0

What interaction() has done (by default, I'll add) is retain all possible combinations of the levels of the two factors; that's 6 even though we only observed 4 of them in this dummy data set. This shows up in the model matrix as a vector of 0s for both interactno.1 and interactno.3 if you just include the interaction() term in the model:

> model.matrix( ~ interact, data = dat)
   (Intercept) interactno.1 interactyes.2 interactno.2 interactyes.3
1            1            0             0            1             0
2            1            0             1            0             0
3            1            0             0            0             0
4            1            0             0            1             0
5            1            0             1            0             0
6            1            0             1            0             0
7            1            0             0            0             0
8            1            0             0            0             0
9            1            0             0            1             0
10           1            0             0            0             1
   interactno.3
1             0
2             0
3             0
4             0
5             0
6             0
7             0
8             0
9             0
10            0

Hence we won't be able to identify both of two all 0 columns, so you'll still get an NA.

In summ, the two ways you mention are not the same model and you'll cause yourself a world of pain if you try to fit interactions using the interaction() function.

Hopefully by now you realize these forms aren't the same and why, and more importantly why it is a bad idea to circumvent R's formula interface notation.

Aside

I mentioned the default behaviour of interaction(). You can do better by adding the drop = TRUE to the call:

> with(dat, levels(interact2))
[1] "yes.1" "yes.2" "no.2"  "yes.3"\

Now we only have the four observed levels. Then you can do:

> model.matrix( ~ ART.conc + Parity + interact2, data = dat)
   (Intercept) ART.concno Parity2 Parity3 interact2yes.2 interact2no.2
1            1          1       1       0              0             1
2            1          0       1       0              1             0
3            1          0       0       0              0             0
4            1          1       1       0              0             1
5            1          0       1       0              1             0
6            1          0       1       0              1             0
7            1          0       0       0              0             0
8            1          0       0       0              0             0
9            1          1       1       0              0             1
10           1          0       0       1              0             0
   interact2yes.3
1               0
2               0
3               0
4               0
5               0
6               0
7               0
8               0
9               0
10              1

But then this still isn't right because you have one more term than needed, again because R can't do the right thing about removing one of the combinations of levels which gets absorbed into the Intercept term. Note the Parity3 and interact2yes.3 terms; they are coded the same way; the Parity3 term works with the Intercept term to give you the information in interact2yes.3.

Solved – Repeated measures – random effects for logistic regression in R

I think that your original model with 504 levels with each level having two readings is problematic because it potentially suffers from complete separation, especially given the small number of positives in your sample. By complete separation I mean that for a given combination of covariates all responses are the same (usually 0 or 1). You might want to try a different optimizer (ie. something along the lines glmerControl(optimizer='bobyqa','Nelder_Mead', etc., ...) but I would not be very confident that this would work either. In general having some levels with one or two observations is not a problem but when all of them are so low things become computationally odd because you starts having identifiability issues (eg. you definitely cannot evaluate any slopes as a random slope plus a random intercept for every individual would give you one random effect for every observation). You really lose a lot of degrees of freedom any way you count them. You do not show the glmmPQL output but I suspect a very high variance of the random effect that would strongly suggest that there is complete separation. (EDIT: You now show that output and can you can clearly see that the ratio is indeed very high.) You might want to consider using the function logistf from the package with the same name. logistf will fit a penalized logistic regression model that will probably alleviate the issue you experience; it will not use any random effects.

The rule of thumb for the lowest number of levels a random effect can be estimated reasonably is "5 or 6"; below that your estimate for the standard deviation of that effect would really suffer. With this in mind, no; you using Area having just four (4) levels is too aggressive. Probably it makes more sense to use it a fixed effect. In general if I do not get at least 10-11 random effects I am a bit worried about the validity of the random effects assumption; we are estimating a Gaussian after-all.

Yes, you could use drop1 but really be careful not to start data-dredging (which is a bad thing). Take any variable selection procedure with a grain of salt. This is issue is extesnsively discussed in Cross-Validated; eg. see the following two great threads for starters here and here. Maybe it is more reasonable to include certain "insignificant" variables in a model so one can control for them and then comment on why they came out insignificant rather then just break down a model to the absolutely bare-bone where everything is uber-signficant. In any case I would strongly suggest using bootstrap to get confidence intervals for estimated parameters.

Best Answer

Related Solutions

Solved – Interaction in logistic models with R. Using * operator or create an interaction variable

Aside

Solved – Repeated measures – random effects for logistic regression in R

Related Question