Solved – How to make the rare events corrections described in King and Zeng (2001)

case-control-studylogisticrare-eventsunbalanced-classesweighted-regression

I have a dataset with a binary (survival) response variable and 3 explanatory variables (A = 3 levels, B = 3 levels, C = 6 levels). In this dataset, the data is well balanced, with 100 individuals per ABC category. I already studied the effect of these A, B, and C variables with this dataset; their effects are significant.

I have a subset. In each ABC category, 25 of the 100 individuals, of which approximately half are alive and half are dead (when less than 12 are alive or dead, the number was completed with the other category), were further investigated for a 4th variable (D). I see three problems here:

I need to weight the data the rare events corrections described in King and Zeng (2001) to take into account the approximate 50% – 50% is not equal to 0/1 proportion in the bigger sample.
This non-random sampling of 0 and 1 leads to a different probability for individuals to be sampled in each of the ABC categories, so I think I have to use true proportions from each category rather than the global proportion of 0/1 in the big sample.
This 4th variable has 4 levels, and the data are really not balanced in these 4 levels (90% of the data is within 1 of these levels, say level D2).

I have read the King and Zeng (2001) paper carefully, as well as this CV question that led me to King and Zeng (2001) paper, and later this other one that led me to try the logistf package (I use R).
I tried to apply what I understood from King and Zheng (2001), but I am not sure what I did is right. I understood there are two methods:

For the prior correction method, I understood you only correct the intercept. In my case, the intercept is the A1B1C1 category, and in this category survival is 100%, so survival in the big dataset and the subset are the same, and therefore the correction changes nothing. I suspect this method should not apply to me anyway, because I do not have an overall true proportion, but a proportion for each category, and this method ignores that.
For the weighting method: I calculated w_i, and from what I understood in the paper: "All
researchers need to do is to calculate w_i in Eq. (8), choose it as the weight in their computer program, and then run a logit model". So I first ran my glm as:
```
glm(R~ A+B+C+D, weights=wi, data=subdata, family=binomial)
```
I am not sure I should include A, B, and C as explanatory variables, since I normally expect them to have no effect on survival in this subsample (each category contains about the 50% dead and alive). Anyway, it should not change the output a lot if they are not significant. With this correction, I get a good fit for level D2 (the level with most of individuals), but not at all for others levels of D (D2 preponderates). See the top right graph:

Fits of a non-weighted glm model and of a glm model weighted with w_i. Each dot represents one category. Proportion in the big dataset is the true proportion of 1 in the ABC category in the big dataset, Proportion in the sub dataset is the true proportion of 1 in the ABC category in the subdataset, and Model predictions are the predictions of glm models fitted with the subdataset. Each pch symbol represents a given level of D. Triangles are level D2.

Only later when seeing there is a logistf, I though this is perhaps not that simple. I am not sure now. When doing logistf(R~ A+B+C+D, weights=wi, data=subdata, family=binomial),
I get estimates, but the predict function does not work, and the default model test returns infinite chi squared values (except one) and all p-values = 0 (except 1).

Questions:

Did I properly understand King and Zeng (2001)? (How far am I from understanding it?)
In my glm fits, A, B, and C have significant effects. All this means is that I deparse a lot from the half / half proportions of 0 and 1 in my subset and differently in the different ABC categories – isn't that right?
Can I apply King and Zeng's (2001) weighting correction despite the fact that I have a value of tau and a value of $\bar y$ for each ABC category instead of global values?
Is it an issue that my D variable is so unbalanced, and if it is, how can I handle it? (Taking into account I will already have to weight for the rare event correction…Is "double weighting", i.e. weighting the weights, possible?)
Thanks!

Edit: See what happens if I remove A, B and C from the models. I do not understand why there is such differences.

Fits without A, B, and C as explanatory variables in models

Best Answer

The logistf() function do not implement rare event logistic regression, that is done by the relogit() function in the Zelig package, on CRAN. You should test that one!

Related Solutions

Binary Outcome Modelling – Adjusting for Varying Census Intervals

An alternative and slightly easier approach is to use a conditional log-log link (cloglog), which estimates the log-hazard rather than the log-odds of outcomes such as mortality. Copying from rpubs:

A very common situation in ecology (and elsewhere) is a survival/binary-outcome model where individuals (each measured once) differ in their exposure. The classical approach to this problem is to use a complementary log-log link. The complementary log-log or "cloglog" function is $C(\mu)=\log(−\log(1−\mu))$; its inverse is $\mu=C^{−1}(\eta)=1−\exp(−\exp(\eta))$. Thus if we expect mortality $\mu_0$ over a period $\Delta t=1$ and the linear predictor $\eta=C^{−1}(\mu_0)$ then $$ C^{−1}(\eta+\log\Delta t)=(1−\exp(−\exp(\eta) \cdot \Delta t)) $$ Some algebra shows that this is equal to $1−(1−\mu_0)^{\Delta t}$, which is what we want.

The function $\exp(−\exp(x))$ is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a cloglog link to the survival, rather than the mortality, probability) is also called a gompit (or extreme value) regression.

To use this approach in R, specify family=binomial(link="cloglog") and add a term of the form offset(log(exposure)) to the formula (alternatively, some modeling functions take offset as a separate argument). For example,

glm(surv~x1+x2+offset(log(exposure)),
    family=binomial(link="cloglog"),
    data=my_surv_data)

where exposure is the length of time for which a given individual is exposed to the possibility of dying/failing (e.g., census interval or time between observations or total observation time).

You may also want to consider checking the model where log(exposure) is included as a covariate rather than an offset - this makes the log-hazard have a $\beta_t \log(t)$ term, or equivalently makes the hazard proportional to $t^{\beta_t}$ rather than to $t$ (I believe this makes the survival distribution Weibull rather than exponential, but I haven't checked that conclusion carefully).

Advantages of using this approach rather than Schaffer's power-logistic method:

because the exposure time is incorporated in the offset rather than in the definition of the link function itself, R handles this a bit more gracefully (for example, it will be easier to generate predictions with different exposure times from the original data set).
it is slightly older and more widely used in statistics; Googling "cloglog logistic regression" or searching for cloglog on CrossValidated will bring you to more resources.

The only disadvantage I can think of off the top of my head is that people in the nest survival world are more used to Schaffer's method. For a large enough data set you might be able to tell which link actually fits the data better (e.g. fit with both approaches and compare AIC values), but in general I doubt there's very much difference.

Solved – Rare event logistic regression bias: how to simulate the underestimated p’s with a minimal example

This is an interesting question - I'm have done a few simulations that I post below in the hope that this stimulates further discussion.

First of all, a few general comments:

The paper you cite is about rare-event bias. What was not clear to me before (also with respect to comments that were made above) is if there is anything special about cases where you have 10/10000 as opposed to 10/30 observations. However, after some simulations, I would agree there is.
A problem I had in mind (I have encountered this often, and there was recently a paper in Methods in Ecology and Evolution on that, I couldn't find the reference though) is that you can get degenerate cases with GLMs in small-data situations, where the MLE is FAAAR away from the truth, or even at - / + infinity (due to the nonlinear link I suppose). It's not clear to me how one should treat these cases in the bias estimation, but from my simulations I would say they seem key for the rare-event bias. My intuition would be to remove them, but then it's not quite clear how far out they have to be to be removed. Maybe something to keep in mind for bias-correction.
Also, these degenerate cases seem prone to cause numerical problems (I have therefore increased maxit in the glm function, but one could think about increasing epsilon as well to make sure one actually reports the true MLE).

Anyway, here some code that calculates the difference between estimates and truth for intercept, slope and predictions in a logistic regression, first for a low sample size / moderate incidence situation:

set.seed(123)
replicates = 1000
N= 40
slope = 2 # slope (linear scale)
intercept = - 1 # intercept (linear scale)

bias <- matrix(NA, nrow = replicates, ncol = 3)
incidencePredBias <- rep(NA, replicates)

for (i in 1:replicates){
  pred = runif(N,min=-1,max=1) 
  linearResponse = intercept + slope*pred
  data = rbinom(N, 1, plogis(linearResponse))  
  fit <- glm(data ~ pred, family = 'binomial', control = list(maxit = 300))
  bias[i,1:2] = fit$coefficients - c(intercept, slope)
  bias[i,3] = mean(predict(fit,type = "response")) - mean(plogis(linearResponse))
}

par(mfrow = c(1,3))
text = c("Bias intercept", "Bias slope", "Bias prediction")

for (i in 1:3){
  hist(bias[,i], breaks = 100, main = text[i])
  abline(v=mean(bias[,i]), col = "red", lwd = 3)  
}

apply(bias, 2, mean)
apply(bias, 2, sd) / sqrt(replicates)

The resulting bias and standard errors for intercept, slope and prediction are

-0.120429315  0.296453122 -0.001619793
 0.016105833  0.032835468  0.002040664

I would conclude that there is pretty good evidence for a slight negative bias in the intercept, and a slight positive bias in the slope, although a look at the plotted results shows that the bias is small compared the the variance of the estimated values.

If I'm setting the parameters to a rare-event situation

N= 4000
slope = 2 # slope (linear scale)
intercept = - 10 # intercept (linear scale)

I'm getting a larger bias for the intercept, but still NONE on the prediction

   -1.716144e+01  4.271145e-01 -3.793141e-06
    5.039331e-01  4.806615e-01  4.356062e-06

In the histogram of the estimated values, we see the phenomenon of degenerate parameter estimates (if we should call them like that)

Let's remove all rows for which intercept estimates are <20

apply(bias[bias[,1] > -20,], 2, mean)
apply(bias[bias[,1] > -20,], 2, sd) / sqrt(length(bias[,1] > -10))

The bias decreases, and things become a bit more clear in the figures - parameter estimates are clearly not normally distributed. I wonder that that means for the validity of the CIs that are reported.

-0.6694874106  1.9740437782  0.0002079945
1.329322e-01 1.619451e-01 3.242677e-06

I would conclude the rare event bias on the intercept is driven by rare events itself, namely those rare, extremely small estimates. Not sure if we want to remove them or not, not sure what the cutoff would be.

An important thing to note though is that, either way, there seems to be no bias on predictions at the response scale - the link function simply absorbs these extremely small values.

Best Answer

Related Solutions

Binary Outcome Modelling – Adjusting for Varying Census Intervals

Solved – Rare event logistic regression bias: how to simulate the underestimated p’s with a minimal example

Related Question