Solved – How to make the rare events corrections described in King and Zeng (2001)

case-control-studylogisticrare-eventsunbalanced-classesweighted-regression

I have a dataset with a binary (survival) response variable and 3 explanatory variables (A = 3 levels, B = 3 levels, C = 6 levels). In this dataset, the data is well balanced, with 100 individuals per ABC category. I already studied the effect of these A, B, and C variables with this dataset; their effects are significant.

I have a subset. In each ABC category, 25 of the 100 individuals, of which approximately half are alive and half are dead (when less than 12 are alive or dead, the number was completed with the other category), were further investigated for a 4th variable (D). I see three problems here:

  1. I need to weight the data the rare events corrections described in King and Zeng (2001) to take into account the approximate 50% – 50% is not equal to 0/1 proportion in the bigger sample.
  2. This non-random sampling of 0 and 1 leads to a different probability for individuals to be sampled in each of the ABC categories, so I think I have to use true proportions from each category rather than the global proportion of 0/1 in the big sample.
  3. This 4th variable has 4 levels, and the data are really not balanced in these 4 levels (90% of the data is within 1 of these levels, say level D2).

I have read the King and Zeng (2001) paper carefully, as well as this CV question that led me to King and Zeng (2001) paper, and later this other one that led me to try the logistf package (I use R).
I tried to apply what I understood from King and Zheng (2001), but I am not sure what I did is right. I understood there are two methods:

  • For the prior correction method, I understood you only correct the intercept. In my case, the intercept is the A1B1C1 category, and in this category survival is 100%, so survival in the big dataset and the subset are the same, and therefore the correction changes nothing. I suspect this method should not apply to me anyway, because I do not have an overall true proportion, but a proportion for each category, and this method ignores that.
  • For the weighting method: I calculated wi, and from what I understood in the paper: "All
    researchers need to do is to calculate wi in Eq. (8), choose it as the weight in their computer program, and then run a logit model". So I first ran my glm as:

    glm(R~ A+B+C+D, weights=wi, data=subdata, family=binomial)
    

    I am not sure I should include A, B, and C as explanatory variables, since I normally expect them to have no effect on survival in this subsample (each category contains about the 50% dead and alive). Anyway, it should not change the output a lot if they are not significant. With this correction, I get a good fit for level D2 (the level with most of individuals), but not at all for others levels of D (D2 preponderates). See the top right graph:

Fits
Fits of a non-weighted glm model and of a glm model weighted with wi. Each dot represents one category. Proportion in the big dataset is the true proportion of 1 in the ABC category in the big dataset, Proportion in the sub dataset is the true proportion of 1 in the ABC category in the subdataset, and Model predictions are the predictions of glm models fitted with the subdataset. Each pch symbol represents a given level of D. Triangles are level D2.

Only later when seeing there is a logistf, I though this is perhaps not that simple. I am not sure now. When doing logistf(R~ A+B+C+D, weights=wi, data=subdata, family=binomial),
I get estimates, but the predict function does not work, and the default model test returns infinite chi squared values (except one) and all p-values = 0 (except 1).

Questions:

  • Did I properly understand King and Zeng (2001)? (How far am I from understanding it?)
  • In my glm fits, A, B, and C have significant effects. All this means is that I deparse a lot from the half / half proportions of 0 and 1 in my subset and differently in the different ABC categories – isn't that right?
  • Can I apply King and Zeng's (2001) weighting correction despite the fact that I have a value of tau and a value of $\bar y$ for each ABC category instead of global values?
  • Is it an issue that my D variable is so unbalanced, and if it is, how can I handle it? (Taking into account I will already have to weight for the rare event correction…Is "double weighting", i.e. weighting the weights, possible?)
    Thanks!

Edit: See what happens if I remove A, B and C from the models. I do not understand why there is such differences.

Fits2
Fits without A, B, and C as explanatory variables in models

Best Answer

The logistf() function do not implement rare event logistic regression, that is done by the relogit() function in the Zelig package, on CRAN. You should test that one!

Related Question