Solved – Strategy to deal with rare events logistic regression

logisticrare-events

I would like to study rare events in a finite population. Since I am unsure about which strategy is best suited, I would appreciate tips and references related to this matter, although I am well-aware it has been largely covered. I just don't really know where to begin.

My problem is a political sciences one and I have a finite population comprising 515,843 records. They are associated to a binary dependent variable with 513,334 "0"s and 2,509 "1"s. I can coin my "1"s as rare events since they account for only 0.49% of the population.

I have a set of around 10 independent variables I would like to build a model with to explain the presence of "1"s. Like many of us, I read King & Zeng's 2001 article about rare events correction. Their approach was to use a case-control design to reduce the number of "0"s, then apply correction to the intercept.

However, this post says that King & Zeng's argument was not necessary if I already collected my data over the whole population, which is my case. Therefore, I have to use the classical logit model. Unfortunately for me, although I obtain good significant coefficients, my model is completely useless in terms of prediction (fails to predict 99.48% of my "1"s).

After reading King & Zeng's article, I wanted to try a case-control design and selected only 10% of the "0"s with all the "1"s. With almost the same coefficients, the model was able to predict almost one third of the "1"s when applied to the full population. Of course, there are a lot of false-positive.

I have thus three questions I would like to ask you:

1) If King & Zeng's approach is prejudiciable when you have full knowledge of the population, why do they use a situation where they know the population in their article to prove their point?

2) If I have good and siginificant coefficients in a logit regression, but very poor predictive power, does that mean that the variation explained by these variable is meaningless?

3) What is the best approach to deal with rare events? I read about King's relogit model, Firth's approach, the exact logit, etc. I must confess I am a lost among all these solutions.

Best Answer

(1) If you've "full knowledge of a population" why do you need a model to make predictions? I suspect you're implicitly considering them as a sample from a hypothetical super-population—see here & here. So should you throw away observations from your sample? No. King & Zeng don't advocate this:

[...] in fields like international relations, the number of observable 1’s (such as wars) is strictly limited, so in most applications it is best to collect all available 1’s or a large sample of them. The only real decision then is how many 0’s to collect as well. If collecting 0’s is costless, we should collect as many as we can get, since more data are always better.

The situation I think you're talking about is the example "Selecting on $Y$ in Militarized Interstate Dispute Data". K.&Z. use it to, well, prove their point: in this example if a researcher had tried to economize by collecting all the 1's & a proportion of the 0's, their estimates would be similar to one who'd sampled all available 1's & 0's. How else would you illustrate that?

(2) The main issue here is the use of an improper scoring rule to assess your model's predictive performance. Suppose your model were true, so that for any individual you knew the probability of a rare event—say being bitten by a snake in the next month. What more do you learn by stipulating an arbitrary probability cut-off & predicting that those above it will be bitten & those below it won't be? If you make the cut-off 50% you'll likely predict no-one will get bitten. If you make it low enough you can predict everyone will get bitten. So what? Sensible application of a model requires discrimination—who should be given the only vial of anti-venom?— or calibration—for whom is it worth buying boots, given their cost relative to that of a snake-bite?.

Related Question