I am running a rare events logistic regression on a binary dependent variable. I have 538 observations and only 10 events (so 528 values of 0 and 10 of 1), which is why I chose to use a rare events logistic regression.
When I run the regression, one of the independent variables in the model has a huge coefficient (around 25,000,000) and is found to be significant. The range on the independent variable is 0 to 1. Is this a problem? Could anyone explain why this is happening?
When I run the same model with just a logistic regression this variable is insignificant.
I'm not sure what is happening. Any advice would be appreciated.
Best Answer
In all likelihood, you have a poorly diagnosed complete separation / perfect prediction in your model: a combination of the explanatory variables (if you used interactions), or more likely a single explanatory variable, uniquely identifies one of the rare events. Let's say that if
x
> 10, then the outcome is always a one, while forx
< 10, there can be a mix of zeroes and ones. What happens then is that the greater the coefficient forx
, the closer you can get the predicted probability to 1 for the cases withx
> 10. Since their contribution to the likelihod is $\ln \hat p_i$, maximum likelihood keeps pushing that number up to the extent possible (while maintaining the other coefficients in bay so that the probabilities forx
<10 are OK), and sky is the limit... except that the finite precision of computer arithmetic prevents that from technically happening, so you will stop somewhere around $\hat p_i = 1-10^{-8}$ or so. This is a known problem forglm
in R; Stata diagnoses this and drops the perfectly predicted observations.You need to identify which of your explanatory variables perfectly predicts the outcome, and do something about it -- exclude it from regression, find another measure of the underlying concept, etc. Another solution is to use Firth logistic regression, which is a frequentist version of Bayesian regression with Jeffrey's prior, or, in a distant way, a version of ridge regression for binary outcome.