Solved – Logistic regression with low event rate

logistic

I'm trying to build a logistic regression model to predict 90+ Days past due(DPD) events. The size of the database is 96000, with an event rate of 6%. We ran the entire data set through the info value process, and converted it into Weight of Evidence (WOE) bins. When I try to build the model using 60% of the data (for development data, the other 40% are held out for validation), the logistic regression model gives me 7 significant variables with a very high wald score for the intercept. Below, I give the results from the model:

  1. The overall model is significant (P<.0001).
  2. ROC 0.84.
  3. The Hosmer and Lemeshow Test is significant (P<0.0001), which implies the model does not adequately fit the data.
  4. The accuracy of this model is poor, and has a correct classification rate of 21%.

Please tell me your views on this, specifically:
. Are there any ways/methods that can help improve performance on the HL Test, since we need use the probability for the prediction?
. Can I ignore a few good loans based on some business rules?
. Is there a different methodology we should try?

I'm fairly new to the credit risk modeling and looking forward for your view.

Thank you in advance

Best Answer

You can adjust the penalty for different sorts of wrong classification. Consider that you could get 94% correct just by putting everyone in the "no event" category. But that wouldn't be useful. Clearly, whatever program you are using is, implicitly or explicitly, using a different weighting.

You can look at graphs: For continuous independent variables you can look at parallel boxplots. For categorical ones you can look at mosaic plots.

You can avoid binning, which can lower statistical power.

You could try other methods, such as classification trees.

But, in the end, you may just not have a good model.