Logistic Regression – How to Predict Unbalanced Classes Effectively

logisticpredictive-modelsregressionscoring-rulesunbalanced-classes

I am running an analysis on the probability of loan default using logistic regression and random forests.

When I use logistic regression, the prediction is always all '1' (which means good loan). I have never seen this before, and do not know where to start in terms of trying to sort out the issue. There are 22 columns with 600K rows. When I decrease the # of columns I get the same result with logistic regression.

Why could the logistic regression be so wrong?

**Actual from the data**

0 :   41932

1 :   573426

**Logistic regression output** 

prediction for 1 when actually 0: 41932
prediction for 1 when actually 1:573426

A**s you can see, it always predicts a 1**


**Random forests does better:**

actual 0, pred 0 : 38800 
actual 1, pred 0 : 27 
actual 0, pred 1 : 3132
actual 1, pred 1 : 573399

Best Answer

Well, it does make sense that your model predicts always 1. Have a look at your data set: it is severly imbalanced in favor of your positive class. The negative class makes up only ~7% of your data. Try re-balancing your training set or use a cost-sensitive algorithm.

Related Question