I am running an analysis on the probability of loan default using logistic regression and random forests.
When I use logistic regression, the prediction is always all '1' (which means good loan). I have never seen this before, and do not know where to start in terms of trying to sort out the issue. There are 22 columns with 600K rows. When I decrease the # of columns I get the same result with logistic regression.
Why could the logistic regression be so wrong?
**Actual from the data**
0 : 41932
1 : 573426
**Logistic regression output**
prediction for 1 when actually 0: 41932
prediction for 1 when actually 1:573426
A**s you can see, it always predicts a 1**
**Random forests does better:**
actual 0, pred 0 : 38800
actual 1, pred 0 : 27
actual 0, pred 1 : 3132
actual 1, pred 1 : 573399
Best Answer
Well, it does make sense that your model predicts always 1. Have a look at your data set: it is severly imbalanced in favor of your positive class. The negative class makes up only ~7% of your data. Try re-balancing your training set or use a cost-sensitive algorithm.