Solved – How to increase the accuracy of the logistic regression model

logisticpredictive-modelsregression

I am dealing with a tricky, unbalanced data set and trying to run a logistic regression model. One class is present with a 10:1 ratio.

My objective here is to boost my predictive accuracy – minimize the incorrect predictions and maximize the correct ones.

I've tried undersampling (which doesn't work very well) and I have tried logistic regression,case-weighted logistic, and Firth logistic regression.

None of them are very successful.

While I can sometimes yield good true negative rates depending on the dataset, in the end the prediction is merely representative of its underlying class distribution. The case-weighted logistic does about as well as the Firth when I test it – which is to say, it does horribly.

So…

  1. Is there anything else I can do to meet my objective? From what I
    can tell, exact logistic regression is an option but only for very
    small datasets, which is not the case here.

  2. Do I need to go back and explore variable selection?

  3. Why is it that the penalized model (Firth) is not a substantial
    improvement over the case-weighted logistic?

  4. Are there any options I am not considering?

  5. Should I change tactics and look to something like an anomaly
    detection model instead?


Please excuse me if this question is lacking in details -this is my first time asking a question here. I would be happy to add in anything.

Best Answer

Just think about the fact that you have a 10:1 class ratio and yet your interest is in zero-one loss, that is, maximizing the proportion of cases correctly identified. The accuracy of a trivial model that just guesses the modal class for every case would be 10/(10 + 1) = 91%, which is pretty high. In order to substantially beat 91%, as with 95% accuracy, you need one or more highly predictive features. If you don't have any, as is often the case in real problems, the best you can hope for is quite small improvements on 91%. In a nutshell, this problem is too easy for statistics or machine learning to be able to get you much more predictive power.

Possibly I can give you more specific advice if you add details.