Solved – Adjusting probability threshold for sklearn’s logistic regression model

logisticmachine learningpythonscikit learnunbalanced-classes

I am a 10th grade student working on a binary classification problem and I have decided to use the logistic regression model from Scikit-Learn. I am looking to predict patient adherence given the time of day, day of week, or both. I have simulated data and have made it so that certain timeslots have many more 0s (meaning the patient did not take the medicine) to simulate a trend, but my model is still predicting "1" for every single input. I believe my data is very imbalanced and without any class weights, the model puts every input into the "1" class. Obviously, this results in terrible accuracy, AUC and everything in between. Sklearn does have a class_weight parameter, but since that is dichotomous and only gives the "balanced" option, it really does not help and in some cases makes accuracy far worse than just assuming everything to be in the 1 class. I think changing the threshold to 0.75 will work, given what I have seen from the data, but I can't find anything about adjusting the threshold in any documentation.

Is there anyway to adjust this threshold? Or any other way to deal with my imbalanced data?

Let me know if you want me to elaborate on the specifics of my data.

Best Answer

There is almost never a good reason to do this! As Kjetil said above, see here.

You should be able to get the probability outputs from ‘predict_proba’, then you can just write

decisions = (model.predict_proba() >= mythreshold).astype(int)

Note as stated that logistic regression itself does not have a threshold. However sklearn does have a “decision function” that implements the threshold directly in the “predict” function, unfortunately. Hence they consider logistic regression a classifier, unfortunately.

Related Question