Solved – Adjusting probability threshold for sklearn’s logistic regression model

logisticmachine learningpythonscikit learnunbalanced-classes

I am a 10th grade student working on a binary classification problem and I have decided to use the logistic regression model from Scikit-Learn. I am looking to predict patient adherence given the time of day, day of week, or both. I have simulated data and have made it so that certain timeslots have many more 0s (meaning the patient did not take the medicine) to simulate a trend, but my model is still predicting "1" for every single input. I believe my data is very imbalanced and without any class weights, the model puts every input into the "1" class. Obviously, this results in terrible accuracy, AUC and everything in between. Sklearn does have a class_weight parameter, but since that is dichotomous and only gives the "balanced" option, it really does not help and in some cases makes accuracy far worse than just assuming everything to be in the 1 class. I think changing the threshold to 0.75 will work, given what I have seen from the data, but I can't find anything about adjusting the threshold in any documentation.

Is there anyway to adjust this threshold? Or any other way to deal with my imbalanced data?

Let me know if you want me to elaborate on the specifics of my data.

Best Answer

There is almost never a good reason to do this! As Kjetil said above, see here.

You should be able to get the probability outputs from ‘predict_proba’, then you can just write

decisions = (model.predict_proba() >= mythreshold).astype(int)

Note as stated that logistic regression itself does not have a threshold. However sklearn does have a “decision function” that implements the threshold directly in the “predict” function, unfortunately. Hence they consider logistic regression a classifier, unfortunately.

General Comments

"I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.
Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things
It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.

Specific Comments

It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.

You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.

Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.

Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:

set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable

# make adherence probabilities based on time
adhere_prob <- ifelse(
  time == "morning", .80, 
  ifelse(
    time == "afternoon", .50, .65
  )
)

# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)

# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)

This summary shows, in part:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.02882    0.10738   0.268  0.78839    
timeevening  0.45350    0.15779   2.874  0.00405 ** 
timemorning  1.39891    0.17494   7.996 1.28e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf

I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).

Lastly, you can also simulate having no effect by setting all of the times to have the same probability:

set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))

Which returns:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.40306    0.10955   3.679 0.000234 ***
timeevening -0.06551    0.15806  -0.414 0.678535    
timemorning  0.18472    0.15800   1.169 0.242360    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This shows no significant differences between the times, as we would expect from the probability being the same across times.

Best Answer

Related Solutions

Solved – How to simulate data to be statistically significant

General Comments

Specific Comments

Related Question