Solved – Low probability levels when doing logistic regression

churnlogistic

I am building a Logistic regression model for a churn problem. When I scored the out of sample data set, I find very low probability levels as the output probability. Conventionally, I would look for .5 as the cut off but this scored population doesnt have many customers above .5 ( say just 1%). Seeing the business cause, we need atleast 5% people to be approached for the impact.

I therefore reduced the cut off probability to judge the scored dataset. So now, I am defining .1 probability as the cut off. The model is very good at that level, in that it is perfectly distinguishing my target from non target.

  1. Is there any problem with this approach, given that at .1 level, model has very good accuracy.
  2. what in general is the cause of low probabilities at scored population level.

Best Answer

Re 1: If you predict well in the hold out sample then you're doing well (no time to worry about propriety ;-) But since you're asking...

One way to look at the threshold is that when you set it to 0.1 you are implicitly specifying a loss function. That is, separating the question of what to do (e.g. approach a customer) from what to infer (e.g. that the probability is of 1 is 0.15). Indeed, you might make this separation a bit more explicit in your question. For example, you talk about needing to approach 5% of some people for something to be worthwhile. And then about how well you can predict cases. Is the issue that to approach the `right' 5% (presumably the true '1's) you might have to approach many more (true '0's) to no effect? Then the cost of approach is relevant and the threshold should be set to minimise loss. But you also say you can predict the held out cases well when the threshold is set at 0.1...

Re 2: The cause of low probabilities is an unbalanced category distribution. This may cause estimation problems, though don't automatically assume that it will. If it does you can often correct them quite easily by changing the training data set structure and correcting parameters or in other ways. There's some discussion here, a link to a good paper, and much more discussion elsewhere in the site - just search for 'unbalanced sample'.