Solved – Adding predicted probabilites from logistic regression instead of using cut value

logisticpredictionregression

I am using a logistic regression model to predict a binary decision (purchase, don't purchase) based on several independent variables (income, age, education, etc.) for a population of individuals (customers). I have data for individuals from one or more previous time periods, and I want to predict behavior for different individuals in a future time period. Unfortunately, my experience is with explanation, not prediction.

My real interest is in predicting aggregate behavior–for example, what are total predicted purchases by customers in a future time period based on their characteristics? I can see two ways of doing this. First, I could use the parameters from the logistic regression model to generate a probability [0-1] for each customer in the future time period, then use a cut value (0.5) to resolve those probabilities to either 0 or 1, then sum the 1s to generate an estimate of total purchases. Second, I could use the parameters from the logistic regression model to generate a probability [0-1] for each customer in the future time period (as before), then simply sum those probabilities to generate an estimate of total purchases (without using a cut value).

The second approach (adding the probabilities) makes the most sense to me, but the reference material I have consulted so far frames the prediction task in terms of cut values and classification tables. Is the second approach conceptually flawed? If so, why? Thanks very much.

ADDENDUM: With regard to the references I consulted, it was often suggested to use cut values and classification tables, with training and validation sets, to evaluate the real-world performance of a logit model. However, I would have thought that summing the probabilities would have been a better way to do that.

Best Answer

For what you're trying to do, the second approach seems better to me. If your model is correctly specified, then you'll converge to the right answer as $n \rightarrow \infty$. Meanwhile, the first approach, i.e., cutting at 0.5, could go horribly wrong if you have lots of probabilities that are systematically around 40% or 60%.

One follow-up, though: Are you sure you will always have 0 or 1 purchases per time period? Or is it also possible for people to buy 2, 3, etc. items? If so, I'd recommend doing Poisson regression instead of logistic regression. Poisson regression is the standard way of predicting quantities of the form "number of times something happens."

Related Question