Solved – Predicting whether a potential sale will be won or lost

financelogisticprobabilityrrandom forest

I am currently working on a project using a sales system and trying to come up with a way to use the current pipeline of potential sales to predict the amount of product that will be sold in the future. I’m looking for advice on how to approach this problem and hopefully some resources to teach me what approach to use and why.

The sales system I’m using has historical data for opportunities (potential sales). Around 50,000 of the opportunities are “closed” meaning that they are either won or lost. I have around 1,000 “open” opportunities that have not yet been won or lost. Some variables that I have on each sale include the product (which is generally homogenous except for the amount), the amount, the salesman, the date, the time it was input into the system, the customer, and other data about the customer.

I understand that if I want to predict a dichotomous variable like win / lose then I should look at a logistic regression. However, I’m looking for general advice on how to

  1. Predict the probability of each individual opportunity closing as won using the data I have (and how to tell if I've done it correctly).
  2. Estimate the total amount of won opportunities for a period.

I found a similar question here Using a logistic model on the estimates of several other classification models but I’m hoping for a response that gives me a better idea of where to start. I’m comfortable using R or any other statistical software, but ideally I'd like some kind of book or other reference material that is as low-level as possible.

Best Answer

  1. I deal with this sort of messy data a lot. If you have lots of time, there is no reason to not test on all the factors (Maybe not customers if you don't have repeat customers).

More than likely, you want more easily usable information. I suggest grouping your salesmen into "Best" "Medium" and "Acceptable" categories, or however many categories you want. This way your system will still work as you get new salesmen and as new salesmen come in, you can quickly grade them. Similarly, group the customers into "high quality we want lots of business from them" to "one-shot, small-time customers". This will help with regression, having simple categories.

To test if you've done it correctly is a different ballgame altogether, it really helps to think about the numbers that you get for a regression and think about the trendline that shows. If your "Not good" category of salesmen get better sales, it should automatically clue you in that something is wrong, since they are defined by not having the best sales records. If it shows no correlations at all, it doesn't mean you did it wrong, it might mean there is too much variance unaccounted for, so I can't really answer that part for you.

  1. One of the problems I see in "Open" cases is messy paperwork can keep cases open for years. Make sure you clear stuff that is open past a reasonable time frame (you need to decide what makes sense for your situation).

I would set a time frame (monthly or quarterly) and compare the ratios of opens and closes; then looking at closes only, which ones are wins and losses. You can then start looking at baseline sales and trends and such. It is really difficult to accurately extrapolate sales data, and any recommendation I give starts with a long preamble about that.


In short, the regression is the easy part, cleaning your data takes all the work.

Related Question