Solved – Predicting the probability of product to be bought

classificationisotonicmachine learningprobabilityunbalanced-classes

I want to predict using statistical/machine-learning methods the probability of a person to buy a product on a website knowing the characteristics of the product and the other products it is compared to on the same website.

Each "line" of my dataset is the characterics of a product, a summary of the characterics of the other proposed products (mean, minimum, maximum of prices and other characteristics) and what product has been chosen.
The number of bought products is much lower than the number of non-bought ones: ratio between the two is several hundreds. As this dataset is unbalanced, I have implemented a subsampling method that

I have used the following algorithm (see Easy-Ensemble, basically averaging of under-sampling):

  1. Subset $N$ times the non-bought products (majority class)
  2. Define $N$ training datasets as each of the $N$ subsets combined with the whole set of bought products (minority class)
  3. Train $N$ models on these $N$ datasets
  4. The probability of a product to be bought is the mean of the probabilities given by the $N$ models.

The models are random forests.

My questions:

Models have been trained on data with much more bought products than non-bought ones, therefore one could think that the obtained quantity is biased towards bought products.

  1. Can we consider that the probability obtained in 4 is an estimate of the real probability?
  2. If not, is there a way to "scale" the quantity obtained in 4 to get an estimate of the probability of the product to bought.

Re-phrasing

What is the effect of the biased sampling used for the modelling? How could I transform back the obtained model to counter-balance this effect?

Best Answer

I have found quite satistactory answers to my two questions, that are worth sharing:

1. Could the output score of the classification model be considered as the true conditional probability we wanted to estimate?

Not in general. In particular, for models learnt on undersampled datasets, Dal Pozzolo et al. (2015) have quantified the bias induced by the sampling method.

To assess if the output score is close to the true conditional probability, one can plot the so-called calibration curve (also called reliability diagram). See, for instance, Niculescu-Mizil and Caruana (2005) and this scikit-learn demo.

2. How could I transform back the obtained model to get the seeked probabilities?

This transformation is called calibration.

The two main classical methods are the following:

  1. Platt's or sigmoid's method (Platt, 1999)
  2. Isotonic regression (Zadrozny & Elkan, 2001)

See Niculescu-Mizil and Caruana (2005) for an explaination of both methods. Moreover, both are implemented in scikit-learn. More recent works have been published in this domain, among which Naeini and Cooper (2015) for one of the most recent papers.

Furthermore, works addressing specifically the undersampling case have been published:

  1. A method to calibrate equally the minority and majority class (Wallace and Dahabreh, 2014)
  2. A closed-form formula to correct the bias induced by undersampling (See Dal Pozzolo et al.,2015)
Related Question