Machine Learning – How to Use Supervised Binning on Train Data Without Data Leakage

cartclassificationmachine learningmathematical-statisticsneural networks

I have a dataset which has Quantity ordered (along with other variables like product type, product price, customer group etc). Target variable is whether customer churned or not. I am doing this to convert my continuous variable into categorical values like high, med, low based on Qty ordered level

However, my question is not based on the dataset itself but on the technique called supervised binning.

Doesn't supervised binning qualify as data leakage? because we create bins based on the target variable (train data only). Later, we use that info (bin info based on target column) and feed it as input to the model.

Can you share some insights on whether it is recommended to this?

If yes, why so?

If not, why so? Because, I see lot of tutorials and posts on doing supervised binning for discretization of continuous variables (during data preparation). Should I only use unsupervised binning?

Best Answer

As already noticed in the comments and another answer, you need to train the binning algorithm using training data only, in such a case it has no chance to leak the test data, as it hasn't seen it.

But you seem to be concerned with the fact that the binning algorithm uses the labels, so it "leaks" the labels to the features. This concern makes sense, in the end if you had a model like

$$ y = f(y) $$

it would be quite useless. It would predict nothing and it would be unusable at prediction time, when you have no access to the labels. But it is not that bad.

First, notice that any machine learning algorithm has access to both labels and the features during the training, so if you weren't allowed to look at the labels while training, you couldn't do it. The best example would be naive Bayes algorithm that groups the data by the labels $Y$, calculates the empirical probabilities for the labels $p(Y=c)$, and the empirical probabilities for the features given (grouped by) each label $p(X_i | Y=c)$, and combines those using Bayes theorem

$$ p(Y=c) \prod_{i=1}^n p(X_i | Y=c) $$

If you think about it, it is almost like a generalization of the binning idea to the smooth categories: in binning we transform $X_i | Y=c$ to discrete bins, while naive Bayes replaces it with a probability (continous score!). Of course, the difference is that with binning you then use the features as input for another model, but basically the idea is like a kind of poor man's naive Bayes algorithm.

Finally, as noticed by Stephan Kolassa in the comment, binning is usually discouraged. It results in loosing information, so you have worse quality features to train as compared to the raw data. Ask yourself if you really need to bin the data in the first place.

Related Question