Machine Learning – How to Use Supervised Binning on Train Data Without Data Leakage

cartclassificationmachine learningmathematical-statisticsneural networks

I have a dataset which has Quantity ordered (along with other variables like product type, product price, customer group etc). Target variable is whether customer churned or not. I am doing this to convert my continuous variable into categorical values like high, med, low based on Qty ordered level

However, my question is not based on the dataset itself but on the technique called supervised binning.

Doesn't supervised binning qualify as data leakage? because we create bins based on the target variable (train data only). Later, we use that info (bin info based on target column) and feed it as input to the model.

Can you share some insights on whether it is recommended to this?

If yes, why so?

If not, why so? Because, I see lot of tutorials and posts on doing supervised binning for discretization of continuous variables (during data preparation). Should I only use unsupervised binning?

Best Answer

As already noticed in the comments and another answer, you need to train the binning algorithm using training data only, in such a case it has no chance to leak the test data, as it hasn't seen it.

But you seem to be concerned with the fact that the binning algorithm uses the labels, so it "leaks" the labels to the features. This concern makes sense, in the end if you had a model like

$$ y = f(y) $$

it would be quite useless. It would predict nothing and it would be unusable at prediction time, when you have no access to the labels. But it is not that bad.

First, notice that any machine learning algorithm has access to both labels and the features during the training, so if you weren't allowed to look at the labels while training, you couldn't do it. The best example would be naive Bayes algorithm that groups the data by the labels $Y$, calculates the empirical probabilities for the labels $p(Y=c)$, and the empirical probabilities for the features given (grouped by) each label $p(X_i | Y=c)$, and combines those using Bayes theorem

$$ p(Y=c) \prod_{i=1}^n p(X_i | Y=c) $$

If you think about it, it is almost like a generalization of the binning idea to the smooth categories: in binning we transform $X_i | Y=c$ to discrete bins, while naive Bayes replaces it with a probability (continous score!). Of course, the difference is that with binning you then use the features as input for another model, but basically the idea is like a kind of poor man's naive Bayes algorithm.

Finally, as noticed by Stephan Kolassa in the comment, binning is usually discouraged. It results in loosing information, so you have worse quality features to train as compared to the raw data. Ask yourself if you really need to bin the data in the first place.

Related Solutions

Regression – How to Rank and Predict an Outcome with or without a Machine Learning Model

a) The probability values from a well-calibrated$^{\dagger}$ model will be your friend. If you have the budget to contact $500$ people, contact the $500$ people most likely to respond. Many machine learning methods output probability values that one can map to a discrete outcome by using a threshold, but doing so has major drawbacks, such as not being able to identify the most likely customers. This gets at the lift curve mentioned in the comments to your question (which also is mentioned in the "major drawbacks" blog post I linked).

b) You can plug in your values to the logistic regression you would fit. Example:

set.seed(2022)
x <- c(
60/100,
10/200,
40/50,
20/80,
750/1000
)
y <- c(1, 0, 1, 0, 1)
L <- glm(y ~ x, family = binomial) # This is the logistic regression
round(1/(1 + exp(-predict(L))), 2)

The caveat is that you have perfect separation, because you are creating the labels based on a rule: if x > 50, then positive, else negative.

c) Rule-based methods might work fine, but you need to come up with the rules. The appeal of machine learning is that the machine learns the rules from the data, rather than being explicitly told what the rules are.

d) I don't think you have the data to do a machine learning problem, and it does not seem like what you're doing makes sense for machine learning. What you would want for machine learning is some measured value that you want to predict in the future. For instance, based on historical records of target quantity and booked quantity, what were the outcomes? It might turn out that $50\%$ is an interesting point, but probably not everyone above $50\%$ turned out positive and not everyone below turned out negative.

If you just want to assign people to "positive" and "negative" classes based on the booked:target ratio, then you always know the outcome, and you know the outcome with certainty. Even in a situation where you achieve perfect separation in your model, it is conceivable that there is some small chance that a point could wind up on the wrong side of the fence (a positive case in the realm that you thought was exclusively negative, for instance) but simply did not. However, you absolutely know the positive/negative outcome once you know the booked:target ratio.

$^{\dagger}$You actually don't need calibration of the probabilities, but the right order. The top five people are the top five, whether their probabilities are $0.9, 0.85, 0.8, 0.75, 0.7$ with the others below $0.69$ or $0.9, 0.55, 0.5, 0.35, 0.2$ with the others below $0.19$. What calibrated probabilities could give you, however, is the ability to spend less than your allocated budget. Perhaps you have the budget for $500$, but after the first $400$, there is a steep drop in probability. You might be able to save $20\%$ of your funds without taking much of a hit in quality.

Machine Learning – Is Creating Feature Using Outcome Label Data Leakage?

a) Yes, this would certainly be data leakage, precisely for the reason that you are using outputs to design a feature to then predict that output. This would be analogous to, for example, predicting whether a team won or lost an individual game last season and you give as a feature the team's total win/loss ratio throughout last season. I hope it makes sense why this is data leakage, but if it doesn't make sense, please make a comment.

Most fundamentally, it's data leakage because you're using information (win/loss of entire season) that you wouldn't have had in that situation (half way through the season, you don't know the win/loss for the entire season).

Note: for the rest of my answer, I will be assuming that your "supplier adherence score" feature is not binary (good/bad), but some kind of continuous variable (like a score from 0-10, or a percentage or something). This, I think, just makes more sense.

There is a way to fix this though, depending on your dataset. If you have a time component in your data. For example, if you are predicting a supplier's likelihood of meeting a target, and this target comes once a month, every month, so you're running one prediction in March to predict whether March's target will be met, then one in April to predict if April's target will be met etc. , then there is a solution.

Going back to the sports example, what would not be data leakage is if you used a team's running tally as a feature, because you would only be using information actually available at that time and in that situation. For example, to predict the outcome of a game halfway through the season, it would be totally fine to use their win/loss ratio in the first half of the season, up until that game in question. In your case, if there is a time element, you could calculate a supplier adherence at every point in time, determined by the supplier's adherence up until that point. I hope that makes sense, if your data has a time aspect.

b) Broadly speaking, this is the correct thing to do. However, it depends on what exactly you're trying to do. 1000 rows is not a huge amount of data. So if you only have one train set (say, first 800 rows) and one test set (say, last 200 rows) and you run the model and it does poorly on the test set, boom, you've blown your unseen data. Why? Because now, when you go back to change things in the model, you yourself have "seen" the test data, so you are biased in a way that will make it more likely that you make a model that does well on the test set but is overfitted (for example, imagine you kept re-doing this process, trying out 100 different combinations of variables until you finally found one that just happens to do well on the test set). I hope it makes sense why this is bad and would not lead to a good production model. If it doesn't please ask.

What you need to do is use cross validation and a hold-out test set.

c) This is going to have to be a judgement call, based on your experience. Maybe it's a good idea, like you said, to start each new supplier out at a "good supplier adherence" score. Maybe it's worth starting each new supplier with an adherence score equal to the average of all your suppliers at that time (perhaps this would lead to less bias?). Seems like a judgement call based on your expertise. You could also try out different options - maybe it will turn out to make very little difference.

Best Answer

Related Solutions

Regression – How to Rank and Predict an Outcome with or without a Machine Learning Model

Machine Learning – Is Creating Feature Using Outcome Label Data Leakage?

Related Question