Machine Learning – Is Creating Feature Using Outcome Label Data Leakage?

classificationfeature selectionhypothesis testingmachine learningneural networks

I have a small dataset of 1000 rows and 10 features, working on binary classification to predict whether the supplier's likelihood of meeting the target (coded as positive and negative)

For these 1000 rows, I have the label generated based on business rules.

Now during exploratory analysis, I found out the below

a) suppliers A, B and C have shown better adherence to target. Meaning, they mostly fulfill/meet their targets (ex: If 10 transactions, they meet target 7 times and fail 3 times)

b) suppliers D, E and F have shown poor adherence to target. Meaning, they mostly don't fulfill/meet their targets (ex: If 10 transactions, they meet target 2 times and fail 8 times)

So, I would like to create a new feature called supplier adherence and store values as good and bad and use it during ML model building.

So, my questions are as follows

a) Does creation of new feature (supplier adherence) based on target variable qualifies as data leakage? If no, why? because, am deriving this info from the target label. Of course, I could have also checked with business owners (who would have shared their own view on quality of supplier which could have also helped me create a feature)

b) After creating this feature, I intend to split the dataset into train and test splits. Am I right to do this?

c) If a new supplier (unseen data) comes in, I believe his adherence should be 0 because we will have no historical data about that supplier to calculate his adherence. So, he is considered to be a good supplier (due to lack of data) Am I right?

Can help me with each of the questions please?

Project Background

Okay, I have this dataset obtained by merging two data sources. One dataset which contains the datetime when this deal was made. Meaning, supplier signed a deal with us saying he will buy 1000 (target) units. And we have another dataset where we have info about how much he finally bought (600 units) until this point of time. Meaning, the remaining 400 units can be bought in upcoming days/months/year. But as of now, he hasn't fulfilled his target yet. So, we would like to follow up with supplier and ask him to book remaining orders early. If we don't follow up, they may or may not book

So, based on above, each supplier can have multiple deals (for different products) with us. They (show signs of) fulfill some and they don't (show signs of) fulfill some. I mention signs because, these deals are still in progress (and not closed). So, we consider a threshold of 50%. If supplier booked 50% of target quantity, we consider it is fulfilled (because in real time, no one sticks to 100%)

These deals have dates like deal date which indicates when was this made…ex: we have deals from 2017, 2018, 2019,2020,2021 etc…We also have dates on what was his latest order booking date (for a matching deal)

Best Answer

a) Yes, this would certainly be data leakage, precisely for the reason that you are using outputs to design a feature to then predict that output. This would be analogous to, for example, predicting whether a team won or lost an individual game last season and you give as a feature the team's total win/loss ratio throughout last season. I hope it makes sense why this is data leakage, but if it doesn't make sense, please make a comment.

Most fundamentally, it's data leakage because you're using information (win/loss of entire season) that you wouldn't have had in that situation (half way through the season, you don't know the win/loss for the entire season).

Note: for the rest of my answer, I will be assuming that your "supplier adherence score" feature is not binary (good/bad), but some kind of continuous variable (like a score from 0-10, or a percentage or something). This, I think, just makes more sense.

There is a way to fix this though, depending on your dataset. If you have a time component in your data. For example, if you are predicting a supplier's likelihood of meeting a target, and this target comes once a month, every month, so you're running one prediction in March to predict whether March's target will be met, then one in April to predict if April's target will be met etc. , then there is a solution.

Going back to the sports example, what would not be data leakage is if you used a team's running tally as a feature, because you would only be using information actually available at that time and in that situation. For example, to predict the outcome of a game halfway through the season, it would be totally fine to use their win/loss ratio in the first half of the season, up until that game in question. In your case, if there is a time element, you could calculate a supplier adherence at every point in time, determined by the supplier's adherence up until that point. I hope that makes sense, if your data has a time aspect.

b) Broadly speaking, this is the correct thing to do. However, it depends on what exactly you're trying to do. 1000 rows is not a huge amount of data. So if you only have one train set (say, first 800 rows) and one test set (say, last 200 rows) and you run the model and it does poorly on the test set, boom, you've blown your unseen data. Why? Because now, when you go back to change things in the model, you yourself have "seen" the test data, so you are biased in a way that will make it more likely that you make a model that does well on the test set but is overfitted (for example, imagine you kept re-doing this process, trying out 100 different combinations of variables until you finally found one that just happens to do well on the test set). I hope it makes sense why this is bad and would not lead to a good production model. If it doesn't please ask.

What you need to do is use cross validation and a hold-out test set.

c) This is going to have to be a judgement call, based on your experience. Maybe it's a good idea, like you said, to start each new supplier out at a "good supplier adherence" score. Maybe it's worth starting each new supplier with an adherence score equal to the average of all your suppliers at that time (perhaps this would lead to less bias?). Seems like a judgement call based on your expertise. You could also try out different options - maybe it will turn out to make very little difference.

Best Answer

Related Solutions

Machine Learning – Usefulness of Identifier Variables in Model Building

Rule-Based Label – Random Split vs Time-Based Split

Related Question