CatBoost Encoding – Why It Does Not Cause Target Leakage

catboostcategorical-encodingdata-leakagetarget-encoders

I'm currently working on a fraud detection problem with a dataset of 300,000 rows and 500 columns, 70 of which are categorical with over 10 categories each. I'm facing memory constraints and exploring target encoding as a solution to deal with the categorical columns.

I've recently come across the CatBoost Encoder, which is often praised for its ability to prevent target leakage. However, I'm struggling to understand why this method prevents target leakage.

What's the intuitive explanation of this method?

Edit:

My question is similar to this one: How do Ordered Target Statistics work for CatBoost?. It's different because the author in that question is asking about the meaning of the history. I understand that, but I don't understand why the method stops target leakage.

Best Answer

If you read the CatBoost paper, you can see that they answer this intuitively. Paraphrasing:

The assumption of a time ordering allows you to encode each row with data from the "past", which prevents information from the "future" from leaking into the encoding.

That's the intuitive explanation.

Mathematically:

Iteratively splitting your data into "past" and "future" and conditioning your encoding on the past introduces a significant amount of variance in the encoding of your categories. The variance is like random noise added over the target encoding and partially breaks the relationship between the target and your categorical column.

Partially breaking the relationship prevents target leakage and improves your model's ability to generalise.

This method can also be better than random noise because it doesn't require introducing a prior assumption about the distribution of your category's values.

This combination is powerful and is responsible for Ordered Encoding's success.