CatBoost Encoding – Why It Does Not Cause Target Leakage

catboostcategorical-encodingdata-leakagetarget-encoders

I'm currently working on a fraud detection problem with a dataset of 300,000 rows and 500 columns, 70 of which are categorical with over 10 categories each. I'm facing memory constraints and exploring target encoding as a solution to deal with the categorical columns.

I've recently come across the CatBoost Encoder, which is often praised for its ability to prevent target leakage. However, I'm struggling to understand why this method prevents target leakage.

What's the intuitive explanation of this method?

Edit:

My question is similar to this one: How do Ordered Target Statistics work for CatBoost?. It's different because the author in that question is asking about the meaning of the history. I understand that, but I don't understand why the method stops target leakage.

Best Answer

If you read the CatBoost paper, you can see that they answer this intuitively. Paraphrasing:

The assumption of a time ordering allows you to encode each row with data from the "past", which prevents information from the "future" from leaking into the encoding.

That's the intuitive explanation.

Mathematically:

Iteratively splitting your data into "past" and "future" and conditioning your encoding on the past introduces a significant amount of variance in the encoding of your categories. The variance is like random noise added over the target encoding and partially breaks the relationship between the target and your categorical column.

Partially breaking the relationship prevents target leakage and improves your model's ability to generalise.

This method can also be better than random noise because it doesn't require introducing a prior assumption about the distribution of your category's values.

This combination is powerful and is responsible for Ordered Encoding's success.

Related Solutions

Solved – Dumthe coding categorical variables with lots of unique values using log2

It's spelled out in the docstring:

Binary encoding for categorical variables, similar to onehot, 
but stores categories as binary bitstrings.

A bitstring is a binary integer, so all digits are zeros and ones. The base two logarithm, rounded up, of an integer is the number of digits in its base two representation.

n = 3, binary = 11, log2 = 1.58
n = 7, binary = 111, log2 = 2.81
n = 11, binary = 1011, log2 = 3.46
n = 55, binary = 110111, log2 = 5.78
n = 78, binary = 1001110, log2 = 6.29

So the code is calculating the base two logarithm to determine how many bits it needs to use to encode all of the categories.

I have a variable called "City" that contains 87 unique values. If I dummy code them, I would then increase the number of columns by 87, so that each column represented whether that row included the city (1) or not(0). Using this encoder, I only get 7 columns, how do I know which cities are represented?

They are all represented, not just seven.

In a standard encoding, you would create 87 new columns. For each city, exactly one column would be one, and all 86 other columns would be zero. In this encoding, each city is represented by a unique combination of zeros and ones in the seven new columns.

Solved – (Low cardinality) categorical features handling in gradient boosting libraries

The XGBoost implementation of GBM does not handle categorical features natively because it did not have to. The methodological breakthrough of XGBoost was the use of Hessian information. When other implementations (e.g. sklearn in Python, gbm in R) used just gradients, XGBoost used Hessian information when boosting. Simply put, it obliterated them in terms of speed. Handling categorical variables was an after-thought. LightGBM and CatBoost build on the work of XGBoost and primarily focus on the handling of categorical features and growing "smarter" trees. Especially for CatBoost, that is developed mainly by Yandex, an Internet search provider, the ability to work efficiently with very high cardinality features (e.g. query types) is crucial functionality.
This is completely application specific. Anecdotally, I have seen Kaggle-threads where users complained about experiencing performance degradation when using categorical features and I have seen Kaggle-threads where users raved about experiencing performance boost when using categorical features. In terms of performance, other aspects of the model-fitting procedure (e.g. how to objectively measure a model's performance and/or how to avoid over-fitting) have far greater influence. The general rule is that numerical encoding (and the subsequent binning) of categorical-turned-numerical features leads to speed ups. I have come across an investigation of behaviour of decisions trees when using different encoding schema here.

For low cardinality features, numerical encoding should make no real difference; binary features being an extreme case where there is no difference at all. The main thing gained by avoiding one-hot encoding (OHE) is the case of having very deep and unbalanced trees. When working with a low-cardinality feature this is mostly irrelevant, so the choice between OHE or numerical is mostly a matter of convenience. Obviously, one-hot encoding (minus a reference level) should be used if we want to use a factorial design and test a particular hypothesis.

Best Answer

Related Solutions

Solved – Dumthe coding categorical variables with lots of unique values using log2

Solved – (Low cardinality) categorical features handling in gradient boosting libraries

Related Question