Feature Engineering – How to Handle Target Encoding in Test Data and Avoid Target Leakage

categorical-encodingfeature-engineering

I understand target encoding, which is the average of the target value by category using out-of-fold mean within each fold. although you get slightly different means for the same value of a categorical variable among different folds. An example (source):

Let's say we have 20-fold cross validation. we need somehow to
calculate mean value of the feature for #1 fold using information from
#2-#20 folds only.

So, you take #2-#20 folds, create another cross validation set within
it. Calculate means for every leave-one-out fold. You average these means and apply that
vector for your primary #1 validation set. Repeat that for remaining
19 folds.

My question are:

  1. How using out-of-fold mean within each fold prevents target leakage? Could you give a graphical example or math example or it's too complex with math?
  2. Related with question (1), I have read that regularization prevents the target leakage when using out-of-fold mean within each fold. Which one is it? Or both prevents target leakage?
  3. How do I implement this method in the test set? Because I won't have the target variable, and I can't replace the same value of a categorical variable in the test set with the target mean of the train set with that value of the categorical variable. Why? Because the train set for that value has slightly different means. So, which mean do I put for each specific value in the test set?

Best Answer

There's an excellent tutorial on this in the "Learn from Top Kagglers: How to Win a Data Science Competition" Coursera course, which is currently unavailable in response to the affiliation of the course with Moscow State University. It answers several of these questions and I'm not aware of any other resource that is nearly as good (there's several good YouTube videos, but they gloss a bit over some details such as the target leakage within the training data issue).

Another source that I've looked at is the "Approaching (almost) any Machine Learning problem" book by Abhishek Thakur, in particular the brief section from page 132 onwards. There may be also other good materials by other Kagglers, because this is a technique that is widely used in data science competitions, but has received comparatively little academic attention. Additionally, people taking part in serious data science competitions are extremely well incentivized to find approaches that generalize well to previously unseen data (or at least the test set for the competition, which they cannot see) and to avoid any target leakage including in their own evaluation of their models. It seems that for that reason academic descriptions of the topic tend to gloss over very important details that people regularly participating in data science competitions are aware of. I'm sure there's notable positive exceptions and the two communities do of course overlap substantially, so take these comments as subjective impressions of "average" papers I've seen.

Calculating target encoding out-of-fold

This is critical for giving you a fair evaluation of your model (including the target encoded features) on the validation part of a fold (i.e. nothing that is used as a predictor when evaluating the model in the validation part of a fold must in any way shape or form use the target information on the validation part of the fold).

At this point, we have avoided target leakage to the validation part of each fold. That's important for evaluating our model. However, we still have target leakage between records in the training data. Example let's assume our training data is this:

Record Category Label
1 A 0
2 A 0
3 A 0
4 B 0
5 B 0
6 B 1
7 B 1
8 C 1
9 C 1
10 C 1

In this example, the a target encoding of A = 0, B = 0.33 and C = 1.0 allows for overfitting, as the target encoding as a feature for record 1 already gives away that record 1 must have a label of 0, otherwise the target encoding would not be 0. Next, you might go for leave-current-record-out target encoding, but even that has issues: for records 4 and 5 (encoding 0.67), and records 6 and 7 (encoding 0.33) you again leak that 4 and 5 must be 0s, and 6 and 7 must be 1s (otherwise their labels wouldn't differ from other records with the same category). Additionally, this example shows a weird reversal of the effect of the target encoding as a predictor (i.e. lower target encoding means higher label value for the current record). In any case, some classes of models will be able to overfit to this target leakage (and this is really just overfitting, because you cannot leak the true label in such a way on a real test set).

So, in summary, the remaining problem with "naive target encoding in a cross-validated fashion" does not lead to the CV-evaluation being invalid. However, it may negatively impact our model performance, because it leads to overfitting. There's approaches to reduce its impact (e.g. regularization, see next section) and try to even prevent that (e.g. further nested splitting of data within the training part of a fold-split for creating target encodings). Let's illustrate the latter idea: e.g. you split your data 5-fold, then each fold can be looked at as consisting of 5 parts (one validation part and 4 training parts), you can for each training part calculate the target encoding from the other 3 training parts and use that as features on this training part.

Regularization

Regularization here can mean multiple things. One commonly used technique is to take a weighted average with some weight parameter $\lambda \in [0,1]$ so that the target encoding would be $$\lambda \times \text{overall average} + (1-\lambda) \times \text{average for category}.$$ Another version is to pick some $N_\text{pseudo}>0$ (kind of a number of pseudo-observations that pull the category average to the overall average) and to base the regularization on the amount of records in the category $$N_\text{pseudo} / (N_\text{category}+N_\text{pseudo}) \times \text{overall average} \\ + N_\text{category} / (N_\text{category}+N_\text{pseudo}) \times \text{average for category}.$$ This second option has the nice feature of more regularization for smaller categories.

You can now tune these parameters like other hyperparameters and pick ones that lead to good out-of-fold performance.

This form of regularization does not really prevent target leakage, but can reduce the impact of target leakage within the training data of a fold-split (which is basically overfitting).

Implementation on the test set

This will - as it always should - be done the same way as you do for the validation part of each cross-validation fold-split. I.e. without using the test (or validation, in case of doing CV) target information, at all. I.e. the test set encoding is based on the target encoding for the training data. That's all the data with labels, if you re-train a model on all your data before applying it to the test data, or the training data of each fold split, if you apply the models from each of your CV folds to the test data and average their results. In the latter case, that means that each of those models would use a different target encoding.

Note, that you may have to take care of previously unseen categories that you did not have in the training data. The good thing is that this might already occur in your cross-validation, in which case you can evaluate your approach for dealing with these via the cross-validation already. E.g. do you just use the overall average (and possibly also use a frequency encoding, or some other way of flagging that it's a rare category), or do you pool rare and/or not previously seen categories into a larger "other" category, or something else.

Related Question