Solved – Target encoding a categorical variable in a highly imbalanced dataset for binary classification

categorical-encodingfeature-engineeringmachine learningpredictive-models

I have a categorical variable, Industry, that has different values in a dataset that is over 400K datapoints. This dataset is highly imbalanced, the ratio of roughly 99/1. What I am doing is significantly undersampling the majority class to create a 50/50 dataset of 8000 datapoints (I definitely lost most of the training points because I need most of those to have a prediction made on them, though those are mainly from the majority class so I am not losing too many rare class data).

What I would like to do is mean encode this variable, Industry, against the target. For example, let's say I have NYC appear 1000 times in the training data, and 100 of them are in the positive class the value for datapoints with NYC will be 100/1000=.1 will be the value of this new feature, etc.

The problem is that because I am undersampling to make my dataset balanced, that skews this ratio by creating fake good ratios on the minority class because it now represents 50% of my training data rather then 1%. Hence, my training column will significantly overvalue the ratio compared to the testing set and will not help my machine learning algorithms at all.

What should I do in this case to create a good mean encoded ratio?

Best Answer

I hope you have followed the good advice in the comments. Why do you use downsampling? It is most often used to solve a nonproblem. See Why downsample? and many of its answers. With only 400k rows memory shouldn't be a problem, if it is, get some better software. Their problem may be the use of accuracy, which is an improper score function, see Is accuracy an improper scoring rule in a binary classification setting?.

Then the question about target (or mean) encoding. That is an idea from machine learning used with categorical variables with very many levels. Your variable Industry probably does not have that many levels, so you could try other ideas, like dummy variables with regularization, maybe try glmnet. Glmnet uses sparse matrices so many levels isn't a big problem. If there is many levels, see some of the ideas here: Principled way of collapsing categorical variables with many levels?.

Finally, if you still go for target encoding, see my answer here: Strange encoding for categorical features

Related Question