Machine Learning Classification – How to Transform Data-Dependent Target Variables into Classes Using Scikit-Learn

binningclassificationmachine learningscikit learn

I was redirected from StackOverflow because my question is more about theory.

I have a usual set-up with a pandas dataframe with some features and a numeric target variable (financial returns for example). Now I want to make a classification problem out of it: Rather than predicting the numerical value of the return, I want to predict classes. My question deals now about how I do correctly create the classes of the target variable if the creation of the classes are dependent on the data. For example I want to make 4 classes (1,…4) for the target variable based on the quartiles of the target variable. But my believe is, that when I have the full data set, I cannot calculate the quantile on the whole target variable and then make a train/test split afterwards and do a CV on the train set. Because then the calculated quantile values to create the classes are based on the test data as well. So my question is, how can I approach such a task in a sklearn framework? I saw that there exists the class TransformedTargetRegressor which goes into this direction: One could possibly use this together with KBinsDiscretizer for transforming the target variable. But a problem I see there is that it always backtransforms then the classes into numerical values when using .predict etc. but I want to do a classification not predicting numerical values.

Or: Would it be allowed to estimate the quartiles on the whole dataset and then classify the all target observations based on this? – But I would have data leakage problem there right?

Happy for any help.

Best Answer

As per @Sycorax's suggestion, I'm expanding my first comment as an answer...


I think you should do it in the same way you would use any other preprocessing transformer. That is:

  1. Calculate the quartiles using the training set
  2. Transform your target variable from continuous to categorical on both the training and test sets using the quartiles found in the previous step

I'll illustrate this with an example. For simplicity, let's imagine you only had 10 observations. This is how your target variable might look like: Full dataset

Next, you randomly split your dataset into train (70%) and test (30%).

Train: Training set

Test: Test set

(I know this all looks a bit ridiculous with such a small number of observations but the main idea is the important bit).

Now, you calculate the quartiles from the training set. These are: $$q_1=-0.040 \\ q_2=0.100 \\ q_3=0.145$$ Using this information, now you proceed to transform your target variable on both the training and test sets using the following logic: $$Y^* = \begin{cases} 1 & \text{if } Y\leq q_1\\ 2 & \text{if } q_1 < Y\leq q_2\\ 3 & \text{if } q_2 < Y\leq q_3\\ 4 & \text{if } Y > q_3 \end{cases}$$

Train: Transformed training set

Test: Transformed test set

Now you can train a model using $Y^*$ as your target variable.

Related Question