I was redirected from StackOverflow because my question is more about theory.
I have a usual set-up with a pandas dataframe with some features and a numeric target variable (financial returns for example). Now I want to make a classification problem out of it: Rather than predicting the numerical value of the return, I want to predict classes. My question deals now about how I do correctly create the classes of the target variable if the creation of the classes are dependent on the data. For example I want to make 4 classes (1,…4) for the target variable based on the quartiles of the target variable. But my believe is, that when I have the full data set, I cannot calculate the quantile on the whole target variable and then make a train/test split afterwards and do a CV on the train set. Because then the calculated quantile values to create the classes are based on the test data as well. So my question is, how can I approach such a task in a sklearn
framework? I saw that there exists the class TransformedTargetRegressor
which goes into this direction: One could possibly use this together with KBinsDiscretizer
for transforming the target variable. But a problem I see there is that it always backtransforms then the classes into numerical values when using .predict
etc. but I want to do a classification not predicting numerical values.
Or: Would it be allowed to estimate the quartiles on the whole dataset and then classify the all target observations based on this? – But I would have data leakage problem there right?
Happy for any help.
Best Answer
As per @Sycorax's suggestion, I'm expanding my first comment as an answer...
I think you should do it in the same way you would use any other preprocessing transformer. That is:
I'll illustrate this with an example. For simplicity, let's imagine you only had 10 observations. This is how your target variable might look like:
Next, you randomly split your dataset into train (70%) and test (30%).
Train:
Test:
(I know this all looks a bit ridiculous with such a small number of observations but the main idea is the important bit).
Now, you calculate the quartiles from the training set. These are: $$q_1=-0.040 \\ q_2=0.100 \\ q_3=0.145$$ Using this information, now you proceed to transform your target variable on both the training and test sets using the following logic: $$Y^* = \begin{cases} 1 & \text{if } Y\leq q_1\\ 2 & \text{if } q_1 < Y\leq q_2\\ 3 & \text{if } q_2 < Y\leq q_3\\ 4 & \text{if } Y > q_3 \end{cases}$$
Train:
Test:
Now you can train a model using $Y^*$ as your target variable.