Solved – How to make train/test split with given class weights

kagglemachine learningpython

I am doing simple multi class classification ML problem.

I was given train data with perfectly balanced classes. However the data I must predict is not balanced. I was able to deduct the class proportions of test data.

Is there a way to split train data into train/validation data sets so that validation data set will have class proportions arbitrary set?

To cut it short: lot's of people want to make balanced training and validation set from imbalanced data. I want the reverse: I want to make imbalanced validation set from balanced training set;

Reasoning: I want my validation set to look like test data set; I know that 2 labels out of 7 cover 90% of data in test set (while they cover only 28% in train); I want to pass the same structure to my validation set;

Best Answer

i'm not sure about the purpose of you'r taks but you can do it with

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=TEST_PROPORTION, 
                                                    test_size=0.25)

use the argument stratify with the proportion of each class in test set

Related Question