Solved – Class labels in data partitions

classificationdata miningmachine learningpartitioning

Suppose that one partitions the data to training/validation/test sets for further application of some classification algorithm, and it happens that training set doesn't contain all class labels that were present in the complete dataset, i.e. if say some records with label "x" appear only in validation set and not in the training.

Is it the valid partitioning? The above can have many consequences like confusion matrix would be no longer square, also during the algorithm we may evaluate an error and this would be affected by unseen labels in training set.

The second question is following: is it common for partitioning algorithms to take care about above issue and partition the data in the way that training set has all existing labels?

Thank you

Best Answer

If you consider stratified sampling, I think something similar could be done here, assuming your class is not so under represented that it does not even have 3 examples (one for training, one for testing and one for cross-validation).

Using a method like stratified sampling, you would make certain that each class is represented by randomly selecting instances of that class for each data set.

If you are running into this problem, you also might question whether you have enough data to train your algorithm well. Sometimes the correct answer is to get more data, assuming that is possible.