Solved – Cross Validation with duplicates and (un)balanced data

classificationcross-validationunbalanced-classes

I am currently working on a student project were we do a binary classification. But the data is highly screwed!

The train AND test data contains a huge amount of duplicates , were every row is identical. We dont know exactly how to handle this problem because in the test data we have a lot of duplicates as well and we cannot delete rows there because we need to predict the binary outcome for them.

My qustion is now related to how we can do a Cross Validation with the train data to estimate our models performance with the best atttributes and parameters for the unknown test data.

If we keep the duplicates in our training data, the binary classification problem is balanced, so we have almost the same amount of 0s and 1s for the binary class. But if we delete the duplicates we get a lot more 0s then 1s (2/3 are 0s and 1/3 are 1). As already mentioned we have also alot of duplicates in the test data, so we assume that in the test data the 0s and 1s are balanced as well, and without the duplicates we have more 0s and less 1s.

How do we do a good Cross Validation for this problem?

Do we leave the training data as it is and dont delete any dublicate rows, or do we delete dublicate rows and then do on the unbalanced data the Cross Validation and prediction? Or do we have to delete the duplicates per CV-Fold for the Training-Fold and then balance it and then predict on the unbalanced CV-Test-Fold where we didnt delete any rows?

Best Answer

If you have exact training data with equal labels (absolutely the same data point with the same labels are repeated over and over) delete the duplicates. Make sure they are exactly the same in observation and label. (e.g, for some reason, the same exact data, from the same subject, is repeated.)

After removing your duplicates, create your folds with respect to the number of observations in each class. Try to represent the overall distribution of classes in each fold. Then each fold represents the true distribution of your dataset.

If you do so, you train with all the folds except one, and test it on the one remaining. and the test fold has also realistic class distribution.

The way you use the training data to train your model in cross validation though, is another story. If you use neural nets or something that requires minibatch training, use stratified minibatch. If not, you can randomly select equal number of observations from both classes for training or use other tricks. But create your folds realistically and keep this decision apart from fixing the training issues of unbalanced data.

Here are some more information regarding removing duplicates in different scenarios:

1) It does not make any sense to have one data point more than once. It does not add any new information. If the duplicates are not exactly the same series, or are only in some of the feature dimensions, that's another story. Keep them as they are.

2) if you have two separate time series, recorded from the same subject, and the values of all time steps are identical, then I think you can remove one of them, considering the time step unit is small enough (e.g, is a second, not a year) .

4) But if your data is something like stock market price and the time series are prices during a month, it is possible that two identical series are not actually the same (e.g, price didn't change at all two months in a row) and this information is indeed valuable and should be kept.

UPDATE: I added the summary of our discussions with @janbauer and @cbeleites in the comments to the answer.

Related Question