Solved – Classification accuracy heavily depends on random seed

classificationcross-validationmachine learningrandomness

I want to compare different classification methods and evaluate their prediction measures (such as accuracy etc). I first split the data into training and test set. With the training data I then perform 10-fold CV to tune the classification method (SVM, LASSO).

After that I calculate the prediction measures with the best model on the test data. Depending on how I split my data set into training and test set (with which random seeds), the performance of the models heavily change and one method suddenly becomes better than another one.

I could repeat the calculation of the performance measures by using the entire data set again (and split it into new training and test data) after having tuned the classification method, multiple times with different random seeds.
It seems that like this I am underestimating the 'true' error but I am not sure if this is really the case.

Are there better solutions to this problem?

Best Answer

If I am understanding your methodology correctly, you are utilizing a single train-validation-test split where you are training a classifier on the train data, tuning using the validation set, and then evaluating final performance on the test set. There is nothing inherently wrong with this but if you have a relatively small dataset with a relatively large amount of noise a single data split is likely to give estimates that vary quite a bit depending on the partitions which is exactly what you have described. There has been a lot of discussion on data splitting in general and when it is appropriate, such as here.

In your case, I would look into repeated nested cross validation where you would partition the entire dataset into say, five folds. For a single iteration, you would create a training set out of four of those folds, and leave the last fold as the test set. The training set would then be partitioned again into say, another five folds (or maybe a inner train-validation split if this is computationally too expensive) to tune your hyper parameters/select the best model, and then you would evaluate the best found model on the test set. You would then repeat this for the other folds to get five error estimates from which you could then take an average.

This still doesn't get around the problem that the initial split of the entire dataset could still lead to significant differences in performance estimates. In that case, you could use repeated nested cross validation, that is, repeat the entire process above multiple times with different seeds to generate different partitions (search repeated k fold cross validation) to hopefully get better estimates of performance. I know there are also other alternatives such as the optimism adjusted bootstrap which can be read about at this site. Regardless, the point is that we do many repeats of our resampling strategy. In your original strategy of a single train-validation-test, you seem to have been on the right track: just repeat this process many more times over (with different seeds of course) and take an average of the repeats to get estimates with lower standard error. Pick the classifier that gives the best error estimate over all repeats. The difference between the nested cross validation approach I have described and your approach, however, is that all of the observations are guaranteed to appear in the test set once per a single repeat.

I hope this helps.