Solved – repeat cross validation with a small dataset, and/or how can I improve the cross validation confidence

cross-validation

For university we need to classify 3 cancer types and give an estimation of how well our model will perform. We received a dataset with 100 samples. We split the data up into a training and test set using stratified sampling with a ratio of 0.3 and 0.7. The resulting training set consists of 69 samples, and the test set out of 31 samples.

We used 10-fold cross validation to calculate the accuracy of our models. When applying the same model on the test set for most models the accuracy on the test set is between 10-15% worse than with cross validation on the training set, except for one model where the accuracy on the test set was 2% better than during cross validation.

The problem that we have now is that the two best scoring models on cross-validation are not significantly different, one has an accuracy of 88.57% +/- 12.45%, the other an accuracy of 88.00% +/- 7.92%. However, on the test set the first score 76%, and the second scores 90%.

If we understood it correctly, we can't choose the second model as the best model based on the test set results, because then we would be using the test set as a training set. Instead, we would like to use repeated cross validation to improve my confidence in the cross validation results, and thereby hopefully being able to choose the best model.

With the small dataset that we have, if we do repeated cross validation and take the average, would we run into the problem that the same folds would be used multiple times?

Best Answer

It seems as if you are using an improper scoring rule, proportion correctly classified. Optimizing this measure will choose a bogus model.

You will need to repeat 10-fold cross-validation 100 times to get sufficient precision for validation estimates, and be sure to use a proper scoring rule (e.g., Brier score (quadratic error score) or logarithmic scoring rule (log likelihood)).