Solved – How to compare accuracy with k-fold cross-validation with different ‘k’

cross-validation

How to compare accuracy with k-fold cross-validation with different k

I need to compare the accuracy and F1 measures of my machine learning classifier (C1) with the state-of-the-art classifier (C2). However, the paper which proposed C2 had tested the classifier with 3-fold cross-validation test.

Is it possible to compare 10-fold CV on C1 with 3-fold CV on C2?

If yes, could you teach me how?

I am not sure about this since the larger k means more training set to be leaned and which might be unfair.

Best Answer

  • Cross validation is a technique to estimate the generalization error of a model. Comparing generalization error of two models M1 and M2 is certainly possible, and not restricted to $k$-fold cross validation (nor to equal $k$).

  • From your post it is not entirely clear whether you actually want to compare models M1 and M2 or training algorithms A1 and A2.
    Comparing training algorithms via cross validation makes stronger assumptions than comparing predictive performance of specific (fully trained) models. Also, resampling validation (to which cross validation belongs) cannot fully measure the variance uncertainty for algorithm comparison - usually leading to the additional implicit assumption that this doesn't matter...

  • The most powerful way to such comparisons are so-called paired designs (read up on paired design of experiments), where you compare the performance of both models (or algorithms) on exactly the same data: that way, you are sure that both models (or algorihms) have to deal with problems of exactly the same difficulty.
    In order to do that, you need to have access either to the exact data (including cross validation splits) of the reference classifier M2 and then use the same to train and test your model M1. If comparing algorithms, you can also work with a reference implementation of A2 and then train models M2i using A2 models M1i by your algorithm A1 both on the same splits (and many of those).
    You can then compare the predictions pairwise: same test case, M1 prediction vs. M2 prediction.

  • If that is not possible, because all you really have is the publication about the reference classiifier, the uncertainty in the comparison is larger. I.e., the "I don't know whether my classifier is better" zone is wider.


Bias and variance

  • Classifier testing like any other measurement is subject to systematic (bias) and random error (variance).

  • Correctly implemented k-fold cross validation has a small pessimistic bias. This bias may be different for your 10-fold and the published 3-fold CV results, but it is hard to say anything further:

    • The 3-fold setup uses only 2/3 of the available data for training, so for the same algorithm and data, 3-fold typically has a larger pessimistic bias than 5-fold or 10-fold CV. But you neither have the same algorithm nor the same data set (or at least, if you have the same data set, you can do a paired design as explained above - and that implies using the same splits, so it's not 10-fold vs. 3-fold).
    • The reason for the pessimistic bias of cross validation is that we take an (unbiased) measurement of the generalization error of a model trained on $\frac{k-1}{k}$ of the data and use this as approximation for the generalization error of a model trained on the whole data set. That latter model is on average better because it has more training data. So the pessimistic bias is a direct consequence of how the learning curve improves for that algorithm or that algorithm together with the data set at hand (for model comparison) going from $\frac{k-1}{k}$ of the data to the whole data set.
    • And this is why all bets are off here: your algorithm on your data may be better or worse in the slope of its learning curve compared to the state of the art algorithm (on your data or on its own data), even if you use a data set of the same size as the reference publication.
  • This hinges on cross validation splits being statistically independent: if you have groups in your data read up on independent splitting and confounding variables.

    • Random error, OTOH, depends on the absolute number of tested cases across all k folds, i.e. on the total number of cases you have (plus further sources of random error such as model instability which you can measure by iterating/repeating the cross validation and for algorithm comparison the random deviation of the actual data set you have compared to the population where that data comes from [only for algorithm comparison]).

 Actually comparing the two classifiers

  • Make sure the figure of merit you use is a good one: read up on proper scoring rules.
  • If you have to use fraction-of-tested-cases figures of merit (such as accuracy), confidence intervals for that can be calculated, see binomial confidence intervals or confidence interval for proportions.
    Depending on your sample size, a quick look at an approximate confidence interval may already tell you all you need.