Significance-Testing – Cross-Validated Classification Accuracy: Shuffling vs Binomial Test

classificationcross-validationhypothesis testingmonte carlo

I have a dataset with 2 classes and a certain way to build a binary classifier. I want to measure its performance and to test if it is significantly above the chance level. I measure its performance with repeated cross-validation (see below). My question is: how to test the significance?

Here is my cross-validation procedure. I use 100-fold stratified Monte Carlo cross-validation (I am not exactly sure that's the correct term though; some people seem to call it boostrap, or out-of-bootstrap, or leave-group-out cross-validation): on each fold I randomly select $K=4$ test cases, 2 from each class, train the classifier on the remaining data, and classify these 4 cases. Let's say I get $a_i$ correct classifications. This is repeated $N=100$ times, and so I get an overall number of correct classifications $A = \Sigma a_i$. I report mean classification accuracy $A/400$ and the standard deviation of individual accuracies $a_i/4$.

  • Note 1: For the reasons that I think are not very important here I cannot increase $K$ and cannot use the usual k-fold cross-validation, so this Monte Carlo approach is the only possible one for me. The variance of my estimator is quite large, but I have nothing else to do.
  • Note 2: @FrankHarrel would say that classification accuracy is a "discontinuous improper scoring rule". I know, but in this particular case I am fine with it. I am not optimizing any model here, my classifier is already given.

Now, naively I would think that a random classifier would predict each case with probability 50%, so the number of correct classifications under null hypothesis of chance level classifier would be $\mathrm{Binom}(400, 0.5) \approx \mathcal{N}(200,100)$, so I can simply test if my $A$ is in the upper $\alpha$% (say 1%) percentile of this binomial/normal distribution.

However, I decided to do a shuffling test. I shuffle my labels, then use the whole above procedure with 100 folds to get a mean shuffled accuracy $B_j$, and repeat this shuffling $M=100$ times. The purpose is to sample accuracies under null hypothesis. After I obtain 100 values $B_j$ I look at their distribution. The mean is very close to 200, which is good. However, the variance is much larger than 100, it's around 1500. I don't understand how it is possible.

After I looked closer, I noticed that inside each shuffle the variance of correct classifications over 100 folds is around 1, as expected: $4*0.5*(1-0.5)=1$. But inside some shuffles the mean number of correct classifications is quite a bit below 2, and inside some other shuffles it is quite a bit over 2. This additional variation causes the variance of $B_j$ to be so high. In contrast, if inside each shuffle I use a truly random classifier instead of the classifier built on a training set according to my method, then I get $B_j$ nicely following $\mathcal{N}(200,100)$.

How is it possible? Should I use binomial or empirical distribution for statistical testing? The difference is very large. If I should use the empirical one, is there a way to somehow approximate it without actually performing the shuffles (which takes ages)?

Update

There is actually a third way to test for significance that comes to mind: I can look at the confidence interval of my $A$ and see if it excludes 200. No need to do any binomial testing, no need for any shuffling. However, I am confused about what variance I should build the confidence interval upon: is it the standard deviation of $a_i$ or the standard error of the mean of $a_i$? The difference is of course huge. If the total number of samples is large, then my individual Monte Carlo folds can be assumed to be independent, and I guess I can take standard error of the mean of $a_i$. But if the total number of samples is not so large, they are not independent anymore. On the other hand, standard deviation is too large to be useful for testing. Seems I would need to divide the variance of $a_i$ by the square root of effective number of folds (which would be less than 100), but I have no idea how to do estimate it.

Best Answer

Radmacher and colleagues (J. Comput. Biol. 9:505-511) describe a process for computing the significance of an error rate. We permute the class labels (a few thousand times) and repeat the entire cross-validation procedure to assess the probability of producing a cross-validated error rate as small as the observed one. One concept related to the current question is that the summaries from folds of cross-validation are not independent, and this presents challenges in obtaining confidence intervals and p-values. This is discussed by Jiang and colleagues (Stat Appl Genet Mol Biol. 2008;7(1)). Also note that a 'significant' error-rate measure is a bare-minimum requirement for a prediction rule and does not say much about its usefulness. However, given that small samples are often used and there are many problems with prediction rules, it is still helpful as a sanity check.