Classification – Is Cross-Validation Effective with Small Sample Sizes?

classificationcross-validation

I have what seems to be a very basic confusion about cross-validation.

Let us say I am building a linear binary classifier, and I want to use cross-validation to estimate the classification accuracy. Imagine now that my sample size $N$ is small, but the number $k$ of features is large. Even when features and classes are randomly generated (i.e. "actual" classification accuracy should be 50%), it can so happen that one of the features will perfectly predict the binary class. If $N$ is small and $k >> N$, such a situation is not unlikely. In this scenario I will get 100% classification accuracy with any amount of cross-validation folds, which obviously does not represent the actual power of my classifier, in a sense that probability to classify a new sample correctly is still only 50%. [Update: this is wrong. See my answer below for the demonstration of why it is wrong.]

Are there any common methods of dealing with such a situation?

For example, if I want to assess statistical difference between my two classes, I could run MANOVA which in case of two groups reduces to computing Hotelling's T. Even if some of the features yield significant univariate differences ("false positives"), I should get an overall non-significant multivariate difference. However, I do not see anything in the cross-validation procedure that would account for such false positives ("false discriminants"?). What am I missing?

One thing that I can think of myself, would be to cross-validate over features, e.g. to select random subset of features (in addition to randomly selecting a test set) on each cross-validation fold. But I do not think such an approach is often (ever?) used.

Update: Section 7.10.3 of "The Elements of Statistical Learning" entitled "Does Cross-Validation Really Work?" asks exactly the same question and claims that such a situation can never arise (cross-validation accuracy will be 50%, not 100%). So far I am not convinced, I will run some simulations myself. [Update: they are right; see below.]

Best Answer

I don't think there is much confusion in your thoughts, you're putting your finger on one very important problem of classifier validation: not only classifier training but also classifier validation has certain sample size needs.


Well, seeing the edit: there may be some confusion after all... What the "Elements" tell you is that in practice the most likely cause of such an observation is that there is a leak between training and testing, e.g. because the "test" data was used to optimze the model (which is a training task)

The section of the Elements is concerned with an optimistic bias caused by this. But there is also variance uncertainty, and, even doing all splitting correctly you can observe extreme outcomes.


IIRC the variance problematic is not discussed in great detail in the Elements (there's more to that than what the Elements discuss in section 7.10.1), so I'll give you a start here:

Yes, it can and does happen that you either accidentally have a predictor that predicts this particular small data set perfectly (train & test set). You may even just get a splitting that does accidentally lead to seemingly perfect results while the resubstitution error would be > 0.

This can happen also with correct (and thus unbiased) cross validation because the results are also subject to variance.

IMHO it is a problem that people do not take this variance uncertainty into account (on contrast, bias is often discussed in great length; I've hardy seen any paper discussing the variance uncertainty of their results although in my field with usually < 100, frequently even < 20 patients in one study it is the predominant source of uncertainty). It is not that difficult to get a few basic sanity checks that would avoid most of these issues.

There are two points here:

  • With too few training cases (trainig samples ./. model complexity and no. of variates), models get unstable. Their predictive power can be all over the place. On average, it isn't that great, but it can accidentally be truly good.
    You can measure the influence of model instability on the predictions in a very easy way using the results of an iterated/repeated $k$-fold cross-validation: in each iteration, each case is predicted exactly once. As the case stays the same, any variation in these predictions is caused by instability of the surrogate models, i.e. the reaction of the model to exchanging a few training cases.
    See e.g. Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 390, 1261-1271 (2008).
    DOI: 10.1007/s00216-007-1818-6

    IMHO checking whether the surrogate models are stable is a sanity check that should always be done in small sample situations. Particularly as it comes at nearly zero cost: it just needs a slightly different aggregation of the cross validation results (and $k$-fold cross-validation should be iterated anyways unless it is shown that the models are stable).

  • Like you say: With too few test cases, your observed sucesses and failure may be all over the place. If you calculate proportions like error rate or hit rate, etc. they will also be all over the place. This is known as these proportions being subject to high variance.
    E.g. if the model truly has 50% hit rate, the probability to observe 3 correct out of 3 predictions is $0.5^3 = 12.5 \%$ (binomial distribution). However, it is possible to calculate confidence intervals for proportions, and these take into account how many cases were tested. There is a whole lot of literature about how to calculate them, and what approximations work well or not at all in what situations. For the extremely small sample size of 3 test cases in my example:

    binom.confint (x=3, n=3, prior.shape1=1, prior.shape2=1)
    #           method x n mean     lower     upper
    # 1  agresti-coull 3 3  1.0 0.3825284 1.0559745  
    # 2     asymptotic 3 3  1.0 1.0000000 1.0000000  
    # 3          bayes 3 3  0.8 0.4728708 1.0000000  
    # 4        cloglog 3 3  1.0 0.2924018 1.0000000
    # 5          exact 3 3  1.0 0.2924018 1.0000000
    # 6          logit 3 3  1.0 0.2924018 1.0000000
    # 7         probit 3 3  1.0 0.2924018 1.0000000
    # 8        profile 3 3  1.0 0.4043869 1.0000000 # generates warnings
    # 9            lrt 3 3  1.0 0.5271642 1.0000000
    # 10     prop.test 3 3  1.0 0.3099881 0.9682443
    # 11        wilson 3 3  1.0 0.4385030 1.0000000
    

    You'll notice that there is quite some variation particularly in the lower bound. This alone is an indicator that the test sample size is so small that hardy anything can be concluded from the test results.
    However, in practice it hardly matters whether the confidence interval spans the range from "guessing" to "perfect" or from "worse than guessing" to "perfect"

  • conclusion 1: think beforehand how precise the performance results need to be in order to allow a useful interpretation. From that, you can roughly calculate the needed (test) sample size.

  • conclusion 2: calculate confidence intervals for your performance estimates

  • For model comparisons on the basis of correct/wrong predictions, don't even think of doing that with less than several hundred test cases for each classifier.
    Have a look at McNemar's test (for paired situations, i.e. you can test the same cases with both classifiers). If you cannot do the comparison paired, look for "comparison of proportions", you'll need even more cases, see the paper I link below for examples.

  • You may be interested in our paper about these problems:
    Beleites, C. et al.: Sample size planning for classification models., Anal Chim Acta, 760, 25-33 (2013). DOI: 10.1016/j.aca.2012.11.007; arXiv: 1211.1323


second update about randomly selecting features: the bagging done for random forests regularly uses this strategy. Outside that context I think it is seldom, but it is a valid possibility.