Cross-Validation Errors – Identifying False Cross Validation Points in Models

cross-validationmodeloverfitting

So I have a practice test I am attending on K-folds and it asks me to select the false statements regarding K-fold cross validation:

    Which of the following are incorrect about k-fold cross validation?
     
    
  • You repeat the cross validation process ‘k’ times.
  • Each ‘kth’ fold is used as the validation data once.
  • You repeat the process k-1 times.
  • A model trained with k-fold cross validation will never overfit.

Now I am not sure of the process is repeated k times or k – 1 times… I searched over the internet but did not get any unambiguous answer to that.

Also when I checked some articles, it says at Can K-fold cross validation cause overfitting?
that …

It cannot "cause" overfitting in the sense of causality. However,
there is no guarantee that k-fold cross-validation removes overfitting

Again, I am not able to conclusively decide if that last point is false or not. Can any guru here help me in identifying the incorrect statements from the above list and give some insight why they are incorrect?

Best Answer

First of all, I have to say that I find the wording quite ambiguous.

  • cross validation process: IMHO this can refer to at least 3 different processes/steps:
    • dealing with one "fold": in that case, the "process" of training on k - 1 folds and testing the left out kth fold is repeated k times in a complete "run" of k-fold cross validation.
  • However, cross validation process may refer to such a complete "run".
  • runs may be repeated with a different fold assignement/split. Again, that may be refered to as "the cross validation process".

I'd usually assume "cross validation process" to refer to the middle level, i.e. a run of dealing with k different folds.
From the provided answer options, I guess that it may refer to dealing with fold in this question. In that case, option 1 would be true (so should not be marked), otherwise not.

Similarly, whether option 2 is true or not depends on whether repetitions are considered to be part of k-fold cross validation or not (bullet point 3 above).

Option 3 is almost certainly asking for false (i.e. should be marked as incorrect), but it is logically true in the sense that to repeat something k times, you first have to repeat k-1 times and then an additional time...

The only option that is unambiguously false (should be marked) is no. 4.
Cross validation in itself is rather orthogonal to overfitting. It can be used as part of the toolboox against overfitting in some particular (and important) "design patterns", but there are other perfectly valid uses of cross validation that will not remove the overfitting. Some of them can help detecting overfitting. But there are types and causes of overfitting that cannot be detected by cross validation.
Last but not least, among the types of overfitting that can in principle be detected by cross validation, whether and which of them are actually monitored wrt. overfitting depends on whether the cross validation is set up correctly wrt. the particular application and data. This may be considered ambiguous, since one may argue that doing cross validation implies that it must be done correctly.


Side note: IMHO the negation in the question it is bad multiple choice practice - it could ask just as easily which statements are correct. That would more directly measure understanding of the topic in question and not confound it with oversight on the additional negation due to exam stress.

Related Question