Solved – Does k-fold cross validation always imply k uniformly sized subsets

cross-validation

I'm a bit confused on a minor point that I'm trying to discern due to a cross-validation strategy I've come across in my work that creates k-folds but the folds are not of equal length (for example some folds are of size 17, another 18, on up to 24). Is k-folds cross validation constrained to folds of equal length? Arbitrary choices of training data length and fold count can yield fractional numbers of course where one fold will draw the short stick but would it be accurate to say k-fold attempts to make roughly equal fold sizes?

In particular I'm hearing contradictory messages from in this question

Matt Krause "divided into different, mutually-exclusive 'folds'"

a Data Head "k-fold cross-validation (kFCV) divides the N data points into k mutually exclusive subsets of equal size."

Best Answer

I didn't meant to imply that the folds should be different sizes in that answer (and I'll update it to match).

Each fold should contain an equal number of observations, or as close to equal as possible. If you want to perform 10-fold cross-validation on $N=101$ observations, one fold will have 11, rather than 10, items in it. That's fine. If you have 102 observations, it'd be best to have two folds of 11 items each, rather than 1 of 12 and 9 of 10, though I doubt this matters much in practice, particularly as $N/k$ increases.

There is no need to choose $k$ that divides evenly divides $N$ (how would you even run cross-validation on a prime-number-sized data set?), or discard examples until $N$ is evenly divisible by $k$ (data is hard to get; don't throw it away).

Related Question