Solved – Partitioning data for k-fold cross validation that will not have equal partitions

cross-validation

From Wikipedia:

In k-fold cross-validation, the original sample is randomly
partitioned into k equal size subsamples.

I am working on a 10 fold cross validation project. I have a dataset that has 76 elements. It means that I can not have equal size partitions.

What are the approaches for remaining data (in my example 6 data)? Ignoring them, making a data 16 elements, 6 partitions have 11 elements or etc?

Best Answer

Usually the $k$-fold cross validation subsets have approximately equal size. It is just crucial that they don't overlap.

For example I just had a look at what WEKA does. Say that you have $N$ instances and $k$ folds, then $$ r = N \mod k $$ (the remainder of $N$ divided by $k$) is the number of surplus records. The first $r$ partitions will have $\lfloor N/k \rfloor + 1$ records, the other ones just $\lfloor N/k \rfloor$ instead

Regarding your example: $$N = 76 $$ $$k = 10 $$ $$ r = N \mod k = 6 $$

First $6$ partitions will have $ \lfloor N/k \rfloor + 1 = 7 + 1 = 8$ records, the other ones $ 7 $ instead.

Related Question