Solved – Would a sorted class harm a 10-split cross validation

confusion matrixcross-validationweka

I am using the Java API of Weka to apply a Naive Bayes classification to an .arff file I've created. The (@data part) file has the following format:

0 0 0 0 1 0 ... 0
1 0 0 0 0 1 ... 0
.
.
.
0 0 1 0 0 1 ... 3 
.
.
.
0 0 0 0 1 0 ... 5

Where each number belongs to [0,1], except the last one which is the class [0,5].

Considering that I am using a 10-split cross validation, would it be a mistake to have my data in a sorted by class form? Would that lead in taking the test data from the last class only for example?

I am getting the following confusion matrix, which is obviously wrong:

0.0 |0.0 |0.0 |0.0  |0.0  |0.0|  
0.0 |0.0 |0.0 |0.0  |0.0  |0.0|     
0.0 |0.0 |0.0 |0.0  |0.0  |0.0|   
0.0 |0.0 |0.0 |0.0  |0.0  |0.0|  
42.0|14.0|15.0|114.0|233.0|7.0|  
71.0|16.0|30.0|241.0|86.0 |7.0|

Any ideas why the first 4 classes are only zeroes? My .arff file has examples (not so evenly distributed) from all 6 classes.

EDIT: I shuffled my data and now I am getting a much more rational result.

27.0|2.0|8.0 |24.0 |27.0 |2.0|
4.0 |2.0|0.0 |2.0  |1.0  |0.0|
6.0 |3.0|15.0|16.0 |19.0 |3.0|
29.0|4.0|13.0|326.0|87.0 |33.0|
20.0|5.0|7.0 |37.0 |110.0|6.0|
5.0 |0.0|1.0 |17.0 |8.0  |7.0|

I am using the code I found here. Is there anything wrong with it?

Best Answer

tl;dr: You're calling the wrong function! trainCV doesn't randomly partition the data.

Background

Weka has a few different ways to set up cross validation.

The "top-level" function is weka.classifiers.evaluation.Evalution. This class operates like (and presumably implements) the Experimenter GUI. If you do NOT provide a test set (and don't set the "no-cv" option), it will perform a stratified cross validation. The instance order is shuffled (and you can provide a seed for it). This class will take you a set of instances right through to a performance measurement (e.g., accuracy), which would obviate most of the code in your link. Use this class if you can.

Weka also provides "filters" that process sets of instances. There are two cross-fold validation-related filters: the supervised weka.filters.supervised.instance.StratifiedRemoveFolds and the unsupervised version weka.filters.unsupervised.instance.RemovedFolds. These shuffle your data and then create/remove the specified folds. You can provide a seed so that the fold assignments are reproducible across runs, if you like.

Finally, Instance also includes the weka.instance.trainCV/testCV pair of functions. These both come in two flavours. There is a three argument version, which takes a Java Random object and uses it to shuffle the data, and a two argument version which just blindly assigns the first $k/n$ points to the first fold (etc), without shuffling. This is potentially bad, as you've just discovered.

Your Problem

Your code uses the non-shuffling version of weka.instance.trainCV (see the code for CrossValidationSplit in "Step 3" of your code). Since the data are sorted by class, each fold is very unbalanced (even more than your whole data set), which is why your performance was initially terrible and it improves when you shuffle the data.

For best performance, I'd recommend trying a stratified cross validation with shuffling. If you can use the Evalution class, you can get the whole thing done in a handful of lines. Otherwise, check out the StratifiedRemoveFolds function linked above.

Source: the source (linked above)

Best Answer

Background

Your Problem

Related Solutions

Solved – Split clustered data into calibration and validation sample (Cross validation)

Solved – Cross-validation with unbalanced-classes

Related Question