Solved – 30% difference on accuracy between cross-validation and testing with a test set in weka? is it normal

accuracycross-validationweka

I'm new with weka and I have a problem with my text classification project using it.

I have a train dataset with 1000 instances and one of 200 for testing. The problem is that when I try to test the accuracy of some algorithms (like randomforest, naive bayes…) with weka, the number given by cross-validation and test set is too different.

Here is an example with cross-validation

=== Run information ===

Scheme:weka.classifiers.trees.RandomForest -I 100 -K 0 -S 1
Relation:     testData-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10000000-prune-rate-1.0-T-I-N0-L-stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\"\'()?!--+-í+*&#$\\/=<>[]_`@"-weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker -T 0.0 -N -1
Instances:    1000
Attributes:   276
[list of attributes omitted]
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

Random forest of 100 trees, each constructed while considering 9 random features.
Out of bag error: 0.269



Time taken to build model: 4.9 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         740               74      %
Incorrectly Classified Instances       260               26      %
Kappa statistic                          0.5674
Mean absolute error                      0.2554
Root mean squared error                  0.3552
Relative absolute error                 60.623  %
Root relative squared error             77.4053 %
Total Number of Instances             1000     

=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.479     0.083      0.723     0.479     0.576      0.795    I
                 0.941     0.352      0.707     0.941     0.808      0.894    E
                 0.673     0.023      0.889     0.673     0.766      0.964    R
Weighted Avg.    0.74      0.198      0.751     0.74      0.727      0.878

=== Confusion Matrix ===

   a   b   c   <-- classified as
 149 148  14 |   a = I
  24 447   4 |   b = E
  33  37 144 |   c = R

74% , it's something…

But now if I try with a my test set of 200 instances…

=== Run information ===

Scheme:weka.classifiers.trees.RandomForest -I 100 -K 0 -S 1
Relation:     testData-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10000000-prune-rate-1.0-T-I-N0-L-stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\"\'()?!--+-í+*&#$\\/=<>[]_`@"-weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker -T 0.0 -N -1
Instances:    1000
Attributes:   276
[list of attributes omitted]
Test mode:user supplied test set: size unknown (reading incrementally)

=== Classifier model (full training set) ===

Random forest of 100 trees, each constructed while considering 9 random features.
Out of bag error: 0.269



Time taken to build model: 4.72 seconds

=== Evaluation on test set ===
=== Summary ===

Correctly Classified Instances          86               43      %
Incorrectly Classified Instances       114               57      %
Kappa statistic                          0.2061
Mean absolute error                      0.3829
Root mean squared error                  0.4868
Relative absolute error                 84.8628 %
Root relative squared error             99.2642 %
Total Number of Instances              200     

=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.17      0.071      0.652     0.17      0.27       0.596    I
                 0.941     0.711      0.312     0.941     0.468      0.796    E
                 0.377     0          1         0.377     0.548      0.958    R
Weighted Avg.    0.43      0.213      0.671     0.43      0.405      0.758

=== Confusion Matrix ===

  a  b  c   <-- classified as
 15 73  0 |  a = I
  3 48  0 |  b = E
  5 33 23 |  c = R

43% … obviously, something is really wrong, I used batch filtering with test set

What am I doing wrong? I manually classified the test and train set using the same criteria, so I find strange that differences.

I think I got the concept behind CV, that's why I don't understand that big gap.

Excuse my english and thanks, any help will be appreciated.

Best Answer

The problem is that the distribution of classes in your training dataset is dramatically different than the distribution of your classes in your testing dataset. In your training dataset if you were to always predict class B, then you would have about a 50 percent accuracy; however, in your testing dataset notice that class B is very under represented as compared to the training dataset. The algorithm predicts the vast majority of your observations are going to be class B but it's wrong.

The general solution is to shuffle your dataset before splitting off a final validation set. This will help ensure the distributions of both data sets are similar.

If your data changes dramatically through time, then you should not shuffle your dataset. Having a final validation set that has observations that were recorded after your training dataset is ideal to understand how well a final model trained on all of the data will predict subsequent future observations.