Solved – Summary on correctly classfied instances WEKA for a 10-fold cross-validation

cross-validationweka

I ran a 10-fold cross-validation BayesNet (but it could be any method in WEKA), and the output I got was (among other things):

...
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         142               88.75   %
Incorrectly Classified Instances        18               11.25   %
Kappa statistic                          0.8594
K&B Relative Info Score              13968.9989 %
K&B Information Score                  324.3318 bits      2.0271 bits/instance
Class complexity | order 0             371.9209 bits      2.3245 bits/instance
Class complexity | scheme            13057.6688 bits     81.6104 bits/instance
Complexity improvement     (Sf)     -12685.748  bits    -79.2859 bits/instance
Mean absolute error                      0.045 
Root mean squared error                  0.212 
Relative absolute error                 14.0471 %
Root relative squared error             52.9716 %
Total Number of Instances              160     
...

My questions is related to the Correctly Classified Instances. In this case I got 142 out of 160, but on which run? To my knowledge, a 10-fold cross-validation runs several times. This statistic is for which run?

The best?

The last?

For all of them?

And if it was an average on all runs, don't you guys think that it's a little bit convenient that the number was a nice integer (I ran a lot of classifications and its always an integer).

Best Answer

Cross validation 'runs several times' but it only predicts each case one time.

In your example of 10-fold cross-validation on 160 cases, each of the 10 runs (folds) leaves out 10% of the cases (lets say cases #1-16) to be tested on while training on the remaining 90% (#17-160 cases). The trained model tests on the 16 cases in the hold-out sample, and then the process is repeated on a new hold-out sample (e.g. cases #17-32). This process is repeated until each case has been predicted one time.

The idea is to never use the same case for both the training and testing phase, which can help with problems associated with over-fitting.

comparison to bootstrapping:

iterations use new random splits.
the main difference is resampling with (bootstrap) or without (cv) replacement.
computational cost is about the same, as I'd choose no of iterations of cv $\approx$ no of bootstrap iterations / k, i.e. calculate the same total no of models.
bootstrap has advantages over cv in terms of some statistical properties (asymptotically correct, possibly you need less iterations to obtain a good estimate)
however, with cv you have the advantage that you are guaranteed that
- the number of distinct training samples is the same for all models (important if you want to calculate learning curves)
- each sample is tested exactly once in each iteration
some classification methods will discard repeated samples, so bootstrapping does not make sense

Variance for the performance

short answer: yes it does make sense to speak of variance in situation where only {0,1} outcomes exist.

Have a look at the binomial distribution (k = successes, n = tests, p = true probability for success = average k / n):

$\sigma^2 (k) = np(1-p)$

The variance of proportions (such as hit rate, error rate, sensitivity, TPR,..., I'll use $p$ from now on and $\hat p$ for the observed value in a test) is a topic that fills whole books...

Fleiss: Statistical Methods for Rates and Proportions
Forthofer and Lee: Biostatistics has a nice introduction.

Now, $\hat p = \frac{k}{n}$ and therefore:

$\sigma^2 (\hat p) = \frac{p (1-p)}{n}$

This means that the uncertainty for measuring classifier performance depends only on the true performance p of the tested model and the number of test samples.

In cross validation you assume

that the k "surrogate" models have the same true performance as the "real" model you usually build from all samples. (The breakdown of this assumption is the well-known pessimistic bias).
that the k "surrogate" models have the same true performance (are equivalent, have stable predictions), so you are allowed to pool the results of the k tests.
Of course then not only the k "surrogate" models of one iteration of cv can be pooled but the ki models of i iterations of k-fold cv.

Why iterate?

The main thing the iterations tell you is the model (prediction) instability, i.e. variance of the predictions of different models for the same sample.

You can directly report instability as e.g. the variance in prediction of a given test case regardless whether the prediction is correct or a bit more indirectly as the variance of $\hat p$ for different cv iterations.

And yes, this is important information.

Now, if your models are perfectly stable, all $n_{bootstrap}$ or $k \cdot n_{iter.~cv}$ would produce exactly the same prediction for a given sample. In other words, all iterations would have the same outcome. The variance of the estimate would not be reduced by the iteration (assuming $n - 1 \approx n$). In that case, assumption 2 from above is met and you are subject only to $\sigma^2 (\hat p) = \frac{p (1-p)}{n}$ with n being the total number of samples tested in all k folds of the cv.
In that case, iterations are not needed (other than for demonstrating stability).

You can then construct confidence intervals for the true performance $p$ from the observed no of successes $k$ in the $n$ tests. So, strictly, there is no need to report the variance uncertainty if $\hat p$ and $n$ are reported. However, in my field, not many people are aware of that or even have an intuitive grip on how large the uncertainty is with what sample size. So I'd recommend to report it anyways.

If you observe model instability, the pooled average is a better estimate of the true performance. The variance between the iterations is an important information, and you could compare it to the expected minimal variance for a test set of size n with true performance average performance over all iterations.

Solved – 10 fold cross validation model in weka

You can use the Evaluate class to perform this 10-fold cross-validation. To define the cross-validation you have to set the parameter as '-x 10' in the EvaluateModel.

clear all; close all; clc;

%% Add jar file to path plus import dependencies
javaaddpath('/usr/local/weka-3-6-11/weka.jar');
import weka.classifiers.trees.RandomForest.*;
import weka.classifiers.meta.Bagging.*;
import weka.classifiers.Evaluation.*;
import weka.core.Instances.*

%% load the arff file and extract the informations
filename = 'algo_output/results_features_labeling2_2class.arff';
loader = weka.core.converters.ArffLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(D.numAttributes()-1);

%% classification
classifier = weka.classifiers.functions.MultilayerPerceptron();
classifier.buildClassifier(data);
classifier.toString()

%% 10-fold cross-validation
ev = weka.classifiers.Evaluation(data);

v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-x');
v(4) = java.lang.String('10');
v(5) = java.lang.String('-i');

prm = cat(1,v(1:end));
ev.evaluateModel(classifier, prm)

For more information, check this link.

Best Answer

Related Solutions

Cross Validation – Variance Estimates in K-Fold Cross-Validation

comparison to bootstrapping:

Variance for the performance

Why iterate?

Solved – 10 fold cross validation model in weka

Related Question