Solved – Estimating classifier performance using cross validation, average accuracy and standard deviation and

accuracyclassificationcross-validationmeanstandard deviation

I want to estimate a classifier accuracy on benchmark data.
Data is not split into training and testing so I use 5-fold cross validation, using 80% of data as training and testing on 20%.
Each test is repeated 20 times, so in total there are 100 runs (20 test runs * 5 tests on each fold)
Accuracy is defined as number of correct predictions divided by number of records in a training data

I do not know how to calculate average accuracy and its standard deviation:

  • Should results from each fold be averaged and then the stdev calculated on 20 samples?

or

  • Should I calculate average and stdev on all 100 samples?

Another question is should STDEV or STDEVP function be used to calculate standard deviation, they are defined as follows:

  • STDEVP – Calculates standard deviation based on the entire population given as arguments.

  • STDEV – Estimates standard deviation based on a sample.

Best Answer

How to calculate standard deviation depends on which standard deviation you need.

Your results are subject to (at least) 2 differenct sources of variance:

  • variance due to the actual sample you have (finite test set)
  • variance due to the differences in the surrogate models (model instability)

You can characterize the model instability variance by comparing the predictions for the same sample against that sample's mean over all runs.

For performance characteristics that are fractions of tested cases such as % correct, error rate, sensitivity and so on, you can calculate the variance due to the finite number of test cases as variance of a binomial distribution. You can also use the variance of your per-sample-loss for each of the surrogate models. But if you calculate variance of loss for a whole run, you already mix in the model instability variance of $k=5$ surrogate models.

(For none of these you have the whole population, so you need the degrees of freedom corrected for sampling.)