Solved – Difference between Mean/average accuracy and Overall accuracy

accuracycomputer visionimage processingmachine learningmean

I just got confusion while reading the paper "Local Binary Pattern-Based Hyperspectral Image Classification With Superpixel Guidance".

They mentioned that they repeated each experiment 10 times and calculated both mean and standard deviation. after that they also mentioned they calculated overall accuracy. in the results they mentioned mean and std accuracy of each class and then overall accuracy. What is the difference between average/meanMean and overall accuracy? isn't should be same?
Table where mean accuracy of each class is calculated
enter image description here

I found this link that explain about different method to accuracy . Is the sensitivity calculated in that method for each class is same as mean accuracy?

An example confusion matrix to calculate Class Accuracy and Overall Accuracy:
enter image description here

According to the references given in answer mean accuracy can be calculated as:
Mean Accuracy of Class N: 1971/ (1971 + 19 + 1 + 8 + 0 + 1) = 98.55%

Overall accuracy = (1971 + 1940 + 1891 + 1786 + 1958 + 1926) / (2000 + 2000 + 2000 + 2000 + 2000 + 2000) = 95.60

Best Answer

What the paper describes

I trust you are referring to the following quote

Each experiment is repeated ten times with a different training set to make the comparison fair, and both the mean accuracies and standard deviation are reported. For the evaluation metrics, overall accuracy (OA) and kappa coefficient (κ) are adopted to quantify the classification performance. The OA is computed by the ratio between the number of the correctly classified test samples and the total test samples.

It appears that the authors were using a single iteration of 10 fold cross validation but avoided using that terminology. The mean accuracy is related to the mean accuracy achieved across ten different training folds. So they build 10 different models using non-overlapping data and test how consistently they perform.

After cross validation an overall model is typically built using all the data from the 10 folds and this is what is used to predict the outcomes in the test set.

Overall accuracy is clearly stated as the accuracy achieved in the test set. Not ideal terminology, the term 'predictive accuracy' is maybe more along the lines of what they done

What it means

In the ideal world mean accuracy of the 10 training experiments would be identical to the overall accuracy. To achieve this would require a perfect match in terms of distribution of samples within each subsampling (mean of the training set folds and the test set) from the parent dataset.

However, each fold has a distinct set of samples so we expect variation in what the population characteristics of each fold will be, therefore what the accuracy will be. This is why standard deviation is calculated alongside mean accuracy for the training set.

This means that when you come to yet another independent set of samples (the test set) you hopefully can guess what range of accuracy you expect to achieve based on your training folds, but you will get a distinct accuracy value for that population. this is what the paper refers to as the 'overall accuracy'

** UPDATE for comments **

The methodology states that the authors tested class sizes of 7,10 and 15 samples per class to determine sensitivity to small sample sizes, the results are presented in Fig 8 and show that the more samples per class the better the overall accuracy, especially in the Indian Pines data set. The table you copy in your updated question states that the training set had 10 samples per class, so the mean accuracy is simply the average accuracy of each class, but this number is pretty meaningless.

To get a number that was more meaningful for comparison to the test set you would need to adjust for expected distribution of class sizes (see table I and II). Table II lists 4 classes with fewer than 150 samples which makes it impossible to sample 10 independent training sets of 15 samples. I therefore now assume the authors mean that the randomisation for selection was independent but the training sets could overlap. Whether (and how) they were able to retain enough test set samples from any of the short fall classes (C1,54,10 and 12) is not clear.

The fact remains that the class accuracy is based on the training set and the overall accuracy is based on the test set so will never agree. To be honest the completely different presentation of the training and test set results makes comparison obscure.

I recommend you read the answers to the following question on CV around the issue of classification accuracy and group imbalance. Why is accuracy not the best measure for assessing classification models?

see also

https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/

to answer you updated query about sensitivity: At first I said no but later realised you were right. Class accuracy only considers the actual positives for that class. This means that correct answers are indeed true positives and incorrect answers are false negatives.

** further update ** mean class accuracy is calculated as the mean of the class accuracy across the 10 training sets. So the example in your question is how the class accuracy is calculated in 1 iteration. You would calculate this value for each class for each iteration, then you would calculate its arithmetic mean (and standard deviation).

The paper clearly states that the class accuracy was calculated from 10 training sets while the overall accuracy was calculated from the test set. This means the two should never be perfectly reconcilable. It also means it is very difficult to compare test set performance to training performance. Since selection of samples to training and test sets is not described at all so it is impossible to interpret much from the paper.