Hypothesis Testing – How to Compare Accuracy of Multiple Classifiers on the Same Data Set

accuracyanovaclassificationhypothesis testing

I have run 6 different classifiers (Naive Bayes, Decision Trees, Linear Discriminant Analysis, k-Nearest Neighbors, Support Vector Machine and one-layer Perceptron) on the same data set (7 features and 2 classes), using random sampling with 80% of the samples being for training and 20% for testing and have computed the mean of the precision as an accuracy metric. Now, I want to compare that using a statistic test. I have found that ANOVA can carry off this case. But I don't know much about it and, from what I have found, I must have enough samples, in order to find a mean for each classifier (in this case I only have one value for the accuracy, due to test data set being only one). Is there any way to figure out what I have to do for ANOVA to work or is there any better way to solve my problem?

P.S.
I'm sorry if you got confused why ANOVA is not working, but I'm as confused as you too. I'd appreciate if you have any suggestions, despite being one on ANOVA.

Best Answer

As far as I understand, your aim is to compare the performances of the classifiers. If you wish to use ANOVA ("analysis of variance"), you must have the variance in the first place.

For that, you will need to run each of your classifiers multiple times. As you say, you only have one test set. On the other hand, you mention that this test set is randomly picked, which suggests that the test set is not fixed.

There are several ways how to get the statistics for your classifiers:

  1. Try cross-validation (CV) on your data. CV is performed by randomly partitioning the whole data set into k groups, and iteratively set one group aside for testing, while the remaining data serves for training. This gives you k models per classifier. If you use python, I would suggest using KFold function from the sklearn library.

  2. Run the same classifier with different random seeds and/or different training/validation sub-splits during the training.

  3. Combine the two.

Once you have access to the mean and standard deviation of each classifier, you can compute, e.g., two-tailed test p-value.

That being said, if you do not have computational power/time to perform additional training, you cannot compare classifiers' performances statistically and have to rely on the mean only.

Related Question