Solved – How to perform hypothesis testing for comparing different classifiers

classificationhypothesis testingmachine learningt-test

I am trying to classify a small dataset (around 500 records) into two classes. I used various methods like SVM, Naive Bayes and k-nn classifier. Now, I would like to set the results from one of the classifiers are my baseline and perform a statistical hypothesis testing. I am not much familiar with this field of statistical testing, and I wonder how to proceed on this.

I have been thinking of setting the SVM classifier as my baseline, but I am not sure how to perform a t-test (or similar) on the data. The input dataset has 10 attributes. Should I use the classification results from two classifiers and do a paired t-test on them? For example, I could take the result from Naive Bayes and perform the paired t-test with the SVM results (which is the baseline). Is this the right approach?

Also, I am confused with the explanation for null and alternate hypothesis. Could someone be willing to give an idea about how to fix the null and alternate hypothesis.

Best Answer

In general layman's terms (and not just for this problem),

  • Null Hypothesis $H_0$: no change or difference (i.e. the classifiers have the same performance, however you define it)
  • Alternative Hypothesis: there is some sort of difference in performance

For your classifier performance comparison problem, I recommend reading Chapter 6 of Japkowicz & Shah, which goes into detail on how to use significance testing to assess the performance of different classifiers. (Other chapters give more background on classifier comparison - sounds like they might interest you too.)

In your case,

  1. to compare 2 classifiers (on a single domain) you may use a matched-pairs t-test where $t=\frac{\bar{d}}{\bar{\sigma}_d / \sqrt{n}}$, where $\bar{d} = \bar{\text{pm}}(f_1) - \bar{\text{pm}}(f_2)$ is the difference of the means of your performance measures (whatever you choose to use) based on applying the two classifiers $f_1$ and $f_2$, $n$ is the number of trials and $\bar{\sigma}_d$ is the sample standard deviation of the mean difference
  2. to compare multiple classifiers (on a single domain) you may use one-way ANOVA (i.e. an F-test) to check whether there is any difference among multiple means (though it cannot tell which are actually different) and then follow up with post-hoc tests, such as Tukey's Honest Significant Difference test to identify which pairs of classifiers exhibit significant differences.

The book goes into far more detail, so I do recommend reading that chapter.

And in terms of baselines, the tests I've mentioned don't distinguish between a baseline and a non-baseline. This is a good thing, as it gives you flexibility to decide which comparisons you should give more importance to in your analysis. The number of tests you actually do determines whether you should rely on 1. or 2. above.