Solved – Comparing two classifiers on separate pairs of train and test datasets

hypothesis testingmachine learningstatistical significance

In my problem I have 2 classifiers, C1 and C2. Both C1 and C2 are Naive Bayes classifiers but the difference between them is that they use different feature selection methods. Both classifiers are trained over a dataset of 10,000 instances (noisy-labeled), and tested over a different dataset of 1,000 instances (manually labeled), both datasets being balanced.

Now, I have plotted accuracy of both classifiers on increasing number of instances, and I found by visual inspection that C2 has generally better accuracy and recall than C1 . I would like to know whether such difference is statistically significant or not to assess that C2 is better than C1.

Previously, I used the same dataset for k-cross validation, got the mean and variation of the accuracies of both classifiers and computed student t-test on a specific amount of features. However, now I have 2 different datasets for training and testing. How could I perform a test in such situation? Should I get the mean of accuracies for all different feature amounts?

Thanks in advance…

EDIT

Regarding the domain, I am dealing with sentiment analysis (SA), classifying text data in 3 classes : positive, negative and neutral.
Regarding error cost, at this stage I suppose that all error costs are the same (although I understand that the cost of classifying negative as positive would higher than negative as neutral).
Regarding the "practical significant difference" when dealing with SA I am assuming that an improvement of 1% is significant, since previous papers usually present such kind of improvements.
I want to test the accuracy of C1 and C2 when trained over automatic-labeled data, and tested over manually-labeled data.

Best Answer

First of all, before testing you need to define couple of things: do all classification errors have same "cost"? Then you chose a single measurement parameter. I usually chose MCC for binary data and Cohen's kappa for k-category classification. Next it is very important to define what is the minimal difference that is significant in your domain? When I say "significant" I don't mean statistically significant (i.e. p<1e-9), but practically significant. Most of the time improvement of 0.01% in classification accuracy means nothing, event if it has nice p-value.

Now you can start comparing the methods. What are you testing? Is it the predictor sets, model building process or the final classifiers. In the first two cases I would generate many bootstrap models using the training set data and test them on bootstrap samples from the testing set data. In the last case I would use the final models to predict bootstrap samples from the testing set data. If you have a reliable way to estimate noise in the data parameters (predictors), you may also add this to both training and testing data. The end result will be two histograms of the measurement values, one for each classifier. You may now test these histograms for mean value, dispersion, etc.

Two last notes: (1) I'm not aware of a way to account for model complexity when dealing with classifiers. As a result better apparent performance may be a result of overfitting. (2) Having two separate data sets is a good thing, but as I understand from your question, you used both sets for many times, which means that the testing set information "leaks" into your models. Make sure you have another, validation data set that will be used only once when you have made all the decisions.

Clarifications following notes

In your notes you said that "previous papers usually present such kind [i.e. 1%] of improvements". I'm not familiar with this field, but the fact that people publish 1% improvement in papers does not mean this improvement is significant :-)

Regarding t-test, I think it would be a good choice, provided that the data is normally distributed or converted to normal distribution or that you have enough data samples, which you will most probably will.

Related Question