Solved – Comparison of two classifiers based on precision/recall/F1 only

machine learningprecision-recallstatistical significance

For two classifiers h1 and h2 I have the precision, recall and F1 score as a percentage (along with the original labeled data set that they were tested on). If I had access to which samples each classifier classified right/wrong, I would be able to do, for example, McNemar's test to evaluate significance, but unfortunately I don't.

I would ideally like to be able to speak on the significance of the results obtained by h2, that is, whether h2 is a significant improvement over h1. Am I unable to do that, or is there something I can say using only precision/recall/F1 and the labeled data set?

Best Answer

If all you have is P/R/F1 scores for the two systems/classifiers there's no way of testing whether the difference between the two is statistically significant. For the McNemar's test, as you suggested, you would need the predictions of the two systems.

If you have other labeled data and the implementation of the two systems, you can test those on the data (shuffling and 5- or 10-fold cross-validating several times) so that you can perform a statistical test.