Solved – Can we compare classifier scores in one-vs-all/one-vs-many

classificationmachine learningmulti-classprobabilitysvm

In a system where we perform multi-class classification via a one-vs-all technique, are two scores comparable? E.g.: If I have 0.5 and 0.6 on two different classifiers, is it possible to say that the classifier that has output 0.6 is more likely to relate to the sample than the classifier class that outputs 0.5?

I have trained each classifier on positive training data from one class and all other training data for all other classes as the negative data, as per the standard for one-vs-all.

I'm aware that when comparing two different classifiers in classifying different types of data that the two classifier scores are not comparable because they are calibrated differently with different accuracies, i.e.: a score of 0.6 in one classifier may be a high accuracy for that classifier but a low accuracy for another classifier. I'm wondering whether this applies here and what can be done to get around it?

Best Answer

It wasn't clear from your original question that you're using a classifier that outputs probabilities. In this case, assuming the probabilities are reasonably well-calibrated, you can directly compare them – that's the main advantage of using a probabilistic framework.

Now, for each item $x$ and class $j$ you essentially have a probability estimate $p_j(x)$ that $x$ is a member of class $j$. Of course, in a standard one-vs-all approach the probabilities most likely will not sum to 1 ($\sum_j p_j(x) \ne 1$), which means they're not actually valid probabilities. It seems reasonable, though, to renormalize them into $\hat p_j = p_j \left( \sum_{j'} p_{j'} \right)^{-1}$, and then treat them as actual probabilities.

If you trust that the probabilities from each classifier are reasonable, then this sum shouldn't be too far from 1, and this is just a "patch" to make them closer to actual probabilities. If you don't trust the probabilities from each classifier, then this of course won't necessarily be any better. I don't know of any theory saying that this has any nice probabilities, but I also don't know of any pitfalls in doing it.