Here's what I would recommend: Use probability rankings and class proportions in the training sample to determine the class assignments.
You have three (estimated) probabilities: $p_a, p_b,$ and $p_c$. And you have the original class proportions from the training sample: $m_a, m_b,$ and $m_c$, where $m_a$ is the percentage of classes that belong to class $a$ (e.g., 0.6), and so on.
You can start with the smallest class, say $b$, and use $p_b$ to rank order all records from the highest to lowest values. From this rank-ordered list, start assigning each record to class $b$ until you have $m_b$ percent records assigned to this class. Record the value for $p_b$ at this stage; this value will become the cut-off point for class $b$.
Now take the next smallest class, say $c$, and use $p_c$ to rank order all records and follow the same steps described in the paragraph above. At the end of this step, you will get a cut-off value for $p_c$, and $m_c$ percent of all records would be assigned to class $c$.
Finally, assign all remaining records to (the largest) class $a$.
For future scoring purposes, you can follow these steps but discard the class proportions. You can let the probability cut-off values for class $b$ and $c$ to drive class assignments.
In order to make sure that this approach yields a reasonable level of accuracy, you can review the classification matrix (and any other measures you are using) on the validation set.
Proposed Solution: Calibration
I have read the "Multi-class" part of your post, but since you are using One vs. All SVMs, i think you should reconsider solving the problem at the binary level. You could calibrate the single svms, so that the resulting output values are comparable.
Calibration methods for the binary SVMs (so that means also applicable in the one vs. all scenario) are Platt scaling1 and Isontonic regression. A nice overview with python code is available here.
For your own use case you would then calibrate each OvA SVM separately and afterwards the calibrated outputs for a
, b
and c
should be comparable.
What does calibration do here?
The key thing here is, that SVMs themselves, are not probabilistic. The output value you mentioned is usually a function of the classified point's distance to the hyperplane. So we are using a heuristic which has no further significance. The goal of this heuristic is that higher numbers are more likely to be the correct result.
You can measure the signifance of your output values using a
reliability plot. I will cut short here but essentially you want your reliability curve to be as close as possible to the diagonal. The calibration adds another mapping of output values to calibrated output values. This can handle for example classifiers which have a bias towards high output values. Think of it as another translation step "Ok i got that really confident 0.9
from you classifier A, but i know you always are over-confident so let's make this a 0.5
". So a 0.5
value of classifier A should be closer to a 0.5
value of classifier B in the end.
Keep in mind, when using calibration you have to work thoroughly as usual (train/dev/test set).
1. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 61-74.
Best Answer
It wasn't clear from your original question that you're using a classifier that outputs probabilities. In this case, assuming the probabilities are reasonably well-calibrated, you can directly compare them – that's the main advantage of using a probabilistic framework.
Now, for each item $x$ and class $j$ you essentially have a probability estimate $p_j(x)$ that $x$ is a member of class $j$. Of course, in a standard one-vs-all approach the probabilities most likely will not sum to 1 ($\sum_j p_j(x) \ne 1$), which means they're not actually valid probabilities. It seems reasonable, though, to renormalize them into $\hat p_j = p_j \left( \sum_{j'} p_{j'} \right)^{-1}$, and then treat them as actual probabilities.
If you trust that the probabilities from each classifier are reasonable, then this sum shouldn't be too far from 1, and this is just a "patch" to make them closer to actual probabilities. If you don't trust the probabilities from each classifier, then this of course won't necessarily be any better. I don't know of any theory saying that this has any nice probabilities, but I also don't know of any pitfalls in doing it.