Solved – the best way to use a 2-class classifier for a multi-category case

classificationneural networkspattern recognition

I have a 3 class sample labeled data set, which I have divided into 2 parts. I am using the first part to train the two-class perceptron classifier.

One approach is to train $\binom{3}{2}$ two-class classifiers on training data for two classes at a time, and then use voting to classify the test samples for a multi-category case.
The obvious problem with the above approach is the presence of ambiguous regions like the ones shown in the image below:

Linear decision boundary for a 4-class problem

Pattern classification by Duda, Hart, Stork suggests training $c$ different linear discriminant functions, where $c$ is the number of unique classes, such that

$$g_{i}(x) = W_{i}^tX + w_{i0} \quad \quad \quad i = 1,…, c$$

and assigning $X$ to $\omega_{i}$ if $g_{i}(X) > g_{j}(X)$. The resulting classifier is called in the text book as a linear machine.

Following is an illustration from the book showing decision boundary produced by a linear machine for 3-class problem.

decision boundary produced by a linear machine for 3-class problem

My doubt is how is the training process of the $c$ linear discriminant functions $g_{i}(X)$ different from training of $\binom{c}{2}$ 2-class classifiers?

UPDATE: The one vs rest approach mentioned in DavidDLewis's answer is not the one I am referring to. With the one vs rest approach, there are infact even more ambiguous regions. I am referring to linear machine classifier using which there are no ambiguous regions.
See illustration of 1 vs rest approach below:

1 vs rest

Best Answer

Training c linear discriminant functions is a example of a "1-vs-all" or "1-against-the-rest" approach to building a multiclass classifier given a binary classifier learning algorithm. Training C(c,2) 2-class classifiers is an example of the "1-vs-1" approach. As c gets larger, the "1-vs-1" approach builds a lot more classifiers (but each from a smaller training set).

This paper compares these two approaches, among others, using SVMs as the binary classifier learner:

C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines , IEEE Transactions on Neural Networks, 13(2002), 415-425.

Hsu and Lin found 1-vs-1 worked best with SVMs, but this would not necessarily hold with all binary classifier learners, or all data sets.

Personally, I prefer polytomous logistic regression.

Related Question