Solved – Expected error in a multiclass classification problem

accuracyclassificationmulti-class

I have a multiclass classification problem with more than 1,000 classes. I've trained several classifiers (SVM, kNN, Random forests, etc) for 10, 100, 500 and 1000 of the classes to estimate the tendency of classifier's accuracy as the number of classes increases.

I have around 100 samples per each class and do 10-fold cross-validation. What I observe is that the accuracy (# correct classified samples divided by total tested samples) saturates as I increase the number of classes ( 100% accuracy for 10 classes, 87% for 50 classes, 75% for 100 classes, 67% accuracy for 500 classes and 63% for 1,000 classes…).

I'm trying to understand this phenomenon because I was expecting the accuracy to steadily drop to zero. What I had in my mind is that as the number of classes increases, it is more likely to find two classes that are similar to each other and, thus, have more errors.

Can anybody provide an insight on what is happening?

Best Answer

It is not clear what you mean by increasing the number of classes. It can be interpreted in two ways:

1- More data is added where they belong to new classes. This is shown in the following figure where we start with 2 classes and add two more class of data each time.

enter image description here

2- The target variable is broken into more values. This is shown in the following figure where each label for target is broken to more labels.

enter image description here

Consider that most of incorrectly classified data points are close to the lines that separate the classes from each other. The exact picture would be a bit different when you use a one versus all encoding for the multi class classifiers, but the same concept holds. In both cases you are increasing the number of lines. So the total area around the lines increases. In the first case however you are also covering a larger space. So it is likely that the ratio of the area around the lines to the total area remains constant. In the second case however you are passing more lines through the same space. So indeed it is expected that the error rate increase. In the second case however since you have less data for each class then you are more likely to overfit and get a result that looks better on your training data.

Consider that in some cases classifying in a smaller number of classes might even seems more difficult. For example in the following figure the two class classifier seems more comlex than 4 class classifier (although both will probably have the same accuracy).

enter image description here