I am working on an "object recognition" project. I have 3 types of objects. I am always able to detect them but unable to correctly classify them using an SVM. Very often, for example, an "unknown" object is misclassified as object 1 with very high certainty.
My SVM only contains 3 classes actually: object 1,2 and 3, which correspond to my 3 classes. I don't have any data about the "unknown" object when training. When the object's features don't correspond to any of the 3 objects, it should be classified as "unknown."
What classifier would be the most suited for this?
Could it be that the incorrect classification happens because object 1 has 4x more training data than object 2?
My initial thoughts are to use random forest, but I'm not sure.
Best Answer
Since you know that your unknown objects in your data set are not of the three classes, as you know they are past face photos of unauthorized users, you can just treat them as a separate class. Since that fourth class is less defined as the first three and would contain very different objects, you should expect it to be scattered in a complex form over your input feature space.
Therefore, you need a classifier that can model complex non-linear decision boundaries. SVMs can do that. Unfortunately, SVMs do only binary classification and you have multiple classes. But do you? You could also see your problem as a two step procedure:
It is not guaranteed that this will work better, but it's worth a shot. Step 1 would be binary which is better for SVMs. The ensembles to extend them to multiple classes come with problems. Step two is only necessary if the different authorized users have different types of access or if you need to log who was there at what time. If you only need an access/no access decision, no need for step two. For step two you could still try SVM ensembles or something else.
You should try a couple of algorithms (including those that have a two step procedure and those that don't) on the same cross validation folds and decide.
Your application scenario also can tell you whether it is a good idea or not to aggregate allowed users into one class:
Regarding the higher class prevalence of object 1, that can be problematic for most algorithms (more for some than for others), but at factor 4 to 1, it shouldn't be a major concern yet.
What you should do however is take misclassification costs into consideration. If you know how much more costly it is to misclassify in one direction rather than the other and if you have an estimate of the probability of the classes, you can base your decision on this.