Solved – What classifier allows me to classify: object 1, object 2, object 3, none of those 3

classificationmachine learningmulti-classunbalanced-classes

I am working on an "object recognition" project. I have 3 types of objects. I am always able to detect them but unable to correctly classify them using an SVM. Very often, for example, an "unknown" object is misclassified as object 1 with very high certainty.

My SVM only contains 3 classes actually: object 1,2 and 3, which correspond to my 3 classes. I don't have any data about the "unknown" object when training. When the object's features don't correspond to any of the 3 objects, it should be classified as "unknown."

What classifier would be the most suited for this?

Could it be that the incorrect classification happens because object 1 has 4x more training data than object 2?

My initial thoughts are to use random forest, but I'm not sure.

Best Answer

Since you know that your unknown objects in your data set are not of the three classes, as you know they are past face photos of unauthorized users, you can just treat them as a separate class. Since that fourth class is less defined as the first three and would contain very different objects, you should expect it to be scattered in a complex form over your input feature space.

Therefore, you need a classifier that can model complex non-linear decision boundaries. SVMs can do that. Unfortunately, SVMs do only binary classification and you have multiple classes. But do you? You could also see your problem as a two step procedure:

Decide if the object is unknown or known
If it is known, decide which of the three it is

It is not guaranteed that this will work better, but it's worth a shot. Step 1 would be binary which is better for SVMs. The ensembles to extend them to multiple classes come with problems. Step two is only necessary if the different authorized users have different types of access or if you need to log who was there at what time. If you only need an access/no access decision, no need for step two. For step two you could still try SVM ensembles or something else.

You should try a couple of algorithms (including those that have a two step procedure and those that don't) on the same cross validation folds and decide.

Your application scenario also can tell you whether it is a good idea or not to aggregate allowed users into one class:

If you have only very few users that rarely if ever change, it is more practical to once aggregate them into a class and not have to retrain the classifier all the time
If you have a long list of authorized users and correspondingly often new ones added to the list and old ones subtracted, it would not be practical to put them all in a class and retrain the entire classifier every time a user changes (though some techniques like $k$NN aren't prohibitive in that regard)
You could do multiple single class classifications: user 1 or not? user 2 or not? etc. With 3 users that will work just fine. With 300 it would be dangerous because you inflate your chances of allowing an intruder. The user scans once and your system does 300 tests (without the user seeing there are so many tests). Each of these 300 tests would have a non-zero probability of falsely admitting someone who is not that particular user. So an intruder has 300 chances of being falsely admitted. If you raise all your 300 cutoffs in the 300 tests to counterbalance this problem, you raise the chances that authorized users will have to scan multiple times.

Regarding the higher class prevalence of object 1, that can be problematic for most algorithms (more for some than for others), but at factor 4 to 1, it shouldn't be a major concern yet.

What you should do however is take misclassification costs into consideration. If you know how much more costly it is to misclassify in one direction rather than the other and if you have an estimate of the probability of the classes, you can base your decision on this.

Best Answer

Related Solutions

Solved – Training multiple models for classification using the same dataset

Random Forest – Creating a Certainty Score from Votes

Related Question