Solved – Classifier for uncertain class labels

classificationuncertaintyweka

Let's say I have a set of instances with class labels associated. It does not matter how these instances were labelled, but how certain their class membership is. Each instancs belongs to exactly one class. Let's say I can quantify the certainty of each class membership with a nominal attribute that goes from 1 to 3 (very certain to uncertain, respectively).

Is there some sort of classifier that takes into consideration such a certainty measure and if yes, is it available in the WEKA toolkit?

I imagine this situation occurs quite often, for example when instances are classified by human beings which are not always completely sure. In my case, I have to classify images, and sometimes an image could belong to more than one class. If this happens, I give the class a high uncertainty, but still classify it with only one class.

Or are there any other approaches to this problem, without a specialized classifier? E.g. only taking "certain" classifications for training? I fear that in this case, there will be more misclassifications because "border" cases are not covered.

Best Answer

First of all, as @Marc Claesen already explained, semi-supervised classification is one of the techniques to take care of the situation where you know that the classes are really distinct, but you are not certain which class the case actually belongs to.

However, there are related situations as well, where the the "reality" isn't that clear, and the assumption of having really distinct classes is not met: bordeline cases may be a "physical" reality (see below for papers about an application where we met such a condition).

There is one crucial assumption for semi-supervised classifers that you need to make sure is met: the assumption that in feature space, class boundaries come along with low sample density. This is referred to as the cluster assumption.
Even if the reality underlying your data has distinct classes, your data set may have disproportionally more borderline cases: e.g. if your classification technique is targeted at classifying difficult cases, while the clear and easy cases are not of interest and already your training data reflects this situation.

only taking "certain" classifications for training? I fear that in this case, there will be more misclassifications because "border" cases are not covered.

I fully agree with you that excluding the borderline cases is often a bad idea: by removing all difficult cases you end up with an artificially easy problem. IMHO it is even worse that excluding borderline cases usually does not stop with model training, but the borderline cases are also excluded from testing, thus testing the model only with easy cases. With that you'd not even realize that the model does not perform well with borderline cases.

Here are two papers we wrote about a problem that differs from yours in that in our application also the reality can have "mixed" classes (a more general version of your problem: uncertainty in reference labels is covered as well).

The appliation: brain tumour diagnostics. We used logistic regression. Semi-supervised modeling was not appropriate as we cannot assume low sample density at class boundaries.
C. Beleites, K. Geiger, M. Kirsch, S. B. Sobottka, G. Schackert and R. Salzer: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal. Bioanal. Chem., 400 (2011), 2801 - 2816.
Theory paper deriving a general framework for measuring the performance of the classifier for borderline cases.
C. Beleites, R. Salzer and V. Sergo:
Validation of Soft Classification Models using Partial Class Memberships: An Extended Concept of Sensitivity & Co. applied to Grading of Astrocytoma Tissues
Chemom. Intell. Lab. Syst., 122 (2013), 12 - 22.

The links go to a project page of an R package I developed to do the performance calculations. There are further links to both the official web page and my manuscripts of the papers. While I have not used Weka so far, I understand that an interface to R is available.

practical considerations:

While the copy-and-label-differently approach is straightforward, it does not work well with all classifiers and implementations in practice. E.g. AFAIK there is no way to tell libSVMs tuning by cross validation that all copies of each data point need to be kept in the same cross validation fold. Thus libSVMs tuning would probably yield a massively overfit model.
Also for logistic regression, I found that many implementations did not allow the partial membership labels I needed.
The implementation I used for the papers above is actually an ANN without hidden layer using the logistic as sigmoidal link function (nnet::multinom).

Best Answer

practical considerations:

Related Solutions

Solved – How does a classifier handle unseen documents that do not belong to any of the pre-existing classes

Solved – Significance testing of cross-validated classification accuracy: shuffling vs. binomial test

Related Question