Solved – Classification when training set contains missing/unknown class labels

classificationlogisticrandom forestsemi-supervised-learning

I have a set of data points that belong to two classes – class A and class B. I know that a subset of the data points belong to class A. But I have no idea of the classification of the remaining data points.

This can arise in various settings. For example, if we ask people to respond with yes if they agree with something, then we cannot tell if those who don't respond do not agree with something or forget to respond.

I wonder if it is possible to build/train a classifier using random forest or logistic regression or some other techniques, and how I can approach the problem.

Best Answer

The term that you want to search for is "semi-supervised learning." Here's the wikipedia link, for convenience.

In a nutshell, you should think of this as "clustering with hints." You start with the labeled points, and get a sense of what they look like -- then you can make a good guess that other points that look like them also have that label, and finally, that points that don't look like them are labeled differently.

As always, there are lots and lots of tricks and theories and variations that go with this -- but given the breadth of your question, I think this will get you started in the right direction. Come back when your first results come through!

Related Question