Solved – Incorporate new unlabeled data into classifier trained on a small set of labeled data

classificationclusteringlabelingsemi-supervised-learning

I have a set of 400 labeled samples (8 numeric features) on which I trained a binary classifier.

The problem I am facing is that once the classifier is shipped to the users, I will get additional samples, but those will be unlabeled. I was researching common ways to incorporate unlabeled data in order to increase future classification accuracy. The way I see it I have 4 options:

  1. Forget about the existing binary classifier and use a semi-supervised learning algorithm such as S3VM

  2. Keep the existing binary classifier, use a transductive learning algorithm, such as label propagation, and use the newly (but possibly wrongly) labeled data to retrain the binary classifier; iterate this procedure.

  3. Keep the existing binary classifier, use a (supervised?) clustering algorithm to label new data, and use the newly (but possibly wrongly) labeled data to retrain the binary classifier; iterate this procedure. Maybe some mixture model with Expectation Maximization?

  4. Alternative idea?

While 3) seems rather flawed, because usual clustering algorithm optimize criterions different from labels I am not sure what to think about 1) and 2). What I do not like in 2) is that after we use a label propagation algorithm, we assume that these labels are correct and based on this new set of samples, we want to select new features and retrain our classifier. But a change in the missclassification rate now can stem form either a bad selection of features but might as well stem from the fact that the new labels are wrong. To me, 1) seems to reflect the situation the best.
Am I understanding the situation correctly, i.e., is it true that 1) is superior to 2) and 2) is superior to 3)?

Or did I miss the point completely and an alternative approach is more appropriate than any of the 3?

Best Answer

(3) doesn't have to be bad if you have some prior about what the clusters might look like, however you wouldn't be using your labelled data optimally. As you point out, you can iteratively train a classifier on its own output.

(2) isn't that different from (3) really, it'll depend on how good your metric is

(1) is what I would recommend, though it doesn't have to be S3VM. A Bayesian model would treat all the missing label as latent variables and learn the posterior distribution of both the missing labels and the classifier's parameters.

Related Question