Machine Learning – Distant Supervision: Is It Supervised, Semi-Supervised, or Both?

data miningmachine learningreferencessemi-supervised-learningunsupervised learning

"Distant supervision" is a learning scheme in which a classifier is learned given a weakly labeled training set (training data is labeled automatically based on heuristics / rules). I think that both supervised learning, and semi-supervised learning can include such "distant supervision" if their labeled data is heuristically/automatically labeled. However, in this page, "distant supervision" is defined as "semi-supervised learning" (i.e., limited to "semi-supervision").

So my question is, does "distant supervision" exclusively refer to semi-supervision? In my opinion it can be applied to both supervised and semi-supervised learning. Please provide any reliable references if any.

Best Answer

A Distant supervision algorithm usually has the following steps:
1] It may have some labeled training data
2] It "has" access to a pool of unlabeled data
3] It has an operator that allows it to sample from this unlabeled data and label them and this operator is expected to be noisy in its labels
4] The algorithm then collectively utilizes the original labeled training data if it had and this new noisily labeled data to give the final output.

Now, to answer your question, you as well as the site both are correct. You are looking at the 4th step of the algorithm and notice that at the 4th step one can use any algorithm that the user has access to. Hence your point, "it can be applied to both supervised and semi-supervised learning".

Whereas the site is looking at all the steps 1-4 collectively and notices that the noisily labeled data is obtained from a pool of unlabeled data (with or without the use of some pre-existing labeled training data) and this process of obtaining noisy labels is an essential component for any distant supervision algorithm, hence it is a semi-supervised algorithm.

Related Question