For the sake of simplicity, let's say I'm working on the classic example of spam/not-spam emails.
I have a set of 20000 emails. Of these, I know that 2000 are spam but I don't have any example of not-spam emails. I'd like to predict whether the remaining 18000 are spam or not. Ideally, the outcome I'm looking for is a probability (or a p-value) that the email is spam.
What algorithm(s) can I use to make a sensible prediction in this situation?
At the moment, I'm thinking of a distance-based method that would tell me how similar my email is to a known spam email. What options do I have?
More generally, can I use a supervised learning method, or do I necessarily need to have negative cases in my training set to do that? Am I limited to unsupervised learning approaches? What about semi-supervised methods?
Best Answer
This is called learning from positive and unlabeled data, or PU learning for short, and is an active niche of semi-supervised learning.
Briefly, it is important to use the unlabeled data in the learning process as it yields significantly improved models over so-called single-class classifiers that are trained exclusively on known positives. Unlabeled data can be incorporated in several ways, the predominant approaches being the following:
I am active in this field, and rather than summarizing it here for you, I recommend reading two of my papers and the references therein to get an overview of the domain: