Machine Learning – How to Predict Outcome with Only Positive Cases as Training?

machine learningpredictive-modelssemi-supervised-learningsupervised learningunsupervised learning

For the sake of simplicity, let's say I'm working on the classic example of spam/not-spam emails.

I have a set of 20000 emails. Of these, I know that 2000 are spam but I don't have any example of not-spam emails. I'd like to predict whether the remaining 18000 are spam or not. Ideally, the outcome I'm looking for is a probability (or a p-value) that the email is spam.

What algorithm(s) can I use to make a sensible prediction in this situation?

At the moment, I'm thinking of a distance-based method that would tell me how similar my email is to a known spam email. What options do I have?

More generally, can I use a supervised learning method, or do I necessarily need to have negative cases in my training set to do that? Am I limited to unsupervised learning approaches? What about semi-supervised methods?

Best Answer

This is called learning from positive and unlabeled data, or PU learning for short, and is an active niche of semi-supervised learning.

Briefly, it is important to use the unlabeled data in the learning process as it yields significantly improved models over so-called single-class classifiers that are trained exclusively on known positives. Unlabeled data can be incorporated in several ways, the predominant approaches being the following:

  • somehow infer a set of likely negatives from the unlabeled data and then train a supervised model to distinguish known positives from these inferred negatives.
  • treat the unlabeled set as negative and somehow account for the label noise that is known to be present.

I am active in this field, and rather than summarizing it here for you, I recommend reading two of my papers and the references therein to get an overview of the domain:

Related Question