Machine Learning – How to Predict Outcome with Only Positive Cases as Training?

machine learningpredictive-modelssemi-supervised-learningsupervised learningunsupervised learning

For the sake of simplicity, let's say I'm working on the classic example of spam/not-spam emails.

I have a set of 20000 emails. Of these, I know that 2000 are spam but I don't have any example of not-spam emails. I'd like to predict whether the remaining 18000 are spam or not. Ideally, the outcome I'm looking for is a probability (or a p-value) that the email is spam.

What algorithm(s) can I use to make a sensible prediction in this situation?

At the moment, I'm thinking of a distance-based method that would tell me how similar my email is to a known spam email. What options do I have?

More generally, can I use a supervised learning method, or do I necessarily need to have negative cases in my training set to do that? Am I limited to unsupervised learning approaches? What about semi-supervised methods?

Best Answer

This is called learning from positive and unlabeled data, or PU learning for short, and is an active niche of semi-supervised learning.

Briefly, it is important to use the unlabeled data in the learning process as it yields significantly improved models over so-called single-class classifiers that are trained exclusively on known positives. Unlabeled data can be incorporated in several ways, the predominant approaches being the following:

somehow infer a set of likely negatives from the unlabeled data and then train a supervised model to distinguish known positives from these inferred negatives.
treat the unlabeled set as negative and somehow account for the label noise that is known to be present.

I am active in this field, and rather than summarizing it here for you, I recommend reading two of my papers and the references therein to get an overview of the domain:

A state-of-the-art technique to learn models from positive and unlabeled data (formal publication available here): http://arxiv.org/abs/1402.3144
A technique to compute commonly used performance metrics without known negatives (under review, this is first of its kind): http://arxiv.org/abs/1504.06837

Related Solutions

Solved – Understanding the difference between Supervised and unsupervised learning

Lets look at a simple example of trying to predict housing prices. Assume we have a dataset that looks like

Cost |  Sq Ft  | N bedroom
 100K    1,800     4
 120K    1,300     3
 220K    2,200     5

In the case of supervised learning we would know the cost (these are our y labels) and we would use our set of features (Sq ft and N bedrooms) to build a model to predict the housing cost. The formula would look like

Cost ~ Sq Ft + N bedrooms

Now in unsupervised learning we would not know the cost of the house but we still would know the features. Therefore, we would train a model and try to group the types of houses together that are similar. For an example of this look at k-means clustering (http://scikit-learn.org/stable/modules/clustering.html#clustering)

This is a great, free, book which covers this very nicely (http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf)

Each type of learning method (may) have a set of parameters which are called model parameters. The training phase is used to find out the optimal set of parameters which generalizes you data the best. That book also gives very nice information on different learning methods are there parameters.

For example, in the leaning algorithm called SVM there is a term that looks like $\exp(-\gamma|x-x^{2}|)$. In this example the $\gamma$ parameter is what we try to optimize using the training data.

Solved – Can we say that RNN for time series is an example of semi-supervised learning

Semi-supervised refers to using a combination of a labeled dataset (usually quite small) and a (usually much larger) unlabeled dataset.

When the labels are automatically derived from the data itself, this is these days called self-supervised learning or self-supervision. Examples in time series include predicting the next time step, or filling in a gap in the sequence.

This website by Lillian Weng is a good reference for some of the many possible strategies. https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html

Best Answer

Related Solutions

Solved – Understanding the difference between Supervised and unsupervised learning

Solved – Can we say that RNN for time series is an example of semi-supervised learning

Related Question