Classification – Constructing Target Variable from Correlates and Proxies Without Actual Target Data

classificationsemi-supervised-learningunsupervised learning

I want to classify customers who at risk to churn (unsubscribe). The typical path would be to have a training set of historical data that includes observations of customers who churned so that we have a binary target variable. In such a case, there are many straightforward options.

Now, assume we have the same objective of predicting a customer who is at risk of churning, but our data has exclusively customers that have not churned. For example, say the company providing the data doesn't understand statistics, and they automatically delete all records of customers who churned such that we only have the "survivors." Assume that this data on churned observations is 100% gone and unrecoverable, we can only work with the data on survivors.

In this case, we do not have the desired target variable, but we still want to predict churn. What we do have are several variables that we know tend to be highly correlated to churn based on other data sets and literature.

I've thought of fitting a model to a similar population then using that model on my data (requires the strong assumption that the populations are the same). I've considered predicting something that is known to be correlated to churn instead, and hoping that the company can provided expert opinion based on their experience to infer churn risk from that (not data-driven). I can't think of any way to make this into a supervised learning problem that results in a model that can be validated properly. Am I doomed to educated guesses, or can I pull some magic model out of a hat?

Best Answer

Your intuition is right: machine learning is based on maths, not magic. If you can't give your machine enough examples of $(X,Y)$ pairs (where $X$ is an input vector of customer attributes and past behaviour, say, and $Y$ is 1 if the customer churned and 0 otherwise), there's no way it can learn any function that maps from $X$ to $Y$. In other words, you can't build a supervised ML model without having data about the target variable.

As you also say, I think your best shot is to train a model with the other dataset (which contains both $X$ and $Y$). If its performance on the test set is good enough and that data isn't too different from the other data (the one missing $Y$), you can then predict on such data and hope for the best. In the best of cases, "hoping for the best" would involve designing an adequate experiment to find out how well your model did on the unlabelled dataset.

On a separate but related note, you might also want to explore different techniques besides supervised ML (on your labelled data). For example, churn problems lend themselves quite well to survival-analysis techniques.

Related Question