Solved – Predicting customer churn – train & test sets

churnsurvivalvalidation

I'm struggling with a problem where I'm trying to predict customer churn. I have monthly snapshot data going back several years, and tags for whether a customer left during a given month.

My main question is whether I should be using the entire dataset as my training set? For example, take the March 2014 end-of-month snapshot, and train on whether a customer left or not. That March 2014 EOM snapshot includes the March EOM data for all current customers (or those that left in March), and the time-shifted data for any customers that left prior to March 2014. My thinking is that I CAN use the entire dataset, rather than reserving a test set, because effectively my test set can be the snapshot for April 2014 or May 2014. (Or August 2014, for that matter.)

I want to use the whole snapshot for training because there is a relatively low churn rate (0.02% in a given month). I've tried splitting off a Test set from the Train, and that usually shows good model performance on the Test set. But terrible performance on the subsequent months… (That's probably my real question, but I figured I'd start with getting the Train / Test thing settled.)

Best Answer

It depends entirely on whether historical churn in 2013 is a good predictor of churn in mid-2014, i.e. whether the training-set was predictive of test-set behavior.

In general you should assume no. [*]

(Obviously the actual individual customers churning are different. But do they churn for different reasons? duration? cost? product usage? etc. Did those customers come in through a trial? a social-media campaign? word-of-mouth? Was that different to how previous customers were acquired? Do you have the right features to capture those? At least you're using random-forest classification instead of linear-regression? Pay attention to Feature Selection: generate a ton of plausible candidate features, then use a legitimate Feature Selection procedure e.g. VIF.)

[*] Why assume no? You haven't said what product domain it was (music website? insurance? etc. well what is it?), but things change over time, prices rise, product features get changed, competitors appear etc. This is called feature creep. Maybe all the original customers came in on a 12-month subscription, then the renewal price rose.

I've tried splitting off a Test set from the Train, and that usually shows good model performance on the Test set. But terrible performance on the subsequent months... (That's probably my real question, but I figured I'd start with getting the Train / Test thing settled.)

Ok well that's useful actionable information you should pay attention to. That's a good thing not a bad thing. Start digging into it and tell us more details: How exactly did you split the test set? First n rows? By customer id? By join-date? By alphabetical last name? username? By subscriber-price? Randomly stratified? Chances are you naively chose some split criterion which introduced bias. Try different split-criteria and show us what results you get. Tell us more (add it to your question details above) - the more information you give us the more help we can be.

Related Question