Solved – Survival Model for Predicting Churn – Time-varying predictors

churnpredictive-modelssurvival

I am looking to build a predictive model for predicting churn and looking to use a discrete time survival model fitted to a person-period training dataset (one row for each customer and discrete period they were at risk, with an indicator for event – equaling 1 if the churn happened in that period, else 0).

  • I am fitting the model using ordinary
    logistic regression using the
    technique from Singer and
    Willet.
  • The churn of a customer can happen
    anywhere during a month, but it is
    only at the end of the month that we
    know about it (i.e. sometime during
    that month they left). 24 months is
    being used for training.
  • The time variable being used is the
    origin time of the sample – all
    customers active as of 12/31/2008 – they all receive t=0 as of Jan 2009 (not the classical way to do it, but I believe the way when building a predictive model versus a traditional statistical one). A
    covariate used is the tenure of the
    customer at that point in time.
  • There are a series of covariates that
    were constructed – some that do not
    change across the rows of the dataset
    (for a given customer) and some that
    do.

  • These time variant covariates are the
    issue and what is causing to me
    question a survival model for churn
    prediction (compared to a regular
    classifier that predicts churn in the
    next x months based on current
    snapshot data). The time-invariant ones describe activity the month prior and are expected to be important triggers.

The implementation of this predictive model, at least based on my current thinking, is to score the customer base at the end of each month, calculating the probability / risk of churn sometime during the next month. Then again for the next 1,2 or 3 months. Then for the next 1,2,3,4,5,6 months. For the 3 and 6 month churn probability, I would be using the estimated survival curve.

The problem:

When it comes to thinking about scoring, how can I incorporate time-varying predictors? It seems like I can only score with time-invariant predictors or to include those that are time invariant, you have to make them time invariant – set to the value “right now”.

Does anyone have experience or thoughts on this use of a survival model?

Update based on @JVM comment:

The issue is not with estimating the model, interpreting coefficients, plotting the hazard/survival plots of interesting covariate values using the training data etc. The issue is in using the model to forecast risk for a given customer. Say at the end of this month, I want to score everyone who is still an active customer with this model. I want to forecast that risk estimate out x periods (risk of closing the account at the end of next month. risk of closing the account at the end of two months from now, etc.). If there are time varying covariates, their values are unknown out any future periods, so how to utilize the model?

Final Update:

A person period data set will have an entry for each person and each time period they are at risk. Say there are J time periods (maybe J =1…24 for 24 months) Lets say I construct a discrete time survival model, where for simplicity we just treat time T as linear and have two covariates X and Z where X is time-invariant, meaning it is constant in every period for the ith person and Z is time varying, meaning that each record for the ith person can take on a different value. For example, X may be the customers gender and Z might be how much they were worth to the company in the prior month. The model for the logit of the hazard for the ith person in the jth time period is :

$logit(h(t_{ij}))=\alpha_{0}+\alpha_{1}T_{j}+\beta_{1}X_{i}+\beta_{2}Z_{ij}$

So the issue is, when using time varying covariates, and forecasting (into the yet unseen future) with new data, the $Z_{j}$ are unknown.

The only solutions I can think are:

  • Don't use time varying covariates like Z. This would greatly weaken the model to predict the event of churning though since, for example, seeing a decrease in Z would tell us the customer is disengaging and perhaps preparing to leave.
  • Use time varying covariates but lag them (like Z was above) which allows us to forecast out however many periods we have lagged the variable (again, thinking of the model scoring new current data).
  • Use time varying covariates but keep them as constants in the forecast (so the model was fitted for varying data but for prediction we leave them constant and simulate how changes in these values, if later actually observed, will impact risk of churning.
  • Use time varying covariates but impute their future values based on a forecast from known data. E.g. Forecast the $Z_{j}$ for each customer.

Best Answer

Thank you for the clarification, B_Miner. I don't do a lot of forecasting myself, so take what follows with a pinch of salt. Here is what I would do as at least a first cut at the data.

  • First, formulate and estimate a model that explains your TVCs. Do all of the cross-validation, error checking, etc., to make sure you have a decent model for the data.
  • Second, formulate and estimate a survival model (of whatever flavor). Do all of the cross-validation, error checking, to make sure this model is reasonable as well.
  • Third, settle on a method of using the forecasts from the TVCs model as the basis of forecasting risks of churn and whatever else you want. Once again, verify that the predictions are reasonable using your sample.

Once you have a model that you think is reasonable, I would suggest bootstrapping the data as a way to incorporate the error in the first TVC model into the second model. Basically, apply steps 1-3 N times, each time taking a bootstrap sample from the data and producing a set of forecasts. When you have a reasonable number of forecasts, summarize them in any way you think is appropriate for your task; e.g., provide mean risk of churn for each individual or covariate profile of interest as well as 95% confidence intervals.

Related Question