I am looking to build a predictive model for predicting churn and looking to use a discrete time survival model fitted to a person-period training dataset (one row for each customer and discrete period they were at risk, with an indicator for event – equaling 1 if the churn happened in that period, else 0).
- I am fitting the model using ordinary
logistic regression using the
technique from Singer and
Willet. - The churn of a customer can happen
anywhere during a month, but it is
only at the end of the month that we
know about it (i.e. sometime during
that month they left). 24 months is
being used for training. - The time variable being used is the
origin time of the sample – all
customers active as of 12/31/2008 – they all receive t=0 as of Jan 2009 (not the classical way to do it, but I believe the way when building a predictive model versus a traditional statistical one). A
covariate used is the tenure of the
customer at that point in time. -
There are a series of covariates that
were constructed – some that do not
change across the rows of the dataset
(for a given customer) and some that
do. -
These time variant covariates are the
issue and what is causing to me
question a survival model for churn
prediction (compared to a regular
classifier that predicts churn in the
next x months based on current
snapshot data). The time-invariant ones describe activity the month prior and are expected to be important triggers.
The implementation of this predictive model, at least based on my current thinking, is to score the customer base at the end of each month, calculating the probability / risk of churn sometime during the next month. Then again for the next 1,2 or 3 months. Then for the next 1,2,3,4,5,6 months. For the 3 and 6 month churn probability, I would be using the estimated survival curve.
The problem:
When it comes to thinking about scoring, how can I incorporate time-varying predictors? It seems like I can only score with time-invariant predictors or to include those that are time invariant, you have to make them time invariant – set to the value “right now”.
Does anyone have experience or thoughts on this use of a survival model?
Update based on @JVM comment:
The issue is not with estimating the model, interpreting coefficients, plotting the hazard/survival plots of interesting covariate values using the training data etc. The issue is in using the model to forecast risk for a given customer. Say at the end of this month, I want to score everyone who is still an active customer with this model. I want to forecast that risk estimate out x periods (risk of closing the account at the end of next month. risk of closing the account at the end of two months from now, etc.). If there are time varying covariates, their values are unknown out any future periods, so how to utilize the model?
Final Update:
A person period data set will have an entry for each person and each time period they are at risk. Say there are J time periods (maybe J =1…24 for 24 months) Lets say I construct a discrete time survival model, where for simplicity we just treat time T as linear and have two covariates X and Z where X is time-invariant, meaning it is constant in every period for the ith person and Z is time varying, meaning that each record for the ith person can take on a different value. For example, X may be the customers gender and Z might be how much they were worth to the company in the prior month. The model for the logit of the hazard for the ith person in the jth time period is :
$logit(h(t_{ij}))=\alpha_{0}+\alpha_{1}T_{j}+\beta_{1}X_{i}+\beta_{2}Z_{ij}$
So the issue is, when using time varying covariates, and forecasting (into the yet unseen future) with new data, the $Z_{j}$ are unknown.
The only solutions I can think are:
- Don't use time varying covariates like Z. This would greatly weaken the model to predict the event of churning though since, for example, seeing a decrease in Z would tell us the customer is disengaging and perhaps preparing to leave.
- Use time varying covariates but lag them (like Z was above) which allows us to forecast out however many periods we have lagged the variable (again, thinking of the model scoring new current data).
- Use time varying covariates but keep them as constants in the forecast (so the model was fitted for varying data but for prediction we leave them constant and simulate how changes in these values, if later actually observed, will impact risk of churning.
- Use time varying covariates but impute their future values based on a forecast from known data. E.g. Forecast the $Z_{j}$ for each customer.
Best Answer
Thank you for the clarification, B_Miner. I don't do a lot of forecasting myself, so take what follows with a pinch of salt. Here is what I would do as at least a first cut at the data.
Once you have a model that you think is reasonable, I would suggest bootstrapping the data as a way to incorporate the error in the first TVC model into the second model. Basically, apply steps 1-3 N times, each time taking a bootstrap sample from the data and producing a set of forecasts. When you have a reasonable number of forecasts, summarize them in any way you think is appropriate for your task; e.g., provide mean risk of churn for each individual or covariate profile of interest as well as 95% confidence intervals.