Regression – Difference Between CoxPH and Logistic Regression: Data Preparation for Each Model

cox-modellogisticregressionsurvival

I'm working on a research project of which the objective is to predict the customer churn probability in the next month. We have a dataset of monthly records for each customer with variables including (the list below is not exhaustive):

month: month
customer_id: customer ID
tenure: number of months the customer has stayed
gender: whether the customer is a male or a female
churn: whether the customer churned or not

A part of the dataset looks like:

month customer_id tenure gender churn
1 2022-01 1 6 1 1
2 2022-01 2 15 1 0
3 2022-01 3 12 0 0
4 2022-02 2 16 1 0
5 2022-02 3 13 0 0
5 2022-02 4 0 1 0
6 2022-03 2 17 1 0
7 2022-03 3 14 0 1
8 2022-03 4 1 1 0

Currently, I have problems with model selection and data preparation.

Problem 1: should I choose a CoxPH model (Cox proportional hazards model) or a logistic regression model?

CoxPH: the tenure variable can be considered as time to event (churn) and we can also easily determine if a record is censored. Then with the survival function $S(t \mid x) = S_0(t)^{\exp(x^\top \beta)}$, we calculate the probability of survival (non-churn) at time $t$ for a customer.

Logistic regression: the logistic regression seems also suitable for this case. The tenure will be an explanatory variable and the churn will be the target variable.

Problem 2: how should I prepare data for a model?

If we choose Cox regression, we need and select only one line (maybe the last one) for each individual customer. So that would be like:

month customer_id tenure gender churn
1 2022-01 1 6 1 1
6 2022-03 2 17 1 0
7 2022-03 3 14 0 1
8 2022-03 4 1 1 0

If we choose logistic regression, we fit the model with all data rows (every month for every customer).

Am I thinking correctly about the problems?

Best Answer

As time and censoring are important, this is clearly a survival-model situation. You have to decide what you want to choose as time = 0 for the model.

If you want to model tenure as an outcome, then you would effectively set time = 0 to the time that each individual started as a customer by using tenure as the (potentially censored) outcome in a survival model, as you propose for a Cox model. If no covariate values change with time and no customer departs and returns, then you can use just the last observed tenure value along with a censoring indicator as the outcome in a Cox (or other proportional-hazards) model.

You might, however, want to consider time = 0 as some fixed calendar date. See this answer and the linked reference to a thesis that used that approach instead for modeling insurance-customer churn. Then you could use tenure prior to that starting date as a predictor.

That's your choice depending on just what you want to model.

If you only have a small number of possible event times (e.g., monthly data over a year or so), you probably should be using discrete-time survival analysis. That can be set up as a logistic regression based on data for each individual at each at-risk time (to handle censoring; you evidently have data in that format already) and that includes time as a modeled covariate. This answer provides several links for study and to tools for setting up such data.

Finally, this will be most reliable if the "churn" is an active event, like the refusal to renew an insurance policy. If it's just that you haven't seen the customer in a long time at which point you call a "churn" then you might need to model this more subtly.

Related Question