I'm working on a research project of which the objective is to predict the customer churn probability in the next month. We have a dataset of monthly records for each customer with variables including (the list below is not exhaustive):

`month`

: month

`customer_id`

: customer ID

`tenure`

: number of months the customer has stayed

`gender`

: whether the customer is a male or a female

`churn`

: whether the customer churned or not

A part of the dataset looks like:

month | customer_id | tenure | gender | … | churn | |
---|---|---|---|---|---|---|

1 | 2022-01 | 1 | 6 | 1 | 1 | |

2 | 2022-01 | 2 | 15 | 1 | 0 | |

3 | 2022-01 | 3 | 12 | 0 | 0 | |

4 | 2022-02 | 2 | 16 | 1 | 0 | |

5 | 2022-02 | 3 | 13 | 0 | 0 | |

5 | 2022-02 | 4 | 0 | 1 | 0 | |

6 | 2022-03 | 2 | 17 | 1 | 0 | |

7 | 2022-03 | 3 | 14 | 0 | 1 | |

8 | 2022-03 | 4 | 1 | 1 | 0 |

Currently, I have problems with model selection and data preparation.

**Problem 1: should I choose a CoxPH model (Cox proportional hazards model) or a logistic regression model?**

**CoxPH**: the `tenure`

variable can be considered as *time to event* (churn) and we can also easily determine if a record is censored. Then with the survival function $S(t \mid x) = S_0(t)^{\exp(x^\top \beta)}$, we calculate the probability of survival (non-churn) at time $t$ for a customer.

**Logistic regression**: the logistic regression seems also suitable for this case. The `tenure`

will be an explanatory variable and the `churn`

will be the target variable.

**Problem 2: how should I prepare data for a model?**

If we choose Cox regression, we need and select only one line (maybe the last one) for each individual customer. So that would be like:

month | customer_id | tenure | gender | … | churn | |
---|---|---|---|---|---|---|

1 | 2022-01 | 1 | 6 | 1 | 1 | |

6 | 2022-03 | 2 | 17 | 1 | 0 | |

7 | 2022-03 | 3 | 14 | 0 | 1 | |

8 | 2022-03 | 4 | 1 | 1 | 0 |

If we choose logistic regression, we fit the model with all data rows (every month for every customer).

Am I thinking correctly about the problems?

## Best Answer

As time and censoring are important, this is clearly a survival-model situation. You have to decide what you want to choose as

`time = 0`

for the model.If you want to model

`tenure`

as an outcome, then you would effectively set`time = 0`

to the time that each individual started as a customer by using`tenure`

as the (potentially censored) outcome in a survival model, as you propose for a Cox model. If no covariate values change with time and no customer departs and returns, then you can use just the last observed`tenure`

value along with a censoring indicator as the outcome in a Cox (or other proportional-hazards) model.You might, however, want to consider

`time = 0`

as some fixed calendar date. See this answer and the linked reference to a thesis that used that approach instead for modeling insurance-customer churn. Then you could use tenureprior to that starting dateas a predictor.That's your choice depending on just what you want to model.

If you only have a small number of possible event times (e.g., monthly data over a year or so), you probably should be using discrete-time survival analysis. That can be set up as a logistic regression based on data for each individual at each at-risk time (to handle censoring; you evidently have data in that format already) and that

includes time as a modeled covariate. This answer provides several links for study and to tools for setting up such data.Finally, this will be most reliable if the "churn" is an active event, like the refusal to renew an insurance policy. If it's just that you haven't seen the customer in a long time at which point you call a "churn" then you might need to model this more subtly.