Survival Analysis – How to Design a Survival Analysis Study

churncox-modelkaplan-meierpredictive-modelssurvival

I'm struggling to understand what the correct start date would be for my analysis.

I have cross-sectional data for an insurance company and the goal is to perform survival analysis to understand churn behaviour. Each record contains information for one insurance policy (so if the policy gets renewed, the record is the same, but if the terms change, then the record is terminated and a new one is generated).

The two alternatives that I've seen in research concerning this topic are:

  1. The first option would be to only consider registries that correspond to policies bought after the start of the study (calendar time). The $T_0$ for each individual would be the start of the coverage. This is what's suggested in Gustafsson's thesis and this other paper.

  2. The second option would be to start the observation for each individual at the first renewal date after the calendar date set for the beginning of the study. This is how Fu and Wang, address the problem, although they use a panel data approach.

Any clarification and comments regarding what approach makes more sense given the task at hand would be greatly appreciated!
Thanks!

Best Answer

The danger in Option 2, as described in this question, is that your data only include those who chose to renew at least once.

The limitation to Option 1, as described in this question, is that you throw away information about those who already had policies at the calendar date of the study.

The thesis you cite, however, uses a third option: it analyzes data over a fixed calendar-time window, from "31.01.2005,* time of origin $t_0$, and the last at the date 31.12.2007" (page 30), not a customer-specific starting time. It includes only customers with policies in effect at that calendar-fixed start time, treating them as left truncated. From page 9 of the thesis:

Left truncation is present if subjects have been exposed to the risk of having the event [churn] before participating in the study, e.g. if a customer had a focus product before time of origin.

All cases in that analysis are potentially right censored, meaning that no churn had occurred as of the end date of 31.12.2007. The thesis took prior length of customer relationships into account by specifying that prior relationship time as a covariate as of the calendar-fixed start time.

So there isn't a single "correct" answer to what the starting time should be. If your data could be worked into panel-type data as analyzed in that thesis, you could similarly use a fixed calendar date as starting time. Alternatively you could choose customer-specific first-policy starting times, but then you would have to account for corresponding implications for data truncation and censoring (succinctly summarized in Section 2.5.2 of the thesis, and described in detail by Klein and Moeschberger), and decide how to incorporate things like the date of first policy and subsequent customer history into account.

Which choice is better probably depends on details specific to your subject matter and how you propose to use the results of your study. My sense (as someone with the same homeowner's insurance policy for 35 years and auto policy for nearly 50) is that customer-specific starting dates and covariate values as of those starting dates might put too much emphasis on ancient history that isn't relevant to current-day business decisions.


*I suspect that's a typo and was supposed to be 01-01-2005, as the study seems to evaluate 36 months' of data.

Related Question