Solved – Is oversampling done for Cox regression data

cox-modeloversamplingsamplingsurvival

I have a dataset consisting of about 48000 people, about 40,000 of which die before the end of the study and get a failed = 1 and the remaining 8000 have a failed = 0 because they're either lost to followup or survive beyond the period of study and I have no info about them.

Because there is an imbalance in distribution of data points, among the two classes (failed = 1 and failed = 0), does oversampling/undersampling have to be done and then Cox regression performed over it? In most practical cases, the number of deaths are way more than number of survivors. But I have not come across any models where such methods are employed.

In this case for instance, is it valid for me to take about 1000 of each of the two classes and with 2000 data points, perform the Cox regression?

Best Answer

Original Answer: There is no benefit to under- or over-sampling either group

New Conclusion, see edit at end: There can be a benefit of sampling, but not random sampling of failures and non-failures.

You are forgetting the point of a Cox analysis. It is not to analyze who dies and who does not die; rather it is to study effects of covariates on the timing of death. A Cox analysis would be valid even if everyone died. If $Y$ is a failure time and $t$ is the time scale of the study, Cox studies the hazard rate, defined as:

$$ \lambda (t) = \lim_{\Delta t \to 0} \frac{ Pr ( t \leq Y < t + \Delta t \thinspace| \thinspace Y \ge t)}{\Delta t} $$

The fundamental unit is not the individual but the risk set at each failure time, those individuals who have not yet failed prior to that time, i.e. who have $Y \ge t$. Thus at the first failure, the risk set consists of the failure and all the others in the sample. The Cox partial likelihood compares covariate values for the failure and the others in the risk set. So an imbalance is built into the Cox analysis, but it is not a shortage of non-failures, just a reverse: the one (or few) failures at each distinct failure time $t*$ is compared to all those who have not yet failed by that time.

In some, perhaps most, studies, there is a shortage of failures, with most individuals still alive at the end of follow-up. In such studies, it pays to sample non-failures at each risk set. This is called "risk-set sampling"; see Langholtz et al, 1996. You don't have this problem.

For two reasons there might be a benefit to sampling, but not to under-or over-sampling each group. The first reason is economic, that studying all individuals is too costly to do. This could happen, for example, if you need to examine physical records for each subject. A second reason for sampling arises from your very number large of failures If you split your sample, you can explore and develop models on one part and validate the models on the second part.

Reference

Langholz, Bryan, and Larry Goldstein. 1996. Risk set sampling in epidemiologic cohort studies. Statistical Science 35-53. Available at: http://projecteuclid.org/euclid.ss/1032209663

Edit December 8:

In the original version, I stated that there were two reasons in which sampling might be beneficial: to reduce economic burden and as part of a modeling strategy. I would add computational burden as a reason, a likely one with a sample as large as yours.

I didn't say how you might do the sampling. I would suggest a two-stage procedure: 1) sample failures; and 2) sample risk sets for those failures. Therefore the Langholz reference may well be relevant to your study.

In Stata, the command sttocc can perform risk-set sampling. Apparently the Epi package in R can also do so.

As part of a modeling strategy, you might stratify the entire sample into natural groups, such as gender. You can fit separate models to each or treat them as strata for the Cox analysis. In either case, the risk sets will be smaller, and the computational burden will be lessened.

Added-off topic: Even with exact dates for entry and exit, you are likely to have heavily tied observation times. In that case, you might consider discrete or grouped data models. See, e.g.

Jenkins, S. P. (1995). Easy estimation methods for discrete-time duration models. Oxford Bulletin of Economics and Statistics 57(1): 129-138.

Prentice, R. and L. Gloeckler. (1978). Regression analysis of grouped survival data with application to breast cancer data Biometrics 34: 57-67.

There are Stata specific examples at https://www.iser.essex.ac.uk/resources/survival-analysis-with-stata

Related Question