Survival Analysis – Choosing Time Origin When No Specific Event or Study Begin Can Be Selected

datasetpredictive-modelssurvival

I'm currently struggling with the choice of time origin for survival analysis in my data.

My data comes from an ongoing clinical database of patients who all have the same genetic disease. In it, I have multiple variables; Birth date of the participants, several binary variables for a number of different clinical manifestations of the disease, a second variable for each with the diagnosis age of the manifestation, date of enrollment in the database and date of last record entry.

My intention was to do a prognostic model using Cox regressions where one of the disease manifestations would be the outcome. The binary variables of relevant clinical manifestations of the disease would have been used as predictors.

I intended to create a "time-to-event" variable for which the beginning of the follow-up would be the date of enrollment in the database and the end would be either the age of diagnosis of the outcome if the event happens or the date of the last record entry if censored.

What I didn't realize is that absolutely doesn't work. Since it's an ongoing database recruiting patients of every age with the disease, many of them had been diagnosed with the outcome way before enrolling in the database, so I can't use date of enrollment as the beginning of follow-up.

Since it's my first time working with survival analysis other than in a class context, I'm struggling with finding what is an acceptable time of origin. Usually, it's the moment where the study begins, or a specific event that can be applied to all participants, but I can't seem to find one with this specific data

I'm guessing it wouldn't be a good decision to choose either birth date or a specific age before the outcome is first diagnosed among participants, but I'm too much of a beginner to know what problems could arise from this.

As of now, I haven't found good information on the subject, as time of origin always seems to be a given, so if any of you could help me, or point me towards good information on the subject, that would be appreciated!

Best Answer

With a genetic disease the risk is presumably present from birth or after some later developmental landmark like puberty makes the risk noticeable. From that perspective, you could use your understanding of the subject matter to choose such a time as the time origin time = 0 for your study. In principle there's nothing wrong with that. The event or right-censored survival times then become the elapsed time since that time origin (e.g., years since birth or since puberty).

This then becomes a standard situation with left-truncated survival times. From Klein and Moeschberger (2nd edition), Section 9.4:

The most common situation, where left-truncated data arises, is when the event time $X$ is the age of the subject and persons are not observed from birth but rather from some other time $V$ corresponding to their entry into the study.

The need for special care with left-truncated survival times is that an individual who enters after time = 0 "was not at risk for an observable [event]" until the study entry time; Therneau and Grambsch, Section 3.7.3. Such data can be handled with the Surv(startTime, stopTime, event) counting-process data formatting in the R survival package, which treats startTime as left truncated in a Cox model. This format also handles the time-varying covariates that you have.

I worry a bit about your outcome criterion:

My intention was to do a prognostic model using Cox regressions where one of the disease manifestations would be the outcome. The binary variables of relevant clinical manifestations of the disease would have been used as predictors.

With multiple manifestations of the disease you have to be very careful. This might better be modeled as a multiple-risk model instead of focusing on one particular manifestation as the outcome. Work closely with clinical experts about just what should be modeled as outcome(s) and try to get some more experienced statistical help to deal with the problems raised by left truncation and multiple potential outcomes.