Survival Analysis – How to Perform Causal Survival Analysis in Large Prospective Cohorts Treating Incident Events as Exposure

cox-modelrsurvival

I'm currently working on a large prospective cohort with the basic demographic characteristics and various socioeconomic factors collected at baseline. This cohort was follow up since baseline entry for about 10 years, and disease incidence and cause-specific deaths were recorded. The aim is to estimate the causal effect of incident diabetes (exclude prevalent cases) on cancer risks. I wonder how to realize this in my modelling, by treating incident diabetes as a time-updated variable or as a fixed variable (so the index time for the exposure group would be the date of diabetes diagnosis). I plan to use the Cox Proportional Hazards Regression model and the software is R.

I have tried to treat incident diabetes as a fixed variable (dichotomous,YES/NO). Specifically, for the reference group (i.e.,whose without incident diabetes during the follow-up), the 'study entry date' was defined as 'time zero'; while for the exposure group (i.e., those developed into diabetes in the follow-up), the 'diabetes diagnosis date' was defined as 'time zero'. Event status was 1 if they developed cancer in the follow-up, and 0 if not. I used the age (from time zero) as the underlying time-scale in the Cox Proportional Hazards Regression model and adjusted for a set of potential confounders. The results were okay and in consistent with my expectations overall, but I think there are at least three problems with this method:

1.because incident diabetes was diagnosed during follow-up, for the exposure group, the average follow-up time was obviously shorter than the reference group. Since cancer is a rare event, I wonder whether this would influence the effect estimation.
2.How to treat those individuals that developed a cancer before a diabetes diagnosis? Exclude them in the analyses or put them in the reference group? if the former, would it result in a overestimation of effect, if the latter, would it result in an underestimation of effect(because of reverse causality).
3. The exposure group is obviously older on average than the reference group (at time zero), even I used age as the time-scale, would that cause any bias? Do I need to further adjust for age in my Cox model?

Best Answer

A few general points about this study:

Time-varying variable vs fixed effect: Treating incident diabetes as a fixed variable simplifies the model but may not accurately capture the dynamic nature of the disease's development. If individuals develop diabetes at different times during the follow-up period, treating it as a fixed variable from the study's start might lead to misclassification of exposure status for part of the follow-up. Using incident diabetes as a time-varying variable will allow for the exposure status to change over time, reflecting a more accurate time at risk.

Time-Zero Definitions: Defining time zero differently for the reference and exposure groups may lead to immortal time bias, which occurs when the exposure group has a period during which the outcome (cancer) cannot occur by design. If cancer occurrence is being measured from the diagnosis of diabetes forward, there is a period for those who will develop diabetes where they cannot be counted as a case, which artificially lowers the risk in this group. To avoid this, time zero should typically be defined consistently across all participants to ensure that all individuals are followed up from the same starting point (e.g., age at entry into the cohort or the study start date).

Event Status: It's appropriate to use binary coding for the event status. However, for individuals who do not develop cancer, their data should be censored at the last follow-up time they were known to be cancer-free.

Underlying Time-Scale: Using age as the time scale is a common approach because it adjusts for age automatically, which is a strong risk factor for many chronic diseases. However, ensure that this does not conflict with your definition of time zero and that age is appropriately accounted for at the start of risk time.

Potential Confounders: It is crucial to adjust for potential confounders in the analysis. Ensure that the set of confounders you're adjusting for includes all known risk factors for cancer that could also be associated with diabetes. It is good practice to identify the potential confounders by using a causal diagram such as a DAG. This will help to ensure that you don't make errors in selecting the set of potential confounders to use. A good, free, too for this is DAGitty. See here for some example of using a DAG to reduce bias: How do DAGs help to reduce bias in causal inference?

Cox Proportional Hazards Assumption: If the assumption is violated, consider stratification or time-varying covariates as potential solutions.

Regarding your specific questions:

because incident diabetes was diagnosed during follow-up, for the exposure group, the average follow-up time was obviously shorter than the reference group. Since cancer is a rare event, I wonder whether this would influence the effect estimation.

Yes, the fact that incident diabetes was diagnosed during follow-up and that the exposure group has a shorter average follow-up time than the reference group could indeed influence the effect estimation in several ways:

Differential Follow-Up: If the group with incident diabetes has a shorter follow-up time, there is less opportunity for these individuals to develop cancer within the study period. This could lead to an underestimation of the association between diabetes and cancer risk due to differential opportunity for the event to occur (cancer).

Time-Related Bias: The shorter follow-up for the exposure group can introduce bias, especially if you are not using diabetes as a time-dependent covariate. This can lead to time-related biases such as immortal time bias, where individuals are not at risk of the event during the "immortal time" period — the time before they develop diabetes.

Competing Risks: Individuals with diabetes may have a higher mortality from causes other than cancer, which could result in a lower observed incidence of cancer simply because the risk of death from other causes removes them from the risk set before they can develop cancer (competing risks).

Severity and Treatment of Diabetes: Individuals who develop diabetes during the follow-up may be at different stages of the disease and may have different treatment regimes, which can influence cancer risk and further complicate comparisons.

To mitigate these issues, consider the following adjustments:

  • Use diabetes as a time-dependent variable: This method accounts for the fact that diabetes status changes over time and that the risk of developing cancer changes after the diagnosis of diabetes.

  • Ensure that individuals are censored appropriately at the time of their last follow-up. This will account for differing lengths of follow-up and the time of cancer diagnosis relative to diabetes diagnosis.

  • Statistical techniques such as weighting individuals by their follow-up time can help adjust for the differences in the lengths of follow-up. Competing Risk Analysis: Consider using methods that account for competing risks, such as the Fine-Gray Subdistribution Hazard Model (Fine & Gray, 1999), especially if mortality from causes other than cancer is high.

I would suggest conducting sensitivity analyses where you simulate what the effect of diabetes on cancer risk might be under different assumptions about the relationship between diabetes, follow-up time, and the risk of cancer.

2.How to treat those individuals that developed a cancer before a diabetes diagnosis? Exclude them in the analyses or put them in the reference group? if the former, would it result in a overestimation of effect, if the latter, would it result in an underestimation of effect(because of reverse causality).

When considering individuals who developed cancer before a diabetes diagnosis in a prospective cohort study, there are a few methodological options:

Exclude Pre-Diagnosis Cancer Cases: If your primary interest is in understanding the impact of diabetes on subsequent cancer risk, then individuals who had cancer occurrences before the onset of diabetes do not fit this exposure-outcome relationship.

Excluding these participants is generally appropriate because they do not meet the criteria for being at risk for the outcome (cancer) post-exposure (diabetes). It is unlikely to result in an overestimation of the effect; rather, it helps to ensure that the temporal sequence is correct (diabetes before cancer).

Include Pre-Diagnosis Cancer Cases in the Reference Group: By including them in the reference group, you are considering these individuals as non-exposed to the risk factor (diabetes) at the time of their cancer diagnosis. This approach could lead to an underestimation of the effect of diabetes on cancer risk because it might include individuals who have an underlying biological process that could eventually lead to diabetes after cancer diagnosis (reverse causality). The presence of cancer and its treatment may influence glucose metabolism, which could later contribute to the diagnosis of diabetes. Thus, their inclusion could dilute the contrast in risk between the non-diabetes and the post-diabetes cancer incidence. Considering these points, it is generally preferable to exclude individuals who developed cancer before a diabetes diagnosis when the aim is to study the effect of diabetes on subsequent cancer risk. This exclusion criterion should be clearly stated in the study methodology to maintain transparency.

However, in excluding these individuals, you also have to be mindful of the potential for selection bias. If the occurrence of cancer is related to the likelihood of developing diabetes or being diagnosed with it (for example, due to increased medical surveillance), the exclusion of these cases could bias the results. This could happen if the fact of having cancer influences the probability of diabetes diagnosis or vice versa.

To handle this I would suggest that you conduct sensitivity analyses including these individuals in your reference group to see how much it affects your estimates. This would provide a range of estimates across different reasonable analytical scenarios. Also, assuming that you are going to write up your results, then address these issues in the discussion section, explaining the the potential for reverse causality or selection bias in the limitations section.

It is important to align your methodological approach with your research question and consider the biological plausibility and temporal sequence of the exposure-outcome relationship. Reporting the results from both approaches and discussing the potential biases can provide a comprehensive understanding of the relationship between diabetes and cancer risk in your cohort.

  1. The exposure group is obviously older on average than the reference group (at time zero), even I used age as the time-scale, would that cause any bias? Do I need to further adjust for age in my Cox model?

Using age as the timescale in the Cox model is a well-established method to control for age as a confounder because it aligns the entry time for all individuals to their age at baseline and ends at the age at the event or censoring. This inherently adjusts for age, as the analysis compares individuals at the same chronological age.

However, if the exposure group is older on average than the reference group at their respective 'time zero' (age at diabetes diagnosis for the exposed and age at study entry for the unexposed), this could introduce bias if age has a non-linear effect on the outcome or if other age-related factors that could influence cancer risk are not evenly distributed across the study participants.

Despite using age as the timescale, there could still be residual confounding due to age-related factors if age effects vary over the age range. If the effect of age on the risk of cancer is not proportional across the age span of your cohort, this could lead to biased estimates. Age might have a different impact on cancer risk at younger ages compared to older ages, which might not be fully accounted for by simply using age as the timescale.

Also, if age interacts with other covariates that affect cancer risk and if the age distribution is significantly different between your exposure groups, this could lead to biased estimates if these interactions are not considered.

Finally, if the hazard ratio for diabetes changes with age, this could violate the proportionality assumption of the Cox model and potentially bias your results. To address this issue, you may consider:

  • stratification by age to try to ensure that the comparison between the diabetes and non-diabetes groups is made within strata of similar age distributions. This approach is particularly useful if you suspect that the hazard ratio varies with age.

  • if there are reasons to believe that the effect of diabetes on cancer risk changes with age, you can include an interaction term between diabetes status and a function of age (e.g., linear, polynomial) to allow the effect to vary with age.

  • even though using age as the timescale inherently adjusts for age, if there is substantial age imbalance between groups, you could further adjust for age by including it as a covariate in the model. This could be done by including age at time zero or by using a more complex function of age if the relationship is non-linear.

I would also suggest that you conduct sensitivity analyses with different methods of age adjustment to see how robust your findings are to these methods.

References: Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American statistical association, 94(446), 496-509.

Related Question