I have a question I would like to pose to the community. I have recently been asked to provide statistical analysis for a tumor marker prognostic study. I have primarily used these two references to guide my analysis:
-
McShane LM, et al. Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst. 2005 Aug 17; 97(16):1180-4.
-
Simon RM, et al. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform. 2011 May; 12(3):203-14. Epub 2011 Feb 15.
I have summarized the study and my analyses below. I would appreciate any comments, suggestions, or criticisms.
Study background:
Some patients with cancer X experience early relapse after treatment. The clinical prognostic score currently used by doctors does not do a good job of predicting clinical outcome in these patients. It would therefore be useful to identify biological prognostic markers that add value above and beyond this standard score. The goal of this study is to discover such a biomarker.
Study methods:
Pre-selection of candidate biomarkers
Twelve biomarkers associated with cancer X were identified in a previous study. We attempted to validate the association between these 12 candidates and cancer X in an independent sample of patients/tumors, described below.
Univariate validation of pre-selected candidate biomarkers
Levels of these biomarkers were measured in a set 220 patients/tumors.
[Note: I have masked the data and made them available for public download as a *.csv file. The file has the following columns: “ID”, a unique identifier for each patient; “PS”, the prognostic score for each patient, with 1 indicating a good prognosis and 2 indicating a bad prognosis; “m1” to “m12”, levels of each tumor marker; “time”, in months; and “event”, where 0 indicates that the observation is censured and 1 indicates that treatment failure occurred.]
Univariable Cox regression models with time to death as the dependent variable were built for each of the 12 biomarkers (n = 220 observations, number of events = 91).
Risk LCI UCI pValue
1 0.93 0.86 1.02 0.1088
2 0.93 0.88 0.99 0.0215
3 0.99 0.92 1.05 0.6528
4 0.93 0.87 1.00 0.0468
5 0.93 0.88 0.98 0.0055
6 0.97 0.92 1.01 0.1202
7 0.91 0.83 0.99 0.0297
8 0.98 0.90 1.07 0.6972
9 0.99 0.92 1.06 0.7841
10 1.01 0.91 1.11 0.9149
11 0.96 0.87 1.05 0.3837
12 0.90 0.83 0.97 0.0047
Using a threshold p value of 0.05/12 = 0.004, none of the results were significant.
Multivariable analyses
It was decided to fit a model to the data by inputting all 12 biomarkers at once into a stepwise Cox regression algorithm using ten-fold cross-validation. After building ten models on the ten different training sets, time-dependent ROC curves were built to allow selection of optimal cutoff points to identify two groups of patients, “high” and “low” risk. Cut points that minimized “1 – TP + FP” were selected. These ten models were then asked to make predictions about the corresponding patients in the validation groups. These patients were then classified into “high” and “low” risk groups and plotted on a single, cross-validated Kaplan Meier curve.
Conclusions
The confidence intervals of the high and low risk curves significantly overlapped, suggesting that the identified biomarkers were not useful prognostic markers. Our study therefore has not identified any significant univariate or multivariate associations between these markers and patient prognosis.
Questions for the community
Have I gone about analyzing my data in the correct manner?
If you had been the statistician on this study, would you have done something differently?
Prior to performing the validation analyses, sample size and power calculations were not performed to determine the number of samples to include and the detectable effect size. I would like to perform these analyses now to guide future studies. Can someone tell me how to do this?
What I am really interested in is whether these biomarkers provide predictive information above and beyond the clinical prognostic score. From what I understand, this would entail making three different models: (1) a model with clinical covariates only, (2) a biomarker model with biomarker covariates only, and (3) a biomarker/clinical model based on both types of covariates. So far I have made models 1 (not shown above; it was unable to differentiate between high and low risk patients in our sample either) and 2 (shown above). Because 1 and 2 were not significant, I didn’t make model 3. Should I do this any way?
Any additional comments about analytical concerns would be greatly appreciated! Please feel free to download the masked data and have a look yourself.
Best Answer
You have nicely described the problem and have set it up well in a number of ways. I wasn't clear on the definition of "prognostic score", but it is very unlikely that a 2-level score is clinically helpful. It is important to adjust for all pertinent available clinical variables, based on expert opinion when choosing them. Here are some opportunities for improvement:
rms
package)