Yes, these are survival analysis/event history analysis data.
The beginning of time in survival analysis is rarely calendar time, but is the first day the individual was observed in the study. This affects your interpretation in that intervention/treatment effects are understood to affect person time (e.g. to affect the hazard function in a given abstracted notion of "days since start of observation", or "days since diagnosis" or "days since treatment"... depending on the nature of your study design), rather than affecting the hazard function in terms of calendar time (i.e. you are not trying to estimate change in hazard due to treatment on June, 3rd, 2014).
If you only followed people for 6 months: that's 180 days; unless everyone experienced readmission by 180 days, there should be some right censoring, and the survival curve should not plummet to 0 at 180 days.
Would it be correct to manually censor some patients to limit the time of the study to some interval predefined by me?
Rather than formally censor the survival times of those patients for Kaplan-Meier and log rank analyses, keep all the original data but just set the limits of the plot to the range that you desire--eg, xlim=c(0,5*365)
with TCGA time values in days and if you don't want to display beyond 5 years.
I have conducted Cox regressions for each gene individually.
That's not generally a good idea, even though people often do that. It doesn't allow you to find associations with outcome that are only seen when you take other gene expression levels into account. It's a particular problem with Cox and logistic regressions, as omitting any outcome-associated predictor might bias the magnitude of other coefficients toward 0. And if you started with all ~20,000 genes, you have pre-selected predictors based on your outcomes in a way that will make all attempts at defining "significance" downstream questionable.
If you must find a small set of genes that together is associated with outcome, it's best to start with your entire set of candidate genes and do LASSO starting with all of them.
Be aware that the set of genes selected by LASSO for a particular situation will vary greatly from data set to data set (or even resamples from the same data set), as many threads on this site describe and as demonstrated in Chapter 6 of Statistical Learning with Sparsity. That can be OK for a predictive model but you can't say that those are the "important" genes.
Also, I'm skeptical of any gene-based survival model that doesn't incorporate the critical clinical characteristics known to be associated with outcome. There's a risk that you will just identify genes that are surrogates for fundamentally important clinical variables.
I have previously log-transformed, scaled, and centered gene expression values.
Log transformation is often a good start with gene-expression data. I'm not sure that further pre-scaling is a good idea. In a log2 scale, for example, you can interpret a Cox model coefficient as "the change in log-hazard per doubling of expression."
You lose that if you further scale the expression values; you are stuck with "change in log-hazard per unit standard deviation change in the log2-transformed expression." I think that internal pre-scaling to allow penalization in LASSO is done automatically by glmnet()
, with coefficients then returned in the original scales where they are easier to interpret. (The coxph()
function does that silently.)
although each one of the genes was significantly associated with the survival rate on its own, after this step, the p-value for the log-rank test is 1. Am I doing something wrong?
The outputs from glmnet()
and from the coxph()
function (forced to use the initial values and not iterate) are the same, as they should be. The coxph()
function can't do the log-rank, score or Wald tests reliably if, as in this circumstance, you don't let it come to its own optimal solution. That is OK, as you really shouldn't be doing that simple type of inference on a set of predictors selected by LASSO based on their associations with outcome. The selectiveInference
package provides for inference after LASSO selection:
The functions compute p-values and selection intervals that properly account for the inherent selection carried out by the procedure
but I haven't used it for Cox models.
You can, however, use that Cox model to make predictions in a way that you can't with glmnet()
--which was the main issue in the question to which you linked. I don't have experience with the glmpath
package, linked from the page you cite; it might provide additional functionality.
The number of events ranged from 8 to 160 in several analyses.
This is your real limit. You typically need 10-20 events per predictor to fit a Cox model without overfitting (unless you penalize as with LASSO). See Frank Harrell's course notes and book on this and on the issues with predictor selection noted above.
With only 8 events you can barely even contemplate fitting a single predictor reliably. With 160 events you might be able to handle 10-15. With LASSO, you might expect to have similar numbers of predictors returned with non-zero coefficients at the optimal penalty.
For comparison, the gene-expression panels used clinically in breast cancer start at 16 cancer-related genes and go up to about 70 or 80, depending on the test. So think hard about whether you can accomplish what you wish, even with TCGA data.
Best Answer
The starting time of the study is immaterial: it's just an origin for the clock. What you want to consider are the states in which the subjects can be found and the ages at which they transition from one to another. In this situation a minimum set of states would be
(This framework will allow multiple "endpoint" states to be modeled.)
The multistate analysis supposes there is a transition probability from some of these states to others. The relevant ones would be
The Nelson-Aalen estimator can be generalized to estimate the rates of these transitions. It's a simple estimator, summing the ratios of events occurring to the numbers of people at risk for them to occur. The conclusion of the recent TAS article Two Pitfalls in Survival Analyses of Time-Dependent Exposure is that if you get your multistate model wrong, you will miscount the number of people at risk in various states and that will bias the results. Its message is clear: get the multistate model right. If the study truly is prospective--that is, if you identify people with the gene at birth and follow them--then there is no question about the right model. Similarly, if enrollment in the study is independent of the presence of the gene, there will be no bias. Otherwise, this framework calls out for incorporating the study selection probabilities into the model and shows how to account for deaths and prior disease before enrollment was possible.
This paper also illustrates a nice tool for analyzing these subtleties: the Lexis Diagram. (Look at the figures in the end of this rather technical paper.) I believe these diagrams can be produced with the epi package in R. You might find them helpful for having discussions with your colleagues about the appropriate model to adopt.
ASA members and people with university library privileges probably already have online access to this article: it's worth reading.