Generally speaking, $\exp(\hat\beta_1)$ is the ratio of the hazards between two individuals whose values of $x_1$ differ by one unit when all other covariates are held constant. The parallel with other linear models is that in Cox regression the hazard function is modeled as $h(t)=h_0(t)\exp(\beta'x)$, where $h_0(t)$ is the baseline hazard. This is equivalent to say that $\log(\text{group hazard}/\text{baseline hazard})=\log\big((h(t)/h_0(t)\big)=\sum_i\beta_ix_i$. Then, a unit increase in $x_i$ is associated with $\beta_i$ increase in the log hazard rate.
The regression coefficient allow thus to quantify the log of the hazard in the treatment group (compared to the control or placebo group), accounting for the covariates included in the model; it is interpreted as a relative risk (assuming no time-varying coefficients).
In the case of logistic regression, the regression coefficient reflects the log of the odds-ratio, hence the interpretation as an k-fold increase in risk. So yes, the interpretation of hazard ratios shares some resemblance with the interpretation of odds ratios.
Be sure to check Dave Garson's website where there is some good material on Cox Regression with SPSS.
Would it be correct to manually censor some patients to limit the time of the study to some interval predefined by me?
Rather than formally censor the survival times of those patients for Kaplan-Meier and log rank analyses, keep all the original data but just set the limits of the plot to the range that you desire--eg, xlim=c(0,5*365)
with TCGA time values in days and if you don't want to display beyond 5 years.
I have conducted Cox regressions for each gene individually.
That's not generally a good idea, even though people often do that. It doesn't allow you to find associations with outcome that are only seen when you take other gene expression levels into account. It's a particular problem with Cox and logistic regressions, as omitting any outcome-associated predictor might bias the magnitude of other coefficients toward 0. And if you started with all ~20,000 genes, you have pre-selected predictors based on your outcomes in a way that will make all attempts at defining "significance" downstream questionable.
If you must find a small set of genes that together is associated with outcome, it's best to start with your entire set of candidate genes and do LASSO starting with all of them.
Be aware that the set of genes selected by LASSO for a particular situation will vary greatly from data set to data set (or even resamples from the same data set), as many threads on this site describe and as demonstrated in Chapter 6 of Statistical Learning with Sparsity. That can be OK for a predictive model but you can't say that those are the "important" genes.
Also, I'm skeptical of any gene-based survival model that doesn't incorporate the critical clinical characteristics known to be associated with outcome. There's a risk that you will just identify genes that are surrogates for fundamentally important clinical variables.
I have previously log-transformed, scaled, and centered gene expression values.
Log transformation is often a good start with gene-expression data. I'm not sure that further pre-scaling is a good idea. In a log2 scale, for example, you can interpret a Cox model coefficient as "the change in log-hazard per doubling of expression."
You lose that if you further scale the expression values; you are stuck with "change in log-hazard per unit standard deviation change in the log2-transformed expression." I think that internal pre-scaling to allow penalization in LASSO is done automatically by glmnet()
, with coefficients then returned in the original scales where they are easier to interpret. (The coxph()
function does that silently.)
although each one of the genes was significantly associated with the survival rate on its own, after this step, the p-value for the log-rank test is 1. Am I doing something wrong?
The outputs from glmnet()
and from the coxph()
function (forced to use the initial values and not iterate) are the same, as they should be. The coxph()
function can't do the log-rank, score or Wald tests reliably if, as in this circumstance, you don't let it come to its own optimal solution. That is OK, as you really shouldn't be doing that simple type of inference on a set of predictors selected by LASSO based on their associations with outcome. The selectiveInference
package provides for inference after LASSO selection:
The functions compute p-values and selection intervals that properly account for the inherent selection carried out by the procedure
but I haven't used it for Cox models.
You can, however, use that Cox model to make predictions in a way that you can't with glmnet()
--which was the main issue in the question to which you linked. I don't have experience with the glmpath
package, linked from the page you cite; it might provide additional functionality.
The number of events ranged from 8 to 160 in several analyses.
This is your real limit. You typically need 10-20 events per predictor to fit a Cox model without overfitting (unless you penalize as with LASSO). See Frank Harrell's course notes and book on this and on the issues with predictor selection noted above.
With only 8 events you can barely even contemplate fitting a single predictor reliably. With 160 events you might be able to handle 10-15. With LASSO, you might expect to have similar numbers of predictors returned with non-zero coefficients at the optimal penalty.
For comparison, the gene-expression panels used clinically in breast cancer start at 16 cancer-related genes and go up to about 70 or 80, depending on the test. So think hard about whether you can accomplish what you wish, even with TCGA data.
Best Answer
Usually, age at baseline is used as a covariate (because it is often associated to disease/death), but it can be used as your time scale as well (I think it is used in some longitudinal studies, because you need to have enough people at risk along the time scale, but I can't remember actually -- just found these slides about Analysing cohort studies assuming a continuous time scale which talk about cohort studies). In the interpretation, you should replace event time by age, and you might include age at diagnosis as a covariate. This would make sense when you study age-specific mortality of a particular disease (as illustrated in these slides).
Maybe this article is interesting since it contrasts the two approaches, time-on-study vs. chronological age: Time Scales in Cox Model: Effect of Variability Among Entry Ages on Coefficient Estimates. Here is another paper:
Cheung, YB, Gao, F, and Khoo, KS (2003). Age at diagnosis and the choice of survival analysis methods in cancer epidemiology. Journal of Clinical Epidemiology, 56(1), 38-43.
But there are certainly better papers.