Survival Analysis – Questions Related to Lasso Cox Regression

cox-modellassosurvival

I am doing several survival analyses on TCGA (The Cancer Genome Atlas) data. As this is my first time doing this kind of analysis, I have a question about it.

To study the influence of the gene expression of several genes on the patients’ survival rate, I have conducted Cox regressions for each gene individually. To do so, I have previously log-transformed, scaled, and centered gene expression values. Then, I wonder if I could assess the prediction power of those genes in a multivariate cox regression model. However, for a particular case, I obtained 30 genes significantly associated with the patients’ survival rate. So, I wonder if I could do some kind of variable selection step with Lasso cox regression using the glmnet R package. After lambda cross-validation, I obtained the genes with non-zero coefficients and now I would like to know the performance of those genes for prediction or a way to assess the quality of the subset of genes obtained. To do this, I selected coefficients reported by glmnet at lambda.min and manually introduced them in coxph function from survival R package through init argument and then prevent any additional fitting with an iter.max=0 argument, just as reported in this previous post Cox regression with lasso regression. However, although each one of the genes was significantly associated with the survival rate on its own, after this step, the p-value for the log-rank test is near 1. Am I doing something wrong? The number of events ranged from 8 to 160 in several analyses. It would be better to do other approximations to assess this problem? If so, could you please provide me with some suggestions?

Edit: Added outputs from glmnet and coxph.

genes = c("HRK.8739", "NKAIN3.286183", "KLK15.55554", "NKX2.4.644524", "ADIG.149685", "DMRTC1.63947", "HTR3B.9177",
            "TMPRSS7.344805", "PSKH2.85481", "TH.7054", "PSG3.5671", "ADAM2.2515")

final_clin = final_clin[!final_clin$new_death <= 0,]
b = final_clin[, c(genes)]
x = data.matrix(b)
y = Surv(final_clin$new_death, final_clin$death_event)
coxlasso <- glmnet(x = x, y = y, family = 'cox')
plot(coxlasso, label = T)
cv.lasso_OS <- cv.glmnet(x, y, family="cox", standardize=T, alpha=1, nfolds=10, parallel=T)
cv.lasso_OS

Call:  cv.glmnet(x = x, y = y, nfolds = 10, parallel = T, family = "cox", standardize = T, alpha = 1) 

Measure: Partial Likelihood Deviance 

     Lambda Index Measure     SE Nonzero
min 0.01255    20   10.93 0.4086       4
1se 0.07351     1   11.33 0.4002       0

(coefs = coef(cv.lasso_OS, s = cv.lasso_OS$lambda.min))

12 x 1 sparse Matrix of class "dgCMatrix"
                     1
HRK.8739        .         
NKAIN3.286183   0.03159785
KLK15.55554     0.27252634
NKX2.4.644524   .         
ADIG.149685     .         
DMRTC1.63947   -0.19942332
HTR3B.9177     -0.18328332
TMPRSS7.344805  .         
PSKH2.85481     0.25614089
TH.7054         .         
PSG3.5671       .         
ADAM2.2515     -0.09113397

coefs = coefs[coefs != 0]
control = coxph.control(iter.max = 0)
surv_model = coxph(Surv(new_death, death_event) ~ NKAIN3.286183 + KLK15.55554 + DMRTC1.63947
                  + HTR3B.9177 + PSKH2.85481 + ADAM2.2515, data = final_clin,
                  control = control, init = coefs)
summary(surv_model)

Call:
coxph(formula = Surv(new_death, death_event) ~ NKAIN3.286183 + 
KLK15.55554 + DMRTC1.63947 + HTR3B.9177 + PSKH2.85481 + ADAM2.2515, 
data = final_clin, init = coefs, control = control)

  n= 531, number of events= 160 

                  coef exp(coef) se(coef)      z Pr(>|z|)   
NKAIN3.286183  0.03160   1.03210  0.09904  0.319  0.74970   
KLK15.55554    0.27253   1.31328  0.09161  2.975  0.00293 **
DMRTC1.63947  -0.19942   0.81920  0.10511 -1.897  0.05780 . 
HTR3B.9177    -0.18328   0.83253  0.10148 -1.806  0.07090 . 
PSKH2.85481    0.25614   1.29193  0.09451  2.710  0.00673 **
ADAM2.2515    -0.09113   0.91290  0.10356 -0.880  0.37884   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

              exp(coef) exp(-coef) lower .95 upper .95
NKAIN3.286183    1.0321     0.9689    0.8500     1.253
KLK15.55554      1.3133     0.7615    1.0974     1.572
DMRTC1.63947     0.8192     1.2207    0.6667     1.007
HTR3B.9177       0.8325     1.2012    0.6824     1.016
PSKH2.85481      1.2919     0.7740    1.0735     1.555
ADAM2.2515       0.9129     1.0954    0.7452     1.118

Concordance= 0.69  (se = 0.024 )
Likelihood ratio test= 0  on 6 df,   p=1
Wald test            = 0  on 6 df,   p=1
Score (logrank) test = 2.64  on 6 df,   p=0.9

Best Answer

Would it be correct to manually censor some patients to limit the time of the study to some interval predefined by me?

Rather than formally censor the survival times of those patients for Kaplan-Meier and log rank analyses, keep all the original data but just set the limits of the plot to the range that you desire--eg, xlim=c(0,5*365) with TCGA time values in days and if you don't want to display beyond 5 years.

I have conducted Cox regressions for each gene individually.

That's not generally a good idea, even though people often do that. It doesn't allow you to find associations with outcome that are only seen when you take other gene expression levels into account. It's a particular problem with Cox and logistic regressions, as omitting any outcome-associated predictor might bias the magnitude of other coefficients toward 0. And if you started with all ~20,000 genes, you have pre-selected predictors based on your outcomes in a way that will make all attempts at defining "significance" downstream questionable.

If you must find a small set of genes that together is associated with outcome, it's best to start with your entire set of candidate genes and do LASSO starting with all of them.

Be aware that the set of genes selected by LASSO for a particular situation will vary greatly from data set to data set (or even resamples from the same data set), as many threads on this site describe and as demonstrated in Chapter 6 of Statistical Learning with Sparsity. That can be OK for a predictive model but you can't say that those are the "important" genes.

Also, I'm skeptical of any gene-based survival model that doesn't incorporate the critical clinical characteristics known to be associated with outcome. There's a risk that you will just identify genes that are surrogates for fundamentally important clinical variables.

I have previously log-transformed, scaled, and centered gene expression values.

Log transformation is often a good start with gene-expression data. I'm not sure that further pre-scaling is a good idea. In a log2 scale, for example, you can interpret a Cox model coefficient as "the change in log-hazard per doubling of expression."

You lose that if you further scale the expression values; you are stuck with "change in log-hazard per unit standard deviation change in the log2-transformed expression." I think that internal pre-scaling to allow penalization in LASSO is done automatically by glmnet(), with coefficients then returned in the original scales where they are easier to interpret. (The coxph() function does that silently.)

although each one of the genes was significantly associated with the survival rate on its own, after this step, the p-value for the log-rank test is 1. Am I doing something wrong?

The outputs from glmnet() and from the coxph() function (forced to use the initial values and not iterate) are the same, as they should be. The coxph() function can't do the log-rank, score or Wald tests reliably if, as in this circumstance, you don't let it come to its own optimal solution. That is OK, as you really shouldn't be doing that simple type of inference on a set of predictors selected by LASSO based on their associations with outcome. The selectiveInference package provides for inference after LASSO selection:

The functions compute p-values and selection intervals that properly account for the inherent selection carried out by the procedure

but I haven't used it for Cox models.

You can, however, use that Cox model to make predictions in a way that you can't with glmnet()--which was the main issue in the question to which you linked. I don't have experience with the glmpath package, linked from the page you cite; it might provide additional functionality.

The number of events ranged from 8 to 160 in several analyses.

This is your real limit. You typically need 10-20 events per predictor to fit a Cox model without overfitting (unless you penalize as with LASSO). See Frank Harrell's course notes and book on this and on the issues with predictor selection noted above.

With only 8 events you can barely even contemplate fitting a single predictor reliably. With 160 events you might be able to handle 10-15. With LASSO, you might expect to have similar numbers of predictors returned with non-zero coefficients at the optimal penalty.

For comparison, the gene-expression panels used clinically in breast cancer start at 16 cancer-related genes and go up to about 70 or 80, depending on the test. So think hard about whether you can accomplish what you wish, even with TCGA data.