Solved – Clarification for LASSO based Cox model using glmnet

cox-modelcross-validationglmnetsurvival

I am trying to find a variable signature associated with a characteristic. Particularly I am looking to get a prognostic model from multi-variable data for gene expression. I have the "Time (survival years)" and "Status (dead/alive)" data for individuals and I also have expression data for those individuals.

Based on my reading about statistical models and posts 1 , 2 and 3, I decided to use multivariate Cox propotional-hazard model. I was reading how to select few best markers, and I found "backward" selection and "LASSO" method to achieve that. Further reading suggested "LASSO" could be a good choice and can be implemented using glmnet package in R.

I divided my data into 60-40 (TRAIN-TEST) proportions, and using the 60% to find those variables and then use them on 40% to find there success as prognostic markers.

The training data has 457 subjects (patients) and 612 covariates (continuous log transformed values). The survivability data is right censored with ~15% Dead cases (Status=1). I don't have any idea about what would be a good number of predictors.

For the TRAIN data, I used the code below.

md #expression data
mdsurv #surviablity data( years) and death (0 or 1)
#creating table
eventT<-mdsurv$OS
obsLen<-5 # right censored 5 years
censT<-rep(obsLen,365)
obsT<-pmin(eventT,censT)
status<-eventT<=censT
mdsurv$Status<-as.numeric(status)

keep<-mdsurv$Dead==0 & mdsurv$Status==1
mdsurv$Status[keep]<-0

cvfit<-cv.glmnet(mdt,mdsurv, family = "cox",alpha=1)
plot(cvfit)
c<-coef(cvfit ,s='lambda.min')

I am getting 16 genes(covariates) with lambda.min cutoff for coef() and zero with lambda.1se. When I do univariate cox analysis. HR value for some of those genes is less than one. Shall I still use those genes for my test subject or just make a model using the one with HR>=1.

Thanks.
P.S. Edited with new information.More edits

Best Answer

First, you should not be worried by the predictors for which hazard ratios (HR) are < 1. That simply means that increased expression of those predictors is associated with better outcome rather than worse outcome. There is no reason to remove them from further use.

Second, with about 70 events in your training set you typically would have power to fit a standard Cox model with 5 unpenalized predictors without overfitting. (Usual rule of thumb is about 15 events per predictor being considered.) LASSO has identified 16 predictors, but LASSO also has penalized their regression coefficients to lower magnitudes than they would have in a standard Cox model, to avoid overfitting. If you wish to use a training/test setup (see below for why that might not be wise) then you should use the penalized coefficients provided by LASSO to evaluate performance. Do not simply use those predictors as the basis of a standard Cox model, as you then lose the protection against overfitting.

Third, with so few events you typically lose too much information by setting aside separate training and test sets. A more powerful approach can be to develop the model on all of your data to obtain a LASSO model. You then demonstrate the quality of your model-building process on multiple bootstrap samples of the data.

See this page and its links for an overview of how you could proceed. Essentially, after you have your model, you repeat the entire model-building process (cross-validation to choose lambda, model with the LASSO predictor selections and coefficients based on that value of lambda) on multiple bootstrapped samples from all the original data. You evaluate the performance of each model on all the original data. Averaging your estimates of performance for multiple models developed on bootstrapped samples provides a useful measure of the quality of the model-building process. Note that with LASSO you will typically get different predictors selected for each bootstrap sample but this procedure will still document the quality of your modeling approach.

Related Question