Solved – Problem with BootCV for coxph in pec after feature selection with glmnet (lasso)

cox-modelglmnetlassor

I am attempting to evaluate the prediction error of a coxph model that was built after feature selection with glmnet.

In the preprocessing stage I used na.omit (dataset) to remove NAs.
I reconstructed all my factor variables into binary variables with dummies (using model.matrix) I then used glmnet lasso to fit a cox model and select the best performing features.
Then I fit a coxph model using only these feature.

When I try to evaluate the model using pec and a bootstrap I get an error that the prediction matrix has wrong dimensions.

In the original dataset there are 313 variables.
In the coxph model (glmnet.cox in my code) there are 313 df.
However when I run pec the prediction matrix is noted as having 318 variables and therefore doesn't fit the testset which has 313 variables (+the status and time variables).

It seems that pec does something to the variables while retraining the coxph model on the bootstrap samples. I tried setting singular.ok=FALSE in the coxph model without avail.

Here is a summary of key variables (prior to restructuring for glmnet)
Followed by the code I ran and the errors:

> summary (trainSet)
     time              status         Age_at_Dx       CREATININE   
 Min.   :  0.1429   Min.   :0.0000   Min.   :14.81   Min.   :0.400  
 1st Qu.: 22.0714   1st Qu.:1.0000   1st Qu.:52.51   1st Qu.:0.800  
 Median : 52.9286   Median :1.0000   Median :63.79   Median :0.900  
 Mean   : 96.5415   Mean   :0.7826   Mean   :61.01   Mean   :1.042  
 3rd Qu.:154.6071   3rd Qu.:1.0000   3rd Qu.:73.01   3rd Qu.:1.125  
 Max.   :437.7143   Max.   :1.0000   Max.   :87.23   Max.   :6.000  
 Performance_Status    ALBUMIN          Cyto             WBC        
 Min.   :0.0000     Min.   :0.70   Min.   : 1.000   Min.   :  0.60  
 1st Qu.:1.0000     1st Qu.:3.00   1st Qu.: 4.000   1st Qu.:  3.20  
     Median :1.0000     Median :3.60   Median : 4.000   Median :  9.20  
 Mean   :0.9728     Mean   :3.48   Mean   : 6.802   Mean   : 28.35  
 3rd Qu.:1.0000     3rd Qu.:4.00   3rd Qu.:11.000   3rd Qu.: 33.10  
 Max.   :4.0000     Max.   :5.20   Max.   :17.000   Max.   :373.00  
  PRKCD_pT507             HGB            Maxblast          CD19       
 Min.   :-2.429605   Min.   : 5.400   Min.   : 0.00   Min.   : 0.000  
  1st Qu.:-0.627005   1st Qu.: 8.500   1st Qu.:24.00   1st Qu.: 1.000  
 Median :-0.013117   Median : 9.550   Median :40.00   Median : 2.000  
 Mean   : 0.006432   Mean   : 9.689   Mean   :43.74   Mean   : 6.312  
 3rd Qu.: 0.559782   3rd Qu.:10.800   3rd Qu.:65.25   3rd Qu.: 5.000  
 Max.   : 2.963658   Max.   :14.300   Max.   :98.00   Max.   :98.000  
     GAPDH                CD74              TP53               Fli1          
 Min.   :-2.391021   Min.   :-1.7902   Min.   :-1.64244   Min.   :-3.723143  
 1st Qu.:-0.662164   1st Qu.:-0.5502   1st Qu.:-0.53685   1st Qu.:-0.559783  
 Median : 0.003236   Median :-0.1546   Median :-0.10928   Median :-0.002154  
 Mean   : 0.015759   Mean   : 0.0235   Mean   :-0.02285   Mean   :-0.016555  
 3rd Qu.: 0.632472   3rd Qu.: 0.4759   3rd Qu.: 0.29869   3rd Qu.: 0.554521  
 Max.   : 3.512741   Max.   : 3.7226   Max.   : 5.39242   Max.   : 4.415324  
      LDH              SQSTM0            BM_BLAST         CCND3         
 Min.   :    8.4   Min.   :-1.56639   Min.   : 0.00   Min.   :-2.44772  
 1st Qu.:  562.8   1st Qu.:-0.48762   1st Qu.:31.00   1st Qu.:-0.59042  
 Median :  952.5   Median :-0.05746   Median :46.50   Median :-0.08412  
 Mean   : 1617.9   Mean   :-0.02272   Mean   :50.64   Mean   :-0.02506  
 3rd Qu.: 1596.8   3rd Qu.: 0.33718   3rd Qu.:72.00   3rd Qu.: 0.48353  
 Max.   :36708.0   Max.   : 3.59456   Max.   :98.00   Max.   : 3.58761  
      GAB2                BAX               ABS_BLST           PRKAA1_2       
 Min.   :-3.201564   Min.   :-3.230737   Min.   :     0.0   Min.   :-1.90671  
 1st Qu.:-0.640279   1st Qu.:-0.593174   1st Qu.:    99.8   1st Qu.:-0.66896  
 Median : 0.007018   Median :-0.106097   Median :  1836.5   Median :-0.10586  
 Mean   :-0.013284   Mean   :-0.007051   Mean   : 15024.1   Mean   :-0.03531  
  3rd Qu.: 0.635609   3rd Qu.: 0.588098   3rd Qu.: 11178.0   3rd Qu.: 0.44906  
 Max.   : 2.396419   Max.   : 3.439942   Max.   :358080.0   Max.   : 3.60037  
    RPS6KB1              SPP1               MAPT                ARC          
 Min.   :-2.05490   Min.   :-1.39728   Min.   :-1.826101   Min.   :-2.67081  
 1st Qu.:-0.65736   1st Qu.:-0.55249   1st Qu.:-0.573470   1st Qu.:-0.63908  
 Median :-0.12007   Median :-0.15156   Median :-0.046306   Median :-0.07845  
 Mean   :-0.06115   Mean   : 0.02759   Mean   : 0.009604   Mean   :-0.01082  
 3rd Qu.: 0.49055   3rd Qu.: 0.35195   3rd Qu.: 0.495406   3rd Qu.: 0.63638  
  Max.   : 3.54709   Max.   : 4.44919   Max.   : 4.063568   Max.   : 2.52810  
  FOXO1_pT24_FOXO3_pT32      ATG7               SMAD5           RPS6_pS235_236     
  Min.   :-2.2506       Min.   :-2.916018   Min.   :-2.321250   Min.   :-3.229642  
  1st Qu.:-0.5590       1st Qu.:-0.621408   1st Qu.:-0.480924   1st Qu.:-0.633143  
  Median :-0.1281       Median : 0.003795   Median : 0.003186   Median :-0.003345  
  Mean   :-0.0121       Mean   :-0.029199   Mean   : 0.010663   Mean   :-0.015028  
  3rd Qu.: 0.4415       3rd Qu.: 0.590721   3rd Qu.: 0.502824   3rd Qu.: 0.660224  
  Max.   : 3.1239       Max.   : 2.757525   Max.   : 2.238310   Max.   : 2.816105  
      LSD1             HDAC2               SFN           
  Min.   :-4.5434   Min.   :-2.40682   Min.   :-2.082592  
  1st Qu.:-0.5654   1st Qu.:-0.49165   1st Qu.:-0.554862  
  Median : 0.0477   Median : 0.02828   Median :-0.116248  
  Mean   :-0.0165   Mean   : 0.04641   Mean   : 0.006255  
  3rd Qu.: 0.6044   3rd Qu.: 0.47826   3rd Qu.: 0.506527  
  Max.   : 1.8736   Max.   : 4.22724   Max.   : 3.733445



  ##-----------------------------------------------------------------------------------         -------------------
   ##                                    FEATURE SELECTION BY GLMNET-LASSO
   ##----------------------------------------------------------------------------------     ---------------------
   ## 
   ## This function takes a predictor matrix and a Surv obj and uses the glmnet lasso      regularization 
   ## method to select the most predictive features and create a coxph object
   ## predict_matrix is the 'dataset' reformated to replace categorical variables to     binary with dummy

#creat Y (survival matrix) for glmnet
  surv_obj <- Surv(trainSet$time,trainSet$status) 


 ## tranform categorical variables into binary variables with dummy for trainSet
  predict_matrix <- model.matrix(~ ., data=trainSet, 
                                 contrasts.arg = lapply (trainSet[,sapply(trainSet,     is.factor)], contrasts, contrasts=FALSE))

  ## remove the status/time variables from the predictor matrix (x) for glmnet
  predict_matrix <- subset (predict_matrix, select=c(-time,-status))

  ## create a glmnet cox object using lasso regularization
  glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox")

  # find lambda for which dev.ratio is max 
  max.dev.index <- which.max(glmnet.obj$dev.ratio) 
      optimal.lambda <- glmnet.obj$lambda[max.dev.index] 

  # take beta for optimal lambda 
  optimal.beta  <- glmnet.obj$beta[,max.dev.index] 

  # find non zero beta coef 
  nonzero.coef <- abs(optimal.beta)>0 
  selectedBeta <- optimal.beta[nonzero.coef] 

  # take only covariates for which beta is not zero 
  selectedVar   <- predict_matrix[,nonzero.coef] 

 # create a dataframe for trainSet with time, status and selected variables in binary      representation for evaluation in pec
 reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))

  # create coxph object with pre-defined coefficients 
   glmnet.cox <- coxph(surv_obj ~ selectedVar,data = reformat_dataSet,model=TRUE,      singular.ok=FALSE,init=selectedBeta,iter=0)


 ##------------------------------------------------------------------------------------   -------------------
  ##                                    MODEL PERFORMANCE
  ##-----------------------------------------------------------------------------------    --------------------
  ## 
  set.seed(17743)
  pec.f <- as.formula(Hist(time,status) ~ 1)

   ## Brier score for the cox-glmnet model
   brierGlmnet <- pec(list(glmnet.cox), data= reformat_dataSet , splitMethod="BootCV",     B=50, formula = pec.f)

Here is the result:

Prediction matrix has wrong dimensions:
368 rows and 318 columns.           --->(but there are only 313 predicting vars     in the dataset)
 But requested are predicted probabilities for
 138 subjects (rows) in newdata and 315 time points (columns) --- (315=313 predicting     vars+status+time)
This may happen when some covariate values are missing in newdata!?
Warning in FUN(1:2[[2L]], ...) :

Oddly, when I call print on the glmnet.cox object I get that it has 313 variables and not as pec outputs (318)

print(glmnet.cox)
Likelihood ratio test=0  on 313 df, p=1  n= 368, number of events= 288

Best Answer

pec accepts only R-objects for which a predictSurvProb method exists and glmnet is not such an object.

Currently, predictSurvProb methods are available for the following R-objects:
matrix
aalen, cox.aalen from library(timereg)
mfp from library(mfp)
phnnet, survnnet from library(survnnet)
rpart (from library(rpart))
coxph, survfit from library(survival)
cph, psm from library(rms)
prodlim from library(prodlim)
glm from library(stats)

For calculating the brier score for glmnet one needs to use the peperr package with the c060 library that wraps glmnet as an object suitable for peperr.

peperr_glmnet_noerror <- peperr(response=Surv(time, status), x=x, 
                    fit.fun=fit.glmnet, args.fit=list(family="cox"),
                    complexity=complexity.glmnet,args.complexity=list(family="cox"),
                    indices=resample.indices(n=length(time), method="boot", sample.n=10))

To get the integrated brier score for the entire model it seems one needs to use the ipec function but I still need to research that.

Many thanks to Thomas Hielscher who authored the c060 package and was extremely kind to help me with this.

Best Answer

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Solved – how to work with time-dependent data in Lasso Cox regression in glmnet R package

Related Question