Classification – Is LASSO Regularized Method Correct in R?

classificationcross-validationglmnetlassor

I am using CT scans to classify lung cancer into one of two types (Adenocarcinoma vs. Squamous carcinoma; we can abbreviate them A & B). I am applying LASSO penalized logistic regression to a data set containing 756 CT-derived texture features and 12 radiologist identified categorical (yes/no) features. The latter are based on literature for relevance whereas the former are computer generated with no prior proof of being useful. I have 107 cases so my final dataframe (df) is 107 x 768 dimensional:

i)Texture features (mathematical quantities n=756) are continuous variables scaled and centered. Their names are stored in list ‘texVars’

ii)Semantic features(Qualitative features subjective assessed by experienced radiologist, n=12). These are usually categorical binary inputs of yes / no type. Their names are stored in list ‘semVars’.

Following comments from community on my original (very different) model Is my high dimensional data logistic regression workflow correct?, I performed my LR development in three steps:

1)Feature selection: I used principle components analysis to reduce texture feature-space from 756 to 30. I kept 4 most relevant (from literature) semantic features. This gave me 34 final features. I used the following command:

trans = preProcess(df[,texVars], method=c("BoxCox", "center",   "scale", "pca"),thresh=.95)  # only column-names matching ‘texVars’ are included.
neodf2 <- predict(trans,df[,texVars]).
neodf.sem <- neodf2[,c("Tumour","AirBronchogram", "Cavity", "GroundglassComponent","Shape")]  # this DF is 107 x 4 dimensional, containing only 4 semantic features (most relevant from prior knowledge).
neodf.tex <- neodf2[,c("Tumour",setdiff(names(neodf2),names(neodf.sem)))] # this only has the 30 PCA vectors (labelled PC1 – PC30).

2) Model development (LASSO) and penalty term tuning (10fold cross-validation) using cv.glmnet command Deviance was used as determinant of model quality. Using this method, I developed a model incorporating only semantic features, a second model incorporating only texture features, and a third model incorporating both semantic and texture features. Here are the commands:

#Converting to model.matrix for glmnet 
xall <- model.matrix(Tumour~.,neodf2)[,-1]
xtex <- model.matrix(Tumour~.,neodf.tex)[,-1]
xsem <- model.matrix(Tumour~.,neodf.sem)[,-1]
y <- neodf$Tumour
require(glmnet)
grid <- 10^seq(10,-2,length=100)

lasso.all <- cv.glmnet(xall,y,alpha=1,lambda=grid, family="binomial", type.measure="deviance") 
lasso.tex <- cv.glmnet(xtex,y,alpha=1,lambda=grid, family="binomial", type.measure="deviance")
lasso.sem <- cv.glmnet(xsem,y,alpha=1,lambda=grid, family="binomial", type.measure="deviance")

3) Testing model classification accuracy on entire dataset. The following is the backbone of bootstrap to generate 95% confidence intervals of predictive accuracy:

pred <- predict(lasso.all, newx = xall, s = "lambda.min", "class")
tabl <- table(pred,y)
sum(diag(prop.table(tabl)))

4) As an alternative means to assess model performance than classification accuracy, I used ROC area under curve on entire dataset and compared AUROC curves from different models using DeLong's method (pROC package)

The results are interesting

 =================================================

LR MODEL BASED ON SEMANTIC FEATURES ALONE:
lasso.sem$lambda.min
0.01

Plot cv lambda vs. binomial devance cv lambda vs binomia deviance

             Feature          Odds Ratio
1                (Intercept)  0.1292604
2      AirBronchogramPresent  0.1145378
3              CavityPresent 35.4350358
4 GroundglassComponentAbsent  4.3657928
5                 ShapeOvoid  2.4752881

AUC: .84


=================================================    

LR MODEL BASED ON TEXTURE FEATURES ALONE:

lasso.tex$lambda.min
1e+10   

Plot cv lambda vs binomial deviance (texture alone). Note how the 95% CI's are all overlapping!  cv lambda vs binomial deviance

   Feature OddsRatio
1 (Intercept) 0.6461538



============================================================

LR MODEL BASED ON TEXTURE + SEMANTIC FEATURES:

lasso.all$lambda.min
0.05 

Plot cv lambda vs binomial deviance cv lambda vs binomial deviance

                          Feature   OddsRatio
1                (Intercept)        0.3136489
2                       PC23        0.9404430
3                       PC27        0.8564001
4      AirBronchogramPresent        0.2691959
5              CavityPresent        6.7422427
6 GroundglassComponentAbsent        2.0514275
7                 ShapeOvoid        1.5974378

 AUC : .88

Plot showing loglambda vs coefficients. The dashed vertical line shows the cross-validated optimum lambda:enter image description here

Having rejected the texture only model, which only contains intercept, i was left with two models – semantic and combined texture+semantic. I created ROC curves for both and compared them using DeLong's method:

pred.sem<- predict(lasso.sem, newx = xsem, s = "lambda.min")
pred.all<- predict(lasso.all, newx = xall, s = "lambda.min")

roc.sem<- roc(y,as.numeric(pred.sem), ci=TRUE, boot.n=1000, ci.alpha=0.95,stratified=FALSE)    

roc.all<- roc(y,as.numeric(pred.all), ci=TRUE, boot.n=1000, ci.alpha=0.95,stratified=FALSE)

Outputs of ROC analysis are:

data:  roc.sem and roc.all
Z = -2.1212, p-value = 0.0339
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8369963   0.8809524 

Showing that combined model ROC curve is significantly better despite the modest improvement in AUC (83% vs 88%).
Questions are:

a) is my methodology airtight from a publication point of view now? Apologies in advance for any gross errors in my presentation of this problem.

b) what is the formal inference that texture model is intercept only.

c) if texture model is useless, how do its variables become useful once added to semantic features and yield a higher overall accuracy in the combined result? perhaps that means the effect of texture features alone is too small to be detected in this small dataset but becomes apparent when combined with a stronger predictor (i.e., semantic features).

Any further comments are welcome.

Best Answer

Answering in reverse order:

c) Changes in apparent significance as other variables are added or removed from a model are common in regressions. Your suggested explanation is one possibility. See this discussion as one of several on this site that examine this phenomenon.

b) "Formal inference" in the way you might hope is difficult in LASSO. What you can say is what you found: the texture-only model found no predictor variables that performed better in terms of deviance than the null, intercept-only, model upon 10-fold cross validation. That would seem to be evidence against the usefulness of the texture variables on their own.

a) One should be cautious in judging whether methodology is "airtight from a publication point of view." That depends too much on the audience, the journal, the luck of the draw in editors and referees, and the quality of statistical review. You have taken some wise steps: decreasing dimensionality based on subject-matter knowledge (semantic features) or without reference to the outcome variable (principal components of texture features), using LASSO as a principled way to select predictors, with cross-validation to choose the penalty, and using well established software for pre-processing and analyzing the data.

The main difficulty I see is with interpretation. "Significance" is problematic in at least 2 ways. First, as with any variable-selection technique, the particular features selected by LASSO can change substantially among bootstrap samples from the same data, raising the issue of whether the selected features are in any way the "true" predictors. That might be less of an issue in your texture-only model with orthogonal predictors from PCA, but could arise in your last, combined, model in which there might be correlations of principal components of the textures with the semantic variables.

Second, although LASSO is well accepted as a useful way to select predictor variables, significance testing in LASSO does not yet seem to have generally agreed upon methods. Lockhart et al have proposed a covariance test statistic in the context of linear regression, with some indications that it could also work in logistic regression, which is available in the covTest package in R. There seems, however, to be substantial unresolved discussion about this approach, at least to the eyes of someone not well trained in mathematical statistics.

A few other issues to consider:

  1. The CI in the cross-validation are based on the number of folds, so you shouldn't worry about overlap. The choice of penalty based on minimum cross-validation deviance is a useful heuristic, but shouldn't be thought of as a statistical significance test. In that respect it is similar to use of the Akaike Information Criterion for selecting the best among competing models without claiming "significance" for the result.

  2. Accuracy is, perhaps counter-intuitively, not a very good way to gauge the success of a logistic regression. AUROC is certainly better, and there are others that you can consider. See, for example, this Cross Validated page.

  3. You might want to examine the order of entry of the variables as the penalty is relaxed in your final model.