Classification – Is LASSO Regularized Method Correct in R?

classificationcross-validationglmnetlassor

I am using CT scans to classify lung cancer into one of two types (Adenocarcinoma vs. Squamous carcinoma; we can abbreviate them A & B). I am applying LASSO penalized logistic regression to a data set containing 756 CT-derived texture features and 12 radiologist identified categorical (yes/no) features. The latter are based on literature for relevance whereas the former are computer generated with no prior proof of being useful. I have 107 cases so my final dataframe (df) is 107 x 768 dimensional:

i)Texture features (mathematical quantities n=756) are continuous variables scaled and centered. Their names are stored in list ‘texVars’

ii)Semantic features(Qualitative features subjective assessed by experienced radiologist, n=12). These are usually categorical binary inputs of yes / no type. Their names are stored in list ‘semVars’.

Following comments from community on my original (very different) model Is my high dimensional data logistic regression workflow correct?, I performed my LR development in three steps:

1)Feature selection: I used principle components analysis to reduce texture feature-space from 756 to 30. I kept 4 most relevant (from literature) semantic features. This gave me 34 final features. I used the following command:

trans = preProcess(df[,texVars], method=c("BoxCox", "center",   "scale", "pca"),thresh=.95)  # only column-names matching ‘texVars’ are included.
neodf2 <- predict(trans,df[,texVars]).
neodf.sem <- neodf2[,c("Tumour","AirBronchogram", "Cavity", "GroundglassComponent","Shape")]  # this DF is 107 x 4 dimensional, containing only 4 semantic features (most relevant from prior knowledge).
neodf.tex <- neodf2[,c("Tumour",setdiff(names(neodf2),names(neodf.sem)))] # this only has the 30 PCA vectors (labelled PC1 – PC30).

2) Model development (LASSO) and penalty term tuning (10fold cross-validation) using cv.glmnet command Deviance was used as determinant of model quality. Using this method, I developed a model incorporating only semantic features, a second model incorporating only texture features, and a third model incorporating both semantic and texture features. Here are the commands:

#Converting to model.matrix for glmnet 
xall <- model.matrix(Tumour~.,neodf2)[,-1]
xtex <- model.matrix(Tumour~.,neodf.tex)[,-1]
xsem <- model.matrix(Tumour~.,neodf.sem)[,-1]
y <- neodf$Tumour
require(glmnet)
grid <- 10^seq(10,-2,length=100)

lasso.all <- cv.glmnet(xall,y,alpha=1,lambda=grid, family="binomial", type.measure="deviance") 
lasso.tex <- cv.glmnet(xtex,y,alpha=1,lambda=grid, family="binomial", type.measure="deviance")
lasso.sem <- cv.glmnet(xsem,y,alpha=1,lambda=grid, family="binomial", type.measure="deviance")

3) Testing model classification accuracy on entire dataset. The following is the backbone of bootstrap to generate 95% confidence intervals of predictive accuracy:

pred <- predict(lasso.all, newx = xall, s = "lambda.min", "class")
tabl <- table(pred,y)
sum(diag(prop.table(tabl)))

4) As an alternative means to assess model performance than classification accuracy, I used ROC area under curve on entire dataset and compared AUROC curves from different models using DeLong's method (pROC package)

The results are interesting

 =================================================

LR MODEL BASED ON SEMANTIC FEATURES ALONE:
lasso.sem$lambda.min
0.01

Plot cv lambda vs. binomial devance

             Feature          Odds Ratio
1                (Intercept)  0.1292604
2      AirBronchogramPresent  0.1145378
3              CavityPresent 35.4350358
4 GroundglassComponentAbsent  4.3657928
5                 ShapeOvoid  2.4752881

AUC: .84


=================================================

LR MODEL BASED ON TEXTURE FEATURES ALONE:

lasso.tex$lambda.min
1e+10

Plot cv lambda vs binomial deviance (texture alone). Note how the 95% CI's are all overlapping!

   Feature OddsRatio
1 (Intercept) 0.6461538



============================================================

LR MODEL BASED ON TEXTURE + SEMANTIC FEATURES:

lasso.all$lambda.min
0.05

Plot cv lambda vs binomial deviance

                          Feature   OddsRatio
1                (Intercept)        0.3136489
2                       PC23        0.9404430
3                       PC27        0.8564001
4      AirBronchogramPresent        0.2691959
5              CavityPresent        6.7422427
6 GroundglassComponentAbsent        2.0514275
7                 ShapeOvoid        1.5974378

 AUC : .88

Plot showing loglambda vs coefficients. The dashed vertical line shows the cross-validated optimum lambda:

Having rejected the texture only model, which only contains intercept, i was left with two models – semantic and combined texture+semantic. I created ROC curves for both and compared them using DeLong's method:

pred.sem<- predict(lasso.sem, newx = xsem, s = "lambda.min")
pred.all<- predict(lasso.all, newx = xall, s = "lambda.min")

roc.sem<- roc(y,as.numeric(pred.sem), ci=TRUE, boot.n=1000, ci.alpha=0.95,stratified=FALSE)    

roc.all<- roc(y,as.numeric(pred.all), ci=TRUE, boot.n=1000, ci.alpha=0.95,stratified=FALSE)

Outputs of ROC analysis are:

data:  roc.sem and roc.all
Z = -2.1212, p-value = 0.0339
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8369963   0.8809524

Showing that combined model ROC curve is significantly better despite the modest improvement in AUC (83% vs 88%).
Questions are:

a) is my methodology airtight from a publication point of view now? Apologies in advance for any gross errors in my presentation of this problem.

b) what is the formal inference that texture model is intercept only.

c) if texture model is useless, how do its variables become useful once added to semantic features and yield a higher overall accuracy in the combined result? perhaps that means the effect of texture features alone is too small to be detected in this small dataset but becomes apparent when combined with a stronger predictor (i.e., semantic features).

Any further comments are welcome.

Best Answer

Answering in reverse order:

c) Changes in apparent significance as other variables are added or removed from a model are common in regressions. Your suggested explanation is one possibility. See this discussion as one of several on this site that examine this phenomenon.

b) "Formal inference" in the way you might hope is difficult in LASSO. What you can say is what you found: the texture-only model found no predictor variables that performed better in terms of deviance than the null, intercept-only, model upon 10-fold cross validation. That would seem to be evidence against the usefulness of the texture variables on their own.

a) One should be cautious in judging whether methodology is "airtight from a publication point of view." That depends too much on the audience, the journal, the luck of the draw in editors and referees, and the quality of statistical review. You have taken some wise steps: decreasing dimensionality based on subject-matter knowledge (semantic features) or without reference to the outcome variable (principal components of texture features), using LASSO as a principled way to select predictors, with cross-validation to choose the penalty, and using well established software for pre-processing and analyzing the data.

The main difficulty I see is with interpretation. "Significance" is problematic in at least 2 ways. First, as with any variable-selection technique, the particular features selected by LASSO can change substantially among bootstrap samples from the same data, raising the issue of whether the selected features are in any way the "true" predictors. That might be less of an issue in your texture-only model with orthogonal predictors from PCA, but could arise in your last, combined, model in which there might be correlations of principal components of the textures with the semantic variables.

Second, although LASSO is well accepted as a useful way to select predictor variables, significance testing in LASSO does not yet seem to have generally agreed upon methods. Lockhart et al have proposed a covariance test statistic in the context of linear regression, with some indications that it could also work in logistic regression, which is available in the covTest package in R. There seems, however, to be substantial unresolved discussion about this approach, at least to the eyes of someone not well trained in mathematical statistics.

A few other issues to consider:

The CI in the cross-validation are based on the number of folds, so you shouldn't worry about overlap. The choice of penalty based on minimum cross-validation deviance is a useful heuristic, but shouldn't be thought of as a statistical significance test. In that respect it is similar to use of the Akaike Information Criterion for selecting the best among competing models without claiming "significance" for the result.
Accuracy is, perhaps counter-intuitively, not a very good way to gauge the success of a logistic regression. AUROC is certainly better, and there are others that you can consider. See, for example, this Cross Validated page.
You might want to examine the order of entry of the variables as the penalty is relaxed in your final model.

Related Solutions

Solved – How to handle NA values in shrinkage (Lasso) method using glmnet

Omitting cases with NA values could lead to bias. An alternative would be to perform multiple imputations of the missing data, for example with mice, and then do lasso on each of the imputations. Lasso will probably return different sets of selected variables for the imputations, but you could examine how frequently each variable is selected, among the imputed data sets, to identify your best candidate variables.

Imputation, of course, is inapplicable if the probability of a data point being missing is related to its true value. So before doing imputation make sure at least that is unlikely to be the case, based on knowledge of the subject matter.

Solved – Cross validating lasso regression in R

An example on how to do vanilla plain cross-validation for lasso in glmnet on mtcars data set.

Load data set.
Prepare features (independent variables). They should be of matrix class. The easiest way to convert df containing categorical variables into matrix is via model.matrix. Mind you, by default glmnet fits intercept, so you'd better strip intercept from model matrix.
Prepare response (dependent variable). Let's code cars with above average mpg as efficient ('1') and the rest as inefficient ('0'). Convert this variable to factor.
Run cross-validation via cv.glmnet. It will pickup alpha=1 from default glmnet parameters, which is what you asked for: lasso regression.
By examining the output of cross-validation you may be interested in at least 2 pieces of information:
- lambda, that minimizes cross-validated error. glmnet actually provides 2 lambdas: lambda.min and lambda.1se. It's your judgement call as a practicing statistician which to use.
- resulting regularized coefficients.

Please see the R code per the above instructions:

# Load data set
data("mtcars")

# Prepare data set 
x   <- model.matrix(~.-1, data= mtcars[,-1])
mpg <- ifelse( mtcars$mpg < mean(mtcars$mpg), 0, 1)
y   <- factor(mpg, labels = c('notEfficient', 'efficient'))

library(glmnet)

# Run cross-validation
mod_cv <- cv.glmnet(x=x, y=y, family='binomial')

mod_cv$lambda.1se
[1] 0.108442

coef(mod_cv, mod_cv$lambda.1se)
                     1
(Intercept)  5.6971598
cyl         -0.9822704
disp         .        
hp           .        
drat         .        
wt           .        
qsec         .        
vs           .        
am           .        
gear         .        
carb         .  

mod_cv$lambda.min
[1] 0.01537137

coef(mod_cv, mod_cv$lambda.min)
                      1
(Intercept)  6.04249733
cyl         -0.95867199
disp         .         
hp          -0.01962924
drat         0.83578090
wt           .         
qsec         .         
vs           .         
am           2.65798203
gear         .         
carb        -0.67974620

Final comments:

note, the model's output says nothing about statistical significance of the coefficients, only values.
l1 penalizer (lasso), which you asked for, is notorious for instability as evidenced in this blog post and this stackexchange question. A better way could be to cross-validate on alpha too, which would let you decide on proper mix of l1 and l2 penalizers.
an alternative way to do cross-validation could be to turn to caret's train( ... method='glmnet')
and finally, the best way to learn more about cv.glmnet and it's defaults coming from glmnet is of course ?glmnet in R's console )))

Best Answer

Related Solutions

Solved – How to handle NA values in shrinkage (Lasso) method using glmnet

Solved – Cross validating lasso regression in R

Related Question