Solved – n R function that performs LASSO regression on multiple imputed datasets and pools results together

coefficient of variationlassomultiple-imputationpooling

I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much overlap and i am trying to find the features that can help differentiate these diseases from each other. I understand that LASSO logistic regression is the best solution for this problem, however it can not be run on a incomplete dataset.

So i imputed my missing data with MICE package in R and found that approximately 40 imputations is good for the amount of missing data that i have.

Now i want to perform lasso logistic regression on all my 40 imputed datasets and somehow i am stuck at the part where i need to pool the results of all these 40 datasets.

The with() function from MICE does not work on .glmnet

Impute database with missing values using MICE package:
R> imp<-mice(WMT1, m = 40)
Fit regular logistic regression on imputed data:
R> imp.fit <- glm.mids(Diagnosis~., data=imp,
family = binomial)
Pool the results of all the 40 imputed datasets:
R> summary(pool(imp.fit),2)

The above seems to work fine with logistic regression using glm(), but when i try the exact above to perform Lasso regression i get:

First perform cross validation to find optimal lambda value:
R> CV <- cv.glmnet(Diagnosis~., data = imp, family = "binomial", alpha = 1,
nlambda = 100)

When i try to perform cross validation I get this error message:
Error in as.data.frame.default(data) :
cannot coerce class ‘"mids"’ to a data.frame

Can somebody help me with this problem?

Best Answer

There is both a technical and a conceptual problem here.

Technically, glm.mids() is designed as part of the mice package to work directly with multiply imputed datasets of class mids. The cv.glmnet() function from the glmnet package, in contrast, is only designed to handle a single dataset at a time. It has no way to handle a mids object, hence the error message.

To do cv.glmnet on individual imputed data sets, you can use the complete() function in mice to pull out the imputed datasets one at a time from a mids object or, as shown on this page, put them all into a single large data frame for subsetting into individual datasets. Then just run cv.glmnet(), choose a penalty value, and get coefficients separately on each dataset.

Conceptually, how do you properly combine the results from LASSO on the different imputations into a single final model? Each imputed dataset will typically provide a different set of predictors having non-zero coefficients. Putting the different imputations together in the context of LASSO is not an easy problem.

Some suggestions are to use group LASSO, or work with the predictors having non-zero coefficients in all cases. Those pages have links to further details. It's not clear to me that any of these solutions is very satisfying. There is also the issue of how you are intending to use your model, for example what if some of the selected variables are missing when predicting future cases.

With only 60 variables in your model, you could consider doing ridge regression rather than LASSO in cv.glmnet() by setting alpha=0 in the function call. This will provide penalized coefficients for all 60 of your predictor variables for each imputed data set, which can then easily be combined to give means and standard deviations of coefficient values across the imputations. As this seems to be an initial exploratory study, the results from ridge regression on the multiply-imputed datasets might better point the way to designing prospective studies on the best way to make this clinical distinction.