Solved – Combining LASSO coefficients across imputed datasets

lassomultiple-imputation

I am using the LASSO with multiple imputed datasets and I am not sure how should I combine the coefficients obtained on the different imputed datasets. I could simply average them (as I would do had I computed the coefficients using ordinary least squares), but then since the set of variables with non-zero coefficients are not necessarily equal on all imputed datasets a variable would end up in my model even if it only appears on one of the imputed datasets. Another idea would be to take the average but only if the coefficient is different from zero in at least half (or some other proportion) of the imputed datasets. I would be really thankful any suggestions or references on how to do this (I do not know if this matters but I am selecting the LASSO tuning parameter on each imputed dataset using cross-validation).

Best Answer

I am by no means an expert, but found this while looking into this problem for my own work.

https://www.biostat.wisc.edu/sites/default/files/tr_217.pdf

In a nutshell, they used grouped lasso (reference below) on the variables, where a "group" of variables actually refers to the same variable, but "grouped" across the imputed datasets.

ftp://ftp.stat.math.ethz.ch/Manuscripts/buhlmann/lukas-sara-peter.pdf

I have also seen somewhere that you just average the zeros in like any other estimate. I'll post a reference if I find one.

Related Question