Solved – lasso with many missing values and categorical variables

lassomissing datar

I have a dataset with a lot of missing values and mix of continuous and categorical variables. I want to use something like group lasso to do features selection. Probably the output is binary 0,1 and so grouped lasso logistic regression seems to be the more sensible choice.

My problem is the very large number of missing values. Deleting non complete rows is not an option.

Is there any R implementation that can be used similarly to the lasso and that can handle missing values and categorical variables at the same time?

A possible solution has been proposed here but it does not refer to any R package.

Best Answer

Multiple imputation of the missing data provides a way to deal with the missing values; R packages Hmisc and mice provide methods. You could then perform lasso on each of the imputed data sets (which now have no missing data), and determine the predictor variables that are most frequently returned. There should be no problems with having both categorical and continuous variables in your data with any of the R packages for lasso, but be sure to normalize the variables before you apply lasso so that differences in scaling among the variables (and thus scale-dependent differences in regression coefficients) don't lead to erroneous results.

For more details, other suggestions, and references, see the earlier discussion How to handle with missing values in order to prepare data for feature selection with LASSO?.

Related Question