Solved – How to handle with missing values in order to prepare data for feature selection with LASSO

data-imputationlassorspss

My situation:

  • small sample size: 116
  • binary outcome variable
  • long list of explanatory variables: 44
  • explanatory variables did not come from the top of my head; their choice was based on the literature.
  • most cases in the sample and most variables have missing values.

Approach to feature selection chosen: LASSO

R's glmnet package won't let me run the glmnet routine, apparently due to the existence of missing values in my data set. There seems to be various methods for handling missing data, so I would like to know:

  • Does LASSO impose any restriction in terms of the method of imputation that I can use?
  • What would be the best bet for imputation method? Ideally, I need a method that I could run on SPSS (preferably) or R.

UPDATE1: It became clear from some of the answers below that I have do deal with more basic issues before considering imputation methods. I would like to add here new questions regarding that. On the the answer suggesting the coding as constant value and the creation of a new variable in order to deal with 'not applicable' values and the usage of group lasso:

  • Would you say that if I use group LASSO, I would be able to use the approach suggested to continuous predictors also to categorical predictors? If so, I assume it would be equivalent to creating a new category – I am wary that this may introduce bias.
  • Does anyone know if R's glmnet package supports group LASSO? If not, would anyone suggest another one that does that in combination with logistic regression? Several options mentioning group LASSO can be found in CRAN repository, any suggestions of the most appropriate for my case? Maybe SGL?

This is a follow-up on a previous question of mine (How to select a subset of variables from my original long list in order to perform logistic regression analysis?).

OBS: I am not a statistician.

Best Answer

When a continuous predictor $x$ contains 'not applicable' values it's often useful to code it using two variables:

$$ x_1=\Big{\{} \begin{array}{ll} c & \text{when $x$ is not applicable}\\ x & \text{otherwise} \end{array} \Bigg{.} $$

where $c$ is a constant, &

$$ x_2=\Big{\{} \begin{array}{ll} 1 & \text{when $x$ is not applicable}\\ 0 & \text{otherwise} \end{array} \Bigg{.} $$

Suppose the linear predictor for the response is given by

$$\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$$

which resolves to

$$\eta = \beta_0 + \beta_1 x_1 + \ldots$$

when $x$ is measured, or to

$$\eta = \beta_0 + \beta_1 c + \beta_2 + \ldots$$

when x is 'not applicable'. The choice of $c$ is arbitrary, & does not affect the estimates of the intercept $\beta_0$ or the slope $\beta_1$; $\beta_2$ describes the effect of $x$'s being 'not applicable' compared to when $x=c$.

This isn't a suitable approach when the response varies according to an unknown value of $x$: the variability of the 'missing' group will be inflated, & estimates of other predictors' coefficients biased owing to confounding. Better to impute missing values.

Use of LASSO introduces two problems:

  1. The choice of $c$ affects the results as the amount of shrinkage applied depends on the magnitudes of the coefficient estimates.
  2. You need to ensure that $x_1$ & $x_2$ are either both in or both out of the model selected.

You can solve both of these by using rather the group LASSO with a group comprising $x_1$ & $x_2$: the $L_1$-norm penalty is applied to the $L_2$-norm of the orthonormalized matrix $\left[\vec{x_1}\ \vec{x_2}\right]$. (Categorical predictors are the poster child for group LASSO—you'd just code 'not applicable' as a separate level, as often done in unpenalized regression.) See Meier et al (2008), JRSS B, 70, 1, "The group lasso for logistic regression" & grplasso.