When a continuous predictor $x$ contains 'not applicable' values it's often useful to code it using two variables:
$$
x_1=\Big{\{}
\begin{array}{ll}
c & \text{when $x$ is not applicable}\\
x & \text{otherwise}
\end{array}
\Bigg{.}
$$
where $c$ is a constant, &
$$
x_2=\Big{\{}
\begin{array}{ll}
1 & \text{when $x$ is not applicable}\\
0 & \text{otherwise}
\end{array}
\Bigg{.}
$$
Suppose the linear predictor for the response is given by
$$\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$$
which resolves to
$$\eta = \beta_0 + \beta_1 x_1 + \ldots$$
when $x$ is measured, or to
$$\eta = \beta_0 + \beta_1 c + \beta_2 + \ldots$$
when x is 'not applicable'. The choice of $c$ is arbitrary, & does not affect the estimates of the intercept $\beta_0$ or the slope $\beta_1$; $\beta_2$ describes the effect of $x$'s being 'not applicable' compared to when $x=c$.
This isn't a suitable approach when the response varies according to an unknown value of $x$: the variability of the 'missing' group will be inflated, & estimates of other predictors' coefficients biased owing to confounding. Better to impute missing values.
Use of LASSO introduces two problems:
- The choice of $c$ affects the results as the amount of shrinkage applied depends on the magnitudes of the coefficient estimates.
- You need to ensure that $x_1$ & $x_2$ are either both in or both out of the model selected.
You can solve both of these by using rather the group LASSO with a group comprising $x_1$ & $x_2$: the $L_1$-norm penalty is applied to the $L_2$-norm of the orthonormalized matrix $\left[\vec{x_1}\ \vec{x_2}\right]$. (Categorical predictors are the poster child for group LASSO—you'd just code 'not applicable' as a separate level, as often done in unpenalized regression.) See Meier et al (2008), JRSS B, 70, 1, "The group lasso for logistic regression" & grplasso.
I believe what you want to do is perform data imputation. Here is a good quick (16 pages) pdf on imputation from Columbia.
Generally if you have a large enough set of data and your NAs/NANs account for ~10% of your data, you can simply remove the affected rows. If removing data will not work for you then you should look into imputation. Simple approaches include taking the average of the column and use that value, or if there is a heavy skew the median might be better. A better approach, you can perform regression or nearest neighbor imputation on the column to predict the missing values. Then continue on with your analysis/model.
Another approach would be to build a RandomForest classifier. RandomForest models can neutrally deal with missing data by ignoring them when deciding splits. Berkeley has a good write up on RandomForests. If you choose to go down this road there is also a good paper discussing NAs in tree based models: An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data by Ding and Simonoff.
If you are using python, the Scipy library has an interpolation function which produces data points from within a range of known discrete data points. This is another way to fill in missing data.
Hope this helps!
Best Answer
Multiple imputation of the missing data provides a way to deal with the missing values; R packages
Hmisc
andmice
provide methods. You could then perform lasso on each of the imputed data sets (which now have no missing data), and determine the predictor variables that are most frequently returned. There should be no problems with having both categorical and continuous variables in your data with any of the R packages for lasso, but be sure to normalize the variables before you apply lasso so that differences in scaling among the variables (and thus scale-dependent differences in regression coefficients) don't lead to erroneous results.For more details, other suggestions, and references, see the earlier discussion How to handle with missing values in order to prepare data for feature selection with LASSO?.