When a continuous predictor $x$ contains 'not applicable' values it's often useful to code it using two variables:
$$
x_1=\Big{\{}
\begin{array}{ll}
c & \text{when $x$ is not applicable}\\
x & \text{otherwise}
\end{array}
\Bigg{.}
$$
where $c$ is a constant, &
$$
x_2=\Big{\{}
\begin{array}{ll}
1 & \text{when $x$ is not applicable}\\
0 & \text{otherwise}
\end{array}
\Bigg{.}
$$
Suppose the linear predictor for the response is given by
$$\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$$
which resolves to
$$\eta = \beta_0 + \beta_1 x_1 + \ldots$$
when $x$ is measured, or to
$$\eta = \beta_0 + \beta_1 c + \beta_2 + \ldots$$
when x is 'not applicable'. The choice of $c$ is arbitrary, & does not affect the estimates of the intercept $\beta_0$ or the slope $\beta_1$; $\beta_2$ describes the effect of $x$'s being 'not applicable' compared to when $x=c$.
This isn't a suitable approach when the response varies according to an unknown value of $x$: the variability of the 'missing' group will be inflated, & estimates of other predictors' coefficients biased owing to confounding. Better to impute missing values.
Use of LASSO introduces two problems:
- The choice of $c$ affects the results as the amount of shrinkage applied depends on the magnitudes of the coefficient estimates.
- You need to ensure that $x_1$ & $x_2$ are either both in or both out of the model selected.
You can solve both of these by using rather the group LASSO with a group comprising $x_1$ & $x_2$: the $L_1$-norm penalty is applied to the $L_2$-norm of the orthonormalized matrix $\left[\vec{x_1}\ \vec{x_2}\right]$. (Categorical predictors are the poster child for group LASSO—you'd just code 'not applicable' as a separate level, as often done in unpenalized regression.) See Meier et al (2008), JRSS B, 70, 1, "The group lasso for logistic regression" & grplasso.
Best Answer
preProcess
does not return values, it simply sets up the whole imputation model based on the provided data. So, you need to runpredict
(requiring also theRANN
package), but even if you do so with your artificial data you'll get an error:Error in FUN(newX[, i], ...) : cannot impute when all predictors are missing in the new data point
as the imputation can not work in rows where both your predictors are NA's.
Here's a demonstration with only 20 rows, for clarity and easy inspection:
When viewing the result, keep in mind that methods
"center", "scale"
have been automaticaly added to your preprocessing, even if you did not invoke them explicitly: