R – Should Missing Data Be Imputed Before or After Feature Selection?

feature selectionmicemissing datamultiple-imputationr

Will the results of the feature selection be biased if I perform the feature selection before imputing missing data?

I have a large data set of 20000 samples and 130 variables. The data sets consists of binary, numeric, and ordinal variables. The outcome variable is binary.

I want to do two things:

1) Feature selection to determine the most important variables
2) Build a predictive model with SVM, Random Forest, and Logistic Regression.

The complete case data set contains 70% of the original data (i.e if I keep only samples with no missing variable values, then I'm left with 70% of the samples)

I am using MICE in R to impute the missing data.
Following some guidelines I found in this paper, I plan to impute 30 datasets. (I estimate the Fraction of Missing Information using the percentage of incomplete cases, which is 30%. This is where the 30 comes from)

This is computationally intensive and will take too long. If I take only the top 10 predictors and impute this smaller data set, I will be able to impute my 30 data sets as desired in a reasonable amount of time.

I cannot assume the data are Missing Completely at Random (MCAR). Most variables are Missing at Random (MAR) where the missing values can be modeled from existing data.

Will the results of the feature selection be biased because of missing-ness in the data?

Best Answer

You should consider a couple of factors before you make your decision.

  1. Having a lot of covariates in the imputation model strengthens the MAR assumption for each covariate.

  2. Will you be using the imputations at a later time for a different analysis? If so, you might want to keep everything.

  3. It takes a while to run MICE, but once you have done it (correctly), you never need to touch it again. Also, it doesn't take any effort to run it...you can just let it run overnight! It takes a while to check (for convergence and validity), but it's not like you need to sit around for hours watching it.

I recently did imputation with about 90 covariates. It took a while to set up, but I'm glad I did. I imputed covariates I didn't intend to put in my analysis model, and I ended up actually using them in a related analysis later.

So, I would recommend variable selection after imputation. If you had 1000 covariates, I would say do variable selection before, but 130 is relatively not that large.

As for the bias issue, I'm not sure how to answer this. Hopefully somebody else can answer it better. If you truly have MAR data, then I think you will be ok. However, if it is MNAR, then the missing data is systematically different. If this is the case, variable selection before may yield different results than after imputation.

Related Question