I have a data set with 50.000 observations consisting of 21 features with 14 of them ordinal (surveys answered by people where the structure is: very satisfied, satisfied, neutral, not satisfied). The other features are mostly continuous.
How do I deal with missing values? Can I take one feature at a time or do I have to take the other features into consideration as well? Which methods would be suitable for me here.
I have tried to find some good literature.
Feel free to suggest books and papers or R packages. I use R to analyse
Best Answer
One common way to deal with missing values is (multiple) imputation. A popular
R
package ismice
, which by default uses a multinomial logistic regression for the imputation of categorical variables and a predictive mean matching for the imputation of continuous variables. However, in the package are many other methods available. Themice
package uses the chained equations approach, i.e. all variables can be imputed in one programming step. Since you have only 21 variables in your data set, it probably makes sense to use all variables as predictors for the imputation. See also this paper from the author of mice for further information.Below you can find an example, how an imputation could be done with the
mice
package: