Solved – Missing values in a large data set

machine learningmissing datarreferences

I have a data set with 50.000 observations consisting of 21 features with 14 of them ordinal (surveys answered by people where the structure is: very satisfied, satisfied, neutral, not satisfied). The other features are mostly continuous.

How do I deal with missing values? Can I take one feature at a time or do I have to take the other features into consideration as well? Which methods would be suitable for me here.

I have tried to find some good literature.

Feel free to suggest books and papers or R packages. I use R to analyse

Best Answer

One common way to deal with missing values is (multiple) imputation. A popular R package is mice, which by default uses a multinomial logistic regression for the imputation of categorical variables and a predictive mean matching for the imputation of continuous variables. However, in the package are many other methods available. The mice package uses the chained equations approach, i.e. all variables can be imputed in one programming step. Since you have only 21 variables in your data set, it probably makes sense to use all variables as predictors for the imputation. See also this paper from the author of mice for further information.

Below you can find an example, how an imputation could be done with the mice package:

# Example data
N <- 1000 # In your case 50.000
x1 <- as.factor(sample(1:4, N, replace = TRUE))
x2 <- as.factor(sample(1:4, N, replace = TRUE))
x3 <- as.factor(sample(1:4, N, replace = TRUE))
x4 <- as.factor(sample(1:4, N, replace = TRUE))
x5 <- as.factor(sample(1:4, N, replace = TRUE))
x6 <- as.factor(sample(1:4, N, replace = TRUE))
x7 <- as.factor(sample(1:4, N, replace = TRUE))
x8 <- as.factor(sample(1:4, N, replace = TRUE))
x9 <- as.factor(sample(1:4, N, replace = TRUE))
x10 <- as.factor(sample(1:4, N, replace = TRUE))
x11 <- as.factor(sample(1:4, N, replace = TRUE))
x12 <- as.factor(sample(1:4, N, replace = TRUE))
x13 <- as.factor(sample(1:4, N, replace = TRUE))
x14 <- as.factor(sample(1:4, N, replace = TRUE))
x15 <- rnorm(N)
x16 <- rnorm(N)
x17 <- rnorm(N)
x18 <- rnorm(N)
x19 <- rnorm(N)
x20 <- rnorm(N)
x21 <- rnorm(N)
data <- data.frame(x1, x2, x3, x4, x5, x6, x7, 
               x8, x9, x10, x11, x12, x13, x14, 
               x15, x16, x17, x18, x19, x20, x21)
for(i in 1:ncol(data)) { # Some missings
  data[ , i][sample(1:nrow(data), 50)] <- NA
}


# Imputation with mice package
library("mice")

# Multiple imputation with 5 data sets, m = 1 would be a single imputation
imp <- mice(data, m = 5)
data_imp <- complete(imp, action = "repeated")
Related Question