Solved – Data Imputation in R with NAs in only one variable (categorical)

data preprocessingdata-imputationmissing datar

I have data frame with 44,353 entries with 17 variables (4 categorical + 13 continuous). Out of all variables only 1 categorical variable (with 52 factors) has NAs

No of factors in the categorical variables are 1601, 6, 52 and 15

When I use missforest package it throws error that it cannot handle categorical predictors with more that 53 categories.

Please suggest an imputation method in R for best accuracy. Also since the variable to be imputed is categorical I would prefer to avoid methods that use regression techniques to impute values.

Best Answer

Do you need to impute NA's?

First I would ask if you really need to impute the missing values? If you intend to use the imputed set to train another model you might as well just add NA as a level. In my experience this is really the simplest solution when you have NA's in a categorical variable. Especially when NA's actually do mean something, which is quite common. But even if it does not it is easy, especially for random forests, to ignore that level if it is not predictive.

This will add NA as a level in the factor.

dataset$varWithNAs <- addNA(dataset$varWithNAs)

Dummy encoding large categorical features

Regarding the problem with too many levels it seems to be the factor w 1601 levels that is your main problem. This is really a lot of levels and it is hard to give you any direct usage tips as little is stated about the variable. What you always can do in the case of too many levels is to transform the variable into many boolean (true, false) variables.

I'll give you an example.

dataset <- data.frame(x1 = sample(c('a','b','c'), 10, replace=T))
#     x1
# 1   c
# 2   b
# 3   a
# 4   a
# 5   b
# 6   c
# 7   a
# 8   a
# 9   b
# 10  c

You could use the caret package to create dummy variables for your factor levels.

library(caret)
dummyObj <- dummyVars(~x1, dataset)
dummyset <- predict(dummyObj, dataset)
     x1.a x1.b x1.c
# 1     0    0    1
# 2     0    1    0
# 3     1    0    0
# 4     1    0    0
# 5     0    1    0
# 6     0    0    1
# 7     1    0    0
# 8     1    0    0
# 9     0    1    0
# 10    0    0    1

In your case it will make your feature vector quite a lot wider but it is actually what is done internally in a lot of, especially linear, models before training (although not in RF which is why you get this problem). If you look at eg. the glm package it transforms the dataset into dummy variables using the model.matrix function which does the same but adds an intercept term. Removing this intercept term will give you the same answer. And as model.matrix exists in the stats package you don't need to install anything.

model.matrix(~ x1 - 1, dataset) # -1 removes the intercept
#    x1a x1b x1c
# 1    0   0   1
# 2    0   1   0
# 3    1   0   0
# 4    1   0   0
# 5    0   1   0
# 6    0   0   1
# 7    1   0   0
# 8    1   0   0
# 9    0   1   0
# 10   0   0   1

If you find that your dataset get too many features now you should resort to the options Michael M gave in his answer to reduce the feature space. Chances are you have levels that never occur or several that are very similar in meaning and can be combined etc. Of course it is tedious to do this manually when you have so many levels.

Related Question