Solved – Fast missing data imputation in R for big data that is more sophisticated than simply imputing the means

data-imputationlarge datar

I need a package for missing data imputation in R. But since I am dealing with big data, the number of missing data entries can also be high. The packages which impute using mean or median are of course working fast, but more complicated packages which impute using regression or PCA take too long for a high number of missing values. I tried missMDA and missForest, but as I said, they look like taking forever. There is a package named FastImputation, but I could not figure out how to use it when I have no patterns from some training data. Any suggestions of packages which would impute fast?

Best Answer

I used mice (multiple imputation by chained equation). It's fairly fast, and quite simple. I used it on 3000 obs. for c.a. 10 variables. Done in 10min on an old computer. Further, I believe it is one of the best multiple-imputation packages out there. It can use regression to impute, among other methods.

You need to create a dataframe with the variable you want to impute, and include every variable that might predict values of that variable (so every var. in your model + possibly other var. as well). The mice package will impute every missing value in that dataframe.

Simplest way of imputing. Gives you a dataframe Datimp that has five imputed data + the original data.

library(mice)
#m=5 number of multiple imputations
#maxit=10 number of iterations. 10-20 is sufficient.
imp <- mice(Dat1, m=5, maxit=10, printFlag=TRUE) 
Datimp <- complete(imp, "long", include=TRUE)
write.table(Datimp, "C:/.../impute1.txt",
            sep="\t", dec=",", row.names=FALSE)

A better way to do this is:

library(mice)
Dat1 <- subset(Dat, select=c(id, faculty, gender, age, job, salary)) #create subset
#of variables you would like to either impute or use as predictors for imputation.
ini <- mice(Dat1, maxit=0, pri=F)
pred <- ini$pred
    pred[,c("id", "faculty")] <- 0 #variables you do not want to use as predictors (but
    #want to have in the dataset, can't add them later.
    meth <- ini$meth
meth[c("id", "faculty", "gender", "age", "job")] <- "" #choose a prediction method
#for imputing your variables. Here I don't want these variables to be imputed, so I
#choose "" (empty, no mehod).
imp <- mice(Dat1, m=5, maxit=10, printFlag=TRUE, pred=pred, meth=meth, seed=2345) 
Datimp <- complete(imp, "long", include=TRUE)
write.table(Datimp, "C:/.../impute1.txt",
            sep="\t", dec=",", row.names=FALSE)

See if your imputations were any good:

library(lattice)
com <- complete(imp, "long", inc=T)
col <- rep(c("blue","red")[1+as.numeric(is.na(imp$salary))],6)
stripplot(salary~.imp, data=com, jit=TRUE, fac=0.8, col=col, pch=20,
xlab="Imputation number",cex=0.25) 
densityplot(salary~.imp, data=com, jit=TRUE, fac=0.8, col=col, pch=20,
xlab="Imputation number",cex=0.25) 

long <- complete(imp,"long")
levels(long$.imp) <- paste("Imputation",1:22)
    long <- cbind(long, salary.na=is.na(imp$data$salary))
densityplot(~salary|.imp, data=long, group=salary, plot.points=FALSE, ref=TRUE, 
xlab="Salary",scales=list(y=list(draw=F)),
par.settings=simpleTheme(col.line=rep(c("blue","red"))), auto.key =
list(columns=2,text=c("Observed","Imputed")))

Finally, and importantly. You can't just save your new dataset and use your imputed values as normal observed values. You use pooled regression or pooled lmer ...So the uncertainty of the imputed values is taken into account.

fit1 <- with(imp, lm(salary ~ gender, na.action=na.omit))
summary(est <- pool(fit1))
pool.r.squared(fit1,adjusted=FALSE)

Related Solutions

Time Series – Multiple Imputation for Missing Count Data in Panel Study

You can use the Amelia package to impute the data (full disclosure: I am one of the authors of Amelia). The package vignette has an extended example of how to use it to impute missing data.

It seems as though you have units which are district-gender-ageGroup observed at the monthly level. First you create a factor variable for each type of unit (that is, one level for each district-gender-ageGroup). Let's call this group. Then, you would need a variable for time, which is probably the number of months since January 2003. Thus, this variable would be 13 in January of 2004. Call this variable time. Amelia will allow you to impute based on the time trends with the following commands:

library(Amelia)
a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE)

The ts and cs arguments simply denote the time and unit variables. The splinetime argument sets how flexible should time be used to impute the missing data. Here, a 2 means that the imputation will use a quadratic function of time, but higher values will be more flexible. The intercs argument here tells Amelia to use a separate time trend for each district-gender-ageGroup. This adds many parameters to the model, so if you run into trouble, you can set this to FALSE to try to debug.

In any event, this will get you imputations using the time information in your data. Since the missing data is bounded at zero, you can use the bounds argument to force imputations into those logical bounds.

EDIT: How to create group/time variables

The time variable might be the easiest to create, because you just need to count from 2002 (assuming that is the lowest year in your data):

my.data$time <- my.data$Month + 12 * (my.data$Year - 2002)

The group variable is slightly harder but a quick way to do it is using the paste command:

my.data$group <- with(my.data, 
                      as.factor(paste(District, Gender, AgeGroup, sep = ".")))

With these variables created, you want to remove the original variables from the imputation. To do that you can use the idvars argument:

a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE,
                idvars = c("District", "Gender", "Month", "Year", "AgeGroup"))

Solved – How to know which imputation is best for impute the dataset from Multiple imputation by using mice

If you want to choose a single imputed dataset to work with, you should go for single imputation instead. But many authors recommended to use multiple imputation and the estimates will be pooled using Rubin's rule which taken into account between and within variances. Rule of thumb for choosing the number of imputation is one imputation per percent of incomplete data (White et al.,2011)

Best Answer

Related Solutions

Time Series – Multiple Imputation for Missing Count Data in Panel Study

Solved – How to know which imputation is best for impute the dataset from Multiple imputation by using mice

Related Question