Solved – Multiple Imputation and “Conditional Missing” values

micemultiple-imputationr

I am a lonely peon currently researching and playing around with multiple imputation. I am using MICE in R to impute random missing data; however, I run into a problem when attempting to account for conditional or structured NAs in a dataset.

I'll provide a simplistic dataset in an attempt to illustrate my meaning:

TestData <- data.frame(Condition= c(1,1,1,1,2,NA,2,2), 
Dependent1=c(1,NA,2,3,NA,NA,NA,NA),
Dependent2=c(1,12,44,1,NA,NA,NA,NA),
Dependent3=c(NA,2,3,5,NA,NA,NA,NA), 
UnaffiliatedQ=c(1,NA,3,2,27,NA,32,35))

TestData$Condition <- factor(TestData$Condition,
                     levels = c(1,2),
                     labels = c("Yes","No"))

In my example, the variable "Condition" is a gatekeeper question which determines whether a respondent needs to fill the next three questions (Dependent#). If a respondent answers with "No" and he/she does not see the next three questions, then they are marked as NAs – though not technically missing/ they are structurally not applicable.

I've come to ask what CV would do in this type of situation? If I Impute the NA value in the Condition variable, along with those in Dependent1, Dependent2, and Dependent3, how would I ensure that I don't end up with values in Dependent# that don't make sense (constraints)?

I've thought of possible solutions – but none that I think would be valid or a good idea. (e.g., creating a structured missing value like -999/ subsetting the dataframe based on conditional answers). I've looked around for possible literature or walkthroughs for situations like this; however, I've come up empty handed.

The other alternative is that I've simply been running down the rabbit hole of multiple imputation and this is not the correct use of it. I posted on SO a few days ago, and was told to post here 🙂

I appreciate your thoughts and help all!

Best Answer

I suggest to impute in 2 steps. Impute Condition first and then impute Dependent1-3 for the subset of Condition = Yes.

# I extended your data a bit, to get more cases.
TestData <- data.frame(Condition= c(1,1,1,1,2,NA,2,2, sample(1:2, 500, 1)), 
                   Dependent1=c(1,NA,2,3,NA,NA,NA,NA, sample(1:3, 500, 1)),
                   Dependent2=c(1,12,44,1,NA,NA,NA,NA, sample(1:44, 500, 1)),
                   Dependent3=c(NA,2,3,5,NA,NA,NA,NA, sample(1:5, 500, 1)), 
                   UnaffiliatedQ=c(1,NA,3,2,27,NA,32,35, sample(1:35, 500, 1)))
TestData[10:25, 1:4] <- NA

TestData$Condition <- factor(TestData$Condition,
                         levels = c(1,2),
                         labels = c("Yes","No"))



library("mice")

# Step 1: Imputation of condition
imp <- mice(TestData[ , c(1, 5)], m = 1)
data_imp <- complete(imp)

TestData$Condition <- as.factor(as.character(data_imp$Condition))
TestData$UnaffiliatedQ <- as.numeric(data_imp$UnaffiliatedQ)

# Subset of TestData (Condition == Yes)
TestData_sub <- TestData[TestData$Condition == "Yes", ]

# Step 2: Imputation of Dependent1-3
imp_2 <- mice(TestData_sub, m = 1)
data_imp_2 <- complete(imp_2)

TestData$Dependent1[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent1)
TestData$Dependent2[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent2)
TestData$Dependent3[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent3)