Combining regression models based on missing data patterns

ensemble learningmachine learningmissing dataregression

I have a dataset that contains a few patterns of missingness. For this dataset, I have a training set that is complete and contains all input features. My test set has complete observations for the dependent variable, but there are a few patterns of missing data for the input features. Below, I provide an example:

Example dataset

# Generate a fake test with missing data patterns
subset1 <- data.frame(yield = rnorm(500, 1000, 100)) %>%
  mutate(var1 = yield * rnorm(500, 100, 5))

subset2 <- data.frame(yield = rnorm(500, 1000, 100)) %>%
  mutate(var2 = yield * rnorm(500, 100, 10),
         var3 = -yield * rnorm(500, 100, 4))

subset3 <- data.frame(yield = rnorm(500,1000,100)) %>%
  mutate(var1 = yield * rnorm(500, 100, 5), 
         var3 = -yield * rnorm(500, 100, 4))

test <- plyr::rbind.fill(subset1,subset2,subset3)

# train set has complete observations
train <- data.frame(yield = rnorm(500,1000,100)) %>%
  mutate(var1 = yield * rnorm(500, 100, 5),
         var2 = yield * rnorm(500, 100, 10),
         var3 = -yield * rnorm(500, 100, 4))

md.pattern(plyr::rbind.fill(test,train))

Missing Data Pattern

Note the first row is the train set and the other rows will be the test set

I want to use the completely observed set as my training set

I want to perform regression on this dataset, but I am wary of using imputation, unless I can have a strong justification for it. I could, of course, just remove the rows that contain missing data, however this will remove a very large portion of the dataset. I have encountered the concept of reduced modelling, which identifies the different patterns of missing data in the test set, training separate models based on complete subsets. From what I understood of the concept, this is what I tried:

# Look at the missing data patterns on test set
pattern <- mice::md.pattern(test,plot=F)

# initiate a list to put subsets into
model_list <- list()

for (i in 1:(nrow(pattern) - 1)) {
  # get the missing data pattern
  cols_with_no_missing <- names(which(pattern[i, -ncol(pattern)] == 1))

  # subset the dataset based on this pattern
  subset_df <- test %>%
    select(all_of(cols_with_no_missing)) %>% 
    na.omit()

  # train a model using only these columns
  dat <- train %>%
    select(all_of(cols_with_no_missing))
  
  mod <- train(yield ~ ., 
               data = dat,
               method = 'lm',
               trControl = trainControl(method = 'repeatedcv',
                                        number = 10,
                                        repeats = 10))
  
  # get some model statistics
  obspred <- data.frame(obs=subset_df$yield,
                        pred = predict(mod,newdata = subset_df))
  
  n_obs <- nrow(obspred)
  rsq <- summary(mod)$r.squared
  
  # Create a plot
  plt <- obspred %>%
    ggplot(aes(x = obs, y = pred)) + 
    geom_point() + 
    geom_abline(slope = 1) +
    annotate("text", x = -Inf, y = Inf, 
             label = paste(paste('n',n_obs,sep='='),
                           paste('R²', round(rsq,2),sep='='),
                           paste('yield',paste(cols_with_no_missing[-1], collapse = ' + '),sep=' ~ '),
                           sep='\n'), 
             hjust = -0.1, vjust = 1, size = 4)

  model_list[[i]] <- plt
}

# compare the models test performance
library(patchwork)
wrap_plots(model_list) + plot_layout(axes='collect')

Test performance

subset predictions

My Questions

  1. I was wanting to know what are the consequences of using such an approach or whether there are more appropriate ways of doing something similar?
  2. Can you create some sort of ensemble model that selects the appropriate sub-model based on the available features? If so, how would you incorporate the changing accuracy (which varies depending on the available features) into that?

My actual dataset is much bigger than this example (~90,000 observations) with 25 features, so in reality I will use random forest regression, but (I think??) the principle should be similar.

Best Answer

  1. I would use an XGBoost regressor. Section 3.4 of the paper, "XGBoost: A Scalable Tree Boosting System" by Chen & Guestrin (2016), lays out how they do "sparsity-aware split finding." They explicitly note that the matrix of inputs $X$ may be sparse due to the "presence of missing values in the data." What they do is calculate "a default direction in each tree node" and then "when a value is missing in the sparse matrix $X$, the instance is classified into the default direction." Algorithm 3 in this paper shows how XGBoost calculates the default direction of each node.

  2. Yes, you could build an ensemble. For example, you could fit a decision tree or LASSO for each predictor on its own. On the training and cross-validation data, you would toss out cases listwise where there is missingness, since there is only one $x$ variable per sub-model. You'd have a sub-model for each of your $p$ predictors. You would also have some estimate of the out-of-sample error on your holdout set. What you could do is average all of the sub-models where you have data for a given case, using the inverse of the errors to weight each model. I am not sure how this would perform; you could simulate data to see. Again, I would stick with XGBoost.

Related Question