Data Imputation – How to Determine the Best Method to Use

data-imputation

I am attempting to learn various methods of imputation to deal with missing values in a dataset. I have a dataset storing marathon segment splits (5K, 10K, …) in seconds and identifiers (age, gender, country) as columns and individuals as rows. Each cell for a marathon segment split column may contain either a float (specifying the number of seconds required to reach the segment) or "NaN". I am attempting to predict marathon performances based on the first few splits. Below is some sample data:

      Age M/F Country      5K     10K     15K     20K    Half  Official Time
2323   38   M     CHI  1300.0  2568.0  3834.0  5107.0  5383.0        10727.0
2324   23   M     USA  1286.0  2503.0  3729.0  4937.0  5194.0        10727.0
2325   36   M     USA  1268.0  2519.0  3775.0  5036.0  5310.0        10727.0
2326   37   M     POL  1234.0  2484.0  3723.0  4972.0  5244.0        10727.0
2327   32   M     DEN     NaN  2520.0  3782.0  5046.0  5319.0        10728.0

What imputation method should I use here and, more generally, how should I determine what imputation method to use for a given data set?

I've referenced this answer but I'm not sure what to do from it.

Best Answer

You have to think about why the data might be missing. There are generally 3 types of missing data:

Missing completely at random (MCAR): This sort of missingness happens when someone drops a test tube in a lab or something. "The data are missing for reasons that are unrelated to any characteristics or responses for the subject, including the value of the missing value, were it to be known" writes Frank Harrell in chapter 3 of his book *Regression Modelling Strategies". Your missing data might be MCAR if there is a malfunction in the instruments recording the splits.

Missing at random (MAR): This sort of missingness occurs when the probability he data are missing has something to do with variables which were actually measured. In your example, perhaps Danish runners were given faulty instrumentation.

Informative Missing: Again, from Frank Harrell: "The tendency for a variable to be missing is a function of data that are not available, including the case when data tend to be missing if their true values are systematically higher or lower". So if your Danish runner ran a 5K split which was just too fast or too slow, and the instrumentation for some reason was sensitive to that, then the missingness would be informative missing.

If the missingness is MCAR or MAR then multiple imputation are helpful. You can use something like MICE or predictive mean matching (side note: Frank implements this in his companion R package rms) to use the information that is available -- including the outcome -- to impute the missing values. If the data are informative missing, then we risk biasing any estimates we make.

Often, you will see people impute the missing data with the column mean. That is an attractive approach because it is simple, and it is something you can do if you need to do something. But, there are many problems associated with this approach. I would encourage you to read chapter 3.4 of Frank's book to learn more about the downside to imputing with the column mean.

EDIT:

Without knowing more about how data are collected, there isn't really much you can do. You can assume the data are MAR and just do predictive mean matching, but you're risking bias. Maybe that is important, maybe not.

Another approach might be to build a model. I suspected that times to run certain distances are, mostly, linear. Here is a plot I made using some of the data you linked me to.

[![enter image description here][1]][1]

Because the data are linear, we can probably fit a mixed effects model, adjusting for things like age, sex, and adding a random slope for each person (additionally, I know country of origin might be an important predictor). We can then use the model to predict for the missing times.

Here is some R code to get you started in that direction.


library(tidyverse)
library(lubridate)
library(lme4)

# Read in the data from...wherever
d = read_csv("~/Desktop/marathon_results_2015.csv") %>% 
    select(-X1)


# Clean up the data so that we have 5 columns:
# bib (serves as ID), age, sex, distance, and time (time is the outcome here)
# Scale distance and age to have mean 0 and standard deviation 1 so model can converge
# Also, convert the time into seconds 
raw_data = d %>% 
  select(Bib, Age, sex=`M/F`, `5K`:`40K`) %>% 
  janitor::clean_names() %>% 
  mutate_at(vars(matches('k')), ~period_to_seconds(hms(.x))) %>% 
  gather(distance,time, -bib, -sex, -age) %>% 
  mutate(distance = as.numeric(str_extract(distance, "(\\d)+"))) %>% 
  mutate(distance = distance/max(distance, na.rm=T), age = age/10)


missingness = raw_data %>% filter(if_any(everything(), is.na))

model_data = raw_data %>% 
             anti_join(missingness, by = c('bib','distance')) 
  

# Run a mixed effects model.
# Remove the intercept because we know that at time=0 distance=0
model = lmer(time ~  distance:age + distance:sex +(distance-1|bib)-1, data = model_data)

preds = predict(model, newdata = model_data, allow.new.levels = T)
# Save predictions to compare with training data.
model_data$preds = preds


# Now predict the missing data
predict(model, newdata=missingness, allow.new.levels=T)


I've been a bit sloppy with my data cleaning so you might want to do a better job in that respect, but this should be a good start. [1]: https://i.stack.imgur.com/mjzdp.png

Related Question