Solved – Dealing with MNAR data and imputation

missing datamultiple-imputation

I have a large dataset with large amounts of missing data.

My data involves particular cognitive tests and I would like to see how they are related to academic attainment controlling for SES and IQ.

I would also like to impute missing values using multiple imputation.

I have been looking at the relationship between the amount of missing data each participant has, and both how well they do on the cognitive tests and in academic achievement. I have found that although there is no relationship between how much missing data one has and how well they do on cognitive tests, there is a relationship between how much missing data they have and how well they do at school, their SES and IQ. This could be for many reasons but I assume this means my data is not missing at random?

I was wondering whether if I ran multiple imputation on this dataset and included the correlated variables in the imputation model whether this would account for the relationship? Or if not, how I might be able to deal with this situation.

Many thanks!

Best Answer

The terminology might be getting in the way here. If the data that you have explain the probability that other data are missing, then your data might be "missing at random" (MAR) in the technical sense, even if they are not "missing completely at random" (MCAR). In that case multiple imputation is a reasonable way to proceed.

As Gelman and Hill put it in a chapter on missing data:

A more general assumption, missing at random, is that the probability a variable is missing depends only on available information. Thus, if sex, race, education, and age are recorded for all the people in the survey, then “earnings” is missing at random if the probability of nonresponse to this question depends only on these other, fully recorded variables.

Data are not "missing at random" if missingness depends either on unobserved predictors or on the values of the missing data points themselves. Unfortunately, there is no statistical way to document MAR status. If data are missing not at random (MNAR) there are ways to proceed, but you have to model the missingness mechanism. That might require some specific expertise and experience.

Related Question