Solved – A data set with missing values in multiple variables

data-imputationdatasetmissing datamodelingself-study

I'm trying to analyze a set of data related to the health area but I'm not sure how to proceed with the missing values.

Objective: To adjust a model with a discrete response, to study the influence of certain variables on the response.

The scenario is as follows

  • About 100000 observations
  • 20 variables
  • A variable with 95% missing data
  • A variable with 53% missing data
  • A variable with 52% missing data
  • Two variables with 2% missing data

Initially I thought about discarding this variable with many missing values, however it seems to me important for my analysis. The second option would be to work only with observations with complete data, but I do not know if this could bring problems for my analysis.

I have no knowledge about imputation, is there any way to proceed in this case?

EDIT: Is viable in this case with this number of variables use a multivariate technique like principal components?

EDIT2: My data set is basically a medical record of every patient who was admitted to a hospital. The missing data are not random, simply not all information was collected for all patients

Best Answer

@Tim gave a nice response. To add to that, the best thinking about dealing with missing values (MVs) began with Donald Rubin and Roderick Little in their book Statistical Analysis with Missing Data, now in its 9th edition. They originated the classifications into MAR, MCAR, etc. To their several books I would add Paul Allison's highly readable Sage book Missing Data, which remains one of the best, most accessible treatments on this topic in the literature.

A number of commonly used, bad heuristics have emerged over the years for dealing with missing data, many of which still see use today since they are easily implemented "solutions." These include ones already mentioned such as discretizing the variable and creating a junk category labelled "Missing" or "NA" (not available, unknown) into which all missing values for that variable are tossed, as well as, for continuous variables, plugging the missing values with a constant -- e.g., the arithmetic mean. Secondarily and for regression models, some recommend using dummy variables (0,1) indicating the presence (absence) of an MV. The dummy is intended to "capture" the overall impact of the MVs on the model while also appropriately adjusting the parameters. These are all bad ideas because, in the first case, a heterogenous mix of information is lumped into a single category while, in the second case, a potentially large burst or spike containing a single value (the mean) is introduced into an otherwise typically smooth distribution for a predictor.

The least biasing of all of the options for imputation are regression models. In an American Statistican paper (for which I no longer have a reference, sorry), dummy variables for MVs in regression have been demonstrated to not only not capture the effects of missing values but also to generate biased parameters. The AmStat paper based these conclusions on a comparison of the scenarios for the various MV options with full information data. The author's recommendation was, assuming the magnitude or volume of missing information wasn't too much or too large, to use the least biasing solution -- full information modeled imputation based on data available after deleting the observations containing MVs. Of course, this response demands an answer to "what is too much?" Here, there are no firm benchmarks, only experiential, subjective heuristics and rules of thumb without any firm theoretical motivation. This means that it's up to the analyst to decide. Just so, @Discipulus' rule of thumb is to work with variables containing 50% or less MVs, certainly a reasonable heuristic. In the OPs case, that would exclude the two variables containing more than 50% MVs, variables that are described as "important" to the analysis. That said, it is safe to assume that 95% MVs qualifies as "too much."

If it's thought that there are not too many MVs, then use some variant of multiple imputation to plug them. Here too, there are many bad methods to choose from including, e.g., "sorted hot deck" multiple imputation where observations are sorted across a string of fully observed variables and the fully observed value that comes closest to the observation with missing information across that sort string is used as a plug. In general, all of these "mechanistic" solutions to plugging MVs are to be rejected in deference to model based multiple imputation.

In an ASA workshop taught by Rubin, several "best" practices were discussed for dealing with multiple variables containing MVs in a dataset. First, rank the variables by their frequency or percent of missing information from high to low and begin the process of imputation, one variable at a time, on those containing the lightest or least amount of MVs. Then, retain and use these newly plugged variables in the model-building process for each subsequent variable. Use every variable available to you in building imputation models, including the target or dependent variable(s) and excluding the lower ranked variables with MVs.

The key metric in building and evaluating model-based imputation is the comparison of the pre-imputed means and std devs (based on the full information after deleting MVs) with the post-imputation or plugged value means and std devs. If the imputation was successful, then little or no (significant) difference should be observed in these marginals. An important note of caution needs to be introduced at this point: this metric and multiple imputation in general is intended to evaluate the preservation of overall or unconditional marginals. This means that the actual values used and assigned to each MV field have a high likelihood of being "wrong" for that observation, if compared with the full but unavailable information. For instance, in a head-to-head comparison of actual vs imputed values based on a sample possessing both self-reported survey information (the actual values) versus the imputations made by a leading vendor of geo-demographic information (the imputed values), imputed fields such as head of household age and income were wrong nearly 80% of the time at the level of the individual observation. Even after partitioning these fields into high and low groups based on median splits, the imputations were still wrong more than 50% of the time. However, the marginals were recovered more or less accurately.

One final note, imputation can be appropriate for features, predictors or independent variables but is not recommended for the target or dependent variable.