Solved – Hot deck imputation: validity of double imputation and selection of deck variables for a regression

data-imputationmissing datamultiple-imputation

Background:

I had a data set containing 212 observations with a lots of missing values. Most of the IVs and DVs are categorical (DVs are ordinal) in nature. There are 3 DVs and about 30 IVs. My intention was to run an ordinal logistic regression. A list-wise deletion keeps only 42 observations, so I decided to use hot deck imputation to fill in the missing values. I chose similar variables as the deck variables during the hot deck imputation (the deck variables should always be categorical and as far I know there should be a maximum of 5 deck variables).

Here are my queries:

1) When I imputed via hot deck once, 169 of the observations were filled in completely. If I use these imputed values for another hot deck imputation, then all 212 observations will fill in completely. But I am not sure if it is valid to use the imputed values for a further imputation. Can anyone suggest?

2) Someone suggested me (from his experience) to use the 3 DVs as the background or deck variables for imputing all the DVs and all the IVs, because that will probably facilitate my regression results. May I know your comment about it?

3) If I see almost all of the values (except from a very few) of a continuous IV are 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90 etc. then isn't it better to impute them via hot deck rather than via EM (as the hot deck will impute a variable with it's existing values only)?

Best Answer

Hot deck is often a good idea to obtain sensible imputations as it produces imputations that are draws from the observed data. However, filling in a single value for the missing data produces standard errors and P values that are too low. For correct statistical inference could use multiple imputation. It is easy to apply hot deck imputation in combination with multiple imputation. The most popular technique for doing this is known as predictive mean matching, and has been implemented on a variety of platforms.

Related Question