I have 60,000 data and around 45% of them is missing and the missing values are random. Can I simply use listwise or pairwise deletion or do I have to use imputation? If imputation is recommended which imputation is the best one?
Missing Data Methods – Should Missing Values be Handled by Imputation or Deletion?
data-imputationmissing data
Related Solutions
When you use pair wise deletion to estimate a covariance matrix it just means that for any pair of variables you use all available observations that are not missing on either covariate.
So if you had a data matrix
#| A B C
-----------
1| 1 1 NA
2| 2 NA 2
3| NA 3 3
4| 4 4 4
When calculating the covariance between columns A
and B
you would use rows (observations) 1 and 4, and when calculating the covariance between A
and C
you would only use rows 2 and 4.
So in the case of bivariate regression or simple linear regression, it is equivalent to list-wise deletion.
It's generally unwise to throw away information, which is what you do with complete-cases analysis or by throwing out predictors.
One of the advantages of multiple imputation instead of a single imputation of missing data is that the result incorporates the variability introduced by the imputation process while in principle using all the available information. Coefficients associated with the variable having 30% missing values thus may have larger standard errors than coefficients from variables with few missing values, but there is no a priori reason to omit such a variable. It might be worse to omit such a variable, as information in the cases having values for that variable might improve the imputations for other variables. Even if for some reason you don't keep it as a predictor variable, it can still be included as part of the imputation process.
The link above provides a simple introduction to the process of generating and using the multiple sets of imputations. You draw the imputations from a probability distribution, perform your regressions on each of the imputation sets, and then pool the results among the sets. With this number of predictors it might be best to do the imputations first and then do feature selection if feature selection is really necessary. With only 25 predictors you might be better off doing a ridge regression that keeps all the predictors, with appropriate penalization, and tends to treat collinear predictors together.
The mice package in R provides the tools that you need. The chained-equation approach makes it straightforward to deal with imputations of several variables at a time. You should devote some effort to setting up the structure of the imputations in a way that makes sense based on your understanding of the subject matter.
Two warnings. First, if one of your predictors is really "missing not at random" (MNAR) in the technical sense, then you will need to use special care and develop a joint model of the outcome variable and the predictor. It's possible, however, to think that data are MNAR when they really might be MAR, as this question illustrates. MAR only requires "given the observed data, [missingness] does not depend on the unobserved data". So consider carefully whether your predictor really threatens to be MNAR.
Second, you should think about how you will be using this model for prediction. If there are some predictors that are likely to be missing in many cases going forward, not just frequently omitted from your present data set, and you are going to be making predictions on a case-by-case basis, then you have to consider carefully how you would make your predictions in such cases and whether that variable should be included in your model.
Best Answer
It depends on
According to this nice article (Tsikriktsis: A review of techniques for treating missing data in OM survey research, 2005), if more than 10% data is missing, the best solution is