Solved – Missing data and feature selection

My data is 1,785,000 records with 271 features. I'm trying to reduce number of features used to build the model.

Q1. while exploring the data I found that some features are almost all missing data, like only 25 records has value for this feature and the others records has missing values, so I thought that is not informative enough and it's better to eleminate those features, am I right? and if I am right, for what level I can do that, I mean if 90%, 80%, etc.. of each feature are missing values, when I can decide to get rid of these features? (taking in consideration that it is the dependent variable is N/Y and only %1.157 of the whole data is belonging to Y).

Q2. for each indivisual in the dataset, there are 64 trait_type listed, where each one can take one of the values [1 or 3 or 5]. my question is: if some trait-type take only value [5] or missing dat for all the record, does it have any value or again we can eliminate that feature?

Thank you

Best Answer

  1. I don't think eliminating data is a good idea. Let me ask you this -- the features or variables that you are trying to eliminate, how do you know if they can be ignored or not? They could play vital roles in your model. So, I would consider imputation if I were you. The paper suggested by @Ben is good but this is also a great paper on missing data and multiple imputation. It will answer and/or guide you how to deal with imputing dependent variable as well.
  2. What are the values (1, 3 , and 5) mean? I have experience with imputation with only SAS PROC MI and if the observed values are all identical, like in your case (value=5), in any variables, PROC MI will automatically drop that variable and not impute it.
