Solved – Missing data and feature selection

feature selectionmissing data

My data is 1,785,000 records with 271 features. I'm trying to reduce number of features used to build the model.

Q1. while exploring the data I found that some features are almost all missing data, like only 25 records has value for this feature and the others records has missing values, so I thought that is not informative enough and it's better to eleminate those features, am I right? and if I am right, for what level I can do that, I mean if 90%, 80%, etc.. of each feature are missing values, when I can decide to get rid of these features? (taking in consideration that it is the dependent variable is N/Y and only %1.157 of the whole data is belonging to Y).

Q2. for each indivisual in the dataset, there are 64 trait_type listed, where each one can take one of the values [1 or 3 or 5]. my question is: if some trait-type take only value [5] or missing dat for all the record, does it have any value or again we can eliminate that feature?

Thank you

Best Answer

  1. I don't think eliminating data is a good idea. Let me ask you this -- the features or variables that you are trying to eliminate, how do you know if they can be ignored or not? They could play vital roles in your model. So, I would consider imputation if I were you. The paper suggested by @Ben is good but this is also a great paper on missing data and multiple imputation. It will answer and/or guide you how to deal with imputing dependent variable as well.
  2. What are the values (1, 3 , and 5) mean? I have experience with imputation with only SAS PROC MI and if the observed values are all identical, like in your case (value=5), in any variables, PROC MI will automatically drop that variable and not impute it.
Related Question