Handle NaN/Missing values in Machine Learning (on HiggsML competition data)

machine learningmissing data

I'm currently learning Machine Learning methods and try to work on the recent HiggsML competition in which many different data scientists (etc.) could provide their algorithms etc. of solving this problem with ML. Now, the data set I have does have many missing values (or NaNs) represented as -999.0 values and I wonder how to deal with them (or to deal with them at all, see bullet point 2).

  • Obviously, a trivial and naiv possibility is simply ignoring these data columns but this only reduces accuracy and looses a ton of data
  • Another approach could be, as it is, setting them to a constant number (here -999., alternatively 0) but I wonder if this wouldn't change the fit of the model to predict the test data as many data points would be located around this set value
  • Lastly, I found an approach called the MICE imputation, which assumingly uses the whole data set to simulate/approximate the missing values. Since this are detector measurements this approach could be feasible in my opinion, but I have not yet achieved experience with this method.

To end with a question, until now I simply used the -999 approach, but shouldn't it be better to approximate the missing values? Are there like default approaches one typically uses?

Best Answer

For each variable containing missing values, add an indicator variable for whether this variable is missing. Then, impute the missing variable values in whatever way you desire (good imputation can help but not necessary). This captures all the information in the problem and is usually better than simply naively imputing (which might weaken the signal in the variable).

Adding the missingness indicator variables mimics dropping the rows that have missing observations. If the best fit is obtained by dropping these observations, a flexible machine learning algorithm can do so adaptively by leveraging the indicator variable. However, the machine-learning algorithm might decide to utilize all observations including those that are missing. This also allows the ML algorithm to leverage the missing/imputed values for some variables and to not leverage them for the other variables.

The addition of good imputation of the missing values allows the machine-learning to leverage an (imputed) signal from the missing values. A flexible ML algorithm would intrinsically be imputing these values anyway but imputing as a preprocessing step makes its job easier (and also allows less flexible ML algorithms to work well).

The -999 approach is not terrible if your ML algorithm is very flexible. Because the missing values are all grouped up, the ML algorithm can isolate these values when fitting. However, if you do something like linear regression/GAM, this will likely perform quite poorly. This also does not allow you to leverage imputation, which would allow glm/gam to work well.