Solved – What machine learning techniques can, once trained, generate prediction despite some missing inputs

machine learningmissing datapredictionpredictive-modelssupervised learning

I have a training set where the inputs & outputs are all present, but I suspect that in the data where I want to do prediction, I will occasionally encounter scenarios where a small fraction of the input features are missing. Are there any machine learning methods that, once learning is complete, can provide reasonable prediction amidst missing inputs like this? If it matters, I'm looking for real-valued predictions (ideally multivariate, as I have 2 outputs to predict per input set).

Best Answer

Substituting by the mean value is problematic and can lead to poor results. A principled way to tackle this problem is described in this paper. The idea is to formulate the problem in a probabilistic model which allows treating the missing components as hidden variables, and use the EM algorithm to estimate them. The paper also explains why is not recommendable to use the mean value.

If your model is a graphical model, then you can just integrate over the missing components. This gives you the most likely output compatible with the values of the observed components, averaged over all possible combinations of the missing values.

Related Question