I have a training set where the inputs & outputs are all present, but I suspect that in the data where I want to do prediction, I will occasionally encounter scenarios where a small fraction of the input features are missing. Are there any machine learning methods that, once learning is complete, can provide reasonable prediction amidst missing inputs like this? If it matters, I'm looking for real-valued predictions (ideally multivariate, as I have 2 outputs to predict per input set).
Solved – What machine learning techniques can, once trained, generate prediction despite some missing inputs
machine learningmissing datapredictionpredictive-modelssupervised learning
Related Solutions
You have no choice but to impute the values or to change models. A good choice could be aregImpute in the Hmisc package. I think its less heavy than rfimpute which is what is detaining you, first package example (there are others):
# Check that aregImpute can almost exactly estimate missing values when
# there is a perfect nonlinear relationship between two variables
# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
set.seed(3)
x1 <- rnorm(200)
x2 <- x1^2
x3 <- runif(200)
m <- 30
x2[1:m] <- NA
a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
a
matplot(x1[1:m]^2, a$imputed$x2)
abline(a=0, b=1, lty=2)
x1[1:m]^2
a$imputed$x2
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
x1 <- factor(sample(c('a','b','c'),1000,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 <- rnorm(1000)
y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 <- x1[1:250]
orig.x2 <- x2[251:350]
x1[1:250] <- NA
x2[251:350] <- NA
d <- data.frame(x1,x2,x3,y)
# Find value of nk that yields best validating imputation models
# tlinear=FALSE means to not force the target variable to be linear
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
data=d, B=10) # normally B=75
f
# Try forcing target variable (x1, then x2) to be linear while allowing
# predictors to be nonlinear (could also say tlinear=TRUE)
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
f
# Use 100 imputations to better check against individual true values
f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat <- function(u) {
tab <- table(u)
as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc <- lm(y ~ x1 + x2 + x3)
summary(fcc) # SEs are larger than from mult. imputation
You mention that you have many new observations that have missing values on the independant variables. Even though you have many cases like this, if for each new observation there is only missings in one or two of its variables and your amount of variables is not tiny maybe just filling the holes up with a median or average (are they continuous?) could work.
Another thing that could be interesting is to do a minor variable importance analysis. The random forest R implementation calculates two importance measures and respective plots:
varImpPlot(yourRandomForestModel) # yourRandomForestModel must have the argument importance=TRUE
And you can play around with just including "important" variables in the model training, till the prediction accuracy isn't all that affected in comparison to the "full model". Maybe you keep variables with a low number of missings. It could help you reduce the size of your problem.
11000 is an invalid class if your classes are mutually exclusive and if you have only five classes consisting of 10000,01000,00100,00010, and 00001.
11000 is no longer an invalid class once you define your class space to be all digits between 0 and 2^5. Ultimately it's up to you to decide what types of classes you would like to classify. Likewise for inputs: you can use however many inputs you would like as long as you are consistent between training and test/predicting.
Your dataset sounds very abstract and that could be what is clouding your thinking. If you're just starting out with machine learning and neural networks, go with a well known dataset first so that you only need to think about the algorithm. A good example is the MNIST handwritten digits dataset. All the classes are digits 0-9 (10 total) and so there is no confusion about classes falling outside of 0,1,2,3,4,5,6,7,8,9.
Google's Tensorflow library has a good MNIST tutorial here: https://www.tensorflow.org/versions/r0.7/tutorials/mnist/pros/index.html.
Best Answer
Substituting by the mean value is problematic and can lead to poor results. A principled way to tackle this problem is described in this paper. The idea is to formulate the problem in a probabilistic model which allows treating the missing components as hidden variables, and use the EM algorithm to estimate them. The paper also explains why is not recommendable to use the mean value.
If your model is a graphical model, then you can just integrate over the missing components. This gives you the most likely output compatible with the values of the observed components, averaged over all possible combinations of the missing values.