Solved – Binary classification when many binary features are missing

classificationmissing datasemi-supervised-learning

I'm working on a binary classification problem, with about 1000 binary features in total. The problem is that for each datapoint, I only know the values of a small subset of the features (around 10-50), and the features in this subset are pretty much random.

What's a good way to deal with the problem of the missing features? Is there a particular classification algorithm that handles missing features well? (Naive Bayes should work, but is there anything else?) I'm guessing I don't want to do some kind of variable imputation, since I have so many missing features.

Best Answer

Assuming data are considered missing completely at random (cf. @whuber's comment), using an ensemble learning technique as described in the following paper might be interesting to try:

Polikar, R. et al. (2010). Learn++.MF: A random subspace approach for the missing feature problem. Pattern Recognition, 43(11), 3817-3832.

The general idea is to train multiple classifiers on a subset of the variables that compose your dataset (like in Random Forests), but to use only the classifiers trained with the non-missing features for building the classification rule. Be sure to check what the authors call the "distributed redundancy" assumption (p. 3 in the preprint linked above), that is there must be some equally balanced redundancy in your features set.

Related Solutions

Solved – building up a predictive model with lots of features and missing data

There are several possibilities for handling missing data. A typical easy one is imputing the median for continuous and the modus for discrete predictors. Other more sophisticated methods are available (like e.g. imputation with random forest, see here for some possibilites with R-package mlr: http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html)
As most of the algorithms for predictive modeling cannot handle missing data, you should do imputation before building a model
Random Forest (randomForest or ranger in R) and linear model (lm in R) are good first options for regression problems. Better results with a bit of parameter tuning can usually be obtained by boosting methods (e.g. with the xgboost package), but a bit more experience is needed for this.

Solved – Multi-class classification easier than binary classification

This is actually true as it is possible from this simulated example using R

library(mvtnorm)
sigma <- matrix(c(1,0,0,1), ncol=2)
x1 <- rmvnorm(n=500, mean=c(0,0), sigma=sigma, method="chol")
x2<- rmvnorm(n=500, mean=c(3,0), sigma=sigma, method="chol")
x3 <- rmvnorm(n=500, mean=c(1.5,3), sigma=sigma, method="chol")
x4 <- rmvnorm(n=500, mean=c(-2.5,3), sigma=sigma, method="chol")
x5 <- rmvnorm(n=500, mean=c(-4,-2), sigma=sigma, method="chol")
data<-data.frame(rbind(x1,x2,x3,x4,x5))
data$class<-c(rep(1,500),rep(2,500),rep(3,500),rep(4,500),rep(5,500))

Visualize the data

 library(ggplot2)
 qplot(data[,1],data[,2],colour=data[,3])

Let's fit the first model and see accuracy and a plot of the predicted

 library(e1071)
 fit1<-naiveBayes(factor(class) ~., data, laplace = 0)
 data$predicted<-predict(fit1,data[,1:2],type="class")
 sum(data$predicted==data$class)/length(data$predicted)
 [1] 0.9228
qplot(data[,1],data[,2],colour=data[,3])

Now change the data and repeat the same steps for the second model with binary classification

 data2<-data
 data2$class<-c(rep(2,500),rep(1,500),rep(2,1000),rep(1,500))
 qplot(data2[,1],data2[,2],colour=data2[,3])

 fit2<-naiveBayes(factor(class) ~., data2, laplace = 0)
 data2$predicted<-predict(fit2,data2[,1:2],type="class")
     sum(data2$predicted==data2$class)/length(data2$predicted)
 qplot(data2[,1],data2[,2],colour=data2$predicted)

The underlying reason is that having a distribution for each class enhance the flexibility and can model regions with different shapes

Best Answer

Related Solutions

Solved – building up a predictive model with lots of features and missing data

Solved – Multi-class classification easier than binary classification

Related Question