Solved – When should I not use an ensemble classifier

baggingboostingclassificationensemble learning

In general, in a classification problem where the goal is to accurately predict out-of-sample class membership, when should I not to use an ensemble classifier?

This question is closely related to Why not always use ensemble learning?. That question asks why we don't use ensembles all the time. I want to know if there are cases in which ensembles are known to be worse (not just "not better and a waste of time") than a non-ensemble equivalent.

And by "ensemble classifier" I'm specifically referring to classifiers like AdaBoost and random forests, as opposed to, e.g., a roll-your-own boosted support vector machine.

Best Answer

The model that is closest to the true data generating process will always be best and will beat most ensemble methods. So if the data come from a linear process lm() will be much superior to random forests, e.g.:

    set.seed(1234)
p=10
N=1000
#covariates
x = matrix(rnorm(N*p),ncol=p)
#coefficients:
b = round(rnorm(p),2)
y = x %*% b + rnorm(N)
train=sample(N, N/2)
data = cbind.data.frame(y,x)
colnames(data) = c("y", paste0("x",1:p))
#linear model
fit1 = lm(y ~ ., data = data[train,])
summary(fit1)
yPred1 =predict(fit1,data[-train,])
round(mean(abs(yPred1-data[-train,"y"])),2)#0.79

library(randomForest)
fit2 = randomForest(y ~ ., data = data[train,],ntree=1000)
yPred2 =predict(fit2,data[-train,])
round(mean(abs(yPred2-data[-train,"y"])),2)#1.33