I am new to machine learning and R.
I know that there is an R package called caretEnsemble, which could conveniently stack the models in R. However, this package looks has some problems when deals with multi-classes classification tasks.
Temporarily, I wrote some codes to try to stack the models manually and here is the example I worked on:
library(caret)
set.seed(123)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3 / 4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
set.seed(62433)
modelFitRF <- train(diagnosis ~ ., data = training, method = "rf")
modelFitGBM <- train(diagnosis ~ ., data = training, method = "gbm",verbose=F)
modelFitLDA <- train(diagnosis ~ ., data = training, method = "lda")
predRF <- predict(modelFitRF,newdata=testing)
predGBM <- predict(modelFitGBM, newdata = testing)
prefLDA <- predict(modelFitLDA, newdata = testing)
confusionMatrix(predRF, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
confusionMatrix(predGBM, testing$diagnosis)$overall[1]
#Accuracy
#0.7926829
confusionMatrix(prefLDA, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
Now I've got three models: modelFitRF
, modelFitGBM
and modelFitLDA
, and three predicted vectors corresponding to such three models based on the test set
.
Then I will create a data frame to contain these predicted vectors and the original dependent variable in the test set
:
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
And then, I just used such data frame as a new train set
to create a stacked model:
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
#Accuracy
#0.804878
Considering that stacking models usually should improve the accuracy of the predictions, I'de like to believe this might be a right to stack the models. However, I also doubt that here I used the predDF
which is created by the predictions from three models with the test set
.
I am not sure whether I should use the results from the test set
and then apply them back to the test set
to get final predictions?
(I am referring to this block below:)
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
Best Answer
What you're doing here is what I refer to as "Holdout Stacking" (sometimes also called Blending but that term is also used for regular Stacking), where you use a holdout set to generate the training data for the metalearning algorithm (i.e.
predDF
). I use the term Holdout Stacking to differentiate from regular Stacking (or "Super Learning") where you generate cross-validated predicted values from the base learners to generate the training data for the metalearner algorithm (in your case, a Random Forest) rather than a holdout set (yourtesting
frame).The problem here is not how you're doing the stacking, but how you're evaluating the results. Once you've used the
testing
frame to generate thepredDF
frame, you have to throw that data away and not use it for model evaluation. In your example, you are also using thetesting
frame to evaluate the performance of the base models and the ensemble learner.To fix this, just partition off another chunk of your data. You should have three datasets:
training
,validation
andtesting
. Use thevalidation
set to createpredDF
(also known as the "level one" dataset in stacking terminology).Then evaluate your base learners and your ensemble on the
testing
set to get a better idea of how the ensemble compares to the individual learners.Lastly, as a suggestion, I'd recommend trying a GLM for the metalearning algorithm because they seem to perform better than tree-based models in my experience, though that is not always the case.
If you're specifically looking for multiclass support in Stacking, it will be available soon in the h2o R package. If you don't need multiclass, then you can check out either the SuperLearner or h2o packages to do stacking more easily than writing it all out by hand. See the
SuperLearner()
or theh2o.stackedEnsemble()
functions to do Stacking with one line of code.