Machine Learning – How to Improve Stacking and Meta-Learning Performance

machine learningmeta-learningstacking

I have a dataset of around 10,000 rows, with 500 features, response variable is binary classification.

I split the features into 5 equally sized groups (based on subject matter expertise), and trained 3 different models (RandomForest, XGBoost, SVM) on each of the 5 equally sized groups.

So now I have 15 different models. I then trained a single (RandomForest) model on the outputs of those 15 (the probability outputs, not hard predictions).

The results were surprising to me: The meta learner (single 2nd level RF model) did not do any better than the average individual "base" learner result. In fact, there were some base level models that did better than the meta-learner. I'm wondering why this could be?

I'm not very experienced with stacking, so I thought perhaps there are some general tips/strategies/techniques that I'm missing.

General Note:

  1. Each of the base models does significantly better than baseline.
  2. I give the meta-learner access to some of the base features + predictions (probability outputs from base learners). I don't give meta learner access to ALL base features because I've found that even base learners that are given all 500 features don't perform better than the base learners that get 100/500 of the features, so I don't think the algorithms can handle all features at once (maybe not enough rows for a single learner to be able to learn all relationships between all features?).
  3. I looked at the inter-model agreement among the base learners. As mentioned, it is a binary classification task, and each of the base learners achieves 65-70% accuracy (baseline is around 50%). Now, I looked at how much overlap there is between base models in the predictions they get right and wrong (presumably, if the predictions that they got right are all the same, there's no chance for meta-learner to combine them, since their value is the same). The base models generally get the same examples right and wrong 70-80% of the time. I don't know if this is high or low, but perhaps this could be part of the key. If it is, please explain in detail, as 20-30% difference still seems like a good amount of potential improvement in combination (unless that's random noise).

Info about my train/test splitting:

  1. I am using nested cross validation to generate the level 1 predictions.
  2. I am also using nested cross validation to train, test and evaluate the level 2 model.
  3. I am working with time series data, so I organized my cross validation in a way to try to avoid using "future" training data and "past" validation data. Here is a diagram that shows the general structure of my outer nested CV folds. My inner CV folds look the same way.
    enter image description here

Best Answer

There's a lot of useful material and practical recommendations for stacking from the Kaggle community. These include:

  • Often simple ensembling models (e.g. simple average of predicted probabilities, weighted average of predicted probabilities or logistic regression of logits - possibly with regularization towards a simple average) perform a lot better than trying to fit fancier models on the second level. This is especially the case when there's little data (esp. with time series data it's hard to know how much data 10,000 rows really is, because of the dependency of data over time and across different observations at the same time). XGBoost would probably not be my first thought for a stacking model, unless there's a lot of data.
  • Giving the ensembling model access to base features can be valuable when there is a lot of data and depending on some features the predictions of one model should be trusted more than the predictions of another model (of course, the latter is hard to know up-front). However, one often has to worry about this making it too easy to overfit and not adding value, so I would consider not doing that.
  • Ensembling models also need good hyperparameter choices and often need to be heavily regularized to avoid overfitting. How much so? Hard to say without a good validation set-up. If you do not have a good validation set-up for them, yet, this needs thought. A form of time-wise-splits like you seem to be using would be a sensible approach - e.g. you could use what you show in red as a the validation data on which you fit your ensembling models and then validate based on going even further into the future (perhaps you are already doing that?).
  • Have a look at the chapter on this in Kaggle GM Abishek Thakur's book (page 272 onwards) or - if you have easy access - there's excellent sections on ensembling and the validation schemes for it in the "How to Win a Data Science Competition" Coursera course, as well as what various Kagglers have written on ensembling (e.g. here or simply by looking at the forum discussions and winning solutions posted for a Kaggle competition that resembles your particular problem).

Why am I emphasizing Kaggle (and similar data science competition settings) so much? Firstly, because a lot of ensembling stacking gets done there. Secondly, because a lot of good incentives exist there to ensure that ensembling is not overfit and performs well on the unseen test data (while pracitioners sometimes fool themselves into believing in overfit results that are evaluated in an unreliable manner). Of course, there's also incentives to do things that would not work in practice, but might work well in the competition (like exploiting target leaks).

70-80% agreement between models that perform comparably (although a difference in accuracy of 65 vs. 70% seems large) actually sounds like a promising scenario for ensembling in the sense that this is on the low side for models trained on the same data. This reflects the reasonable diversity of models you chose (I'd expect a lot more similar results if you e.g. used XGBoost and LightGBM). Having models that are too similar in nature e.g. just multiple XGBoosts with slightly different hyperparameter values is usually much less valuable. Perhaps even more diversity could be achieved by having more models e.g. kNN classifier, logistic regression (both might require some good feature engineering) and depending on the details (that will determine whether there's any hope to do this - e.g. high cardinality categorical features, some inputs are text/images, being able to feed the time series nicely into a LSTM etc.) neural networks of some form (e.g. LSTM type).

Related Question