The best generic way (more on that later) to compare the effectiveness of different classifiers in your data set is with receiver operating characteristics (ROC) curves and area under the ROC curve (AUC). See this paper. Briefly, the metric depends on the false positive and true positive rates (fpr/tpr) so you don't need to take special care making a 50/50 positive/negative data set and deal with the associated problems.
To make ROC curves in sklearn
, first have your classifiers return the predicted probabilities:
classifier1 = svm.SVC(kernel='linear', probability=True)
probas_1 = classifier1.fit(X_train, y_train).predict_proba(X_test)
Repeat for classifier2
, classifier3
, etc.
Next, use sklearn
functions to find the fpr, tpr, and auc:
# Compute ROC curve and area the curve
from sklearn.metrics import roc_curve, auc
fpr1, tpr1, thresholds1 = roc_curve(y_test, probas_1[:, 1])
roc_auc1 = auc(fpr1, tpr1)
Again, repeat this for each classifier you are preparing.
Then plot:
# Plot ROC curve
pl.clf()
pl.plot(fpr1, tpr1, label='SVC (area = %0.2f)' % roc_auc1)
pl.plot(fpr2, tpr2, label='KNN(3) (area = %0.2f)' % roc_auc2)
pl.plot(fpr3, tpr3, label='KNN(21) (area = %0.2f)' % roc_auc3)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic example')
pl.legend(loc="lower right")
pl.show()
Each point in ROC space corresponds to a classifier and a threshold setting (in other words you might be better off counting > 65% as one class, not the default > 50%). The perfect classifier is the point (0, 1), the upper-left hand corner. The dashed line represents all the random classifiers where probability of positive is randomly assigned (lower left is 0%, upper right is 100%). Thus, meaningful classifiers are above and left of the dashed line.
You can also collect AUC each time you use a new cross-validation set. Then, the best classifier will have the distribution with the higher mean or median.
All that being said, you really want to look at your own cost/benefit function to decide which classifier to use. For your particular problem, quantify what you get with a true positive and false negative. Also quantify whatever bad things happen when you get a false positive or false negative. The best operating point (which classifier, and where to set the threshold) will depend on the application as seen through your cost/benefit.
There's a lot of useful material and practical recommendations for stacking from the Kaggle community. These include:
- Often simple ensembling models (e.g. simple average of predicted probabilities, weighted average of predicted probabilities or logistic regression of logits - possibly with regularization towards a simple average) perform a lot better than trying to fit fancier models on the second level. This is especially the case when there's little data (esp. with time series data it's hard to know how much data 10,000 rows really is, because of the dependency of data over time and across different observations at the same time). XGBoost would probably not be my first thought for a stacking model, unless there's a lot of data.
- Giving the ensembling model access to base features can be valuable when there is a lot of data and depending on some features the predictions of one model should be trusted more than the predictions of another model (of course, the latter is hard to know up-front). However, one often has to worry about this making it too easy to overfit and not adding value, so I would consider not doing that.
- Ensembling models also need good hyperparameter choices and often need to be heavily regularized to avoid overfitting. How much so? Hard to say without a good validation set-up. If you do not have a good validation set-up for them, yet, this needs thought. A form of time-wise-splits like you seem to be using would be a sensible approach - e.g. you could use what you show in red as a the validation data on which you fit your ensembling models and then validate based on going even further into the future (perhaps you are already doing that?).
- Have a look at the chapter on this in Kaggle GM Abishek Thakur's book (page 272 onwards) or - if you have easy access - there's excellent sections on ensembling and the validation schemes for it in the "How to Win a Data Science Competition" Coursera course, as well as what various Kagglers have written on ensembling (e.g. here or simply by looking at the forum discussions and winning solutions posted for a Kaggle competition that resembles your particular problem).
Why am I emphasizing Kaggle (and similar data science competition settings) so much? Firstly, because a lot of ensembling stacking gets done there. Secondly, because a lot of good incentives exist there to ensure that ensembling is not overfit and performs well on the unseen test data (while pracitioners sometimes fool themselves into believing in overfit results that are evaluated in an unreliable manner). Of course, there's also incentives to do things that would not work in practice, but might work well in the competition (like exploiting target leaks).
70-80% agreement between models that perform comparably (although a difference in accuracy of 65 vs. 70% seems large) actually sounds like a promising scenario for ensembling in the sense that this is on the low side for models trained on the same data. This reflects the reasonable diversity of models you chose (I'd expect a lot more similar results if you e.g. used XGBoost and LightGBM). Having models that are too similar in nature e.g. just multiple XGBoosts with slightly different hyperparameter values is usually much less valuable. Perhaps even more diversity could be achieved by having more models e.g. kNN classifier, logistic regression (both might require some good feature engineering) and depending on the details (that will determine whether there's any hope to do this - e.g. high cardinality categorical features, some inputs are text/images, being able to feed the time series nicely into a LSTM etc.) neural networks of some form (e.g. LSTM type).
Best Answer
I'd try visualizing data in 3d, use generalized scatterplot matrix or use PCA (if you know what it is) to project data to 2d and then try to see the structure.
In such low-dimensional dataset it should be easy to see visually which classifier will be the best, using scikit-learn's comparison of different classifiers.