Solved – Is it possible to combine predictions to improve overall prediction quality

boostingmachine learningprediction

This is a binary classification problem. The metric that is being minimised is the log loss ( or cross entropy ). I also have an accuracy number, just for my information. It is a large, very balanced data set. Very naive prediction techniques get about 50% accuracy and 0.693 log loss. The best I've been able to scrape out is 52.5% accuracy and 0.6915 log loss. Since we are trying to minimize the log loss, we always get a set of probabilities ( predict_proba functions in sklearn and keras ). Thats all background, now the question.

Lets say I can use 2 different techniques to create 2 different sets of predictions that have comparable accuracy and log loss metrics. For example, I can use 2 different groups of the input features to produce 2 sets of predictions that are both about 52% accurate with < 0.692 log loss. The point is that both sets of predictions show there is some predictive power. Another example is that I could use logistic regression to produce one set of predictions and a neural net to produce the other.

Here are the first 10 for each set, for example:

p1 = [0.49121362 0.52067905 0.50230295 0.49511673 0.52009695 0.49394751 0.48676686 0.50084939 0.48693237 0.49564188 ...]
p2 = [0.4833959  0.49700296 0.50484381 0.49122147 0.52754993 0.51766402 0.48326918 0.50432501 0.48721228 0.48949306 ...]

I'm thinking that there should be a way to combine the 2 sets of predictions into one, to increase the overall predictive power. Is there?

I had started trying some things. For example I consider the absolute value of the prediction minus 0.5 ( abs( p - 0.5 ) ) as a signal, and whichever between p1 and p2 had a greater signal, I would use that value. This slightly accomplished that I wanted, but just by a slim margin. And in another instance it didn't seem to help at all. Interestingly it didn't seem to destroy the predictive power.

Best Answer

Short answer: Yes.

Long answer: This is one of many examples of a technique known as "stacking". While you can, of course, decide on some manual way to combine both predictions, it is even better if you train a third model on the output of the first two models (or even more). This will further improve the accuracy. To avoid re-using the data, often a different part of the data set is used for training the first levels, and training the model that combines the data.

See e.g. here for an example.

Related Solutions

Solved – Caret package – Is it possible to compute predictions for non-optimal models

No, in the returned model caret does only provide finalModel as the determined best parametrization trained again on all training data without resampling or similar. Thereby, the final training is the same as if you would have trained this parametrization with trainControl(method='none').

Therefore, what you can do: train those parametrizations you would like to get a test set performance by hand, using trainControl(method='none') and all training data. You could then apply all those models to your test set using predict(model, ...). But keep in mind that you should not compare multiple models based on only the test set performance.

Update: caret provides a good explanation on how to compare multiple models with partitioning + resampling. This could boil down to something like:

library(caret)
set.seed(123456)
training_indexes <- createDataPartition(y = iris$Species, p = 0.8, list = F)
training <- iris[training_indexes,]
testing <- iris[-training_indexes,]
# 2 example models
models <- list()
models$knn <- train(training[,1:4], training[,5], method='knn', tuneGrid=expand.grid(k=1:5), trControl = trainControl(method = 'repeatedcv', 10, 20, savePredictions = T))
models$lda2 <- train(training[,1:4], training[,5], method='lda2', tuneGrid=expand.grid(dimen=1:5), trControl = trainControl(method = 'repeatedcv', 10, 20, savePredictions = T))
# compare models by results of partition+repearts
results <- resamples(x = models)
bwplot(results)

# example of resampling performance of your chosen model in more detail
confusionMatrix(data = models$knn$pred$pred, reference = models$knn$pred$obs)
# your chosen model on test set
confusionMatrix(data = predict(models$knn, newdata = testing[,1:4]), testing[,5])

Solved – Machine learning- Calculating predictive accuracy – cross validation vs accuracy on unseen data

Predictive accuracy always needs to be calculated on unseen data - whether that data is unseen via cross validation splits or via a separate data set.

So often the most important point is to avoid leaks between training and test data. This may be easier to achieve with hold out (e.g. by obtaining test cases only after model training is finished) than for resampling.
But careful: very often "hold out" or "independent test" are used that are in fact a single random split of the available data set. That procedure is of course prone to the same data leaks that cross validation is.

Yes, for simple data, cross validation makes more efficient use of your data. And in small sample size situations, that can be the crucial advantage of resampling. But when you have to deal with multiple confounders and need to split independently for all those confounders, that advantage vanishes very fast because you end up excluding large parts of your data from both test and training set for each surrogate model.

UPDATE: described scenario of 100k (I assume cases) x unknown no of variates.

That is certainly not a small sample size situation. In this situation, a random hold out set of 10 % = 10000 cases should have no practically relevant difference to cross validation results. The more so, as a random subset is prone to the same data leaks that cross validation is prone to as well: confounders that lead to clustering in the data. If you have such confounders, your effective sample size may be orders of magnitude below the 100k rows, and any kind of splitting that doesn't take care of those confounders will mean a data leak between training and test and lead to overoptimistic bias in the error estimates.

The more efficient use of cases in cross validation is mostly relevant with small data sets where

stability of the model is an issue and must be checked (which is easily done by cross validation), and
uncertainty of the test result due to small numbers of test cases is large
here cross validation is better as a full run will test each case.

For theory, I recommend reading up the relevant parts of The Elements of Statistical Learning.

These papers have empirical results on bias and variance of different validation schemes (though they deal explicitly with small sample size situations):

Best Answer

Related Solutions

Solved – Caret package – Is it possible to compute predictions for non-optimal models

Solved – Machine learning- Calculating predictive accuracy – cross validation vs accuracy on unseen data

Related Question