Solved – Extending the idea of Bootstrapping to Train Test splits of a Dataset used to learn a Classifier in Machine Learning

bootstrapclassificationmachine learningmodel-evaluationtrain

In Machine Learning the standard practice for learning a Classifier –e.g. fitting a Logistic Regression model– and then validating its performance is to split the original/available Dataset into a train and test dataset randomly –typically 70% of the data used for training and 30% for validation. Using the training dataset you fit a model — possibly using k-fold cross-validation on the training dataset — and then make predictions on the Test (Unseen / Out of Sample) dataset. You measure the performance of your predictions on the Test dataset and your report them. That's it.

The problem with this approach is that the particular way you split the original dataset into Training and Test datasets may influence the performance of your model and may lead to wrong conclusions regarding the best algorithm for the data at hand and the best hyper-parameter values.

You can test that by repeatedly splitting randomly a dataset into training and test — you will observe differences in the performance of the algorithms / hyper-parameter settings.

And while k-fold cross validation should alleviate the problem of using a particular subset of the original dataset for training purposes, it does nothing to mitigate the bias introduced by a particular choice of the Test dataset (from all the possible combinations).

I was thinking that one way to deal with this problem would be to split randomly the original dataset 1,000 times into Train and Test, fit each time a model on the given Train dataset, evaluate its performance on its complement Test dataset, record the performance and continue. In the end report the distribution of the performance measure (e.g. AUC), i.e. mean and standard deviation.

I find this idea to have an analogy to the Bootstrapping method, in the sense you resample 1,000 times the train and test datasets from the same initial sample in a random way — although not with replacement.

Bootstrapping theory claims that the distribution of the Estimator can be generalized to the Population. Could we make an analogous claim for the distribution of the Estimator of the Performance metric we gauge in each split?

Your advice will be appreciated.

Best Answer

I use exactly this method to estimate the performance of my predictor, and to find an optimal threshold (for a binary classifier) after hyper-parameter tuning, typically when using a gradient boosting machines method, but I still use cross validation for the parameter tuning itself.

I have had success using this method as well to determine the coefficients for a logistic regression: do say a 1000 test train splits and train the model for each...then construct the histograms for each of the coefficients and in some cases I would use the mean, in others the mode, to get "averaged" coefficients which provide the most "robust" solution. I have not tried it, but I guess one could perhaps also do hyper-parameter tuning in this way.

The idea of the bootstrap as I understand it is to get an estimate of the variability of the data, and this method addresses the same problem. Especially with smaller datasets results can vary considerably from one test-train split to the next, and so if you try to select for example a (classification) threshold for a binary classifier probability score based on a single test-train split, you are on shaky ground. Essentially, once you deploy the model the first batch of data your estimator sees might differ a lot from the test set which you used to optimize the threshold. Repeated test-train splits "simulates" the variability of incoming data, and based on the distribution of thresholds you can then select a more robust threshold.

Related Question