Solved – Prediction intervals for machine learning algorithms

boostingbootstrapconfidence intervalmachine learningsupervised learning

I want to know if the process described below is valid/acceptable and any justification available.

The idea: Supervised learning algorithms don't assume underlying structures/distributions about the data. At the end of the day they output point estimates. I hope to quantify the uncertainty of the estimates somehow. Now, the ML model building process is inherently random (e.g. in sampling for cross-validation for hyperparameter tuning and in subsampling in stochastic GBM), so a modeling pipeline is going to give me a different output for the same predictors with each different seed. My (naive) idea is to run this process over and over again to come up with a distribution of the prediction, and I can hopefully make statements about the uncertainty of the predictions.

If it matters, the datasets I work with are typically very small (~200 rows.)

Does this make sense?

To clarify, I'm not actually bootstrapping the data in the traditional sense (i.e. I'm not re-sampling the data). The same dataset is used in every iteration, I'm just exploiting the randomness in xval and stochastic GBM.

Best Answer

To me it seems as good approach as any to quantify the uncertainties in the predictions. Just make sure to repeat all modeling steps (for a GBM that would be the parameter tuning) from scratch in every bootstrap resample. It could also be worthwile to bootstrap the importance rankings to quantify the uncertainty in the rankings.

I have found that sometimes the intervals do not contain the actual prediction, especially when estimating a probability. Increasing the minimal number of observations in each terminal node usually solves that, at least in the data that I have worked with.

Conformal prediction seems like a useful approach for quantifying the confidence in predictions on new data. I have only scratched the surface thus far and others are probably more suited to give an optinion on that.

There is some crude R-code in my reply to this post about finding a GBM prediction Interval.

Hope this helps!