Solved – Machine Learning – Prediction Interval – Cheating

boostingprediction intervalpredictive-models

I work at a company that is trying to use machine learning methods in particular gradient boosting and neural networks to make predictions on stock market data, so using historical data to predict what the price of a stock/asset will be $x$ time periods from the present. We are using these methods for regression as opposed to classification, and it is my habit being trained in experimental sciences to always give a regression prediction in terms of a $\pm$, giving an interval of prediction, rather than just one number. My manager (who doesn't seem very technical) told me that this is unacceptable/cheating, as I'm using an interval to cover the fact that my algorithm is unable to produce a single correct number. I'm a bit confused about this attitude, as coming from laboratory sciences (chemistry) we always cite every result in terms of a $\pm$.

So, I was wondering what the stats experts on here thought? Just out of curiosity, I checked my Machine Learning textbook by Hastie, Witten, et. al., and they use the MSE on the test set to give a $\pm$ on the predictions from an example they use on gradient boosting, so it seems standard to do this…

Thanks.

Best Answer

In general, prediction intervals are considered better than point estimates. While it's great to have a good estimate for what a stock price will be tomorrow, it's much better to be able to be able to give a range of values that the stock price is very likely to be in.

That being said, it's generally more difficult to produce reliable prediction intervals than merely produce point estimates that have good prediction properties. For example, in many cases we can show that with non-constant variance, we can still produce a consistent estimator of the mean of the new value even if we ignore the non-constant variance issue. However, we definitely need a reliable estimate of the variance function to produce prediction intervals.

I've heard of people just treating this as another level of a machine learning problem: the first level is to produce a function $\hat f(x_i) = E[\hat y_i | x_i] $, the estimates of the values and the second level is to produce a function $\hat V(x_i) = E[(y_i - \hat y_i)^2 | x_i]$, the estimates of the variance of the function given the inputs. In theory, this should work (given enough data with a stable function), but in practice, it must be handled with a lot of care, as variance estimates are inherently much less stable than mean estimates. In short, you should expect to need much more data to accurately estimate $\hat V(x_i)$ than $\hat f(x_i)$.

So there's definitely nothing about prediction intervals that is "cheating" compared with just producing point estimates. It's just harder to do. As an empirical example, in the M4 forecasting competition, only 2 of the 15 methods that produced 95% prediction intervals had nearly correct coverage; most of the other prediction intervals had coverage in the 80-90% range (see slide 35 in the link).

Related Solutions

Solved – How to find a GBM Prediction Interval

EDIT: As pointed out in the comments below this gives the confidence intervals for predictions and not strictly the prediction intervals. Was a bit trigger happy with my reply and should have given this some extra thought.

Feel free to ignore this answer or try to build on the code to get the prediction intervals.

I have used the simple bootstrap for creating prediction intervals a few times but there may be other (better) ways.

Consider the oil data in the caret package and suppose we want to generate partial dependencies and 95% intervals for the effect of Stearic on Palmitic. Below is just a simple example but you can play around with it to suit your needs. Make sure the gbm package is update to allow the grid.points argument in plot.gbm

library(caret)
data(oil)
#train the gbm using just the defaults.
tr <- train(Palmitic ~ ., method = "gbm" ,data = fattyAcids, verbose = FALSE)

#Points to be used for prediction. Use the quartiles here just for illustration
x.pt <- quantile(fattyAcids$Stearic, c(0.25, 0.5, 0.75))

#Generate the predictions, or in this case, the partial dependencies at the selected points. Substitute plot() for predict() to get predictions
p <- plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)

#Bootstrap the process to get prediction intervals
library(boot)

bootfun <- function(data, indices) {
  data <- data[indices,]

  #As before, just the defaults in this example. Palmitic is the first variable, hence data[,1]
  tr <- train(data[,-1], data[,1], method = "gbm", verbose=FALSE)

  # ... other steps, e.g. using the oneSE rule etc ...
  #Return partial dependencies (or predictions)

  plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)$y
  #or predict(tr$finalModel, data = ...)
}

#Perform the bootstrap, this can be very time consuming. Just 99 replicates here but we usually want to do more, e.g. 500. Consider using the parallel option
b <- boot(data = fattyAcids, statistic = bootfun, R = 99)

#Get the 95% intervals from the boot object as the 2.5th and 97.5th percentiles
lims <- t(apply(b$t, 2, FUN = function(x) quantile(x, c(0.025, 0.975))))

This is one way to do it which at least try to account for the uncertainties arising from tuning the gbm. A similar approach has been used in http://onlinelibrary.wiley.com/doi/10.2193/2006-503/abstract

Sometimes the point estimate is outside the interval, but modifying the tuning grid (i.e., increasing the number of trees and/or the depth) usually solves that.

Hope this helps!

Solved – GBM Bootstrap Prediction Interval Code Error

Hard to say without a reproducible example but some pointers based on what I can understand from the code:

For plot.gbm you need to pass a gbm object as well as the variable to plot. In your case something like p <- plot(gbmFit$finalModel, i.var = ..., grid.levels = x.pt, ...), where i.var is the variable you want partial dependencies for. The length defaults to 100, see ?plot.gbm and I think that is why you get 100 points. The presence of grid.levels should override this if plot.gbm is called correctly.
2-fold CV is not likely to give a good estimate of performance, and with 75 data points I would use the bootstrap (or maybe leave one out CV) and restrict the trees to be of depth 1--3 (or so) unless you have very strong prior knowledge that you require depth 20 or 21 trees.

The book Applied Predictive Modeling by Max Kuhn and Kjell Johnson is a great go-to source for tuning gbm ensembles (and tuning predictive models in general).

Also note that this code gives confidence intervals for predicted values instead of prediction intervals as pointed out by the comments in the above mentioned thread.

Hope this helps!

Best Answer

Related Solutions

Solved – How to find a GBM Prediction Interval

Solved – GBM Bootstrap Prediction Interval Code Error

Related Question