Machine Learning – Understanding Random Forests Confidence Intervals and Prediction Accuracy

coverage-probabilityprediction intervalrandom forest

This is a short simulation to check the coverage, when used as predictive intervals, of the random forest confidence intervals introduced in the paper:

S. Wager, T. Hastie and B. Efron. Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research, 15, pp 1625-1651 (2014)

The data is simulated from the process described in Section 4.3 of

J. Friedman. Multivariate Adaptive Regression Splines. Annals of Statistics. 19(1), pp 1-67 (1991)

There are six independent predictors $X_1,\dots,X_6$, each of them with $\text{U}[0,1]$ distribution, and an independent random variable $\epsilon$ having a $\text{N}(0,1)$ distribution. The response variable is defined as
$$
Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 -1/2)^2 + 10 X_4 + 5 X_5 + \epsilon.
$$

friedman <- function(n, p = 6) {
    X <- matrix(runif(n*p), nrow = n, ncol = p, dimnames = list(1:n, paste0("x_", 1:p)))
    y <- 10*sin(pi*X[, 1]*X[, 2]) + 20*(X[, 3] - 0.5)^2 + 10*X[, 4] + 5*X[, 5] + rnorm(n)
    data.frame(cbind(y, X))
}

We generate a training sample of size 1.000 and a test sample of size 100.000.

set.seed(42)

n <- 10^3
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

library(ranger)

rf <- ranger(y ~ ., data = training, num.trees = 10^3, keep.inbag = TRUE)

pred <- predict(rf, data = test, type = "se", se.method = "infjack")

y_hat <- pred$predictions

Lower <- y_hat - 1.96 * pred$se
Upper <- y_hat + 1.96 * pred$se

mean((Lower <= test$y) & (test$y <= Upper))

We set a nominal coverage of 95%, but the simulation results in a coverage of approximately 20.2%. The question is about the low effective predictive coverage obtained in this simulation.

Notes:

We're computing the $1-\alpha$ level confidence intervals as described in their footnote (page 3): $\hat{y}\pm z_\alpha \hat{\sigma}$.

As pointed by usεr11852 in the comments below, the effective coverage decreases as we increase the number of trees in the forest. Examples:

num.trees =  50 => effective coverage = 0.94855
num.trees = 100 => effective coverage = 0.76876 
num.trees = 150 => effective coverage = 0.68959 
num.trees = 200 => effective coverage = 0.56038 
num.trees = 250 => effective coverage = 0.55393 
num.trees = 300 => effective coverage = 0.32304 
num.trees = 350 => effective coverage = 0.55413 
num.trees = 400 => effective coverage = 0.26372 
num.trees = 450 => effective coverage = 0.26232 
num.trees = 500 => effective coverage = 0.23139

Best Answer

The way how the OP is written and the results are evaluated, this question appears to be confusing a prediction interval and a confidence interval.

The prediction interval gives an interval estimate of a random variable, while a confidence interval provides an interval estimate of a parameter. The difference between the two is conceptually as large as the difference between the standard deviation of a variable and the standard error of its mean.

EDIT (closing logical gap): The confusion arises often in relation to model predictions, because they can be viewed both as guess for a single random variable as well as an estimate of a conditional mean. END EDIT.

Jackknife estimates are providing uncertainties for an estimated parameter, so the method you try gives a confidence interval for the conditional mean, while your OP asks for a prediction interval.

This would somewhat explain why more trees (-> more robust estimates of the conditional mean) leads to shorter intervals.

Two approaches that I can think of:

Quantile random forest

I wanted to give you an example how to use quantile random forest to produce (conceptually slightly too narrow) prediction intervals, but instead of getting 80% coverage, I end up with 90% coverage, see also @Andy W's answer and @Zen's comment. Similar happens with different parametrizations. I cleaned up the code a little bit, so here is it, anyway.

library(ranger)

friedman <- function(n) {
  X <- matrix(runif(n * 6), nrow = n)
  data.frame(
    X,
    y = 10 * sin(pi * X[, 1] * X[, 2]) + 20 * (X[, 3] - 0.5)^2 +
        10 * X[, 4] + 5 * X[, 5] + rnorm(n)
  )
}

set.seed(42)

n <- 10^5
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

rf <- ranger(y ~ ., data = training, quantreg = TRUE)

pred <- predict(rf, data = test, quantiles = c(0.1, 0.9), type = "quantiles")

y_hat <- pred$predictions

Lower <- y_hat[, 1]
Upper <- y_hat[, 2]

mean((Lower <= test$y) & (test$y <= Upper))  # 0.91616

Generic prediction intervals

There is a generic method that I use from time to time that works with any regression method. It is based on two models: the first is a model for the conditional mean. The second one models the absolute residuals of the model and, as such, provides a model for the conditional standard deviation of the conditional response distribution. The prediction interval is then constructed in the same way as you do in your example. Similar to the quantile approach above, it ignores model uncertainty. A second issue is that the loss function used in the second model is rather arbitrary.

library(ranger)

friedman <- function(n) {
  X <- matrix(runif(n * 6), nrow = n)
  data.frame(
    X,
    y = 10 * sin(pi * X[, 1] * X[, 2]) + 20 * (X[, 3] - 0.5)^2 +
      10 * X[, 4] + 5 * X[, 5] + rnorm(n)
  )
}

set.seed(42)

n <- 10^5
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

rf_mean <- ranger(y ~ ., data = training)
rf_sd <- ranger(abs(rf_mean$predictions - training$y) ~ ., data = training)

pred_mean <- predict(rf_mean, data = test)$predictions
pred_sd <- predict(rf_sd, data = test)$predictions

Lower <- pred_mean - 1.96 * pred_sd
Upper <- pred_mean + 1.96 * pred_sd

mean((Lower <= test$y) & (test$y <= Upper))  # 89%

Interestingly, the coverage of this method is very similar to the one of the quantile random forest, which might somewhat indicate that Friedman's data is easy to predict?

Related Solutions

Difference Between Confidence Intervals and Prediction Intervals – In-Depth Analysis

Your question isn't quite correct. A confidence interval gives a range for $\text{E}[y \mid x]$, as you say. A prediction interval gives a range for $y$ itself. Naturally, our best guess for $y$ is $\text{E}[y \mid x]$, so the intervals will both be centered around the same value, $x\hat{\beta}$.

As @Greg says, the standard errors are going to be different---we guess the expected value of $\text{E}[y \mid x]$ more precisely than we estimate $y$ itself. Estimating $y$ requires including the variance that comes from the true error term.

To illustrate the difference, imagine that we could get perfect estimates of our $\beta$ coefficients. Then, our estimate of $\text{E}[y \mid x]$ would be perfect. But we still wouldn't be sure what $y$ itself was because there is a true error term that we need to consider. Our confidence "interval" would just be a point because we estimate $\text{E}[y \mid x]$ exactly right, but our prediction interval would be wider because we take the true error term into account.

Hence, a prediction interval will be wider than a confidence interval.

Solved – Confidence and Prediction Intervals and Cross-Validation

A quite simple approach is to use the training partition (your 4000 samples) for model tuning and model evaluation (to decide on your single best model), then to use the test partition (your 1000 held-out samples) once for obtaining an estimate on how well your single best model performs on new, unseen data.

Besides others, this can be done using repeated cross validation on the training set. With e.g. using 10 partitions and 20 repeats, you will train each model type and parameter set 200 times and obtain 200 performance estimates for each. The spread of those estimates gives you a hint on how likely it is to obtain a model that performs badly using this model type and parametrization. This information can be used to decide upon the best suited model type and parametrization then. After doing so, the chosen model is trained using all data in the training partition (usually done automatically by your ML software), then used to predict the target variable for the held-out test partition for a final performance estimate.

Here's a simple example on how this could look like (using R caret):

> model <- train(x = iris[,1:3], y = iris[,4], method = 'lm', 
  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 20))
> print(model$resample)
         RMSE  Rsquared     Resample
1   0.1485536 0.9667557 Fold01.Rep01
2   0.2404998 0.9097242 Fold02.Rep01
3   0.2173052 0.9494554 Fold03.Rep01
4   0.1733321 0.9486735 Fold04.Rep01
5   0.1871449 0.9429690 Fold05.Rep01
6   0.2347253 0.9206204 Fold06.Rep01
7   0.2419445 0.8938819 Fold07.Rep01
8   0.1856019 0.9603457 Fold08.Rep01
9   0.1195387 0.9785385 Fold09.Rep01
...
> boxplot(model$resample$RMSE, main = 'RMSE')

PS: this result is not a classic confidence interval (which can be calculated as well), but still employed quite frequently as it gives you a good hint about what is going on with your partitions and resamples.