Machine Learning – Understanding Random Forests Confidence Intervals and Prediction Accuracy

coverage-probabilityprediction intervalrandom forest

This is a short simulation to check the coverage, when used as predictive intervals, of the random forest confidence intervals introduced in the paper:

S. Wager, T. Hastie and B. Efron. Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research, 15, pp 1625-1651 (2014)

The data is simulated from the process described in Section 4.3 of

J. Friedman. Multivariate Adaptive Regression Splines. Annals of Statistics. 19(1), pp 1-67 (1991)

There are six independent predictors $X_1,\dots,X_6$, each of them with $\text{U}[0,1]$ distribution, and an independent random variable $\epsilon$ having a $\text{N}(0,1)$ distribution. The response variable is defined as
$$
Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 -1/2)^2 + 10 X_4 + 5 X_5 + \epsilon.
$$

friedman <- function(n, p = 6) {
    X <- matrix(runif(n*p), nrow = n, ncol = p, dimnames = list(1:n, paste0("x_", 1:p)))
    y <- 10*sin(pi*X[, 1]*X[, 2]) + 20*(X[, 3] - 0.5)^2 + 10*X[, 4] + 5*X[, 5] + rnorm(n)
    data.frame(cbind(y, X))
}

We generate a training sample of size 1.000 and a test sample of size 100.000.

set.seed(42)

n <- 10^3
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

library(ranger)

rf <- ranger(y ~ ., data = training, num.trees = 10^3, keep.inbag = TRUE)

pred <- predict(rf, data = test, type = "se", se.method = "infjack")

y_hat <- pred$predictions

Lower <- y_hat - 1.96 * pred$se
Upper <- y_hat + 1.96 * pred$se

mean((Lower <= test$y) & (test$y <= Upper))

We set a nominal coverage of 95%, but the simulation results in a coverage of approximately 20.2%. The question is about the low effective predictive coverage obtained in this simulation.

Notes:

We're computing the $1-\alpha$ level confidence intervals as described in their footnote (page 3): $\hat{y}\pm z_\alpha \hat{\sigma}$.

As pointed by usεr11852 in the comments below, the effective coverage decreases as we increase the number of trees in the forest. Examples:

num.trees =  50 => effective coverage = 0.94855
num.trees = 100 => effective coverage = 0.76876 
num.trees = 150 => effective coverage = 0.68959 
num.trees = 200 => effective coverage = 0.56038 
num.trees = 250 => effective coverage = 0.55393 
num.trees = 300 => effective coverage = 0.32304 
num.trees = 350 => effective coverage = 0.55413 
num.trees = 400 => effective coverage = 0.26372 
num.trees = 450 => effective coverage = 0.26232 
num.trees = 500 => effective coverage = 0.23139 

Best Answer

The way how the OP is written and the results are evaluated, this question appears to be confusing a prediction interval and a confidence interval.

The prediction interval gives an interval estimate of a random variable, while a confidence interval provides an interval estimate of a parameter. The difference between the two is conceptually as large as the difference between the standard deviation of a variable and the standard error of its mean.

EDIT (closing logical gap): The confusion arises often in relation to model predictions, because they can be viewed both as guess for a single random variable as well as an estimate of a conditional mean. END EDIT.

Jackknife estimates are providing uncertainties for an estimated parameter, so the method you try gives a confidence interval for the conditional mean, while your OP asks for a prediction interval.

This would somewhat explain why more trees (-> more robust estimates of the conditional mean) leads to shorter intervals.

Two approaches that I can think of:

Quantile random forest

I wanted to give you an example how to use quantile random forest to produce (conceptually slightly too narrow) prediction intervals, but instead of getting 80% coverage, I end up with 90% coverage, see also @Andy W's answer and @Zen's comment. Similar happens with different parametrizations. I cleaned up the code a little bit, so here is it, anyway.

library(ranger)

friedman <- function(n) {
  X <- matrix(runif(n * 6), nrow = n)
  data.frame(
    X,
    y = 10 * sin(pi * X[, 1] * X[, 2]) + 20 * (X[, 3] - 0.5)^2 +
        10 * X[, 4] + 5 * X[, 5] + rnorm(n)
  )
}

set.seed(42)

n <- 10^5
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

rf <- ranger(y ~ ., data = training, quantreg = TRUE)

pred <- predict(rf, data = test, quantiles = c(0.1, 0.9), type = "quantiles")

y_hat <- pred$predictions

Lower <- y_hat[, 1]
Upper <- y_hat[, 2]

mean((Lower <= test$y) & (test$y <= Upper))  # 0.91616

Generic prediction intervals

There is a generic method that I use from time to time that works with any regression method. It is based on two models: the first is a model for the conditional mean. The second one models the absolute residuals of the model and, as such, provides a model for the conditional standard deviation of the conditional response distribution. The prediction interval is then constructed in the same way as you do in your example. Similar to the quantile approach above, it ignores model uncertainty. A second issue is that the loss function used in the second model is rather arbitrary.

library(ranger)

friedman <- function(n) {
  X <- matrix(runif(n * 6), nrow = n)
  data.frame(
    X,
    y = 10 * sin(pi * X[, 1] * X[, 2]) + 20 * (X[, 3] - 0.5)^2 +
      10 * X[, 4] + 5 * X[, 5] + rnorm(n)
  )
}

set.seed(42)

n <- 10^5
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

rf_mean <- ranger(y ~ ., data = training)
rf_sd <- ranger(abs(rf_mean$predictions - training$y) ~ ., data = training)

pred_mean <- predict(rf_mean, data = test)$predictions
pred_sd <- predict(rf_sd, data = test)$predictions

Lower <- pred_mean - 1.96 * pred_sd
Upper <- pred_mean + 1.96 * pred_sd

mean((Lower <= test$y) & (test$y <= Upper))  # 89%

Interestingly, the coverage of this method is very similar to the one of the quantile random forest, which might somewhat indicate that Friedman's data is easy to predict?

Related Question