It is perfectly as you suspect -- the fact that leaf nodes contain means over some set of objects make any regression tree model tighten the response distribution and make any extrapolation impossible.
Ensemble of course does not help with that and in fact make situation worse.
The naive solution (and dangerous because of overfitting) is to wrap the model in some kind of classical regression which would rescale the response to its desired distribution.
The better solution is one of the model-in-leaf tree models, like for instance MOB in party package. The idea here is that partitioning of feature space should end when the problem is simplified not to a simple value (as in regular tree) but to a simple relation (say linear) between the response and some predictors. Such relation can be now resolved by fitting some simple model which won't disturb the distribution or trim extreme values and would be able to extrapolate.
The t
matrix is the one to use in the way you describe. Eqs. 4 through 7 in the Dong & Peng paper that Joe_74 references correspond to the elements of the same names in the mipo
object (documentation here), and so t
is the accurate variance-covariance matrix for the pooled regression coefficients qbar
you're actually using. ubar
and b
only matter here in that they are/were used to compute t
.
Presumably you'll be using more than one predictor, so here's a MWE for that, which should be easy to modify.
set.seed(500)
dat <- data.frame(y = runif(20, 0, .5), x1 = c(runif(15),rep(NA, 5)), x2 = runif(20, 0.5))
imp <- mice(dat)
impMods <- with(imp, lm(y ~ x1 + x2))
pooledMod <- pool(impMods)
# Generate some hypothetical cases we want predictions for
newCases <- data.frame(x1=c(4,7), x2=c(-6,0))
# Tack on the column of 1's for the intercept
newCases <- cbind(1, newCases)
# Generating the actual predictions is simple: sums of values times coefficients
yhats <- rowSums( sweep(newCases, 2, pooledMod$qbar, `*`) )
# Take each new case and perform the standard operation
# with the t matrix to get the pred. err.
predErr <- apply(newCases, 1, function(X) sqrt(t(X) %*% pooledMod$t %*% X))
# Finally, put together a plot-worthy table of predictions with upper and lower bounds
# (I'm just assuming normality here rather than using T-distribution critical values)
results <- data.frame(yhats, lwr=yhats-predErr*1.96, upr=yhats+predErr*1.96)
Best Answer
This is partly a response to @Sashikanth Dareddy (since it will not fit in a comment) and partly a response to the original post.
Remember what a prediction interval is, it is an interval or set of values where we predict that future observations will lie. Generally the prediction interval has 2 main pieces that determine its width, a piece representing the uncertainty about the predicted mean (or other parameter) this is the confidence interval part, and a piece representing the variability of the individual observations around that mean. The confidence interval is fairy robust due to the Central Limit Theorem and in the case of a random forest, the bootstrapping helps as well. But the prediction interval is completely dependent on the assumptions about how the data is distributed given the predictor variables, CLT and bootstrapping have no effect on that part.
The prediction interval should be wider where the corresponding confidence interval would also be wider. Other things that would affect the width of the prediction interval are assumptions about equal variance or not, this has to come from the knowledge of the researcher, not the random forest model.
A prediction interval does not make sense for a categorical outcome (you could do a prediction set rather than an interval, but most of the time it would probably not be very informative).
We can see some of the issues around prediction intervals by simulating data where we know the exact truth. Consider the following data:
This particular data follows the assumptions for a linear regression and is fairly straight forward for a random forest fit. We know from the "true" model that when both predictors are 0 that the mean is 10, we also know that the individual points follow a normal distribution with standard deviation of 1. This means that the 95% prediction interval based on perfect knowledge for these points would be from 8 to 12 (well actually 8.04 to 11.96, but rounding keeps it simpler). Any estimated prediction interval should be wider than this (not having perfect information adds width to compensate) and include this range.
Let's look at the intervals from regression:
We can see there is some uncertainty in the estimated means (confidence interval) and that gives us a prediction interval that is wider (but includes) the 8 to 12 range.
Now let's look at the interval based on the individual predictions of individual trees (we should expect these to be wider since the random forest does not benefit from the assumptions (which we know to be true for this data) that the linear regression does):
The intervals are wider than the regression prediction intervals, but they don't cover the entire range. They do include the true values and therefore may be legitimate as confidence intervals, but they are only predicting where the mean (predicted value) is, no the added piece for the distribution around that mean. For the first case where x1 and x2 are both 0 the intervals don't go below 9.7, this is very different from the true prediction interval that goes down to 8. If we generate new data points then there will be several points (much more than 5%) that are in the true and regression intervals, but don't fall in the random forest intervals.
To generate a prediction interval you will need to make some strong assumptions about the distribution of the individual points around the predicted means, then you could take the predictions from the individual trees (the bootstrapped confidence interval piece) then generate a random value from the assumed distribution with that center. The quantiles for those generated pieces may form the prediction interval (but I would still test it, you may need to repeat the process several more times and combine).
Here is an example of doing this by adding normal (since we know the original data used a normal) deviations to the predictions with the standard deviation based on the estimated MSE from that tree:
These intervals contain those based on perfect knowledge, so look reasonable. But, they will depend greatly on the assumptions made (the assumptions are valid here because we used the knowledge of how the data was simulated, they may not be as valid in real data cases). I would still repeat the simulations several times for data that looks more like your real data (but simulated so you know the truth) several times before fully trusting this method.