Solved – Using an RMSE with derived confidence interval, to generate a prediction interval for an estimate

boostingcartmseprediction intervalrandom forest

Previous questions have asked about creating prediction intervals for estimates derived from random forests or boosted regression trees, in a similar way to is easily achieved with linear regression models.

A comment on this question described the RMSE as an estimate of the standard deviation of the residual error, supporting use of the RMSE in construction of an interval around a prediction (or estimate) from a BRT or RF.

Am I right in thinking that CART methods relax the requirement for homoscedasticity? If so, it would seem that using the RMSE calculated across the full range of residuals would lead to inappropriately wide intervals in some regions, and too narrow intervals in others. It would then seem the only way to estimate an interval would be through bootstrapping (BRT) or accessing trees' individual predictions (RF).

[That same question] (Confidence interval of RMSE) attracted advice on the construction of a confidence interval for the standard deviation of residuals, assuming mean residual is zero, with normal distribution, based on a chi squared statistic around the RMSE.

How would such an interval on the SD be used? Would using the high end of the CI for SD as the value in a CI such as $\hat{x} \pm z\hat{SD}_u$ be a valid, if conservative, interval? Could you still attribute a specific 'confidence' value (e.g. 95%) to such an interval, given that it has 'nested' confidences?

Best Answer

The CART, as I understand it, does not have homoscedasticity assumptions. If anything it presumes that the variance of each component is independent from the variance of all other components. It doesn't account for correlations in variables either.

The normality assumption is problematic. It is convenient but not necessarily true. There is often hand-waving about "law of large numbers" but the real world, impo, likes to frustrate such things.

Have you considered using quantile regression forests for your estimate, or is that part of the problem?

Related Question