First, on the use of the word probability, frequentists don't have a problem with using the word probability when predicting something where the random piece has not taken place yet. We don't like the word probability for a confidence interval because the true parameter is not changing (we are assuming it is a fixed, though unknown, value) and the interval is fixed because it is based on data that we have already collected. For example if our data comes from a random sample of adult male humans and x is their height and y is their weight and we fit the general regression model then we don't use probability when talking about the confidence intervals. But if I want to talk about what is the probability of a 65 inch tall male chosen at random from all 65 inch tall males having a weight within a certain interval, then it is fine to use probability in that context (because the random selection has not yet been made, so probability makes sense).
So I would say that the answer to the bonus question is "Yes". If we knew enough information, then we could compute the probability of seeing a y value within an interval (or find an interval with the desired probability).
For your statement labeled "1." I would say that it is OK if you use a word like "approximate" when talking about the interval or probability. Like you mention in the bonus question, we can decompose the uncertainty into a piece about the center of the prediction and a piece about the randomness around the true mean. When we combine these to cover all our uncertainty (and assuming we have the model/normality correct) we have an interval that will tend to be too wide (though can be too narrow as well), so the probability of a new randomly chosen point falling into the prediction interval is not going to be exactly 95%. You can see this by simulation. Start with a known regression model with all the parameters known. Choose a sample (across many x values) from this relationship, fit a regression, and compute the prediction interval(s). Now generate a large number of new data points from the true model again and compare them to the prediction intervals. I did this a few times using the following R code:
x <- 1:25
y <- 5 + 3*x + rnorm(25, 0, 5)
plot(x,y)
fit <- lm(y~x)
tmp <- predict(fit, data.frame(x=1:25), interval='prediction')
sapply( 1:25, function(x){
y <- rnorm(10000, 5+3*x, 5)
mean( tmp[x,2] <= y & y <= tmp[x,3] )
})
I ran the above code a few times (around 10, but I did not keep careful count) and most of the time the proportion of new values falling in the intervals ranged in the 96% to 98% range. I did have one case where the estimated standard deviation was very low that the proportions were in the 93% to 94% range, but all the rest were above 95%. So I would be happy with your statement 1 with the change to "approximately 95%" (assuming all the assumptions are true, or close enough to be covered in the approximately).
Similarly, statement 2 needs an "approximately" or similar, because to cover our uncertainty we are capturing on average more than the 95%.
The CART, as I understand it, does not have homoscedasticity assumptions. If anything it presumes that the variance of each component is independent from the variance of all other components. It doesn't account for correlations in variables either.
The normality assumption is problematic. It is convenient but not necessarily true. There is often hand-waving about "law of large numbers" but the real world, impo, likes to frustrate such things.
Have you considered using quantile regression forests for your estimate, or is that part of the problem?
Best Answer
Try the nnetpredint package. https://cran.r-project.org/web/packages/nnetpredint/
I’ve met the same problem and I also want to construct a prediction confidence interval to the neural networks. So I tried to develop the nnetpredint (R package), using the method from these related papers, which use the Jacobian matrix (first order derivative of the training datasets with gradient function) to estimate model errors instead of the Hessian matrix. The manual is here and the method has the function interface to the models trained by nnet, neuralnet and RSNNS packages:
The example for nnet package is here. The method nnetPredInt takes the model weights, nodes number, training datasets, etc. as input and compute the prediction interval for the new datasets.
The keys to the estimation methods:
Use the first order Taylor expansion to expand the f(x) at each weight parameters. And calculate the gradient vector/ Jacobian matrix from the training datasets.
References:
De Veaux R. D., Schumi J., Schweinsberg J., Ungar L. H., 1998, "Prediction intervals for neural networks via nonlinear regression", Technometrics 40(4): 273-282.
Chryssolouris G., Lee M., Ramsey A., "Confidence interval prediction for neural networks models",IEEE Trans. Neural Networks, 7 (1), 1996, pp. 229-232
And also check out this paper for detailed maths. http://cdn.intechopen.com/pdfs-wm/14915.pdf Confidence Intervals for Neural Networks and Applications to Modeling Engineering Materials