Solved – Difference between confidence interval and RMSE

confidence intervalgaussian processpartial least squaresrms

For a long time I have been doing chemometrics using Partial Least Squares regression to predict the composition of samples based on their spectra, and calculating the root mean squared error (RMSE) as an estimate of the accuracy of the model.

I am now looking into using gaussian process regression to do something similar. As a Bayesian method, GPR results automatically come with a confidence interval, but I could also calculate an RMSE just as I do with PLS.

So, my question is: If I want to report the accuracy of a prediction using GPR, do I report the RMSE or the confidence interval (or both)?

Best Answer

By confidence-interval, you probably mean a prediction-interval. The two terms are often confused, but yes, there is a difference.

Point predictions and prediction intervals offer different kinds of information.

Point predictions predict a single number. You can assess their quality using rms, or a lot of other accuracy KPIs, like the mse, the mae or even the mape. Which KPI is most meaningful really depends on your loss function - the mse penalizes large errors much more than the mae.

If your Bayesian method gives you an entire posterior distribution, you can extract a point forecast from it in various ways, e.g., by taking the mean, or the median, or the mode. Fun fact: if you use the rmse or the mse, take the mean - and if you use the mae, take the median. There won't be much of a difference if the posterior distribution is symmetric, but if it's asymmetric, this can indeed make a difference.
Prediction intervals give interval-valued predictions, which aim at covering the actual value, say, 80% or 90% of the time if you run "many" predictions. The simplest way to assess the quality of a prediction-interval would be to count whether they indeed cover your prespecified proportion of actuals. Maybe test the resulting contingency table or run a binomial test if you want to see whether the deviation from your prespecified coverage is statistically significant.

Whether you actually want a point prediction or a prediction interval is up to what you want to do with your predictions. Point predictions are easier for non-statisticians to grasp. PIs allow you to do scenario analyses; they tell you how sure you are about a prediction (if they are well calibrated).

Related Solutions

Solved – How does Stata calculate RMSE in regression with weights

The essence of this I do take to be statistical, rather than a pure programming problem.

As you don't use weights in your manual calculation, the lack of agreement is at first sight not surprising, but there are other problems too, some but not all trivial.

Random number generation. Although you carefully set seed note also that Stata's random number generator changed in Stata 14 (http://www.stata.com/help.cgi?whatsnew13to14 #15). Below I am using Stata 10.1 on an old computer. This just affects which numbers are generated and should be separate from getting the same answers out of regress and a hand-calculation. You should able to reproduce what I am doing with version control.
Using double precision consistently. regress uses double precision, so hand calculations should follow along.
Using weights that make sense. Your weights by construction have expectation zero, but Stata ignores negative weights, typically about half of them. We could adjust for that by comparing like with like, but it's cleaner to generate positive weights directly.
Using the same formula. This is what interests you.

The code here doesn't follow yours. In some cases I have used slightly more direct routes. Note that scalars hold more precision than locals. The number of decimal places here is ridiculous for every purpose except the present purpose of reproducing a calculation.

. clear

. set seed 12345

. drawnorm x e, n(1000)
(obs 1000)

. gen y = 5+(2*x)+e

. gen double w = runiform()

. reg y x

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) = 3713.72
       Model |  3846.76283     1  3846.76283           Prob > F      =  0.0000
    Residual |  1033.75252   998  1.03582417           R-squared     =  0.7882
-------------+------------------------------           Adj R-squared =  0.7880
       Total |  4880.51536   999  4.88540076           Root MSE      =  1.0178

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.963036   .0322124    60.94   0.000     1.899824    2.026248
       _cons |   5.015532   .0322162   155.68   0.000     4.952313    5.078752
------------------------------------------------------------------------------

. di %23.18f e(rmse) 
   1.017754475398205700

. predict double err, res 

. gen double sqerr = err^2

. su sqerr, meanonly 

. di _n "rmse " %23.18f sqrt(r(sum)/e(df_r)) 

rmse    1.017754475398205700

. reg y x [aw=w] 
(sum of wgt is   4.9129e+02)

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) = 3666.47
       Model |  3913.22661     1  3913.22661           Prob > F      =  0.0000
    Residual |  1065.16666   998  1.06730126           R-squared     =  0.7860
-------------+------------------------------           Adj R-squared =  0.7858
       Total |  4978.39327   999  4.98337665           Root MSE      =  1.0331

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.984769   .0327783    60.55   0.000     1.920447    2.049092
       _cons |   5.005039   .0326721   153.19   0.000     4.940925    5.069153
------------------------------------------------------------------------------

. di %23.18f e(rmse) 
   1.033102736537657100

. predict double werr, res 

. su w, meanonly 

. scalar wsum = r(sum) 

. gen double sqwerr = (w/wsum) * e(N) * werr^2

. su sqwerr, meanonly 

. di _n "rmse " %23.18f sqrt(r(sum)/e(df_r)) 

rmse    1.033102736537657300

I regard Stata code as being as self-evident as presumably R users feel also about R. What may help is that r(sum), e(N) and e(df_r) are saved scalar results accessible after running commands. res is the option that instructs predict to produce plain residuals, not predicted values.

The very minor discrepancy in the second case I trust to be trivial.

If compatibility with R or anything else continues to worry, I'd recommend a test suite not dependent on either's random number generator.

Solved – Gaussian Process Regression with wide confidence interval

The credible interval gets larger the further you extrapolate from the data. In this case, the data density is approximately the same throughout, so that broadening is not observed, and the credible interval is dominated by the estimate of the assumed additive noise contaminating the observations ($\sigma_\mathrm{nois}$). If you extended the plot, say from -100 to +100, I suspect you would see the broadening.

The width of the band look about right for a 95% credible interval, it has about as many data points lying outside it as would be reasonably expected.

If you expect the noise depends on the attribute/feature, you need a heteroscedastic Gaussian process model.