Solved – the best way to represent uncertainty from linear interpolation

errorinterpolation

A little background to this question:

Part of my job is to conduct flood risk appraisals to help determine the viability of flood defence construction. There is a standardised way to do this, which uses flood depth data from a computer model and depth-damage data gathered from large scale flood events. It is understood that there is significant uncertainty in the raw data, but it still allows us to estimate the relative value of different defences.

I have linearly interpolated between points on a flood damage-probability graph and integrated with respect to probability to find an average annual damage. As a way of quantifying the uncertainty in the interpolation, I have calculated an upper (blue dashed line in figure) and lower (red) bound of the interpolation as a way of quantifying the uncertainty in the interpolation. Although this gives quite extreme values, it allows comparison of the relative uncertainty between different data sets. enter image description here

EDIT: The value of a defence is its benefit, which is the cost avoided. e.g. A defence with 1% standard of protection will have a benefit equal to the area under the curve between 1% and 33% (x-axis limit). This means that the total area under the curve is never calculated, as defences normally have a standard of protection between 0.1% and 20%, and so outlying values do not contribute to the uncertainty in the benefit.

My questions are:

  1. Is this an acceptable way of calculating the uncertainty of the method used?

  2. Given that there is also significant random error in the data points as well, is there a better way to represent the total uncertainty? e.g. Standard deviation assuming a normal distribution.

Best Answer

There are a few issues here, but I fear that the interpolation error of integration may be the least of them.

To start, if this work is done in a regulatory context one must use whatever procedures are required. So if there are regulatory requirements for how to estimate integration error then those take practical precedence. What follows ignores such regulations. First we'll examine the situation when the cost estimates have little or no error at the indicated event probabilities, then discuss how uncertainties in the costs and in other aspects of the modeling affect the estimated values of defenses at specified standards of protection.

If costs are known precisely

What you have displayed are the limits of the possible cost-probability curve if you know that the cost curve is non-increasing with increasing x-axis (probability) values. In this case that seems to be a reasonable assumption, as lower-probability events (higher flood stage) should have costs at least as large as higher-probability events (low or no flood stage).

The integrals estimated by the blue and red lines are Riemann sums, with the blue line providing a left Riemann sum and the red line a right Riemann sum. Your proposal to use those as outer limits for the inteterpolation error itself seems quite appropriate.

In general for smooth curves, the trapezoidal rule (which you presumably are using for your integration estimate, and is the average of the left and right Riemann sums) has a defined relationship between the integration error and a value of the second derivative of the cost-probability curve somewhere within the limits of integration. So if you can assume reasonable limits for the second-derivative values, that could set limits to the integration error. For the direction of systematic error, if you can assume that the curve is convex you know that your interpolation will over-estimate the true area.

Convexity and limits on second-derivative values might, however, not be good assumptions for flood costs. For example, there could be a fairly fast jump in costs as flood stage reached the level of first-story floors, and then another jump as flood stage reached the level of second-story floors. So convexity and assumptions about limits on second-derivative values would be questionable. That could also make it risky to try to fit a smooth curve to the set of data pairs and calculate the area analytically from the equation for the smooth curve.

So if the costs are known precisely then the limits you propose seem to be the best you can get in general for the limits of the area under the curve between the specified x-axis limits.

With uncertainty in the costs

Estimating integration error due to interpolation does not deal with an additional source of uncertainty: the uncertainty in the y-axis cost estimates. A statistician would want to see error estimates for each of those cost values and would want you to take those error estimates into account to get a better measure of the actual error in your estimate of the value of the integral.

This is a particular problem with flood damage prediction. The x-axis probabilities typically represent the probability per year of a flood that is greater than a specified level (stage) above normal water levels. The damage associated with a particular stage may be affected by other aspects of the flood besides its level, such as its velocity or duration, adding additional uncertainty to damage estimates at any flood stage. This report compares different approaches to estimating flood risk; some simply ignore the uncertainties in stage and cost, some sample from probability distributions for certain of these estimates, and some model with event-based catastrophe models instead of continuous probability estimates.

This answer shows how errors in the estimates of y-axis values affect the estimates of integrals interpolated by the trapezoidal rule, in situations where the errors of the values at different event probabilities are uncorrelated. This simply follows the rules for variance of a weighted sum.

To get an idea of the relative contributions of interpolation and imprecise cost estimates to overall integration error, consider the integral under the curve between two event stages with cumulative probabilities (x-axis values) $p_0$ and $p_1$ ($p_0 < p_1$; $p_1-p_0=\Delta p$), having associated costs of $C_0$ and $C_1$ ($C_0 > C_1$).

Interpolation error. The trapezoidal rule gives an area $ (C_0 + C_1) \Delta p /2$. The upper limit of area given by the left Riemann sum is $C_0 \Delta p$, for an upper-limit difference for error above the trapezoidal estimate of $(C_0-C_1)\Delta p /2$.

Errors in $C_i$. If the variances of the $C_i$ values are $\sigma_i^2$ and the $C_i$ are uncorrelated, then the trapezoidal interpolation has an associated variance of $(\sigma_0^2 + \sigma_1^2)\Delta p^2/4$, or a standard error of $\sqrt{(\sigma_0^2 + \sigma_1^2)}\Delta p/2$.

So if the curve is relatively steep compared to the errors in the cost estimates then the interpolation error of the area will dominate. Areas over comparatively flatter parts of the curve, perhaps typical of higher-probability stages, may have integration errors dominated by the errors in the cost estimates. Similar application of the formula for the variance of a weighted sum provide the errors in the left and right Riemann sums that you propose as upper and lower estimates of integration error per se.

Other uncertainties in modeling

The full cost/benefit comparison involves more than estimating the annual benefit of installing the defense. This report shows how discount rates (to translate future gains into net present values) and expected project life are also incorporated into the calculation. There is the possibility that past history of flood probabilities does not adequately represent the future. Furthermore, future development (or abandonment) of structures in the flood plain will affect the value of installing the defense. Uncertainties in any of these estimates will add to the uncertainty in the full cost/benefit comparison.

Related Question