My problem (question at the end) is to calculate confidence interval (CI) (NOT prediction interval) of the response of a nonlinear model.
I am working with R but this question is not R-specific.
I want to model some data after the following equation (model):
Y ~ a * X^b/b
First, I estimate the parameters a and b through nonlinear regression (using R's "nls()"), which yields estimates and error on the corresponding estimate.
Nonlinear regression model
model: Y ~ (A * X^B/B)
data: data.frame(X = X, Y = Y)
A B
7.4154 0.6041
residual sum-of-squares: 88983
Then I calculate 95% CI for a and b (using confint(nlm <- nls(Y ~ A * X^B/B, start=list(A=1,B=1)))
> confint(nlm)
Waiting for profiling to be done...
2.5% 97.5%
A 1.21719414 11.549562
B 0.08583486 1.482389
In order to calculate 95%CI for Y, given some fixed, certain value of X, my first idea was to propagate uncertainties on a and b to Y through the model equation. This yields some value for 95% CI of Y, given X.
I then came accross the "propagate" package that proposes to calculate 95%CI of Y, given X, "based on asymptotic normality" (citation from "http://127.0.0.1:22638/library/propagate/html/predictNLS.html"). However this method yields a VERY different 95%CI.
My question is: Why aren't these two CI equal ?
A worked example (with some random equation that just crossed my mind):
Values needed for error propagation : A, CI(A), B, CI(B), X, CI(X) :
Parameters (A & B)' estimates and 95%CI were calculated from
confint(nlm <- nls(Y ~ A * X^B/B, start=list(A=1,B=1))
X was then fixed at 30 for the sake of the argument, and considered error-free.
A B X
value 7.415380 0.6041404 30
95% CI 5.166184 0.6982769 0
The general formula for uncertainties propagation (works for sd, se, ci95%) is :
Y=f(Ai | i = 1 to n)
=> delta(Y) = sqrt( sum( ( dY/dAi * delta(Ai) )^2 ) )
The equation being Y = A * X^B/B
Partial derivatives are then:
dF/dA = X^B/B
dF/dB = A * (X^B * log(X))/B - A * X^B/B^2
dF/dX = A * (X^(B - 1) * B)/B
Then
dF/dA = 12.9196498927581
dF/dB = 167.269472901412
dF/dX = 1.92930443474376
This yields
Y = 95.8041099173585 +- 134.526084150286
However, when using the predictNLS() function from "propagate" R package:
predictNLS(nlm, newdata=data.frame(X=30), interval = "confidence")$summary
Propagating predictor value #1 ...
Prop.Mean.1 Prop.Mean.2 Prop.sd.1 Prop.sd.2 Prop.2.5%
95.80411 102.8339 20.89399 24.86949 51.89104
Prop.97.5% Sim.Mean Sim.sd Sim.Median Sim.MAD Sim.2.5%
153.7767 93.5643 1712.894 97.85209 21.98703 -117.3541
Sim.97.5%
210.3916
Which yields
Y = 95.80411 +- (153.7767-51.89104)/2
=> Y = 95.80411 +- 50.94283
Obviously I must have missed / misunderstood some essential information about CI of response variable, because I believe the person who coded the predictNLS() function must be way more knowledgeable than me about it.
Thanks in advance for your explanations.
Best Answer
I haven't investigated why the two CIs are unequal because I think both are wrong. The common problem is that the estimated parameters are likely correlated, perhaps heavily so, but neither procedure appears to account for that. (In the realistic examples shown below, the correlation ranges from -99.3% to -99.8%.)
To find confidence bands, start with the model in the alternative form
$$y = \exp(\alpha + \beta \log(x)) + \varepsilon.$$
The error term is $\varepsilon$ and $(\alpha,\beta)$ is the model parameter to be estimated. As in the question I will use nonlinear least squares to fit the model. (Taking logarithms of both sides will be vain, because they cannot simplify the right hand side due to the additive error term. Indeed, as the examples below indicate, it is possible--and perfectly OK--for observed values of $y$ to be negative.)
One output of the fitting procedure will be the variance-covariance matrix of the parameter estimates,
$$\mathbb{V} = \pmatrix{\sigma^2_\alpha & \rho \sigma_\alpha\sigma_\beta \\ \rho \sigma_\alpha\sigma_\beta & \sigma^2_\beta}.$$
The square roots of the diagonal elements, $\sigma_\alpha$ and $\sigma_\beta,$ are the standard errors of the estimated parameters $\hat\alpha$ and $\hat\beta,$ respectively. $\rho$ estimates the correlation coefficient of those estimates.
The predicted value at any (positive) number $x_0$ is
$$\hat{y}(x_0) = \exp(\hat\alpha + \hat\beta\log(x_0)) = e^\hat\alpha x_0^\hat\beta = \frac{\hat A}{\hat\beta} x_0^\hat\beta$$
where $\hat A=\hat\beta e^\hat\alpha,$ showing this really is the intended model as formulated in the question. Taking logarithms (which now is possible because there are no error terms in the equation) gives
$$\log(\hat y(x_0)) = \hat\alpha + \hat\beta \log(x_0) = (1, \log(x_0))\ (\hat\alpha,\hat\beta)^\prime.$$
Thus the standard error of the logarithm of the estimated response is
$$SE(x_0) = SE(\log(\hat y(x_0))) = \sqrt{(1, \log(x_0))\mathbb{V}(1, \log(x_0))^\prime}.$$
Being able to formulate the SE in this way was the whole point of the initial reparameterization of the model: the parameters $\alpha$ and $\beta$ (or, rather, their estimates) enter linearly into the calculation of the standard error of $\hat y.$ There is no need to expand Taylor series or "propagate error."
To construct a $100(1-a)\%$ confidence interval for $\log(y(x_0)),$ do as usual and set the endpoints at $$\log(\hat y(x_0)) \pm Z_{a/2}\, SE(x_0)$$ where $\Phi(Z_{a/2}) = a/2$ is the $a/2$ percentage point of the standard Normal distribution. Because this procedure aims to enclose the true response with $100(1-a)\%$ probability, this coverage property is preserved upon taking antilogarithms. That is,
By doing this for a sequence of values of $x_0$ you can construct confidence bands for the regression. If all is well (that is, all model assumptions are accurate and there are enough data to assure the sampling distribution of $(\hat\alpha,\hat\beta)$ is approximately Normal), you can hope that $100(1-a)\%$ of these bands envelop the graph of the true response function $x\to \exp(\alpha + \beta\log(x)).$
To illustrate, I generated $20$ datasets having properties similar to those in the problem: the true $\alpha$ and $\beta$ are close to the estimates reported in the question and the error variance $\operatorname{Var}(\varepsilon)$ was set to make the sum of squares of residuals close to that reported in the question (near 90,000). I used the foregoing technique to fit this model to each dataset and then for each one plotted (a) the data, (b) the $90\%$ confidence band and, for reference, (c) the graph of the true response function. The latter is colored red wherever it lies beyond the confidence band. The test of this approach is that about $10\%,$ or two of the $20,$ panels ought to show red portions: and that's exactly what happened (in iterations 10 and 16).
For details, consult this
R
code that generated and plotted the simulations.