Solved – How to calculate confidence intervals for power function ($y=ax^b$) in R using NLS

bootstrapconfidence intervalnlspower lawr

I hope you can help me get some confidence in my confidence interval… I am trying to get the confidence interval for a particular (threshold) point on a predicted curve.

I find the confidence interval (C.I.) looks funny, so I would like to know whether what I am doing is correct… Thank you for any advice you can give me!

#Here is the data to which I fit a nonlinear least squares regression (power function)
y <-c(0.5745373, 0.8255836, 0.7139635, 0.6318004, 0.8688738, 0.8341626, 0.9278573, 0.6548638,
0.6995722, 0.8410211, 1.0000000, 0.8512973, 0.7917790, 0.5722589, 0.6918069, 0.7117084,
0.5279689, 0.6121315, 0.5869292, 0.7332605, 0.5991816, 0.5470566, 0.5987166, 0.4440854,
0.3719892, 0.4394820, 0.4410862, 0.4966288, 0.4291826, 0.4221613, 0.3951223, 0.3595973,
0.4617549, 0.5106482, 0.5939970, 0.5340835, 0.6036920, 0.5181577, 0.3892170, 0.3667581)

x <- c(0.95149254, 0.57954545, 0.56763699, 0.57089552, 1.00000000, 0.66870629, 0.70833333,
0.53125000, 0.58776596, 0.78061224, 0.61551724, 0.63750000, 0.85000000, 0.52397260,
0.66870629, 0.50328947, 0.52192982, 0.40691489, 0.36016949, 0.37500000, 0.35915493, 
0.53968254, 0.36334197, 0.23486842, 0.19615385, 0.35961538, 0.30039267, 0.26424870,
0.54838710, 0.23363874, 0.26936620, 0.09514925, 0.15324519, 0.32465278, 0.33909574, 
0.31587838, 0.24401914, 0.28813559, 0.23181818, 0.29272959)

plot(x,y, xlim=c(0,1), ylim=c(0,1))
m1 <- nls(y~a*x^b, start=list(a=2,b=0.2))
summary(m1)
curve((0.86888*x^0.43042), add=TRUE)
points(0.2276312,0.4595148, col='red', pch=17) # this is the threshold value I  
                                               # previously calculated

now I would like to know the Confidence Interval of the threshold

#lower CI
0.86888-(2*0.04578) # intercept minus 2* S.E.
0.43042-(2*0.06333) # exponent minus 2*S.E.
curve((0.77732*x^0.30376), add=TRUE) # plot the lower C.I.
0.77732*0.2276312^0.30376 # Now I enter the x coordinate of threshold point in the 
# Confidence interval function, to get the y coordinate
points(0.2276312,0.4958526, col='red', pch="-", cex=1.5) # plotting the CI of threshold value

#Upper CI
0.86888+(2*0.04578) # intercept plus 2* S.E.
0.43042+(2*0.06333) # exponent plus 2*S.E.
curve((0.96044*x^0.55708), add=TRUE) # plot upper C.I.
0.96044*0.2276312^0.55708 # Now I enter the x coordinate of threshold point in the 
# Confidence interval function, to get the y coordinate
points(0.2276312,0.4211113, col='red', pch="-", cex=1.5) # plot the CI of threshold value
lines(c(0.2276312,0.2276312),c(0.4211113,0.4958526), col="red")

# then check whether correct
confint(m1) # more or less, but slightly different, don't understand why?

nls confints

To me, a C.I. that looks like this would make more sense…

curve((0.96044*x^0.30376), col="green",add=TRUE)
curve((0.77732*x^0.55708), col="green",add=TRUE)

NLS_confint2

but that would be mixing the intercept of upper and exponent of lower C.I.
and vice versa

My question is then, are the black Confidence Intervals correct (and thus the thresholds red confidence interval?

Best Answer

Analysis

According to comments to the question, the "threshold" is determined this way:

I calculate where the derivative of the curve equals the derivative of the secant, which is the straight line formed by joining the lowest and highest points on the curve.

The "lowest" and "highest" points are at the extremes of the $x$ values and

are derived from the data.

Proposed solution

Because the secant is determined by the extremes of $x$, how they are determined is critical.

I therefore suggest estimating the bias in the estimation of the threshold, $t$, as well as a confidence interval for it, using a semi-parametric bootstrap. In this method the model is:

  • The x values, $x_i$, are drawn independently and randomly from a distribution. (It looks like they tend to lie between $0$ and $1$.) This is the "parametric" part of the bootstrap. We will need to explore how the results depend on assumptions about this distribution.

  • The y values, $y_i = a x_i^b$, are computed using the fitted parameters $\hat{a}$ and $\hat{b}$ from the original data in place of the unknown values $a$ and $b$, respectively.

  • To these y values are added "errors" drawn independently and with replacement from the residuals of the original fit. This is the "nonparametric" part of the bootstrap: we have assumed that the vertical deviations of the original data from the correct curve are independent and identically distributed and we have taken the residuals from the original fit to be a representative sample of that distribution.

  • The model is fit to the $(x_i, y_i)$ pairs and a threshold for them is computed as described previously.

The distribution of these thresholds is the bootstrap distribution. Any deviation between its central location and our original estimate of the threshold, $\hat{t}$, gives us an additive bias correction factor. The spread of this distribution lets us erect approximate confidence intervals around the (bias-corrected) estimate.

The bootstrap distribution is estimated by performing a large (but necessarily finite) number of simulations from the model. Usually a thousand or so iterations are enough to give good information.

Example scenarios

Here are results for three distributions of $X$: the original one (in which the $x_i$ never vary), a uniform distribution, and a Beta distribution that approximates the distribution of the original data. $10^4$ iterations were used.

Histograms

Each plot marks the estimated threshold $\hat{t}$ with a vertical red line. Because that line is centered on the "Original X" histogram, we deduce there is little bias and can take the extreme values of this histogram as approximate confidence limits for $t$: roughly from $0.412$ to $0.433$ (95% confidence).

The second and third plots show how exquisitely sensitive $\hat{t}$ is to the distribution of the $x_i$ and how biased an estimator it might be: notice how much wider the range of threshold is on their horizontal axes. In the second plot, where it is likely the smallest $x_i$ is close to $0$ and the largest $x_i$ might be substantially less than $1$, the reference slope ("secant") is relatively high, pushing the estimated threshold to the left where the curve has greater slopes. The bias is large, approximately $+0.12$, merely because the largest $x_i$ in the dataset is $1$ and the smallest, $0.095$, is fairly far from zero. In the third plot the bias has disappeared (!) but the range of estimated thresholds is still wide. A symmetric 95% confidence interval for $t$ is $[0.33, 0.52]$ (based on $10^4$ iterations). I suspect the lack of bias may be a bit of luck because the $x_i$ are centered fairly close to the original $\hat{t}$. This can be confirmed by drawing the $x_i$ from other distributions, such as a Beta(4,1) distribution (centered near $2/3$): the bias is now $-0.21$.

This is about as far as we can reasonably take the analysis without knowing more about how the $x_i$ are determined. The results are interesting insofar as they reveal how important it is to understand how the data arise: there is a huge range of answers available here, all of which are consistent with the numbers, but only some of which could be appropriate for what actually occurred during this study.


Code

R code to produce these results follows. (Change n.iter to obtain other than $10^3$ iterations.)

y <-c(0.5745373, 0.8255836, 0.7139635, 0.6318004, 0.8688738, 0.8341626, 0.9278573, 0.6548638,
      0.6995722, 0.8410211, 1.0000000, 0.8512973, 0.7917790, 0.5722589, 0.6918069, 0.7117084,
      0.5279689, 0.6121315, 0.5869292, 0.7332605, 0.5991816, 0.5470566, 0.5987166, 0.4440854,
      0.3719892, 0.4394820, 0.4410862, 0.4966288, 0.4291826, 0.4221613, 0.3951223, 0.3595973,
      0.4617549, 0.5106482, 0.5939970, 0.5340835, 0.6036920, 0.5181577, 0.3892170, 0.3667581)

x <- c(0.95149254, 0.57954545, 0.56763699, 0.57089552, 1.00000000, 0.66870629, 0.70833333,
       0.53125000, 0.58776596, 0.78061224, 0.61551724, 0.63750000, 0.85000000, 0.52397260,
       0.66870629, 0.50328947, 0.52192982, 0.40691489, 0.36016949, 0.37500000, 0.35915493, 
       0.53968254, 0.36334197, 0.23486842, 0.19615385, 0.35961538, 0.30039267, 0.26424870,
       0.54838710, 0.23363874, 0.26936620, 0.09514925, 0.15324519, 0.32465278, 0.33909574, 
       0.31587838, 0.24401914, 0.28813559, 0.23181818, 0.29272959)
#
# Fit a curve y = a*x^b to the data using nonlinear least squares,
# compute the "threshold," and return its value.
# (Called by `sim`.)
#
fit <- function(x, y, a=1, b=0.5) {
  m <- nls(y ~ a * x^b, start=list(a=a, b=b))
  coeff <- m$m$getPars()
  f <- function(x) coeff[1] * x^coeff[2]
  fp <- function(x) coeff[1] * coeff[2] * x^(coeff[2]-1)
  x.min <- min(x)
  x.max <- max(x)
  secant <- (f(x.max) - f(x.min)) / (x.max - x.min);
  threshold <- uniroot(function(x) fp(x) - secant, c(0,10^6))$root #$
  return(threshold)
}
#
# Perform a bootstrap simulation with the specified model:
# `n` is the number of iterations
# `f` is the original estimated curve (as a vectorized function)
# `e` is the array of residuals to the initial fit (it defines an empirical
#     distribution of residuals)
# `alpha` and `beta` are parameters for a Beta distribution of the x[i]
# `x` is the array of x[i] to use when not simulating the x[i].
#
sim <- function(n = 1000, f, e, alpha=1, beta=1, x=NULL) {
  replicate(n, {
    if (is.null(x)) {
      x.0 <- rbeta(length(e), alpha, beta)     
    } else {
      x.0 <- x
    }
    y.0 <- f(x.0) + sample(e, length(e), replace=TRUE)
    fit(x.0, y.0)
  })
}
#
# Obtain the initial fit.
#
m1 <- nls(y ~ a * x^b, start=list(a=1, b=0.5))
#
# Compute the residuals `e` and the curve function `f`.
#
e <- residuals(m1)
coeff <- m1$m$getPars()
f <- function(x) coeff[1] * x^coeff[2]
#
# Prepare to bootstrap and plot several scenarios.
# Each scenario should take 2-3 seconds for n.iter=1000.
#
par(mfrow=c(1,3))
n.iter <- 10^3
b <- seq(0.15, 0.65, 0.025)
set.seed(17)
#
# Scenario 1: keep the x values fixed.
#
threshold <- sim(n.iter, f, e, x=x)
hist(threshold, main="Original X", freq=FALSE)
abline(v = fit(x, y), col="Red")
quantile(threshold, c(0.025, 0.975)) # A symmetric CI
#
# Scenario 2: vary the x values uniformly in [0,1].
#
threshold <- sim(n.iter, f, e, alpha=1, beta=1)
hist(threshold, main="X ~ Uniform(0,1)", freq=FALSE, breaks=b)
abline(v = fit(x, y), col="Red")
quantile(threshold, c(0.025,0.5, 0.975))
#
# Scenario 3: vary the x values according to Beta(4,4).
#
threshold <- sim(n.iter, f, e, alpha=4, beta=4)
hist(threshold, main="X ~ Beta(4,4)", freq=FALSE, breaks=b)
abline(v = fit(x, y), col="Red")
quantile(threshold, c(0.025, 0.975))
Related Question