I hope you can help me get some confidence in my confidence interval… I am trying to get the confidence interval for a particular (threshold) point on a predicted curve.
I find the confidence interval (C.I.) looks funny, so I would like to know whether what I am doing is correct… Thank you for any advice you can give me!
#Here is the data to which I fit a nonlinear least squares regression (power function)
y <-c(0.5745373, 0.8255836, 0.7139635, 0.6318004, 0.8688738, 0.8341626, 0.9278573, 0.6548638,
0.6995722, 0.8410211, 1.0000000, 0.8512973, 0.7917790, 0.5722589, 0.6918069, 0.7117084,
0.5279689, 0.6121315, 0.5869292, 0.7332605, 0.5991816, 0.5470566, 0.5987166, 0.4440854,
0.3719892, 0.4394820, 0.4410862, 0.4966288, 0.4291826, 0.4221613, 0.3951223, 0.3595973,
0.4617549, 0.5106482, 0.5939970, 0.5340835, 0.6036920, 0.5181577, 0.3892170, 0.3667581)
x <- c(0.95149254, 0.57954545, 0.56763699, 0.57089552, 1.00000000, 0.66870629, 0.70833333,
0.53125000, 0.58776596, 0.78061224, 0.61551724, 0.63750000, 0.85000000, 0.52397260,
0.66870629, 0.50328947, 0.52192982, 0.40691489, 0.36016949, 0.37500000, 0.35915493,
0.53968254, 0.36334197, 0.23486842, 0.19615385, 0.35961538, 0.30039267, 0.26424870,
0.54838710, 0.23363874, 0.26936620, 0.09514925, 0.15324519, 0.32465278, 0.33909574,
0.31587838, 0.24401914, 0.28813559, 0.23181818, 0.29272959)
plot(x,y, xlim=c(0,1), ylim=c(0,1))
m1 <- nls(y~a*x^b, start=list(a=2,b=0.2))
summary(m1)
curve((0.86888*x^0.43042), add=TRUE)
points(0.2276312,0.4595148, col='red', pch=17) # this is the threshold value I
# previously calculated
now I would like to know the Confidence Interval of the threshold
#lower CI
0.86888-(2*0.04578) # intercept minus 2* S.E.
0.43042-(2*0.06333) # exponent minus 2*S.E.
curve((0.77732*x^0.30376), add=TRUE) # plot the lower C.I.
0.77732*0.2276312^0.30376 # Now I enter the x coordinate of threshold point in the
# Confidence interval function, to get the y coordinate
points(0.2276312,0.4958526, col='red', pch="-", cex=1.5) # plotting the CI of threshold value
#Upper CI
0.86888+(2*0.04578) # intercept plus 2* S.E.
0.43042+(2*0.06333) # exponent plus 2*S.E.
curve((0.96044*x^0.55708), add=TRUE) # plot upper C.I.
0.96044*0.2276312^0.55708 # Now I enter the x coordinate of threshold point in the
# Confidence interval function, to get the y coordinate
points(0.2276312,0.4211113, col='red', pch="-", cex=1.5) # plot the CI of threshold value
lines(c(0.2276312,0.2276312),c(0.4211113,0.4958526), col="red")
# then check whether correct
confint(m1) # more or less, but slightly different, don't understand why?
To me, a C.I. that looks like this would make more sense…
curve((0.96044*x^0.30376), col="green",add=TRUE)
curve((0.77732*x^0.55708), col="green",add=TRUE)
but that would be mixing the intercept of upper and exponent of lower C.I.
and vice versa
My question is then, are the black Confidence Intervals correct (and thus the thresholds red confidence interval?
Best Answer
Analysis
According to comments to the question, the "threshold" is determined this way:
The "lowest" and "highest" points are at the extremes of the $x$ values and
Proposed solution
Because the secant is determined by the extremes of $x$, how they are determined is critical.
I therefore suggest estimating the bias in the estimation of the threshold, $t$, as well as a confidence interval for it, using a semi-parametric bootstrap. In this method the model is:
The x values, $x_i$, are drawn independently and randomly from a distribution. (It looks like they tend to lie between $0$ and $1$.) This is the "parametric" part of the bootstrap. We will need to explore how the results depend on assumptions about this distribution.
The y values, $y_i = a x_i^b$, are computed using the fitted parameters $\hat{a}$ and $\hat{b}$ from the original data in place of the unknown values $a$ and $b$, respectively.
To these y values are added "errors" drawn independently and with replacement from the residuals of the original fit. This is the "nonparametric" part of the bootstrap: we have assumed that the vertical deviations of the original data from the correct curve are independent and identically distributed and we have taken the residuals from the original fit to be a representative sample of that distribution.
The model is fit to the $(x_i, y_i)$ pairs and a threshold for them is computed as described previously.
The distribution of these thresholds is the bootstrap distribution. Any deviation between its central location and our original estimate of the threshold, $\hat{t}$, gives us an additive bias correction factor. The spread of this distribution lets us erect approximate confidence intervals around the (bias-corrected) estimate.
The bootstrap distribution is estimated by performing a large (but necessarily finite) number of simulations from the model. Usually a thousand or so iterations are enough to give good information.
Example scenarios
Here are results for three distributions of $X$: the original one (in which the $x_i$ never vary), a uniform distribution, and a Beta distribution that approximates the distribution of the original data. $10^4$ iterations were used.
Each plot marks the estimated threshold $\hat{t}$ with a vertical red line. Because that line is centered on the "Original X" histogram, we deduce there is little bias and can take the extreme values of this histogram as approximate confidence limits for $t$: roughly from $0.412$ to $0.433$ (95% confidence).
The second and third plots show how exquisitely sensitive $\hat{t}$ is to the distribution of the $x_i$ and how biased an estimator it might be: notice how much wider the range of
threshold
is on their horizontal axes. In the second plot, where it is likely the smallest $x_i$ is close to $0$ and the largest $x_i$ might be substantially less than $1$, the reference slope ("secant") is relatively high, pushing the estimated threshold to the left where the curve has greater slopes. The bias is large, approximately $+0.12$, merely because the largest $x_i$ in the dataset is $1$ and the smallest, $0.095$, is fairly far from zero. In the third plot the bias has disappeared (!) but the range of estimated thresholds is still wide. A symmetric 95% confidence interval for $t$ is $[0.33, 0.52]$ (based on $10^4$ iterations). I suspect the lack of bias may be a bit of luck because the $x_i$ are centered fairly close to the original $\hat{t}$. This can be confirmed by drawing the $x_i$ from other distributions, such as a Beta(4,1) distribution (centered near $2/3$): the bias is now $-0.21$.This is about as far as we can reasonably take the analysis without knowing more about how the $x_i$ are determined. The results are interesting insofar as they reveal how important it is to understand how the data arise: there is a huge range of answers available here, all of which are consistent with the numbers, but only some of which could be appropriate for what actually occurred during this study.
Code
R
code to produce these results follows. (Changen.iter
to obtain other than $10^3$ iterations.)