Solved – Confidence interval for values for a fitted line

confidence intervaljmpr

I'm using JMP to analyze some sample data to make predictions about the population. My sample is from a destructive QC test, so I obviously want to minimize my sample. I have a response (my Y) and a known factor (a very strong and consistent correlation that is measurable by non-destructive means) but the exact relationship between them varies from lot to lot (the slope and y offset vary).

So, in JMP, I am fitting a line and then showing the "confidence limits for an individual predicted value" which I believe gives me an indicator of how the population is likely to behave. So I'm using that plot to make disposition decisions. I want to automate this process, perhaps using R, but I'm a total novice at R. I could do the math if I was just dealing with a mean and standard deviation, but I don't know how to do it with a fit line and a known factor. Can someone please give me either the general information on how to get the confidence limits around the line, or else tell me how to do the whole thing in R?

Thankss much.

Best Answer

If your using linear regression I would recommend using the rms package in R. It is very easy to use and has lots of nice features.

Here's an example:

# Load package (remember to install.packages("rms") or this will fail the first time)
library(rms)

# Get a dataset to experiment with
data(mtcars)
mtcars$am <- factor(mtcars$am, levels=0:1, labels=c("Automatic", "Manual"))

# The rms package needs this work properly
dd <- datadist(mtcars)
options(datadist="dd")

# Do the regression
f <- ols(mpg~wt, data=mtcars, x=T, y=T)

# Plot regular mean confidence interval
p <- Predict(f, wt=seq(2.5, 4, by=.001), conf.type="mean")
plot(p, ylim=c(10, 30), col="lightblue")

# Plot wide confidence interval
p <- Predict(f, wt=seq(2.5, 4, by=.001), conf.type="individual")
plot(p, ylim=c(10, 30), col="lightblue")

Gives this output:

Plain line

Now usually you want to test the linearity assumption:

# Try the model with a restricted cubic spline
f <- ols(mpg~rcs(wt, 3), data=mtcars, x=T, y=T)
anova(f)

Gives this output:

> anova(f)
                Analysis of Variance          Response: mpg 

 Factor     d.f. Partial SS MS         F     P     
 wt          2   922.04230  461.021149 65.54 <.0001
  Nonlinear  1    74.31705   74.317047 10.56 0.0029
 REGRESSION  2   922.04230  461.021149 65.54 <.0001
 ERROR      29   204.00489    7.034651

And if you plot the graphs with the same code as a bove you get this picture:

Regression with a spline

If you want to make your formula more complicated just add that variable:

f <- ols(mpg~rcs(wt, 3)+am, data=mtcars, x=T, y=T)
p <- Predict(f, wt=seq(2.5, 4, by=.001), am=levels(mtcars$am), conf.type="individual")
plot(p)

I don't know anything about JMP, it shouldn't be too difficult but I recommend learning R because it gives you an incredible freedom.

Hope this helped.

Related Solutions

Solved – Sample size for binomial confidence interval

(1) Yes.

(2) Yes. There are only $n+1$ possible outcomes for a binomial random variable, so it is possible to look at what happens for each possible outcome - in fact this is faster than simulating lots and lots of outcomes!

Let $X$ be the number of "successes" among the $n$ customers and let $\hat{p}=X/n$. The confidence interval is $\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$, so the halfwidth is $z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$. Thus we want to compute $P(z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}\leq 0.005)$. In R, we can do this as follows:

target.halfWidth<-0.005

p<-0.016 #true proportion
n.vec<-seq(from=1000, to=3000, by=100) #number of samples

# Vector to store results
prob.hw<-rep(NA,length(n.vec))

# Loop through desired sample size options
for (i in 1: length(n.vec))
{
n<-n.vec[i]

# Look at all possible outcomes
x<-0:n
p.est<-x/n

# Compute halfwidth for each option
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)

# What is the probability that the halfwidth is less than 0.005?
prob.hw[i]<-sum({halfWidth<=target.halfWidth}*dbinom(x,n,p))
}

# Plot results
plot(n.vec,prob.hw,type="b")
abline(0.95,0,col=2)

# Get the minimal n required
n.vec[min(which(prob.hw>=0.95))]

The answer is $n=2200$ in this case as well.

Finally, it is usually a good idea to verify that the asymptotic normal approximation interval actually gives the desired coverage. In R, we can compute the coverage probability (i.e. the actual confidence level) as:

p<-0.016
n<-2200
x<-0:n
p.est<-x/n
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)
# Coverage probability
sum({abs(p-p.est)<=halfWidth}*dbinom(x,n,p))

Different $p$ give different coverages. For $p$ around $0.015$, the actual confidence level of the nominal $90\%$ interval seems to be about $89\%$ in general, which I presume is fine for your purposes.

(3) When you sample from a finite population, the number of successes is not binomial but hypergeometric. If the population is large compared to your sample size, the binomial works just fine as an approximation. If you sample 1000 out of 5000, say, it does not. Have a look at confidence intervals for proportions based on the hypergeometric distribution!

Answers to additional questions:

Let $(p_L,p_U)$ be the confidence interval.

1) In that case you are no longer computing $P(p_L-p_U\leq0.01)$ but $$P\Big(p_L-p_U\leq0.01~\mbox{and}~p\in(p_L,p_U)\Big),$$ i.e. the probability that the length of intervals that actually contain $p$ is at most 0.01. This may be an interesting quantity, depending on what you're interested in...

2) Maybe, but probably not. If the population size is large compared to the sample size you don't need it, and if it's not then the binomial distribution is not appropriate to begin with!

3) Sprop seems to contain confidence intervals based on the hypergeometric intervals, so that should work just fine.

Solved – Confidence interval of RMSE

I might be able to give an answer to your question under certain conditions.

Let $x_{i}$ be your true value for the $i^{th}$ data point and $\hat{x}_{i}$ the estimated value. If we assume that the differences between the estimated and true values have

mean zero (i.e. the $\hat{x}_{i}$ are distributed around $x_{i}$)
follow a Normal distribution
and all have the same standard deviation $\sigma$

in short:

$$\hat{x}_{i}-x_{i} \sim \mathcal{N}\left(0,\sigma^{2}\right),$$

then you really want a confidence interval for $\sigma$.

If the above assumptions hold true $$\frac{n\mbox{RMSE}^{2}}{\sigma^{2}} = \frac{n\frac{1}{n}\sum_{i}\left(\hat{x_{i}}-x_{i}\right)^{2}}{\sigma^{2}}$$ follows a $\chi_{n}^{2}$ distribution with $n$ (not $n-1$) degrees of freedom. This means

\begin{align} P\left(\chi_{\frac{\alpha}{2},n}^{2}\le\frac{n\mbox{RMSE}^{2}}{\sigma^{2}}\le\chi_{1-\frac{\alpha}{2},n}^{2}\right) = 1-\alpha\\ \Leftrightarrow P\left(\frac{n\mbox{RMSE}^{2}}{\chi_{1-\frac{\alpha}{2},n}^{2}}\le\sigma^{2}\le\frac{n\mbox{RMSE}^{2}}{\chi_{\frac{\alpha}{2},n}^{2}}\right) = 1-\alpha\\ \Leftrightarrow P\left(\sqrt{\frac{n}{\chi_{1-\frac{\alpha}{2},n}^{2}}}\mbox{RMSE}\le\sigma\le\sqrt{\frac{n}{\chi_{\frac{\alpha}{2},n}^{2}}}\mbox{RMSE}\right) = 1-\alpha. \end{align}

Therefore, $$\left[\sqrt{\frac{n}{\chi_{1-\frac{\alpha}{2},n}^{2}}}\mbox{RMSE},\sqrt{\frac{n}{\chi_{\frac{\alpha}{2},n}^{2}}}\mbox{RMSE}\right]$$ is your confidence interval.

Here is a python program that simulates your situation

from scipy import stats
from numpy import *
s = 3
n=10
c1,c2 = stats.chi2.ppf([0.025,1-0.025],n)
y = zeros(50000)
for i in range(len(y)):
    y[i] =sqrt( mean((random.randn(n)*s)**2))

print "1-alpha=%.2f" % (mean( (sqrt(n/c2)*y < s) & (sqrt(n/c1)*y > s)),)

Hope that helps.

If you are not sure whether the assumptions apply or if you want to compare what I wrote to a different method, you could always try bootstrapping.

Best Answer

Related Solutions

Solved – Sample size for binomial confidence interval

Solved – Confidence interval of RMSE

Related Question