Solved – Using Bootstrap to estimate confidence interval of the standard deviation

bootstrapmeanstandard deviation

I am trying to compare two different methods of estimating confidence intervals: a parametric approach that uses the assumption that the sample is t-distributed (i.e. the formulas that are given here: Wikipedia: Normal Distribution), and bootstrapping.

The procedure is rather simple: For every sample size $5 < N <200$, I generate a sample with $N$ normally distributed random numbers.
I then calculate the confidence interval using the parametric formulas for every such sample.
Then, I do the same with bootstrap: For each sample, I draw 1000 sub-samples with replacement, calculate their mean and standard deviation, sort these values, and cut off the top and bottom 2.5%, which should give me the 95%-confidence interval ($\alpha=5\%$).

For each sample size, I then plot the width of the confidence interval, both for the one that I got from the parametric approach and the bootstrap. I also calculate their difference as: [(Width of Parametric CI) – (Width of Bootstrap CI)] / (Width of Parametric CI) * 100, to get a percentage.

I would expect that for increasing sample size, the difference between the two methods vanishes. As shown in the graphic below, this is indeed the case for the mean of the sample.
However for the standard deviation, the parametric model and bootstrap do not really seem to converge against the same value, or at least significantly slower than the mean.

I wonder: Is there a reason why it works for the mean, but not for the standard deviation? Did I just simply make a mistake when implementing all this (if desired, I can post the Python source code that was used to generate the graphic)?

enter image description here

Thanks in advance for all your ideas and suggestions 🙂

Best Answer

I would say that since sample mean is $\overline{X}=\frac{\sum{X_i}}{n}$, its convergence rate is 1/n, which is also the convergence rate of sample variance. But the convergence rate of sample standard deviation $ 1/\sqrt{n} $.

Related Solutions

Solved – Coverage probabilities of the basic bootstrap confidence Interval

The terminology is probably not used consistently, so the following is only how I understand the original question. From my understanding, the normal CIs you computed are not what was asked for. Each set of bootstrap replicates gives you one confidence interval, not many. The way to compute different CI-types from the results of a set of bootstrap replicates is as follows:

B    <- 999                  # number of replicates
muH0 <- 100                  # for generating data: true mean
sdH0 <- 40                   # for generating data: true sd
N    <- 200                  # sample size
DV   <- rnorm(N, muH0, sdH0) # simulated data: original sample

Since I want to compare the calculations against the results from package boot, I first define a function that will be called for each replicate. Its arguments are the original sample, and an index vector specifying the cases for a single replicate. It returns $M^{\star}$, the plug-in estimate for $\mu$, as well as $S_{M}^{2\star}$, the plug-in estimate for the variance of the mean $\sigma_{M}^{2}$. The latter will be required only for the bootstrap $t$-CI.

> getM <- function(orgDV, idx) {
+     bsM   <- mean(orgDV[idx])                       # M*
+     bsS2M <- (((N-1) / N) * var(orgDV[idx])) / N    # S^2*(M)
+     c(bsM, bsS2M)
+ }

> library(boot)                                       # for boot(), boot.ci()
> bOut <- boot(DV, statistic=getM, R=B)
> boot.ci(bOut, conf=0.95, type=c("basic", "perc", "norm", "stud"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL : 
boot.ci(boot.out = bOut, conf = 0.95, type = c("basic", "perc", "norm", "stud"))

Intervals : 
Level      Normal            Basic         Studentized        Percentile    
95%   ( 95.6, 106.0 )   ( 95.7, 106.2 )  ( 95.4, 106.2 )   ( 95.4, 106.0 )  
Calculations and Intervals on Original Scale

Without using package boot you can simply use replicate() to get a set of bootstrap replicates.

boots <- t(replicate(B, getM(DV, sample(seq(along=DV), replace=TRUE))))

But let's stick with the results from boot.ci() to have a reference.

boots   <- bOut$t                     # estimates from all replicates
M       <- mean(DV)                   # M from original sample
S2M     <- (((N-1)/N) * var(DV)) / N  # S^2(M) from original sample
Mstar   <- boots[ , 1]                # M* for each replicate
S2Mstar <- boots[ , 2]                # S^2*(M) for each replicate
biasM   <- mean(Mstar) - M            # bias of estimator M

The basic, percentile, and $t$-CI rely on the empirical distribution of bootstrap estimates. To get the $\alpha/2$ and $1 - \alpha/2$ quantiles, we find the corresponding indices to the sorted vector of bootstrap estimates (note that boot.ci() will do a more complicated interpolation to find the empirical quantiles when the indices are not natural numbers).

(idx <- trunc((B + 1) * c(0.05/2, 1 - 0.05/2)) # indices for sorted vector of estimates
[1] 25 975

> (ciBasic <- 2*M - sort(Mstar)[idx])          # basic CI
[1] 106.21826  95.65911

> (ciPerc <- sort(Mstar)[idx])                 # percentile CI
[1] 95.42188 105.98103

For the $t$-CI, we need the bootstrap $t^{\star}$ estimates to calculate the critical $t$-values. For the standard normal CI, the critical value will just be the $z$-value from the standard normal distribution.

# standard normal CI with bias correction
> zCrit   <- qnorm(c(0.025, 0.975))   # z-quantiles from std-normal distribution
> (ciNorm <- M - biasM + zCrit * sqrt(var(Mstar)))
[1] 95.5566 106.0043

> tStar <- (Mstar-M) / sqrt(S2Mstar)  # t*
> tCrit <- sort(tStar)[idx]           # t-quantiles from empirical t* distribution
> (ciT  <- M - tCrit * sqrt(S2M))     # studentized t-CI
[1] 106.20690  95.44878

In order to estimate the coverage probabilities of these CI-types, you will have to run this simulation many times. Just wrap the code into a function, return a list with the CI-results and run it with replicate() like demonstrated in this gist.

Solved – How to calculate the confidence interval of a mean in a non-normally distributed sample

First of all, I would check whether the mean is an appropriate index for the task at hand. If you are looking for "a typical/ or central value" of a skewed distribution, the mean might point you to a rather non-representative value. Consider the log-normal distribution:

x <- rlnorm(1000)
plot(density(x), xlim=c(0, 10))
abline(v=mean(x), col="red")
abline(v=mean(x, tr=.20), col="darkgreen")
abline(v=median(x), col="blue")

Mean (red), 20% trimmed mean (green), and median (blue) for the log-normal distribution

The mean (red line) is rather far away from the bulk of the data. 20% trimmed mean (green) and median (blue) are closer to the "typical" value.

The results depend on the type of your "non-normal" distribution (a histogram of your actual data would be helpful). If it is not skewed, but has heavy tails, your CIs will be very wide.

In any case, I think that bootstrapping indeed is a good approach, as it also can give you asymmetrical CIs. The R package simpleboot is a good start:

library(simpleboot)
# 20% trimmed mean bootstrap
b1 <- one.boot(x, mean, R=2000, tr=.2)
boot.ci(b1, type=c("perc", "bca"))

... gives you following result:

# The bootstrap trimmed mean:
> b1$t0
[1] 1.144648

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
Intervals : 
Level     Percentile            BCa          
95%   ( 1.062,  1.228 )   ( 1.065,  1.229 )  
Calculations and Intervals on Original Scale

Best Answer

Related Solutions

Solved – Coverage probabilities of the basic bootstrap confidence Interval

Solved – How to calculate the confidence interval of a mean in a non-normally distributed sample

Related Question