Solved – Why bootstrap to calculate the standard error

bootstrapr

Can you please tell me the advantage of bootstrapping in the example below:

sampleOne   <- function(x) sample(x, replace = TRUE)
sampleMany  <- function(x, n) replicate(n, sampleOne(x), simplify = FALSE)
listMeans <- function(x, n) lapply(sampleMany(x, n), mean)
bootData <- function(x,n) do.call(rbind, listMeans(x,n))
sampleSize <- 100000
numBoots <- 1000

# Left Skewed distribution # shape1 = a and shape2 = b
set.seed(400)
popSkewLeft <- rbeta(sampleSize, shape1 = 5, shape2 = 1)
hist(popSkewLeft)
skewLeftbootData <- bootData(popSkewLeft, numBoots)
(populationMean <- mean(popSkewLeft))# Mean = a/(a+b)  = (5)/(5+1) = 0.8333333
(bootMean <- mean(skewLeftbootData))

(populationSd <- sd(popSkewLeft)) #sd = sqrt(ab/((a+b)^2 (a+b+1))) = sqrt((5*1)/((5+1)^2*(5+1+1))) = 0.140859
(bootSd <- sd(skewLeftbootData) * sqrt(sampleSize))

I created a left skewed population as can be seen from the above code. I also calculated the sample standard deviation. the population standard deviation was calculated using the beta distribution equation.

The sample standard deviation and the booted standard error * the square root of the sample size are almost the same. As a matter of fact the sample standard deviation is closer to the population parameter.

Best Answer

There won't be any advantage to bootstrapping the SE in your example because you have a very large sample size. The distribution of means of that sample size is going to be normal, not skewed, because of the central limit theorem [CLT] (try hist(skewLeftbootData)). Even if it were skewed the SE is going to be so small because of N that the SE is not going to be appreciably skewed anyway. Then you're using for proof the backward calculation of an SD based on the SE of the bootstrap distribution calculated through conventional means. Even if the bootstrap distribution were skewed you've just tossed out one of the reasons you might do bootstrap in this case.

Bootstrapping would be more compelling if you had substantially smaller sample (say 12) and calculated your SE as the middle 67% of the bootstrapped data by cutoffs of the sorted bootstrap distribution. Then you would see that that is a different estimate than an SE calculated from the conventional SD. You also wouldn't then calculate a bootstrapped SD based on the cut offs.

Related Solutions

Bootstrap – Coverage Probabilities of the Basic Bootstrap Confidence Interval

The terminology is probably not used consistently, so the following is only how I understand the original question. From my understanding, the normal CIs you computed are not what was asked for. Each set of bootstrap replicates gives you one confidence interval, not many. The way to compute different CI-types from the results of a set of bootstrap replicates is as follows:

B    <- 999                  # number of replicates
muH0 <- 100                  # for generating data: true mean
sdH0 <- 40                   # for generating data: true sd
N    <- 200                  # sample size
DV   <- rnorm(N, muH0, sdH0) # simulated data: original sample

Since I want to compare the calculations against the results from package boot, I first define a function that will be called for each replicate. Its arguments are the original sample, and an index vector specifying the cases for a single replicate. It returns $M^{\star}$, the plug-in estimate for $\mu$, as well as $S_{M}^{2\star}$, the plug-in estimate for the variance of the mean $\sigma_{M}^{2}$. The latter will be required only for the bootstrap $t$-CI.

> getM <- function(orgDV, idx) {
+     bsM   <- mean(orgDV[idx])                       # M*
+     bsS2M <- (((N-1) / N) * var(orgDV[idx])) / N    # S^2*(M)
+     c(bsM, bsS2M)
+ }

> library(boot)                                       # for boot(), boot.ci()
> bOut <- boot(DV, statistic=getM, R=B)
> boot.ci(bOut, conf=0.95, type=c("basic", "perc", "norm", "stud"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL : 
boot.ci(boot.out = bOut, conf = 0.95, type = c("basic", "perc", "norm", "stud"))

Intervals : 
Level      Normal            Basic         Studentized        Percentile    
95%   ( 95.6, 106.0 )   ( 95.7, 106.2 )  ( 95.4, 106.2 )   ( 95.4, 106.0 )  
Calculations and Intervals on Original Scale

Without using package boot you can simply use replicate() to get a set of bootstrap replicates.

boots <- t(replicate(B, getM(DV, sample(seq(along=DV), replace=TRUE))))

But let's stick with the results from boot.ci() to have a reference.

boots   <- bOut$t                     # estimates from all replicates
M       <- mean(DV)                   # M from original sample
S2M     <- (((N-1)/N) * var(DV)) / N  # S^2(M) from original sample
Mstar   <- boots[ , 1]                # M* for each replicate
S2Mstar <- boots[ , 2]                # S^2*(M) for each replicate
biasM   <- mean(Mstar) - M            # bias of estimator M

The basic, percentile, and $t$-CI rely on the empirical distribution of bootstrap estimates. To get the $\alpha/2$ and $1 - \alpha/2$ quantiles, we find the corresponding indices to the sorted vector of bootstrap estimates (note that boot.ci() will do a more complicated interpolation to find the empirical quantiles when the indices are not natural numbers).

(idx <- trunc((B + 1) * c(0.05/2, 1 - 0.05/2)) # indices for sorted vector of estimates
[1] 25 975

> (ciBasic <- 2*M - sort(Mstar)[idx])          # basic CI
[1] 106.21826  95.65911

> (ciPerc <- sort(Mstar)[idx])                 # percentile CI
[1] 95.42188 105.98103

For the $t$-CI, we need the bootstrap $t^{\star}$ estimates to calculate the critical $t$-values. For the standard normal CI, the critical value will just be the $z$-value from the standard normal distribution.

# standard normal CI with bias correction
> zCrit   <- qnorm(c(0.025, 0.975))   # z-quantiles from std-normal distribution
> (ciNorm <- M - biasM + zCrit * sqrt(var(Mstar)))
[1] 95.5566 106.0043

> tStar <- (Mstar-M) / sqrt(S2Mstar)  # t*
> tCrit <- sort(tStar)[idx]           # t-quantiles from empirical t* distribution
> (ciT  <- M - tCrit * sqrt(S2M))     # studentized t-CI
[1] 106.20690  95.44878

In order to estimate the coverage probabilities of these CI-types, you will have to run this simulation many times. Just wrap the code into a function, return a list with the CI-results and run it with replicate() like demonstrated in this gist.

Solved – Determining the confidence interval for a non-normal distribution

If you want to just do a simple bootstrap of a median CI then all you need is a median function that accepts indices.

med <- function(y, indices) median(y[indices])

Then you can just...

b <- boot(dat, med, 1000)
boot.ci(b)

(you might want to plot(b) to examine your boostrap)

But in your case, assuming your simple exponential model described the data well, I would just be tempted to let the entire simulation speak for itself in a figure. Plot your data distribution and overlay your simulation. If you use density curves your simulation can have a much higher N than the data.

densDat <- density(dat)
plot(densDat)
densSim <- density(sim)
lines(densSim, col = 'red')

(You might want to adjust the bw argument in density to smooth the curve.)

Best Answer

Related Solutions

Bootstrap – Coverage Probabilities of the Basic Bootstrap Confidence Interval

Solved – Determining the confidence interval for a non-normal distribution

Related Question