Solved – Why bootstrap to calculate the standard error

bootstrapr

Can you please tell me the advantage of bootstrapping in the example below:

sampleOne   <- function(x) sample(x, replace = TRUE)
sampleMany  <- function(x, n) replicate(n, sampleOne(x), simplify = FALSE)
listMeans <- function(x, n) lapply(sampleMany(x, n), mean)
bootData <- function(x,n) do.call(rbind, listMeans(x,n))
sampleSize <- 100000
numBoots <- 1000

# Left Skewed distribution # shape1 = a and shape2 = b
set.seed(400)
popSkewLeft <- rbeta(sampleSize, shape1 = 5, shape2 = 1)
hist(popSkewLeft)
skewLeftbootData <- bootData(popSkewLeft, numBoots)
(populationMean <- mean(popSkewLeft))# Mean = a/(a+b)  = (5)/(5+1) = 0.8333333
(bootMean <- mean(skewLeftbootData))

(populationSd <- sd(popSkewLeft)) #sd = sqrt(ab/((a+b)^2 (a+b+1))) = sqrt((5*1)/((5+1)^2*(5+1+1))) = 0.140859
(bootSd <- sd(skewLeftbootData) * sqrt(sampleSize))

I created a left skewed population as can be seen from the above code. I also calculated the sample standard deviation. the population standard deviation was calculated using the beta distribution equation.

The sample standard deviation and the booted standard error * the square root of the sample size are almost the same. As a matter of fact the sample standard deviation is closer to the population parameter.

Best Answer

There won't be any advantage to bootstrapping the SE in your example because you have a very large sample size. The distribution of means of that sample size is going to be normal, not skewed, because of the central limit theorem [CLT] (try hist(skewLeftbootData)). Even if it were skewed the SE is going to be so small because of N that the SE is not going to be appreciably skewed anyway. Then you're using for proof the backward calculation of an SD based on the SE of the bootstrap distribution calculated through conventional means. Even if the bootstrap distribution were skewed you've just tossed out one of the reasons you might do bootstrap in this case.

Bootstrapping would be more compelling if you had substantially smaller sample (say 12) and calculated your SE as the middle 67% of the bootstrapped data by cutoffs of the sorted bootstrap distribution. Then you would see that that is a different estimate than an SE calculated from the conventional SD. You also wouldn't then calculate a bootstrapped SD based on the cut offs.