Solved – Understanding bootstrap method for confidence interval of correlation coefficients

bootstrapconfidence intervalcorrelationpythonsampling

Please correct me where I'm wrong:

My understanding of bootstrapping is that it is a way to estimate the distribution of some statistic (mean, standard error, Pearson's correlation coeff, etc), given only one sample. So if I want to estimate the mean of a population using bootstrap methods, I generate many bootstrap samples, compute the mean of each of these bootstrap samples, and then use the distribution of those values to deduce where the unknown population mean is likely to fall and compute a confidence interval for the statistic.

But how are the bootstrap samples generated? There is a scikit bootstrap module and I see that it has a bootstrap method to compute confidence interval for a given statistic: see first function, def(ci).

The first estimator is the empirical distribution function, which should be an array that the statistic of interest can be computed on. How is this empirical data used to generate the bootstrap samples?

To extend this question, if I want to compute a 95% confidence interval for the Pearson correlation coefficient between two random variables x and y, and I pass data = [(x1,y1), (x2,y2), ... (xi,yi), ... (xn,yn)] to the implementation of bootstrap CI, does that mean that (x1, ..., xn) and (y1, ..., yn) are generated independently of each other for each bootstrap sample that is generated?

Best Answer

The short answer is that - at least in the simple cases - the observations are sampled with replacement. Imagine writing each of the data values on an n sided die and rolling the die n times.

If you're trying to bootstrap a correlation, you resample the data in pairs $(x_i,y_i)$. If you think of your data as two columns, each row is an observation, and you resample the observations (rows).

Here's an example:

enter image description here

More generally, think of a matrix of data where the observations (rows) are resampled.

(This is not a suitable resampling scheme for every situation, though. There are a plethora of bootstrap schemes.)

Related Question