Solved – Bootstrap confidence intervals do not include the observed mean. Can this be

bootstrapconfidence intervalcorrelation

I have data on survival and fecundity (fruit production) for a bunch of plants grown in multiple experiments. There are multiple plants per genotype, so I take the trait mean per genotype.

I am interested in the (Spearman's rank) correlation between genotype means for these variables, and the uncertainty around that. To estimate confidence intervals I have used a non-parametric bootstrap to re-estimate genotype means and correlations. However, I find that the bootstrap distribution is substantially lower than the observed values. In most cases the observed statistic is beyond the upper CI of the bootstrap distribution.

My first thought is that I made a mistake in the code, but I have checked multiple times and cannot find anything.

My question is: can this be something real in that case? For example some kind of weird bias in the observed data that doesn't permeate to the resamples? If so, would it be valid to report the mean of the bootstrap distribution, or do something else?

If the answer is 'no', then I know I must have made a coding error and will keep looking.

Additional details:

  1. For I am using the cor function in R to calculate correlations, using method='spearman' and use=complete.obs.
  2. Survival is a proportion. Fecundity is count data, and has a lot (~27%) of missing data (plants that did not survive to reproduce).
  3. I do not see any evidence for outliers in the plots.
  4. I don't think this is an artefact of small sample size (there are >400 genotypes, with 18 plants per genotype)

Best Answer

Although seeing details of your procedure would help, as others have noted in comments, this phenomenon certainly can happen when a sample estimate of a parameter value is biased from the population value. This may explain your results on bootstrapping correlation coefficient values.

In such a case, following the idea that the bootstrapped samples are to the original sample as the original sample is to the population, there will be a bias between the mean parameter value determined your re-sampled bootstrap estimates and your original sample mean. (This can be a useful way to estimate the bias itself.) Yet the confidence intervals from the bootstrapped samples will be placed about the (now doubly, with respect to the population) biased mean of the bootstrapped samples. With sufficient bias, the confidence limits from the bootstrapped samples will all be on one side or the other of the mean parameter value from the original sample.

In your case, the sample correlation coefficient can be a biased estimate of the population value.

There is a specific example of how this plays out, for the biased plug-in estimate of Shannon entropy, on this page.

There is extensive explanation, with some links to relevant literature, here.