Solved – Bootstrap confidence intervals do not include the observed mean. Can this be

bootstrapconfidence intervalcorrelation

I have data on survival and fecundity (fruit production) for a bunch of plants grown in multiple experiments. There are multiple plants per genotype, so I take the trait mean per genotype.

I am interested in the (Spearman's rank) correlation between genotype means for these variables, and the uncertainty around that. To estimate confidence intervals I have used a non-parametric bootstrap to re-estimate genotype means and correlations. However, I find that the bootstrap distribution is substantially lower than the observed values. In most cases the observed statistic is beyond the upper CI of the bootstrap distribution.

My first thought is that I made a mistake in the code, but I have checked multiple times and cannot find anything.

My question is: can this be something real in that case? For example some kind of weird bias in the observed data that doesn't permeate to the resamples? If so, would it be valid to report the mean of the bootstrap distribution, or do something else?

If the answer is 'no', then I know I must have made a coding error and will keep looking.

Additional details:

For I am using the cor function in R to calculate correlations, using method='spearman' and use=complete.obs.
Survival is a proportion. Fecundity is count data, and has a lot (~27%) of missing data (plants that did not survive to reproduce).
I do not see any evidence for outliers in the plots.
I don't think this is an artefact of small sample size (there are >400 genotypes, with 18 plants per genotype)

Best Answer

Although seeing details of your procedure would help, as others have noted in comments, this phenomenon certainly can happen when a sample estimate of a parameter value is biased from the population value. This may explain your results on bootstrapping correlation coefficient values.

In such a case, following the idea that the bootstrapped samples are to the original sample as the original sample is to the population, there will be a bias between the mean parameter value determined your re-sampled bootstrap estimates and your original sample mean. (This can be a useful way to estimate the bias itself.) Yet the confidence intervals from the bootstrapped samples will be placed about the (now doubly, with respect to the population) biased mean of the bootstrapped samples. With sufficient bias, the confidence limits from the bootstrapped samples will all be on one side or the other of the mean parameter value from the original sample.

In your case, the sample correlation coefficient can be a biased estimate of the population value.

There is a specific example of how this plays out, for the biased plug-in estimate of Shannon entropy, on this page.

There is extensive explanation, with some links to relevant literature, here.

Related Solutions

Solved – Biased bootstrap: is it okay to center the CI around the observed statistic

In the setup given by the OP the parameter of interest is the Shannon entropy $$\theta(\mathbf{p}) = - \sum_{i = 1}^{50} p_i \log p_i,$$ which is a function of the probability vector $\mathbf{p} \in \mathbb{R}^{50}$. The estimator based on $n$ samples ($n = 100$ in the simulation) is the plug-in estimator $$\hat{\theta}_n = \theta(\hat{\mathbf{p}}_n) = - \sum_{i=1}^{50} \hat{p}_{n,i} \log \hat{p}_{n,i}.$$ The samples were generated using the uniform distribution for which the Shannon entropy is $\log(50) = 3.912.$ Since the Shannon entropy is maximized in the uniform distribution, the plug-in estimator must be downward biased. A simulation shows that $\mathrm{bias}(\hat{\theta}_{100}) \simeq -0.28$ whereas $\mathrm{bias}(\hat{\theta}_{500}) \simeq -0.05$. The plug-in estimator is consistent, but the $\Delta$-method does not apply for $\mathbf{p}$ being the uniform distribution, because the derivative of the Shannon entropy is 0. Thus for this particular choice of $\mathbf{p}$, confidence intervals based on asymptotic arguments are not obvious.

The percentile interval is based on the distribution of $\theta(\mathbf{p}_n^*)$ where $\mathbf{p}_n^*$ is the estimator obtained from sampling $n$ observations from $\hat{\mathbf{p}}_n$. Specifically, it is the interval from the 2.5% quantile to the 97.5% quantile for the distribution of $\theta(\mathbf{p}_n^*)$. As the OP's bootstrap simulation shows, $\theta(\mathbf{p}_n^*)$ is clearly also downward biased as an estimator of $\theta(\hat{\mathbf{p}}_n)$, which results in the percentile interval being completely wrong.

For the basic (and normal) interval, the roles of the quantiles are interchanged. This implies that the interval does seem to be reasonable (it covers 3.912), though intervals extending beyond 3.912 are not logically meaningful. Moreover, I don't know if the basic interval will have the correct coverage. Its justification is based on the following approximate distributional identity:

$$\theta(\mathbf{p}_n^*) - \theta(\hat{\mathbf{p}}_n) \overset{\mathcal{D}}{\simeq} \theta(\hat{\mathbf{p}}_n) - \theta(\mathbf{p}),$$ which might be questionable for (relatively) small $n$ like $n = 100$.

The OP's last suggestion of a standard error based interval $\theta(\hat{\mathbf{p}}_n) \pm 1.96\hat{\mathrm{se}}_n$ will not work either because of the large bias. It might work for a bias-corrected estimator, but then you first of all need correct standard errors for the bias-corrected estimator.

I would consider a likelihood interval based of the profile log-likelihood for $\theta(\mathbf{p})$. I'm afraid that I don't know any simple way to compute the profile log-likelihood for this example except that you need to maximize the log-likelihood over $\mathbf{p}$ for different fixed values of $\theta(\mathbf{p})$.

Solved – Three questions about the article “Ditch p-values. Use Bootstrap confidence intervals instead”

1 They don’t mean what people think they mean

Am I right that this is not a p-value (which is the probability to see this or more extreme value of a test statistic)? Is it a correct procedure for a statistical testing? I have a gut feeling that it is a wrong situation to apply hypothesis testing, but I can not formally answer why.

One could argue that technically speaking it is a p-value. But, it is a rather meaningless p-value. There are two ways to look at it as a meaningless p-value

Neyman and Pearson suggest that, in order to compute the p-value, you choose the region where the likelihood ratio (between the null hypothesis and alternative hypothesis) is the highest. You count observations as 'extreme' when a deviation from the null hypothesis would mean more likelihood to make that extreme observation.

This is not the case with the US citizen example. If the null hypothesis 'Robert is a US citizen' is false, then the observation 'Robert is a US senator' is in no way more likely. So from the viewpoint of Neyman's and Pearson's approach to hypothesis testing, this is a very bad type of calculation for a p-value.
From the viewpoint of Fisher's approach to hypothesis testing, you have a measurement of some effect and the point of the p-value is to quantify the statistical significance. It is useful as an expression of the precision of an experiment.

The p-value quantifies how good/accurate the experiment is in the quantification of the deviation. Statistically speaking effects will always occur to some extent due to random fluctuations in the measurements. An observation is seen as statistically significant when it is a fluctuation of a sufficiently large size such that it has a low probability that we observe a seemingly effect when there is actually no effect (when the null hypothesis is true). Experiments that have a high probability that we observe an effect while there is actually no effect are not very useful. We use p-values to express this probability.

By reporting p-values researchers can show that their experiments have sufficiently small noise and sufficiently large sample size, such that the observed effects are statistically significant (unlikely to be just noise).

Fisher's p-values are an expression of the noise and random fluctuations, they are a sort of expression of signal/noise ratio. The advice is to only reject a hypothesis when an effect is sufficiently large compared to the noise level.

Even though there is no alternative hypothesis in Fisher's viewpoint, when we express a p-value then this is done for the measurement of some effect as a deviation relative to a null (no effect) hypothesis. There must be some sense of a direction that can be considered to be an effect or a deviation.

In the case of the experiment with US citizenship, the measurement of 'Robert is a US senator' has nothing to do with the measurement of some effect or a deviation from the null hypothesis. Expressing a p-value for it is meaningless.

The example with US citizenship may be a bit weird and wrong. However, it is not meant to be correct. The point is to show that simply a p-value is not very meaningful and correct. What we need to consider is also the power of a test (and that is missing in the example with US citizenship). A low p-value might be nice, but what if the p-value would be just as well low, or even lower, for an alternative explanation? If you have a bad hypothesis test then we could 'reject a hypothesis' based on a (crappy) low p-value while actually, no alternative hypothesis is better suitable.

Example 1: Say you have two jars one with 50% gold and 50% silver coins, the other with 75% gold and 25% silver coins. You take 10 coins out of one jar, and they are all silver, which jar do we have? We could say that the prior odds were 1:1 and the posterior odds are 1:1024. We can say that the jar is very likely the one with 50:50 gold:silver, but both hypotheses are unlikely when we observe 10 silver coins and maybe we should mistrust our model.

Example 2: Say you have data that is distributed by a quadratic curve y = a + c x^2. But you fit it with a straight linear line y = a + b x. When we fit a model we find that the p-value is extremely low for a zero slope (no effect) since the data does not match a flat line (as it is following a quadratic curve). But does that mean that we should reject the hypothesis that the coefficient b is zero? The discrepancy, low p-value, is not because the null hypothesis is false, but because out entire model is false (that is the actual conclusion when the p-value is low, the null hypothesis and/or the statistical model is false).

2 They rely on hidden assumptions

It seems to be wrong, but the question is: can we say that non-parametric tests also rely on some regular statistical distributions? Not only they have assumptions, but also, technically, their statistics also follow some distributions

The point of non-parametric tests is that we make no assumptions about the data. But the statistic that we compute may follow some distribution.

Example: We wonder whether one sample is larger than another sample. Let's say that the samples are paired. Then without knowing anything about the distribution we can just count which of the pairs is larger. Independent of the distribution of the population from which the sample has been taken, this sign statistic will follow a binomial distribution.

So the point of non-parametric tests is not that the statistic that is being computed has no distribution, but that the distribution of the statistic is independent from the distribution of the data.

The point of this "They rely on hidden assumptions" is correct. However, it is a bit harsh and sketches the assumptions in a limited sense (as if assumptions are only simplifications to make computations easy).

Indeed many models are simplifications. But I would say that the parametric distributions are still useful, even when we have much more computation power nowadays and simplifications are not necessary. The reason is that parametric distributions are not always simplifications.

On the one hand: Bootstrapping or other simulations can approach the same result as a computation, and when the computation makes assumptions, approximations and simplifications then the bootstrapping may even do better.
On the other hand: The parametric distribution, if it is true, gives you information that bootstrapping can't give you. When you have only little amount of data then you can't get a proper estimate of p-values or confidence intervals. With parametric distributions you can fill the gap.

Example: if you have ten samples from a distribution, then you might estimate the quantile at multiples of 10%, but you won't be able to estimate smaller quantiles. If you know that the distribution can be approximated by some distribution (based on theory and previous knowledge such assumptions might not be bad) then you can use a fit with the parametric distribution to interpolate and extrapolate the ten samples to other quantiles.

Example 2: The representation of parametric tests as being only useful for making calculations easier is a straw man argument. It is not true because it is far from the only reason. The main reason why people use parametric tests is because they are more powerful. Compare for instance the parametric t-test with the non-parametric Mann-Whitney U test. The choice for the former is not because the computation is easier, but because it can be more powerfull.

3 They detract from the real questions

Can we, based on confidence intervals, say, what is an expected value? Is in this situation a clear decision? I always thought that confidence intervals are not necessarily symmetric, but I started to doubt here.

No, confidence intervals do not give full information. You should instead compute some cost function that quantifies all consideration in the decision (requiring the full distribution).

But confidence intervals may be a reasonable indication. The step from a single point estimate to a range is a big difference and adds an entire new dimension to the representation.

Your criticism here is also exactly the point of the author of the blogpost. You criticize the confidence intervals not giving full information. But the means 0.08 for action A and 0.001 for action B have even less information than the confidence intervals, and that is what the author is pointing out.

This third point is more a matter of point estimate versus interval estimates. Maybe p-values promote the use of point estimates, but it is a bit far-fetched to use it as criticism against p-values. The example is not even a case that is about p-values and it is about a Bayesian posterior for two situations.