Solved – Why does weighted bootstrap have awful coverage even in toy example

bootstraprselection-biassurvey-samplingweights

I'm interested in using the weighted bootstrap to correct for selection bias with a known form. I simulated a very simple example where the underlying data, $X$, are $N(0,1)$ and we are calculating a sample mean. However, there is selection bias such that every $X>0$ gets sampled, while only 5% of $X<0$ get sampled. Thus, the sample mean is biased.

To correct for this, I resampled with replacement using inverse-probability-of-selection weights. I did this using the package boot with weights equal to 1 or 20. I computed confidence intervals using either the percentile method or the raw empirical quantiles of the bootstraps. Even though the resampled means were unbiased, both types of confidence interval had awful coverage (~53%).

As a sanity check, I set the parameter that controls the ratio of selection probabilities (eta below) to 1, such that there is no selection bias and we are just doing plain-vanilla resampling. Lo and behold, the coverage then is fine (96.5%). Since the bootstrap works for these data without selection, I don't think the excellent answers here hold (e.g., issues with non-pivotality, asymptotics, etc.).

Also, although I realize other methods of CI construction (e.g., BCa or basic) are sometimes better, I can't use them in this case because the original sample estimate is obviously biased so can't be used as a benchmark.

My code is below:

##### Helper Fn: Extract CI Limits from boot.ci #####
# list with first entry for b and second entry for t2
# n.ests: how many parameters were estimated?
get_boot_CIs = function(boot.res, type, n.ests) {
  bootCIs = lapply( 1:n.ests, function(x) boot.ci(boot.res, type = type, index = x) )

  # list with first entry for b and second entry for t2
  # the middle index "4" on the bootCIs accesses the stats vector
  # the final index chooses the CI lower (4) or upper (5) bound
  bootCIs = lapply( 1:n.ests, function(x) c( bootCIs[[x]][[4]][4],
                                             bootCIs[[x]][[4]][5] ) )
}


##### Helper Fn: Check CI coverage #####
covers = function( truth, lo, hi ) {
  return( (lo <= truth) & (hi >= truth) )
}


sim.reps = 200
boot.reps = 500
n = 5000  # initial sample size before selection
library(boot)

# sanity check: using eta = 1 gives 96.5% coverage for both methods


for ( i in 1:sim.reps ) {

    x = rnorm(n=5000)

    # sample values non-randomly
    # positive ones have an eta-fold higher chance of being sampled
    eta = 20
    weight = rep(1, length(x))
    weight[x < 0] = eta

    # indicator for whether we sample each observation
    keep = rbinom(n=n, size=1, prob=1/weight)

    d = data.frame(x, weight)
    d = d[ keep == 1, ]

    # this will be > 0
    # mean(d$x)

      boot.res = boot( data = d, 
                       parallel = "multicore",
                       R = boot.reps, 
                       weights = d$weight,
                       statistic = function(original, indices) {

                         b = original[indices,]
                         mean(b$x)
                       }
                      )  # end call to boot()

      # in case there is some weird problem with
      #  computing the boot CI
      tryCatch( {
        percCIs = get_boot_CIs(boot.res, "perc", n.ests = 1)
      }, error = function(err) {
        percCIs <<- list( c(NA, NA), c(NA, NA) )
      } )

      # quantiles
      qlo = quantile(boot.res$t[,1], 0.025)
  qhi = quantile(boot.res$t[,1], 0.975)

    rows =     data.frame( 
                            Method = c( "PercBT",
                                        "JustQuantiles" ),

                            # stats for mean estimate
                            # note that both boot CI methods use the same point estimate
                            XBarWtd = c( mean( boot.res$t[,1] ),
                                   mean( boot.res$t[,1] ) ),

                            Lo = c( percCIs[[1]][1],
                                    qlo ),

                            Hi = c( percCIs[[1]][2],
                                    qhi ),

                            Coverage = c( covers( 0, percCIs[[1]][1], percCIs[[1]][2] ),
                                          covers( 0,
                                                  qlo,
                                                  qhi ) )
    )

    if ( i == 1 ) res = rows
    else res = rbind(res, rows)
}

# see the underwhelming results
library(dplyr)
vars = c("XBarWtd", "Lo", "Hi", "Coverage")
res %>% group_by(Method) %>% summarise_at( vars(vars), mean )

And the disappointing results:

  Method        XBarWtd      Lo     Hi Coverage
  <fct>           <dbl>   <dbl>  <dbl>    <dbl>
1 JustQuantiles 0.00430 -0.0333 0.0423    0.53 
2 PercBT        0.00430 -0.0341 0.0430    0.535

Best Answer

The weights argument in the boot function is looking for resampling importance weights, not inverse probability weights. When no weights are specified, the bootstrap assumes an importance resampling weight of $n^{-1}$ for each unit, where $n$ is the sample size (i.e., uniform weighting). Importance weights are resampling probabilities that are intended to improve the efficiency of Monte Carlo estimates of bootstrap quantiles. See, e.g., this reference: "On importance resampling for the bootstrap" for a description of importance weights and their intended purpose. As to your problem, if you look at the boot function's source code, you will see the following:

if (!is.null(weights)) 
  weights <- t(apply(matrix(weights, n, length(R), 
                            byrow = TRUE), 2L, normalize, strata))

And digging a little deeper, you find that

boot:::normalize

normalizes the weights to add to 1. So, essentially, when you provide a vector of 1's and 20's, it becomes a vector of 1/sum(vector)'s and 20/sum(vector)'s. Boot then oversamples the 20/sum(vector)'s 20 times more than the 1/sum(vector)'s, so it is a happy coincidence that this is precisely the (weighted) correction needed to make your estimate of the mean unbiased. It is not, however, the intention of the weights argument, so it makes sense that other estimated quantities are not correct.

Related Solutions

Solved – Why does the bootstrap interval have terrible coverage

Bootstrap diagnostics and remedies by Canto, Davison, Hinkley & Ventura (2006) seems to be a logical point of departure. They discuss multiple ways the bootstrap can break down and - more importantly here - offer diagnostics and possible remedies:

Outliers
Incorrect resampling model
Nonpivotality
Inconsistency of the bootstrap method

I don't see a problem with 1, 2 and 4 in this situation. Let's look at 3. As @Ben Ogorek notes (although I agree with @Glen_b that the normality discussion may be a red herring), the validity of the bootstrap depends on the pivotality of the statistic we are interested in.

Section 4 in Canty et al. suggests resampling-within-resamples to get a measure of bias and variance for the parameter estimate within each bootstrap resample. Here is code to replicate the formulas from p. 15 of the article:

library(boot)
m <- 10 # sample size
n.boot <- 1000
inner.boot <- 1000

set.seed(1)
samp.mean <- bias <- vars <- rep(NA,n.boot)
for ( ii in 1:n.boot ) {
    samp <- exp(rnorm(m,0,2)) + 1
    samp.mean[ii] <- mean(samp)
    foo <- boot(samp,statistic=function(xx,index)mean(xx[index]),R=inner.boot)
    bias[ii] <- mean(foo$t[,1])-foo$t0
    vars[ii] <- var(foo$t[,1])
}

opar <- par(mfrow=c(1,2))
    plot(samp.mean,bias,xlab="Sample means",ylab="Bias",
        main="Bias against sample means",pch=19,log="x")
    abline(h=0)
    plot(samp.mean,vars,xlab="Sample means",ylab="Variance",
        main="Variance against sample means",pch=19,log="xy")
par(opar)

bootstrap diagnostics

Note the log scales - without logs, this is even more blatant. We see nicely how the variance of the bootstrap mean estimate goes up with the mean of the bootstrap sample. This to me looks like enough of a smoking gun to attach blame to nonpivotality as a culprit for the low confidence interval coverage.

However, I'll happily admit that one could follow up in lots of ways. For instance, we could look at how whether the confidence interval from a specific bootstrap replicate includes the true mean depends on the mean of the particular replicate.

As for remedies, Canty et al. discuss transformations, and logarithms come to mind here (e.g., bootstrap and build confidence intervals not for the mean, but for the mean of the logged data), but I couldn't really make it work.

Canty et al. continue to discuss how one can reduce both the number of inner bootstraps and the remaining noise by importance sampling and smoothing as well as add confidence bands to the pivot plots.

This might be a fun thesis project for a smart student. I'd appreciate any pointers to where I went wrong, as well as to any other literature. And I'll take the liberty of adding the diagnostic tag to this question.

Solved – Bootstrapping Generalized Least Squares

As long as your data set is not infinitely large, a bootstrapping approach would likely entail overlaying alternative realizations of biased parameter values and standard errors on top of one another. Shrinkage approaches which attempt to nudge off-diagonals in the upper(lower) triangular toward the diagonal may help, but the fix might best be hinged to corrections to the eigendensity either through the Marcenko-Pastur law or Tracy-Widom law. (see my random matrix theory lecture). The issues will be central to the eigendensity of the Hessian, which is inverted for updating coefficients as well as obtaining standard errors after convergence. I believe the issues inherent in what you are proposing are potentially inadmissible, as they would scale with sample size.

Best Answer

Related Solutions

Solved – Why does the bootstrap interval have terrible coverage

Solved – Bootstrapping Generalized Least Squares

Related Question