Solved – Why does weighted bootstrap have awful coverage even in toy example

bootstraprselection-biassurvey-samplingweights

I'm interested in using the weighted bootstrap to correct for selection bias with a known form. I simulated a very simple example where the underlying data, $X$, are $N(0,1)$ and we are calculating a sample mean. However, there is selection bias such that every $X>0$ gets sampled, while only 5% of $X<0$ get sampled. Thus, the sample mean is biased.

To correct for this, I resampled with replacement using inverse-probability-of-selection weights. I did this using the package boot with weights equal to 1 or 20. I computed confidence intervals using either the percentile method or the raw empirical quantiles of the bootstraps. Even though the resampled means were unbiased, both types of confidence interval had awful coverage (~53%).

As a sanity check, I set the parameter that controls the ratio of selection probabilities (eta below) to 1, such that there is no selection bias and we are just doing plain-vanilla resampling. Lo and behold, the coverage then is fine (96.5%). Since the bootstrap works for these data without selection, I don't think the excellent answers here hold (e.g., issues with non-pivotality, asymptotics, etc.).

Also, although I realize other methods of CI construction (e.g., BCa or basic) are sometimes better, I can't use them in this case because the original sample estimate is obviously biased so can't be used as a benchmark.

My code is below:

##### Helper Fn: Extract CI Limits from boot.ci #####
# list with first entry for b and second entry for t2
# n.ests: how many parameters were estimated?
get_boot_CIs = function(boot.res, type, n.ests) {
  bootCIs = lapply( 1:n.ests, function(x) boot.ci(boot.res, type = type, index = x) )

  # list with first entry for b and second entry for t2
  # the middle index "4" on the bootCIs accesses the stats vector
  # the final index chooses the CI lower (4) or upper (5) bound
  bootCIs = lapply( 1:n.ests, function(x) c( bootCIs[[x]][[4]][4],
                                             bootCIs[[x]][[4]][5] ) )
}


##### Helper Fn: Check CI coverage #####
covers = function( truth, lo, hi ) {
  return( (lo <= truth) & (hi >= truth) )
}


sim.reps = 200
boot.reps = 500
n = 5000  # initial sample size before selection
library(boot)

# sanity check: using eta = 1 gives 96.5% coverage for both methods


for ( i in 1:sim.reps ) {

    x = rnorm(n=5000)

    # sample values non-randomly
    # positive ones have an eta-fold higher chance of being sampled
    eta = 20
    weight = rep(1, length(x))
    weight[x < 0] = eta

    # indicator for whether we sample each observation
    keep = rbinom(n=n, size=1, prob=1/weight)

    d = data.frame(x, weight)
    d = d[ keep == 1, ]

    # this will be > 0
    # mean(d$x)

      boot.res = boot( data = d, 
                       parallel = "multicore",
                       R = boot.reps, 
                       weights = d$weight,
                       statistic = function(original, indices) {

                         b = original[indices,]
                         mean(b$x)
                       }
                      )  # end call to boot()

      # in case there is some weird problem with
      #  computing the boot CI
      tryCatch( {
        percCIs = get_boot_CIs(boot.res, "perc", n.ests = 1)
      }, error = function(err) {
        percCIs <<- list( c(NA, NA), c(NA, NA) )
      } )

      # quantiles
      qlo = quantile(boot.res$t[,1], 0.025)
  qhi = quantile(boot.res$t[,1], 0.975)

    rows =     data.frame( 
                            Method = c( "PercBT",
                                        "JustQuantiles" ),

                            # stats for mean estimate
                            # note that both boot CI methods use the same point estimate
                            XBarWtd = c( mean( boot.res$t[,1] ),
                                   mean( boot.res$t[,1] ) ),

                            Lo = c( percCIs[[1]][1],
                                    qlo ),

                            Hi = c( percCIs[[1]][2],
                                    qhi ),

                            Coverage = c( covers( 0, percCIs[[1]][1], percCIs[[1]][2] ),
                                          covers( 0,
                                                  qlo,
                                                  qhi ) )
    )

    if ( i == 1 ) res = rows
    else res = rbind(res, rows)
}

# see the underwhelming results
library(dplyr)
vars = c("XBarWtd", "Lo", "Hi", "Coverage")
res %>% group_by(Method) %>% summarise_at( vars(vars), mean )

And the disappointing results:

  Method        XBarWtd      Lo     Hi Coverage
  <fct>           <dbl>   <dbl>  <dbl>    <dbl>
1 JustQuantiles 0.00430 -0.0333 0.0423    0.53 
2 PercBT        0.00430 -0.0341 0.0430    0.535

Best Answer

The weights argument in the boot function is looking for resampling importance weights, not inverse probability weights. When no weights are specified, the bootstrap assumes an importance resampling weight of $n^{-1}$ for each unit, where $n$ is the sample size (i.e., uniform weighting). Importance weights are resampling probabilities that are intended to improve the efficiency of Monte Carlo estimates of bootstrap quantiles. See, e.g., this reference: "On importance resampling for the bootstrap" for a description of importance weights and their intended purpose. As to your problem, if you look at the boot function's source code, you will see the following:

if (!is.null(weights)) 
  weights <- t(apply(matrix(weights, n, length(R), 
                            byrow = TRUE), 2L, normalize, strata))

And digging a little deeper, you find that

boot:::normalize

normalizes the weights to add to 1. So, essentially, when you provide a vector of 1's and 20's, it becomes a vector of 1/sum(vector)'s and 20/sum(vector)'s. Boot then oversamples the 20/sum(vector)'s 20 times more than the 1/sum(vector)'s, so it is a happy coincidence that this is precisely the (weighted) correction needed to make your estimate of the mean unbiased. It is not, however, the intention of the weights argument, so it makes sense that other estimated quantities are not correct.

Related Question