bootstrap – Calculating Bootstrap Confidence Interval and P-Values for Finite Populations

bootstrapcomputational-statisticsfinite-populationp-valueresampling

I am comparing the difference of medians between two groups of sample sizes $n1$ and $n2$. I would like to confirm that my boostrap approach for finite population size without pooling sample data correctly provides a distribution function of the differences between samples. Below, I provide examples of approaches that I've looked at. Approach 1 is provided as a reference (assuming a large population). I would like to confirm that approach 3 is sound while better understanding how to interpret the differences in results between approach 2 and 3.

Assuming a large population, I can compute the distribution of medians for each group using bootstrapping with replacement. To check if the observed difference is due to random error, use the following approach:

Approach 1, assume large population

  1. pool the samples from two groups together into a list of length $n1 + n2$,
  2. shuffle the pool,
  3. split the pool into "simulated" groups–cutting the shuffled list into new lists of sizes $n1$ and $n2$,
  4. compute the medians of each simulated pool,
  5. compute the differences of the medians in each pool,
  6. repeat steps 2-5 many times to calculate a set of medians, and
  7. use the resulting cumulative distribution function of set of medians to understand the probability of observing various effect sizes due to chance (i.e., bin and count the results, divide the counts by the total number of resamples).
    A similar example of this approach is in A.B. Downey's Think Stats (pg 105).

Now, for a finite population size, A.C. Davidson and D.V. Hinkley's "bootstrap methods and their applications" provide methods modify sample size when bootstrapping statistics estimating a population quantity, where the population is a known, finite size (pg 92). For example, given a finite population size, we can adjust the resample size upwards to $n'$ where $n'=(n-1)/(1-n/N)$. Here, $N$ is the population size. (As the sample size approaches the population size, we will have more certainty in the estimate. By adjusting the resample size upwards as $n$ approaches $N$, we tighten the test statistic's distribution to reflect this increased certainty.)

I think that my above steps for shuffling a pool break down, because I'm now working with an $n1'$ and $n2'$ sample size. So I went with the following approach:

Approach 2, fixed population

  1. compute $n1'$ and $n2'$
  2. bootstrap the median test statistics for group 1 and group 2 many times
  3. calculate the difference in medians between the groups (calculated in step 2)
  4. use the empirical/cumulative distribution function of the resulting differences to explore probabilities of observing given differences between the medians.

Is approach 2 correct? (It is similar to Bootstrap sampling for ratio of means with uneven sample sizes) This second approach feels different than the first since I'm not pooling data together. My understanding is that by pooling, I'm testing to see if the two samples could have been generated by the same underlying population. Approach 2 doesn't seem to be accomplishing this since I'm not mixing the data before distributing the data between the two samples.

Approach 3

My intuition is to do somewhat of a hybrid:

  1. pool groups 1 and 2 and then
  2. resample from that pooled group two groups of size $n1'$ and $n2'$, and then
  3. use steps 4 through 7 of approach 1.

If I wasn't adjusting the group sizes for the finite population, I would shuffle the pooled data into new groups (without replacement) as in Approach 1. By resampling with replacement, how should I interpret the results? Is it still correct to think about the fig_bsed_pool_deltas as the probability of observing the delta due to random error? Or is this a misapplication of the technique? One thing that bothers me is that I pool the data, but then use the original group size rather than setting the populations of each group to the sum of population_size_1 and population_size_2.

For reference, here is a toy example with python code implementing approach 3:Suppose I'm at a middle school where I give the same lecture to both class 1 and class 2 with respective class sizes of 15 and 20 students. I suspect that class 2 likes the course better since I teach that class after I have had my coffee. To assess attitude between the classes, I survey 5 students in class 1 and 10 students in class 2. The responses from class 1 are {1,2,3,4,5}. The responses from class 2 are {2,3,4,5,6,7,2,3,4,5}. I want to know if the attitude between the two classes taught by this teacher are different, say greater than a certain value x. (In this example, I happen to have ordered categorical responses–say a survey response from 1 to 7).

Set up and Define the inputs:

import numpy as np
import plotly.graph_objects as go
responses_1 = [1,2,3,4,5] #median is 3
responses_2 = [2,3,4,5,6,7,2,3,4,5] #median is 4
population_size_1 = 15
population_size_2 = 20
sam_pop_ration = len(responses_1)/population_size_1
sam_pop_ration = len(responses_2)/population_size_2

Approach 3:

def bootstrap_medians_pooled_approach(input_array_1, len_input_array_1, sam_pop_ration_1, \
    input_array_2, len_input_array_2, samp_pop_ration_2, \
    n_resamples):

    #sample 1
    adjusted_n_1 = (len_input_array_1 - 1)/(1 - sam_pop_ratio_1)
    ##some considerations for having a decimal adjusted_n_1 
    base_adjusted_n_1 = int(adjusted_n_1)
    fraction_adjusted_n_1 = adjusted_n_1 - base_adjusted_n_1
    #create an a array of sample 1 resample sizes
    ##alternate size to account for the fraction of adjustment
    adjusted_n_array_1 = [base_adjusted_n_1 + \
        int(np.random.choice([0,1], size = 1, \
        p = [1 - fraction_adjusted_n_1, fraction_adjusted_n_1)) \
        for x in range(n_samples)]
    #sample 2 (same setup as above for sample 1)
    adjusted_n_2 = (len_input_array_2 - 1)/(1 - sam_pop_ratio_2)
    base_adjusted_n_2 = int(adjusted_n_2)
    fraction_adjusted_n_2 = adjusted_n_1 - base_adjusted_n_2
    adjusted_n_array_2 = [base_adjusted_n_2 + \
        int(np.random.choice([0,1], size = 1, \
        p = [1 - fraction_adjusted_n_2, fraction_adjusted_n_2)) \
        for x in range(n_samples)]
    
    pooled_array = input_array_1 + input_array_2

    #create list of resampled medians for group 1 and group 2
    medians_1 = [np.median(np.random.choice(pooled_array, size = x)) \
        for x in adjusted_n_array_1]
    medians_2 = [np.median(np.random.choice(pooled_array, size = x)) \
        for x in adjusted_n_array_2]

n_resamples = 10000
bs_pool_delta = bootstrap_medians_pooled_approach(responses_1, len(responses_1), 
    sam_pop_ratio_1,\
    responses_2, len(responses_2), sam_pop_ratio_2, \
    n_resamples)

#visualize the distribution of deltas results
fig_bsed_pool_deltas = go.Figure()
fig_bsed_pool_deltas.add_trace(go.Histogram(x = bs_pool_delta)

#explore the chance that the observed delta of a given delta might be observed by random chance
deltas = 0.25 * x for x in range(-28,28)
fig_ps_bs = go.Figure()
fig_ps_bs.add_trace(go.Scatter(x = deltas, y = bsed_p_values_pool))

Best Answer

  • Pooling data is only allowed if you can reasonably make the assumption of equal distributions. For instance when the null hypothesis of equal medians is correct, but also other distribution parameters, like variance, should be the same.

    By pooling the groups you will get a more precise estimate of the distribution of the statistic, because you are using a more precise estimate of the empirical distribution of the data (an estimate that improves when we have more datapoints).

  • The approach 2 without pooling the data also works if the two groups have different distributions.

    With this method you do have to think about the interpretation of the distribution. Example with two beta distributions shifted such that their medians are 0:

    example with shifted beta distributions

    I have chosen the parameters to create a difficult situation on purpose. Here the sampling distribution of the experiment has some skewness and the right tail is stretched out further than the left tail.

    I also chose a random seed such that the outcome is far in the left tail. This situation shows that the bootstrap does mimic the skewness of the distribution, but as a hypothesis test, one should consider to shift the bootstrapped distribution to be centered around zero, instead of ateound the observed median. The probability that the bootstrapped sample has median zero or larger is different from the probability that the sampling distribution has the observed value or smaller.

Example code:

set.seed(2)
n = 31

### create some data from distributions with zero median
alpha=0.25
beta=2
x = rbeta(n,alpha,beta)-qbeta(0.5,alpha,beta)
y = rbeta(n,beta,alpha)-qbeta(0.5,beta,alpha)

### order the datapoints 
x = x[order(x)]
y = y[order(y)]

### bootstrapping based probability distribution of sampled medians 
k = 1:n
m = (n-1)/2
p = (1/n)*(k/n)^m*((n-k)/n)^m*factorial(n)/factorial(m)^2


### create tables for convolution
mS = outer(x,y,"-") # domain 
mP = outer(p,p,"*") # probabilities
### compute an estimate for density of median(x)-median(y)
f = density(mS, weights=mP, n=2/0.005, bw = 0.005, kernel = "rectangular" , from = -1, to = 1)
brks = seq(-1,1,0.005)

#### creating sampling distribution estimates
#### based on repeating the experiment 
experiment = function() {
  x = rbeta(n,alpha,beta)-qbeta(0.5,alpha,beta)
  y = rbeta(n,beta,alpha)-qbeta(0.5,beta,alpha)
  return(median(x)-median(y))
}
m_sample = replicate(10^5, experiment())


### plot histogram 

hist(m_sample, breaks = brks, xlim = c(-0.1,0.25), freq = 0, main = "estimate for density of median(x)-median(y) \n density curve based on bootstrap \n histogram based on re-sampling true distribution" , ylim =c(0,25))
lines(f)


### plotting other stuff

lines(c(1,1)*(median(x)-median(y)),c(0,25),lty=2,col =2)
text((median(x)-median(y)),15,"observed value",col =2,srt=90,pos =4)
Related Question