Solved – Stratified bootstrapping and confidence intervals

bootstraprresamplingstratification

Here's a toy data set that replicates my problem. I am interested in knowing the confidence intervals of an empirical distribution that is composed of the scores of each school at the proportion that student "A".

set.seed(1)
rows = 50
df <- data.frame(student = sample(LETTERS[1:3],rows,rep=T),
                 school = sample(c("F","G"),rows,rep=T),
                 score = sample(1:10,rows,rep=T,prob = c(rep(0.05,7),rep(0.2167,3)))
                 )
head(df)
student school score
1       A      F     3
2       B      G     9
3       B      F     9
4       C      F     1
5       A      F    10
6       C      F     8
> 

In this example: student "A" has 3 scores from school "G" and 9 scores from school "F":

> df[df$student=="A",]
   student school score
1        A      F     3
5        A      F    10
10       A      F    10
11       A      G     1
12       A      F     6
22       A      G    10
24       A      F     8
25       A      F     7
27       A      G    10
34       A      F    10
38       A      F    10
47       A      F     8

How do I generate bootstrap samples that would sample 12 scores at the correct proportion of student "A" school. I need to calculate the CI of the expected score of the average student scoring student "A"'s school proportions.

I look through the "boot" package boot function help. There is an example of stratified bootstrap but I don't get what stype is doing. I understand stype="i" but I don't understand what happens with stype="w" or "f" and how to use them.

Best Answer

stype applies when you have to calculate a weighted statistic that is based on frequency or weight. In your case, i don't think it applies.

Most likely you need to split the data.frame by student first, and apply boot on each group, this ensures you get the same number of observations per A/B/C group. Inside each group, you apply the strata to get the same proportions. Below I apply a function to get the mean:

stat = function(d,i)mean(d[i,"score"])

bo = by(df,df$student,function(i)boot(i,stat,R=100))

Then to get c.i :

lapply(bo,boot.ci,type="basic")
$A
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates

CALL : 
FUN(boot.out = X[[i]], type = "basic")

Intervals : 
Level      Basic         
95%   ( 5.964,  9.539 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable

$B
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates

CALL : 
FUN(boot.out = X[[i]], type = "basic")

Intervals : 
Level      Basic         
95%   ( 6.026,  8.565 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable

$C
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates

CALL : 
FUN(boot.out = X[[i]], type = "basic")

Intervals : 
Level      Basic         
95%   ( 7.348,  8.812 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Related Question