Solved – Confidence interval for the difference of two means using boot package in R

bootstrapconfidence intervalr

I have two samples, one of size 52, and one of size 31, that are obtained at different times. I'd like to get a 95% bootstrap confidence interval for the difference between the means of the populations that these samples represent. I've been trying to use the "boot" package in R, and I'm getting an error that I can't figure out. I was hoping someone here could help me out.

This is how my data look like (in a dataframe named "totalData"):

      X              samplingTime
1  -0.29              initial
2   0.3               initial 
           ....
52  -1.2              initial
53   0.7              final
54  -1.2              final
           ....
83   1.52             final

This is what I did to get my bootstrap CI:

meanDiff = function(dataFrame, indexVector) { 
    m1 = mean(subset(dataFrame[, 1], dataFrame$samplingTime == "initial"))
    m2 = mean(subset(dataFrame[, 1], dataFrame$samplingTime == "final"))
    m = m1 - m2
    return(m)
}

totalBoot = boot(totalData, meanDiff, R = 10000, strata = totalData[,2])
totalBootCI = boot.ci(totalBoot)

and in the last line I get the error:

Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t.o, : estimated 
adjustment 'w' is infinite.

I'd very much appreciate any comments.

Thanks!

Best Answer

If you look at your totalBoot$t you will see that all the returned values are identical. The secret is that you have not defined your statistic function (meanDiff) to actual resample the data. The help page for boot says

When sim = "parametric", the first argument to statistic must be the data. ... In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.

If you redefine your meanDiff as

meanDiff = function(dataFrame, indexVector) { 
    m1 = mean(subset(dataFrame[indexVector, 1], dataFrame[indexVector, 2] == "initial"))
    m2 = mean(subset(dataFrame[indexVector, 1], dataFrame[indexVector, 2] == "final"))
    m = m1 - m2
    return(m)
}

It should work. Or (not that it matters) I prefer:

meanDiff =function(x, w){
    y <- tapply(x[w,1], x[w,2], mean)
    y[1]-y[2]}

Related Solutions

Solved – Bootstrapping a t-test in R

I've never used the boot package. Bootstrapping is so trivial you can just code it from scratch. Below, I just use t.test() with the defaults; you can choose var.equal=T, alternative="greater", etc., if you'd like. I set the seed, so your results would be identical, if you don't do anything different. For the qq-plot for the t-distribution, I used the df that corresponds to equal variances, which won't quite match the bootstrap (where each iteration will have a different effective df). Under the null, p-values should be uniformly distributed, but yours clearly aren't. I'm not sure I'd draw any conclusions from that, though.

library(car)
white_matter <- read.table(text="   Control Patient
1   0.3329  0.3306
2   0.3458  0.3375
3   0.3500  0.3874
4   0.3680  0.3485
5   0.3421  0.3548
6   0.3403  0.3876
7   0.3447  0.3755
8   0.3330  0.3644
9   0.3450  0.3206
10  0.3764  0.3587
11  0.3646  0.3570
12  0.3482  0.3423
13  0.3734  0.3583
14  0.3436  0.3457
15  0.3348  0.3770
16  0.3553  0.3419
17  0.3281  0.3416
18  0.3567  0.3703
19  0.3390  0.3525
20  0.3287  0.3596
21  0.3603  0.3519
22  0.3533  0.3443", header=T)

set.seed(1315)
B      <- 1000
t.vect <- vector(length=B)
p.vect <- vector(length=B)
for(i in 1:B){
  boot.c <- sample(white_matter$Control, size=22, replace=T)
  boot.p <- sample(white_matter$Patient, size=22, replace=T)
  ttest  <- t.test(boot.c, boot.p)
  t.vect[i] <- ttest$statistic
  p.vect[i] <- ttest$p.value
}

windows()
  qqPlot(t.vect, distribution="t", df=42)

enter image description here

windows()
  qqPlot(p.vect, distribution="unif")

enter image description here

Solved – Bootstrapped confidence interval for the difference in means for paired data

The first method is no resampling test of which I'm aware in the literature. It seems like your goal, by resampling $X$ and $Y$ independently, is to generate data under the null hypothesis. This approach is inefficient because you are ignoring pairing in the design.

The preferred resampling method for generating data under the null hypothesis is the permutation test. Permutation testing for paired data is done by randomly negating the $X-Y$ differences; i.e. replacing them with $Y-X$. Here, the between-pair differences are preserved, but the within-pair differences are only preserved if the paired mean difference is 0.

The second example is a proper description of a paired bootstrap.

Best Answer

Related Solutions

Solved – Bootstrapping a t-test in R

Solved – Bootstrapped confidence interval for the difference in means for paired data

Related Question