Solved – Bootstrapping and comparing mean distributions

bootstrapstatistical significancet-test

Is the following a reasonable approach to assess the statistical significance of the difference between two groups'

For each group
1) Subsample with replacement
2) Take the mean of the subsample
3) Repeat 10,000 times to build a distribution of means
4) Carry out a t-test to assess the difference between those two distributions

(i.e. bootstrapping to build a distribution of means)

The two datasets are very different in size (~100 vs. 100,000). The alternative approach would be to subsample from each to build two equally sized datasets, and then use a t-test on those two samples. The problem I have with this is I'm not sure if the smaller of the two sets is normally distributed (while the larger is), which may invalidate the t-test assumptions?

Best Answer

This is not how you would do a simulation test (not a bootstrap test here). What you want to do mix all the data together and then randomly redivide into two new groups find the mean of each group take the difference and plot it. Repeat lots of times, 10,000 for instance. Then you can find a p-value but counting all the results as or more extreme than your observed result (the original difference in means) and divide by 10000. This is a non-parametric version of a t-test called a permutation test. However, you could use a t-test for difference in means but there are more assumptions about the data than this test.

Related Solutions

Solved – How to compare bootstrapped regression slopes

Bootstrapping is done to get a more robust picture of the sampling distribution than that which is assumed by large sample theory. When you bootstrap, there is effectively no limit to the number of `bootsamples' you take; in fact you get a better approximation to the sampling distribution the more bootsamples you take. It is common to use $B=10,000$ bootsamples, although there is nothing magical about that number. Furthermore, you don't run a test on the bootsamples; you have an estimate of the sampling distribution--use it directly. Here's an algorithm:

take a bootsample of one data set by sampling $n_1$ boot-observations with replacement. [Regarding the comments below, one relevant question is what constitutes a valid 'boot-observation' to use for your bootsample. In fact, there are several legitimate approaches; I will mention two that are robust and allow you to mirror the structure of your data: When you have observational data (i.e., the data were sampled on all dimensions, a boot-observation can be an ordered n-tuple (e.g., a row from your data set). For example, if you have one predictor variable and one response variable, you would sample $n_1$ $(x,y)$ ordered pairs. On the other hand, when working with experimental data, predictor variable values were not sampled, but experimental units were assigned to intended levels of each predictor variable. In a case like this, you can sample $n_{1j}$ $y$ values from within each of the $j$ levels of your predictor variable, then pair those $y$s with the corresponding value of that predictor level. In this manner, you would not sample over $X$.]
fit your regression model and store the slope estimate (call it $\hat\beta_1$)
take a bootsample of the other data set by sampling $n_2$ boot-observations with replacement
fit the other regression model and store the slope estimate (call it $\hat\beta_2$)
form a statistic from the two estimates (suggestion: use the slope difference $\hat\beta_1-\hat\beta_2$)
store the statistic and dump the other info so as not to waste memory
repeat steps 1 - 6, $B=10,000$ times
sort the bootstrapped sampling distribution of slope differences
compute the % of the bsd that overlaps 0 (whichever is smaller, the right tail % or the left tail %)
multiply this percentage by 2

The logic of this algorithm as a statistical test is fundamentally similar to classical tests (e.g., t-tests) but you are not assuming the the data or the resulting sampling distributions have any particular distribution. (For example, you are not assuming normality.) The primary assumption you are making is that your data are representative of the population you sampled from / want to generalize to. That is, the sample distribution is similar to the population distribution. Note that, if your data are not related to the population you're interested in, you are flat out of luck.

Some people worry about using, e.g., a regression model to determine the slope if you're not willing to assume normality. However, this concern is mistaken. The Gauss-Markov theorem tells us that the estimate is unbiased (i.e., centered on the true value), so it's fine. The lack of normality simply means that the true sampling distribution may be different from the theoretically posited one, and so the p-values are invalid. The bootstrapping procedure gives you a way to deal with this issue.

Two other issues regarding bootstrapping: If the classical assumptions are met, bootstrapping is less efficient (i.e., has less power) than a parametric test. Second, bootstrapping works best when you are exploring near the center of a distribution: means and medians are good, quartiles not so good, bootstrapping the min or max necessarily fail. Regarding the first point, you may not need to bootstrap in your situation; regarding the second point, bootstrapping the slope is perfectly fine.

Solved – Bootstrapping a t-test in R

I've never used the boot package. Bootstrapping is so trivial you can just code it from scratch. Below, I just use t.test() with the defaults; you can choose var.equal=T, alternative="greater", etc., if you'd like. I set the seed, so your results would be identical, if you don't do anything different. For the qq-plot for the t-distribution, I used the df that corresponds to equal variances, which won't quite match the bootstrap (where each iteration will have a different effective df). Under the null, p-values should be uniformly distributed, but yours clearly aren't. I'm not sure I'd draw any conclusions from that, though.

library(car)
white_matter <- read.table(text="   Control Patient
1   0.3329  0.3306
2   0.3458  0.3375
3   0.3500  0.3874
4   0.3680  0.3485
5   0.3421  0.3548
6   0.3403  0.3876
7   0.3447  0.3755
8   0.3330  0.3644
9   0.3450  0.3206
10  0.3764  0.3587
11  0.3646  0.3570
12  0.3482  0.3423
13  0.3734  0.3583
14  0.3436  0.3457
15  0.3348  0.3770
16  0.3553  0.3419
17  0.3281  0.3416
18  0.3567  0.3703
19  0.3390  0.3525
20  0.3287  0.3596
21  0.3603  0.3519
22  0.3533  0.3443", header=T)

set.seed(1315)
B      <- 1000
t.vect <- vector(length=B)
p.vect <- vector(length=B)
for(i in 1:B){
  boot.c <- sample(white_matter$Control, size=22, replace=T)
  boot.p <- sample(white_matter$Patient, size=22, replace=T)
  ttest  <- t.test(boot.c, boot.p)
  t.vect[i] <- ttest$statistic
  p.vect[i] <- ttest$p.value
}

windows()
  qqPlot(t.vect, distribution="t", df=42)

enter image description here

windows()
  qqPlot(p.vect, distribution="unif")

enter image description here

Best Answer

Related Solutions

Solved – How to compare bootstrapped regression slopes

Solved – Bootstrapping a t-test in R

Related Question