Solved – How to compare bootstrapped regression slopes

bootstrapregressionstatistical significance

Let us assume I have two data sets with n observations of data pairs of independent variable x and dependent variable y each. Let us further assume I want to generate a distribution of regression slopes for each data set by bootstrapping the observations (with replacement) N times and calculating the regression y = a + bx each time. How do I compare the two distributions in order to say the slopes are significantly different? A U-test for testing the difference between the medians of the distributions would be heavily dependent on N, that is, the more often I repeat the bootstrapping the more significant will be the difference. How do I have to calculate the overlap between the distributions to determine a significant difference?

Best Answer

Bootstrapping is done to get a more robust picture of the sampling distribution than that which is assumed by large sample theory. When you bootstrap, there is effectively no limit to the number of `bootsamples' you take; in fact you get a better approximation to the sampling distribution the more bootsamples you take. It is common to use $B=10,000$ bootsamples, although there is nothing magical about that number. Furthermore, you don't run a test on the bootsamples; you have an estimate of the sampling distribution--use it directly. Here's an algorithm:

  1. take a bootsample of one data set by sampling $n_1$ boot-observations with replacement. [Regarding the comments below, one relevant question is what constitutes a valid 'boot-observation' to use for your bootsample. In fact, there are several legitimate approaches; I will mention two that are robust and allow you to mirror the structure of your data: When you have observational data (i.e., the data were sampled on all dimensions, a boot-observation can be an ordered n-tuple (e.g., a row from your data set). For example, if you have one predictor variable and one response variable, you would sample $n_1$ $(x,y)$ ordered pairs. On the other hand, when working with experimental data, predictor variable values were not sampled, but experimental units were assigned to intended levels of each predictor variable. In a case like this, you can sample $n_{1j}$ $y$ values from within each of the $j$ levels of your predictor variable, then pair those $y$s with the corresponding value of that predictor level. In this manner, you would not sample over $X$.]
  2. fit your regression model and store the slope estimate (call it $\hat\beta_1$)
  3. take a bootsample of the other data set by sampling $n_2$ boot-observations with replacement
  4. fit the other regression model and store the slope estimate (call it $\hat\beta_2$)
  5. form a statistic from the two estimates (suggestion: use the slope difference $\hat\beta_1-\hat\beta_2$)
  6. store the statistic and dump the other info so as not to waste memory
  7. repeat steps 1 - 6, $B=10,000$ times
  8. sort the bootstrapped sampling distribution of slope differences
  9. compute the % of the bsd that overlaps 0 (whichever is smaller, the right tail % or the left tail %)
  10. multiply this percentage by 2

The logic of this algorithm as a statistical test is fundamentally similar to classical tests (e.g., t-tests) but you are not assuming the the data or the resulting sampling distributions have any particular distribution. (For example, you are not assuming normality.) The primary assumption you are making is that your data are representative of the population you sampled from / want to generalize to. That is, the sample distribution is similar to the population distribution. Note that, if your data are not related to the population you're interested in, you are flat out of luck.

Some people worry about using, e.g., a regression model to determine the slope if you're not willing to assume normality. However, this concern is mistaken. The Gauss-Markov theorem tells us that the estimate is unbiased (i.e., centered on the true value), so it's fine. The lack of normality simply means that the true sampling distribution may be different from the theoretically posited one, and so the p-values are invalid. The bootstrapping procedure gives you a way to deal with this issue.

Two other issues regarding bootstrapping: If the classical assumptions are met, bootstrapping is less efficient (i.e., has less power) than a parametric test. Second, bootstrapping works best when you are exploring near the center of a distribution: means and medians are good, quartiles not so good, bootstrapping the min or max necessarily fail. Regarding the first point, you may not need to bootstrap in your situation; regarding the second point, bootstrapping the slope is perfectly fine.