Bootstrap Methods – Bootstrapping Hierarchical/Multilevel Data (Resampling Clusters)

bootstrapfixed-effects-modelr

I am producing a script for creating bootstrap samples from the cats dataset (from the -MASS- package).

Following the Davidson and Hinkley textbook [1] I ran a simple linear regression and adopted a fundamental non-parametric procedure for bootstrapping from iid observations, namely pairs resampling.

The original sample is in the form:

Bwt   Hwt

2.0   7.0
2.1   7.2

...

1.9    6.8

Through an univariate linear model we want to explain cats hearth weight through their brain weight.

The code is:

library(MASS)
library(boot)


##################
#   CATS MODEL   #
##################

cats.lm <- glm(Hwt ~ Bwt, data=cats)
cats.diag <- glm.diag.plots(cats.lm, ret=T)


#######################
#   CASE resampling   #
#######################

cats.fit <- function(data) coef(glm(data$Hwt ~ data$Bwt)) 
statistic.coef <- function(data, i) cats.fit(data[i,]) 

bootl <- boot(data=cats, statistic=statistic.coef, R=999)

Suppose now that there exists a clustering variable cluster = 1, 2,..., 24 (for instance, each cat belongs to a given litter). For simplicity, suppose that data are balanced: we have 6 observations for each cluster. Hence, each of the 24 litters is made up of 6 cats (i.e. n_cluster = 6 and n = 144).

It is possible to create a fake cluster variable through:

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

I have two related questions:

How to simulate samples in accordance with the (clustered) dataset strucure? That is, how to resample at the cluster level? I would like to sample the clusters with replacement and to set the observations within each selected cluster as in the original dataset (i.e. sampling with replacenment the clusters and without replacement the observations within each cluster).

This is the strategy proposed by Davidson (p. 100).
Suppose we draw B = 100 samples. Each of them should be composed by 24 possibly recurrent clusters (e.g. cluster = 3, 3, 1, 4, 12, 11, 12, 5, 6, 8, 17, 19, 10, 9, 7, 7, 16, 18, 24, 23, 11, 15, 20, 1), and each cluster should contain the same 6 observations of the original dataset. How to do that in R? (either with or without the -boot- package.) Do you have alternative suggestions for proceeding?

The second question concerns the initial regression model. Suppose I adopt a fixed-effects model, with cluster-level intercepts. Does it change the resampling procedure adopted?

[1] Davidson, A. C., Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge University press.

Best Answer

Resampling the whole clusters has been known in survey statistics for as long as any resampling methods have been used there at all (which is, since mid 1960s), so it is a well established method. See my collection of links at http://www.citeulike.org/user/ctacmo/tag/survey_resampling. Whether boot can do this or not, I don't know; I use survey package when I need to work with survey bootstraps, although the last time I checked, it did not have all the functionality I needed (like some small sample corrections, as far as I can recall).

I don't think applying a particular model such as fixed effects changes things much, but, IMO, the residual bootstrap makes a lot of strong assumptions (the residuals are i.i.d., the model is correctly specified). Every one of them is easily broken, and the cluster structure surely breaks the i.i.d. assumption.

There's been some econometrics literature on wild cluster bootstrap. They pretended they worked in vacuum without all those fifty years of survey statistics research into the topic, so I am not sure as to what to make of it.