Regression – Choosing the Right Bootstrapped Regression Model

bootstraplogisticmultiple regressionregression

I have a binary logistic regression model with a DV (disease: yes/no) and 5 predictors (demographics [age, gender, tobacco smoking (yes/no)], a medical index (ordinal) and one random treatment [yes/no]). I have also modeled all the two-sided interaction terms. The main variables are centered and there is no sign of multicollinearity (all VIFs < 2.5).

I have some questions:

  1. Is bootstrapping advantageous over my single model? if so,

  2. which bootstrapped model should I choose? I just wanted to see if bootstrapping algorithms follow random methods for creating new samples, or if they have rigid algorithms. Therefore, I resampled for 1000 times in each attempt (so I have several bootstrapped models, each with 1000 trials). However, each time the coefficients of the bootstrapped model differ (although the number of trials are constantly 1000). So I wonder which one should I choose for my report? Some changes are tiny and don't affect my coefficients' significance, but some make some of my coefficients non-significant (only those with P values close to 0.05 in the original model that change to 0.06 for example).

  3. Should I choose a higher number like 10,000? How can I determine this limit?

  4. Again should I bootstrap in the first place? If its results vary each time, can I rely on its results?

  5. Do you have any other ideas in mind that can help me with my case?

Many many thanks.

Best Answer

Bootstrapping is a resampling method to estimate the sampling distribution of your regression coefficients and therefore calculate the standard errors/confidence intervals of your regression coefficients. This post has a nice explanation. For a discussion of how many replications you need, see this post.

  1. The nonparametric bootstrap resamples repeatedly and randomly draws your observations with replacement (i.e. some observations are drawn only once, others multiple times and some never at all), then calculates the logistic regression and stores the coefficients. This is repeated $n$ times. So you'll end up with 10'000 different regression coefficients. These 10'000 coefficients can then be used to calculate their confidence itnervals. As a pseudo-random number generator is used, you could just set the seed to an arbitrary number to ensure that you have exactly the same results each time (see example below). To really have stable estimates, I would suggest more than 1000 replications, maybe 10'000. You could run the bootstrap several times and see if the estimates change much whether you do 1000 or 10'000 replications. In plain english: you should take replications until you reach convergence. If your bootstrap estimates vary between your estimates and the observed, single model, this could indicate that the observed model does not appropriately reflect the structure of your sample. The function boot in R, for example, puts out the "bias" which is the difference between the regression coefficients of your single model and the mean of the bootstrap samples.
  2. When performing the bootstrap, you are not interested in a single bootstrap sample, but in the distribution of statistics (e.g. regression coefficients) over the, say, 10'000 bootstrap samples.
  3. I'd say 10'000 is better than 1000. With modern Computers, this shouldn't pose a problem. In the example below, it took my PC around 45 seconds to draw 10'000 samples. This varies with your sample size of course. The bigger your sample size, the higher the number of iterations should be to ensure that every observation is taken into account.
  4. What do you mean "the results vary each time"? Recall that in every bootstrap step, the observations are newly drawn with replacement. Therefore, you're likely to end up with slightly different regression coefficients because your observations differ. But as I've said: you are not really interested in the result of a single bootstrap sample. When your number of replications is high enough, the bootstrap should yield very similar confidence intervals and point estimates every time.

Here is an example in R:

library(boot)

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

head(mydata)

mydata$rank <- factor(mydata$rank)

my.mod <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

summary(my.mod)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Set up the non-parametric bootstrap

logit.bootstrap <- function(data, indices) {

  d <- data[indices, ]
  fit <- glm(admit ~ gre + gpa + rank, data = d, family = "binomial")

  return(coef(fit))
}

set.seed(12345) # seed for the RNG to ensure that you get exactly the same results as here

logit.boot <- boot(data=mydata, statistic=logit.bootstrap, R=10000) # 10'000 samples

logit.boot

Bootstrap Statistics :
        original        bias    std. error
t1* -3.989979073 -7.217244e-02 1.165573039
t2*  0.002264426  4.054579e-05 0.001146039
t3*  0.804037549  1.440693e-02 0.354361032
t4* -0.675442928 -8.845389e-03 0.329099277
t5* -1.340203916 -1.977054e-02 0.359502576
t6* -1.551463677 -4.720579e-02 0.444998099

# Calculate confidence intervals (Bias corrected ="bca") for each coefficient

boot.ci(logit.boot, type="bca", index=1) # intercept
95%   (-6.292, -1.738 )  
boot.ci(logit.boot, type="bca", index=2) # gre
95%   ( 0.0000,  0.0045 ) 
boot.ci(logit.boot, type="bca", index=3) # gpa
95%   ( 0.1017,  1.4932 )
boot.ci(logit.boot, type="bca", index=4) # rank2
95%   (-1.3170, -0.0369 )
boot.ci(logit.boot, type="bca", index=5) # rank3
95%   (-2.040, -0.629 )
boot.ci(logit.boot, type="bca", index=6) # rank4
95%   (-2.425, -0.698 )

The bootstrap-ouput displays the original regression coefficients ("original") and their bias, which is the difference between the original coefficients and the bootstrapped ones. It also gives the standard errors. Note that they are bit larger than the original standard errors.

From the confidence intervals, the bias-corrected ("bca") are usually preferred. It gives the confidence intervals on the original scale. For confidence intervals for the odds ratios, just exponentiate the confidence limits.

Related Question