Solved – Logistic regression with bootstrap, how to interpret high standard errors and choose coefficient

bootstraplogisticrregression

I am attempting to do a logistic regression bootstrap with R. The problem is I get high SE's. I'm not sure what to do about this or what it means. Does it mean that bootstrap does not work well for my particular data? Here is my code:

get.coeffic = function(data, indices){
  data    = data[indices,]
  mylogit = glm(F~B+D, data=data, family="binomial")
  return(mylogit$coefficients)
}

Call:
boot(data = Pres, statistic = logit.bootstrap, R = 1000)

Bootstrap Statistics :
       original      bias    std. error
t1* -10.8609610 -23.0604501  338.048398
t2*   0.2078474   0.4351766    6.387781

I also want to know that after bootstrapping, how would this help with my final regression model? That is, how do I find what regression coefficient do I use in my final model?

> fit <- glm(F ~ B + D , data = President, family = "binomial")
> summary(fit)
Call:
glm(formula = F ~ B + D, family = "binomial", data = President)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7699  -0.5073   0.1791   0.8147   1.2836  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -14.57829    8.98809  -1.622   0.1048  
B             0.15034    0.14433   1.042   0.2976  
D             0.13385    0.08052   1.662   0.0965 .
- --
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 23.508  on 16  degrees of freedom
Residual deviance: 14.893  on 14  degrees of freedom
AIC: 20.893

Number of Fisher Scoring iterations: 5

Best Answer

I don't follow your code, you call your data different things in different places, I don't see your function being used anywhere, etc. Setting that aside, I'm not sure there is a big problem with your model other than the fact that you don't have much data (I gather N = 17, which is pretty small). I don't think your standard errors would be that problematic if you had a more typical sample size.

Moreover, your model seems impressively good to me for a logistic regression model with so few data to work with. The reason neither variable is significant is clearly because they are correlated. This will expand your SEs, but wouldn't be bad if you had more data. As it is, your SEs are about one third larger than they would have been if your data were perfectly uncorrelated:

1/(1-.49^2)
# [1] 1.315963

That means the model doesn't know which of the two variables should be given credit for predicting the response. Nonetheless, there is good predictive ability amongst those variables somewhere, as can be seen by their combined significance:

1-pchisq(23.508-14.893, 2)
# [1] 0.01346718

As far as bootstrapping goes, it is used to get an estimate of the nature of the sampling distribution that doesn't rely on assumptions about normality. It may help you to read this excellent CV thread: Explaining to laypeople why bootstrapping works.