Solved – Logistic Regression sample size & bootstrapping

bootstrapconfidence intervallogisticsample-size

The data for this example can be retrieved here so that you can reproduce these estimates. It is the low birth weight dataset- http://www.umass.edu/statdata/statdata/data/

There are 59 1's and 130 0's for the outcome variable.

I have a sample size of 189. And I run a logistic regression analysis and get these results:

low<-read.delim(file=file.choose(),header=TRUE)
low<-within(low,{
RACE<-factor(RACE,levels=c(1:3),labels=c("White","Black","Hispanic"))})
low.out<-glm(LOW~AGE+FTV+RACE+LWT,data=low,family=binomial)
summary(low.out)


              Estimate      Std.Error    z value      Pr(>|z|)
Intercept     1.295         1.071        1.209        0.227
AGE           -0.024        0.034        -0.706       0.480
FTV           -0.049        0.167        -0.295       0.768
RACE Black     1.004        0.498        2.016        0.044*
RACE Hispanic  0.433        0.362        1.196        0.232
LWT            -0.014       0.007        -2.178       0.029*

So If I wanted to know what the probability was of a black women having a low birth weight baby and she is 30 years old and her weight (in pounds) at the last menstrual period was 108 and she had 1 physician visit during the first trimester, I would calculate the probability as follows. First,

$$1.295-0.024(30)-0.049(1)+1.004(1)+0.433(0)-0.014(108)=0.018.$$

Then, as a probability, $\exp(0.018)/(1+\exp(0.018))*100=50.45\%$.

If the data says the probability that this person will have a low birth weight baby is 50.45%, somebody might question this and say that the sample is only 189.

I only have a sample of 189 and let's say I can't get any more data, How do I convince the layperson that the results/estimates are robust?

Could you do a bootstrapping perhaps? because If I understand correctly, you could resample repeatedly and randomly from the sample like 10 000 times and calculate standard errors and confidence intervals of the regression coefficents (which would make one more confident in the estimates and results). Thereafter, you could get the predicted probabilities and the 95% confidence intervals? If this is the case, how would I do the bootstrapping in R for this example?

Best Answer

Compared to many other situations such as in NLP (natural language processing), you have 189 samples and 4 features which are not bad.

Besides, the example you gave is a typical example (you should have seen many similar examples) for your samples. That is an intuitive reason why your prediction should not be "so wrong".

I think the bootstrapping won't help here in this case because you don't introduce any new information into the samples. If you are able to introduce some more information and then create "virtual samples", this would be helpful. However, it seems dangerous here unless there are already proved medical evidence to justify this approach.

Finally, I have the impression that here it is the variable selection procedure which makes your regression model good or bad. Carefully choose the right variable to integrate into the regression model can make the results more convincing. (Perhaps it is something you have already done as your model has fewer variables than the original file).