Hypothesis Testing – Wilcoxon Test vs Bootstrapping vs Alternative Methods

bootstraphypothesis testingnonparametricrtreatment-effect

A colleague has developed a treatment for to "prevent falls" in cognitively impaired, psychiatric patients. Since this would be very useful treatment in this population, we especially do not want to make a Type II error (i.e., fail to reject the null, when we should reject it).

Since the data is not normally distributed, another colleague evaluated the full data using (appropriately, I believe) the Wilcoxon test, and did not find significance. There may be valid methodological reasons for this, which I may follow-up on later with another question.

I was concerned about committing a Type II error, and obtained some preliminary data, which I have below. These data reflect pre/post scores (# of "falls") for the same patients (no control group), so should be considered "paired" and not independent:

pre <- c(9,8,37,12,8,3,4,4,3,5,4,8,4,8,9,11,2,4,0,0,5,12,10,2,8,3,0,22,1,0,0,5,0,3,1,5)

post <- c(10,8,6,4,5,2,4,4,2,2,1,7,2,1,3,9,2,2,0,0,6,16,4,3,4,7,0,10,3,0,0,4,0,1,1,5)

When I ran a bootstrapping procedure on this preliminary data (adapted from Crawley, The R Book, p. 385)

preBoot <- numeric(10000)
for (i in 1:10000) {preBoot[i] <- mean(sample(pre, replace=T)) }
quantile(preBoot, c(0.025, 0.975))

and compared the post mean to the bootstrap estimate of the sampling distribution of the mean, I found that the treatment did have a significant beneficial effect. To evaluate significance, I simply took the quantiles for the sample estimate at 0.025 and 0.975; is this correct or am I confusing what I would do with a normal distribution with the distribution of the sample estimates of the mean?

Also, using wilcox.test in R on the preliminary data (i.e.,

wilcox.test(pre, post, paired=T, exact=F)

shows this to be significant.

I would like to know, before I go further, did I use the bootstrapping procedure correctly, and is this a legitimate test for these type of data?

Are there other tests we should consider, and what would be the best way to report this? I am especially interested in methods that would allow us to obtain confidence intervals.

Also, I see in this previous question Wilcoxon one-tailed test the response indicated that "keep in mind that it's generally not advisable to use one-tailed tests," but if I'm interested specifically in fewer falls after the treatment intervention, wouldn't a one-tailed test be appropriate?

Additional Info Update:
I just found a wonderful review on count data and analysis by Neal Alexander, Review: analysis of parasite and other skewed counts , accessed via PubMed http://www.ncbi.nlm.nih.gov/pubmed/22943299 which discusses the issues I've been facing in a very accessible manner (and it is free online). Other reading this question may also find this quite helpful.

I'm still digesting this info. This probably belongs in a new question, but in essence, I believe in my field (clinical psychology) the standard way of looking at this data would be with a Wilcoxon test, with perhaps a square root transformation and a t-test running in second place. Most people currently don't use R, and so don't seem to be aware of or use bootstrapping, which I actually believe would be better than the above two methods. If anybody has further info/or information to the contrary, I would appreciate it).

Best Answer

  • You investigate a paired data situation, as you mentioned, however, you treated it like independent. You should run the bootstrap on differences of pre/post measurements of each patient. Then, see, whether the interval contains zero.
  • Although it's not generally advisable, in the situation as you describe it applying the one-tailed spcification is reasonable.
  • Wilcoxon Signed Rank test is correct method, as is Median Sign test. Moreover, you can consider transforming your count data by taking square roots and then performing t-test.
Related Question