Solved – extreme right skewed data, bootstrap or mann-whitney

bootstrapminitabskewnesswilcoxon-mann-whitney-test

I am an old dog (DB guy) trying to learn new tricks (stats) and was hoping someone here could tell me if this is a good approach:

I have to analyse extremely right skewed counts of events over a period of observation, pre vs post. I have N=1213 pre and N=1138 post intervention datasets. Minitab rates Skewness at 5.3 and 7.3 for the two datasets. I would like to figure out the %change due to our intervention.

I proposed to do: transform non-zero event counts using log10 to make it less skewed and calculate the median change Mann-Whitney between pre and post. In addition, i calculate a probability of zero value events. In Minitab the Levene test for 2 sets of log10 values pre vs post gave P=0.220, so i assume that the variances are similar. With Mann-Whitney η1 = η2 vs η1 > η2 i had 0.4301 difference with P=0. So calculating after reversing the log gave me a 5.85 % improvement after intervention. The log curve wasn't considered normal(p<0.005).

But a teammate said i should be doing a bootstrap mean over the entire dataset including the zero event counts. So I did the Bootstrapping for mean in R, and this gave me over 50% improvement. This does not feel right.

What is the correct representation of the difference?

Best Answer

If you have count data and are interested in the percentage change due to the intervention, you might want to run a negative binomial regression with the counts as the outcome variable and the intervention condition as the predictor. Taking $exp(\beta_1)$, where $\beta_1$ is the coefficient on the intervention variable, will give you the percentage change in your mean outcome corresponding to being in the treatment category vs. the control category.

Negative binomial regression is designed to model count data with right skew. If you have many zeros, you can try a zero-inflated negative binomial regression.The benefit of negative binomial regression is that it is designed to work with the data you have and produces a result on the scale you want (percentage change).

Another thing to note is that a standard t-test on the raw values would not necessarily be invalid; with such a large sample, the test statistic should be approximately t-distributed.

Related Question