Solved – Appropriateness of Wilcoxon signed rank test

hypothesis testingr

I've poked around a bit in the Cross Validated archives and haven't seemed to find an answer to my question. My question is the following: Wikipedia gives three assumptions that need to hold for the Wilcoxon signed rank test (slightly modified for my questions):

Let Zi = Xi-Yi for i=1,…,n.

  1. The differences Zi are assumed to be independent.

  2. (a.) Each Zi comes from the same continuous population, and (b.) each Zi is symmetric about a common median;

  3. The values which Xi and Yi represent are ordered…so the comparisons 'greater than', 'less than', and 'equal to' are useful.

The documentation for ?wilcox.test in R, however, seem to indicate that (2.b) is actually something that is tested by the procedure:

"…if both x and y are given and paired is TRUE, a Wilcoxon signed rank test of the null that the distribution … of x – y (in the paired two sample case) is symmetric about mu is performed."

This sounds to me as though the test is performed for the null hypothesis that "Z is distributed symetrically around median mu=SomeMu" — such that rejection fo the null could be either a rejection of the symmetry or a rejection that the mu around which Z is symmetric is SomeMu.

Is this a correct understanding of the R documentation for wilcox.test? The reason this is important, of course, is that I am conducting a number of paired-difference tests on some before-and-after data ("X" and "Y" above). The "before" and "after" data individually are highly skewed, but the differences are not skewed nearly as much (although still skewed somewhat). By that I mean that the "before" or "after" data considered alone has skewness ~7 to 21 (depending on the sample I am looking at), while the "differences" data has skewness ~= 0.5 to 5. Still skewed, but not nearly as much.

If having skewness in my "differences" data will cause the Wilcoxon test to give me false/biased results (as the Wikipedia article seems to indicate), then skewness could be a big concern. If, however, the Wilcoxon tests are actually testing whether the differences distribution is "symmetric around mu=SomeMu" (as ?wilcox.test seems to indicate) then this is less of a concern.

Thus my questions are:

  1. Which interpretation above is correct? Is skewness in my "differences" distribution going to bias my Wilcoxon test?

  2. If skewness is a concern: "How much skewness is a concern?"

  3. If the Wilcoxon signed rank tests seem grossly inappropriate here, any suggestions for what I should use?

Thanks so much. If you have any further suggestions about how I might do this analysis I am more than happy to hear them (although I can also open another thread for that purpose). Also, this is my first question on Cross Validated; if you have suggestions/comments on how I asked this question, I am open to that as well!


A little background: I am analyzing a dataset that contains observations on what I'll call "errors in firm production." I have an observation on errors occuring in the production process before and after a surprise inspection, and one of the goals of the analysis is to answer the question, "does the inspection make a difference in the oberved number of errors?"

The data set looks something like this:

ID, errorsBefore, errorsAfter, size_large, size_medium, typeA, typeB, typeC, typeD
0123,1,1,1,0,1,1,1,0 
2345,1,0,0,0,0,1,1,0
6789,2,1,0,1,0,1,0,0
1234,8,8,0,0,1,0,0,0

There are roughly 4000 observations. The other variables are catagorical observations that descrie characteristics of the firms. Size can be small, medium, or large, and each firm is one and only one of those. Firms can be any or all of the "types."

I was asked to run some simple tests to see if there were statistically significant differences in observed error rates before and after the inspections for all firms and various sub-groupings (based on size and type). T-tests were out because the data was severely skewed both before and after, for example, in R the before data looked something like this:

summary(errorsBefore)
# Min.  1st Qu.  Median   Mean  3rd Qu.    Max
# 0.000  0.000    4.000  12.00    13.00  470.0

(These are made up — I'm afraid I can't post the actual data or any actual manipulations of it due to proprietary/privacy issues — my apologies!)

The paired differences were more centralized but still not very well fit by a normal distribution — far too peaked. Differences data looked something like this:

summary(errorsBefore-errorsAfter)
# Min.   1st Qu.  Median   Mean  3rd Qu.    Max
# -110.0  -2.000   0.000  0.005   2.000   140.0

It was suggested that I use a Wilcoxon signed rank test, and after a brief persusal of ?wilcox.test and Wikipedia, and here, this seems like the test to use. Considering the assumptions above, I believe (1) is fine given the data generating process. Assumption (2.a) is not strictly true for my data, but the discussion here: Alternative to the Wilcoxon test when the distribution isn't continuous? seemed to indicate that this wasn't too much of a concern. Assumption (3) is fine. My only concern (I believe) is Assumption (2.b).

One additional note, some years later: I eventually took an excellent non-parametric stats course and spent a lot of time on the rank-sum tests. Embedded in assumption (2.a), "Each Zi comes from the same continuous population", is the idea that both samples mush come from populations with equal variance — this turns out to be extremely important, practically speaking. If you have concerns about differing variance in your populations (from which you draw the samples), you should be concerned about using WMW.

Best Answer

Wikipedia has misled you in stating "...if both x and y are given and paired is TRUE, a Wilcoxon signed rank test of the null that the distribution ... of x - y (in the paired two sample case) is symmetric about mu is performed."

The test determines whether the RANK-TRANSFORMED values of $z_i = x_i - y_i$ are symmetric around the median you specify in your null hypothesis (I assume you'd use zero). Skewness is not a problem, since the signed-rank test, like most nonparametric tests, is "distribution free." The price you pay for these tests is often reduced power, but it looks like you have a large enough sample to overcome that.

A "what the hell" alternative to the rank-sum test might be to try a simple transformation like $\ln(x_i)$ and $\ln(y_i)$ on the off chance that these measurements might roughly follow a lognormal distribution--so the logged values should look "bell curvish". Then you could use a t test and convince yourself (and your boss who only took Business Stats) that the rank-sum test is working. If this works, there's a bonus: the t test on means for lognormal data is a comparison of medians for the original, untransformed, measurements.

Me? I'd do both, and anything else I could cook up (likelihood ratio test on Poisson counts by firm size?). Hypothesis testing is all about determining whether evidence is convincing, and some folks take a heap of convincin'.