Solved – Appropriateness of Wilcoxon signed rank test

hypothesis testingr

I've poked around a bit in the Cross Validated archives and haven't seemed to find an answer to my question. My question is the following: Wikipedia gives three assumptions that need to hold for the Wilcoxon signed rank test (slightly modified for my questions):

Let Zi = Xi-Yi for i=1,…,n.

The differences Zi are assumed to be independent.
(a.) Each Zi comes from the same continuous population, and (b.) each Zi is symmetric about a common median;
The values which Xi and Yi represent are ordered…so the comparisons 'greater than', 'less than', and 'equal to' are useful.

The documentation for ?wilcox.test in R, however, seem to indicate that (2.b) is actually something that is tested by the procedure:

"…if both x and y are given and paired is TRUE, a Wilcoxon signed rank test of the null that the distribution … of x – y (in the paired two sample case) is symmetric about mu is performed."

This sounds to me as though the test is performed for the null hypothesis that "Z is distributed symetrically around median mu=SomeMu" — such that rejection fo the null could be either a rejection of the symmetry or a rejection that the mu around which Z is symmetric is SomeMu.

Is this a correct understanding of the R documentation for wilcox.test? The reason this is important, of course, is that I am conducting a number of paired-difference tests on some before-and-after data ("X" and "Y" above). The "before" and "after" data individually are highly skewed, but the differences are not skewed nearly as much (although still skewed somewhat). By that I mean that the "before" or "after" data considered alone has skewness ~7 to 21 (depending on the sample I am looking at), while the "differences" data has skewness ~= 0.5 to 5. Still skewed, but not nearly as much.

If having skewness in my "differences" data will cause the Wilcoxon test to give me false/biased results (as the Wikipedia article seems to indicate), then skewness could be a big concern. If, however, the Wilcoxon tests are actually testing whether the differences distribution is "symmetric around mu=SomeMu" (as ?wilcox.test seems to indicate) then this is less of a concern.

Thus my questions are:

Which interpretation above is correct? Is skewness in my "differences" distribution going to bias my Wilcoxon test?
If skewness is a concern: "How much skewness is a concern?"
If the Wilcoxon signed rank tests seem grossly inappropriate here, any suggestions for what I should use?

Thanks so much. If you have any further suggestions about how I might do this analysis I am more than happy to hear them (although I can also open another thread for that purpose). Also, this is my first question on Cross Validated; if you have suggestions/comments on how I asked this question, I am open to that as well!

A little background: I am analyzing a dataset that contains observations on what I'll call "errors in firm production." I have an observation on errors occuring in the production process before and after a surprise inspection, and one of the goals of the analysis is to answer the question, "does the inspection make a difference in the oberved number of errors?"

The data set looks something like this:

ID, errorsBefore, errorsAfter, size_large, size_medium, typeA, typeB, typeC, typeD
0123,1,1,1,0,1,1,1,0 
2345,1,0,0,0,0,1,1,0
6789,2,1,0,1,0,1,0,0
1234,8,8,0,0,1,0,0,0

There are roughly 4000 observations. The other variables are catagorical observations that descrie characteristics of the firms. Size can be small, medium, or large, and each firm is one and only one of those. Firms can be any or all of the "types."

I was asked to run some simple tests to see if there were statistically significant differences in observed error rates before and after the inspections for all firms and various sub-groupings (based on size and type). T-tests were out because the data was severely skewed both before and after, for example, in R the before data looked something like this:

summary(errorsBefore)
# Min.  1st Qu.  Median   Mean  3rd Qu.    Max
# 0.000  0.000    4.000  12.00    13.00  470.0

(These are made up — I'm afraid I can't post the actual data or any actual manipulations of it due to proprietary/privacy issues — my apologies!)

The paired differences were more centralized but still not very well fit by a normal distribution — far too peaked. Differences data looked something like this:

summary(errorsBefore-errorsAfter)
# Min.   1st Qu.  Median   Mean  3rd Qu.    Max
# -110.0  -2.000   0.000  0.005   2.000   140.0

It was suggested that I use a Wilcoxon signed rank test, and after a brief persusal of ?wilcox.test and Wikipedia, and here, this seems like the test to use. Considering the assumptions above, I believe (1) is fine given the data generating process. Assumption (2.a) is not strictly true for my data, but the discussion here: Alternative to the Wilcoxon test when the distribution isn't continuous? seemed to indicate that this wasn't too much of a concern. Assumption (3) is fine. My only concern (I believe) is Assumption (2.b).

One additional note, some years later: I eventually took an excellent non-parametric stats course and spent a lot of time on the rank-sum tests. Embedded in assumption (2.a), "Each Zi comes from the same continuous population", is the idea that both samples mush come from populations with equal variance — this turns out to be extremely important, practically speaking. If you have concerns about differing variance in your populations (from which you draw the samples), you should be concerned about using WMW.

Best Answer

Wikipedia has misled you in stating "...if both x and y are given and paired is TRUE, a Wilcoxon signed rank test of the null that the distribution ... of x - y (in the paired two sample case) is symmetric about mu is performed."

The test determines whether the RANK-TRANSFORMED values of $z_i = x_i - y_i$ are symmetric around the median you specify in your null hypothesis (I assume you'd use zero). Skewness is not a problem, since the signed-rank test, like most nonparametric tests, is "distribution free." The price you pay for these tests is often reduced power, but it looks like you have a large enough sample to overcome that.

A "what the hell" alternative to the rank-sum test might be to try a simple transformation like $\ln(x_i)$ and $\ln(y_i)$ on the off chance that these measurements might roughly follow a lognormal distribution--so the logged values should look "bell curvish". Then you could use a t test and convince yourself (and your boss who only took Business Stats) that the rank-sum test is working. If this works, there's a bonus: the t test on means for lognormal data is a comparison of medians for the original, untransformed, measurements.

Me? I'd do both, and anything else I could cook up (likelihood ratio test on Poisson counts by firm size?). Hypothesis testing is all about determining whether evidence is convincing, and some folks take a heap of convincin'.

Related Solutions

Solved – perform a signed-rank test on a weighted sample in R

"replicating rows with respect to the values of the weighing factors, but it seems uneasy as the latter are not integers;"

This might be worth trying a little more, but maybe I'm missing something. If you have 100 weighted observations, you could try randomly sampling one observation at a time (with replacement? hmm... not sure what the implications are off the top of my head), and then draw a uniform random number x between 0 and 1. If weights are between 0 and 1, then add the observation to the final set if x < this_observations_weight. That way, an observation with .3 weight should have 3X the probability of being selected as one with a .1 weight.

Keep doing that until you get 100 final observations (i.e. you may have to draw n rows and n uniforms, n > 100, to get your full set). You could then plot a distribution of the Wilcoxon stat or take a mean or something like that... Hope I didn't misunderstand your attempt!

Solved – wilcoxon signed rank test to test paired data

This largely follows on from Antoni's answer.

Here's what I was getting at with a permutation test First let's exclude the tie as Antoni did:

Our test statistic is then 1+2+4+6 = 13

So we could calculate the permutation distribution from scratch as follows:

 pm <- c(-1,1)
 sign <- expand.grid(s1=pm,s2=pm,s3=pm,s4=pm,s5=pm,s6=pm)
 table(rowSums((sign>0)%*%cbind(1:6)))/64

which gives the same values as dsignrank(0:21,6) (yay!). So clearly when there are no ties (which is the case, since we excluded them) this is just the signed rank test, as it should be

Let's consider whether we can do anything with the tie. How we really deal with it depends on how it came about.

If, for example we envision an underlying continuous scale that has become discretized into categories, then our tie is simply due to two different values that our subsequently imposed categories are not fine enough to distinguish.

Arguably then, the two values are "really" different but we have no basis to conclude which sign we should have and so allowing for both possibilities, half the time giving it a +1 and half the time giving it a -1 and then averaging the two (or perhaps to be conservative, some would argue we should take the higher of the two possible counts ... but I'll continue to look at the average of the two possible cases here). [By contrast if the values were inherently discrete with no underlying-but-unobserved continuous scale, then we would say that the 0 was inherent and deal with it in a different way -- I think exclusion probably makes the most sense in that case]

Note tie will always have the smallest rank.

So we have a table like the one above but the second last row is either:

   d  s r 
   :  : :
   0 -1 1

   d  s r 
   :  : :
   0  1 1

and all the other rows would have r one higher than before.

So the possible test statistics are 2+3+5+7 and 1+2+3+5+7 (17 and 18)

[One alternative approach, if we're being conservative would be to treat it as a full n=7 and take the larger of the two possible p-values which gives 0.6875]

If we do all the permutations but always average the two signs for the smallest rank, they'll contribute 1/2, and each of the other positive ranks will be 1 higher.

In effect we'd be doing this:

 table(0.5+rowSums((sign>0)%*%cbind(2:7)))/64

And comparing a test statistic of 17.5 with that. This gives p = 0.625.

So anyway, all the approaches are a bit different but they pretty much tell the same general story here.

Best Answer

Related Solutions

Solved – perform a signed-rank test on a weighted sample in R

Solved – wilcoxon signed rank test to test paired data

Related Question