Mann-Whitney Test – How to Perform Mann-Whitney Test with Tied Ranks

hypothesis testingnonparametrictieswilcoxon-mann-whitney-test

I have two data sets with PDFs roughly like this:

$$
p(x) = \left\{
\begin{array}{lr}
.75 & x = 0\\
\text{Lomax}(x) & x > 0
\end{array}
\right.
$$

i.e. it's continuous, except there is a huge number of points at $x=0$.

I wish to tell if the values in one set are lower than the other. However there are a lot of ties in my data, which makes it kind of hard. It was suggested to me that I do something like: bootstrap a bunch and if a 95% confidence interval has x's being less than y's (as determined by Mann-Whitney U), we declare $X<Y$.

It's not really clear from the definition of U what I do about ties though. wilcox.test in R seems to break them evenly, e.g. when comparing $(0,0)$ to $(0,0)$ it says $U=2$. (This behavior seems to contradict its documentation though which says it returns "the number of all pairs (x[i], y[j]) for which y[j] is not greater than x[i].")

Is breaking them evenly the right thing to do here? I'm worried since I have such a huge fraction of ties that my intuition about a "95% confidence interval" doesn't actually correspond to 95% of my samples.

Best Answer

Answer to the question

First of all, it is important to notice that the quantities $P(X\leq Y)$ and $P(X\lt Y)$ are different, given that the variables are not continuous.

Let $X$ and $Y$ be two independent random variables whose distribution is a mixture of a discrete and a continous distribution such that $P(X=0)=p_1>0$ and $P(Y=0)=p_2>0$. Then by the law of total probability we have that

\begin{eqnarray*} P(X\leq Y)&=&P(X\leq Y \vert Y=0)P(Y=0)+P(X\leq Y \vert Y>0)P(Y>0)\\ &=& P(X=0)P(Y=0)+P(X\leq Y \vert Y>0)[1-P(Y=0)]\\ &=& p_1p_2 +P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)(1-p_2). \end{eqnarray*}

Now, $P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)=p_1+P(X\leq Y\vert X>0,Y>0)(1-p_1)$. With this, we obtain an expression for $\theta=P(X\leq Y)$ in terms of quantities that we can estimate. Note that

\begin{eqnarray*} P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy, \end{eqnarray*}

where $F_X$ is the CDF of the continuos part $X$ and $f_Y$ is the PDF of the continuos part of $Y$ (In your case, a Lomax distribution).

Now, how to estimate the parameters? I am going to use nonlinear squares between the CDFs and the empirical CDFs. This method works in your case given the large sample size. Please, find below an R code for conducting this estimation using a simulated sample of size $n=1000$.

rm(list=ls())
p0 = 0.75
alpha0 = 3
lambda0 = 1

# Function for simulating a Lomax variable
rlomax = function(n,alpha,lambda) return( lambda*( (1-runif(n))^(-1/alpha) - 1 ))

# Simulated data, X and Y
set.seed(1)
ns = 1000
simx = simy = rep(0,ns)

for(i in 1:ns){
u = runif(1)
if(u<p0) simx[i] =  0
else simx[i] = rlomax(1,alpha0,lambda0)
}

for(i in 1:ns){
u = runif(1)
if(u<p0) simy[i] =  0
else simy[i] = rlomax(1,alpha0,lambda0)
}

hist(simx,col="red")
hist(simy,add=T,col="blue")

# Distribution function of the mixture

FM = function(x,p,alpha,lambda){
temp = x
for(i in 1:length(x)){
if(x[i]==0) temp[i]=p
if(x[i]>0) temp[i] = p + (1-p)*( 1-(1+x[i]/lambda)^(-alpha) )
}
return(temp)
}


ecdfdatx = ecdf(simx)(sort(simx))
ecdfdaty = ecdf(simy)(sort(simy))

Datax = data.frame(sort(simx),ecdfdatx)
Datay = data.frame(sort(simy),ecdfdaty)

# Fit for the first data set

nls_fitx = nls(ecdfdatx ~ FM(sort(simx),p,alpha,lambda), data=Datax, start =     list(p = 0.75, alpha = 3,  lambda = 1) )
nls_fitx
plot(ecdf(simx))
lines(sort(simx), predict(nls_fitx), col = "red")

# Fit for the second data set

nls_fity = nls(ecdfdaty ~ FM(sort(simy),p,alpha,lambda), data=Datax, start = list(p =     0.75, alpha = 3,  lambda = 1) )
nls_fity
plot(ecdf(simy))
lines(sort(simy), predict(nls_fity), col = "red")

With this code we obtain estimators of the parameters $(p_1,\alpha_X,\lambda_X,p_2,\alpha_Y,\lambda_Y)$. The remaining step consists of calculating $P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy$.

# remaining quantity

px.h = coef(nls_fitx)[1]
py.h =coef(nls_fity)[1]
ax.h = coef(nls_fitx)[2]
ay.h = coef(nls_fity)[2] 
lx.h = coef(nls_fitx)[3] 
ly.h = coef(nls_fity)[3] 

# Lomax PDF
dlomax = function(x,alpha,lambda) return(alpha*(1+x/lambda)^(-(alpha+1))/lambda)

# Lomax CDF
plomax = function(x,alpha,lambda) return(1-(1+x/lambda)^(-alpha) )

tempf = function(x) plomax(x,ax.h,lx.h)*dlomax(x,ay.h,ly.h)

p.l = integrate(tempf,0,Inf)$value

# Estimator of theta
px.h + p.l*(1-px.h)*(1-py.h)

Similarly, the estimator of $P(X<Y)$ can be calculated as follows

# Estimator of theta2
p.l*(1-px.h)*(1-py.h)

Then the quantity $P(X\leq Y)$ depends on the probabilities $p_1$ and $p_2$ and therefore this quantity may be misleading. For instance, if $X$ and $Y$ are i.i.d. and $p_1,p_2\approx 1$, we have that $P(X\lt Y)\approx 0$ and $P(X \leq Y)\approx 1$.

My conclusion: The stress-strength coefficient is not what you need for comparing the performance of both companies.

How to solve the problem?

I think this problem can be seen as a decision problem. You have two companies providing a programming service and you want to decide which one is better. Consider the hypothetical case where one of the companies produce a large proportion of codes with zero errors but also that when a code contains errors, it is likely that the number of errors is large. Is this better than a company with a lower proportion of codes with zero errors but smaller positive errors?

A toy naive example. Suppose that your decision rule is based on the estimated proportions of 0-error codes as follows:

  1. Estimate $p_1$ and $p_2$. If $\hat{p}_1/\hat{p}_2\in(0.9,1.1)$, then proceed to estimate $\theta = P(X<Y)$ using only the positive quantities. If a $95\%$ confidence interval for $\theta$ contains the value $0.5$, then there is no criterion for choosing one of the companies. If this value is not contained in the confidence interval, then choose Company X if $P(X<Y)>0.5$ or Company Y is $P(X<Y)<0.5$.
  2. If the ratio of estimators $\hat{p}_1/\hat{p}_2<0.9$, then choose Company Y.
  3. If the ratio of estimators $\hat{p}_1/\hat{p}_2>1.1$, then choose Company X.

This (naive, I repeat) criterion favours companies that produce more 0-error codes and proceeds to select one based on the stress-stress coefficient of the continous part when they seem to produce a similar proportion of 0-error codes.

In order to conduct a proper analysis, one would need to select a proper loss-function based on expert's opinion in order to come up with a reasonable selection criterion. This would require more effort and I think it would fall out of the scope of this site but I hope this answer gives you some help.

Some references of possible interest:

Statistical Decision Theory and Bayesian Analysis

Bayesian Theory

Bayesian Decision Analysis: Principles and Practice

It would also help to check the literature on Software quality control and see the critera adopted by some companies.

Related Question