Answer to the question
First of all, it is important to notice that the quantities $P(X\leq Y)$ and $P(X\lt Y)$ are different, given that the variables are not continuous.
Let $X$ and $Y$ be two independent random variables whose distribution is a mixture of a discrete and a continous distribution such that $P(X=0)=p_1>0$ and $P(Y=0)=p_2>0$. Then by the law of total probability we have that
\begin{eqnarray*}
P(X\leq Y)&=&P(X\leq Y \vert Y=0)P(Y=0)+P(X\leq Y \vert Y>0)P(Y>0)\\
&=& P(X=0)P(Y=0)+P(X\leq Y \vert Y>0)[1-P(Y=0)]\\
&=& p_1p_2 +P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)(1-p_2).
\end{eqnarray*}
Now, $P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)=p_1+P(X\leq Y\vert X>0,Y>0)(1-p_1)$. With this, we obtain an expression for $\theta=P(X\leq Y)$ in terms of quantities that we can estimate. Note that
\begin{eqnarray*}
P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy,
\end{eqnarray*}
where $F_X$ is the CDF of the continuos part $X$ and $f_Y$ is the PDF of the continuos part of $Y$ (In your case, a Lomax distribution).
Now, how to estimate the parameters? I am going to use nonlinear squares between the CDFs and the empirical CDFs. This method works in your case given the large sample size. Please, find below an R code for conducting this estimation using a simulated sample of size $n=1000$.
rm(list=ls())
p0 = 0.75
alpha0 = 3
lambda0 = 1
# Function for simulating a Lomax variable
rlomax = function(n,alpha,lambda) return( lambda*( (1-runif(n))^(-1/alpha) - 1 ))
# Simulated data, X and Y
set.seed(1)
ns = 1000
simx = simy = rep(0,ns)
for(i in 1:ns){
u = runif(1)
if(u<p0) simx[i] = 0
else simx[i] = rlomax(1,alpha0,lambda0)
}
for(i in 1:ns){
u = runif(1)
if(u<p0) simy[i] = 0
else simy[i] = rlomax(1,alpha0,lambda0)
}
hist(simx,col="red")
hist(simy,add=T,col="blue")
# Distribution function of the mixture
FM = function(x,p,alpha,lambda){
temp = x
for(i in 1:length(x)){
if(x[i]==0) temp[i]=p
if(x[i]>0) temp[i] = p + (1-p)*( 1-(1+x[i]/lambda)^(-alpha) )
}
return(temp)
}
ecdfdatx = ecdf(simx)(sort(simx))
ecdfdaty = ecdf(simy)(sort(simy))
Datax = data.frame(sort(simx),ecdfdatx)
Datay = data.frame(sort(simy),ecdfdaty)
# Fit for the first data set
nls_fitx = nls(ecdfdatx ~ FM(sort(simx),p,alpha,lambda), data=Datax, start = list(p = 0.75, alpha = 3, lambda = 1) )
nls_fitx
plot(ecdf(simx))
lines(sort(simx), predict(nls_fitx), col = "red")
# Fit for the second data set
nls_fity = nls(ecdfdaty ~ FM(sort(simy),p,alpha,lambda), data=Datax, start = list(p = 0.75, alpha = 3, lambda = 1) )
nls_fity
plot(ecdf(simy))
lines(sort(simy), predict(nls_fity), col = "red")
With this code we obtain estimators of the parameters $(p_1,\alpha_X,\lambda_X,p_2,\alpha_Y,\lambda_Y)$. The remaining step consists of calculating $P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy$.
# remaining quantity
px.h = coef(nls_fitx)[1]
py.h =coef(nls_fity)[1]
ax.h = coef(nls_fitx)[2]
ay.h = coef(nls_fity)[2]
lx.h = coef(nls_fitx)[3]
ly.h = coef(nls_fity)[3]
# Lomax PDF
dlomax = function(x,alpha,lambda) return(alpha*(1+x/lambda)^(-(alpha+1))/lambda)
# Lomax CDF
plomax = function(x,alpha,lambda) return(1-(1+x/lambda)^(-alpha) )
tempf = function(x) plomax(x,ax.h,lx.h)*dlomax(x,ay.h,ly.h)
p.l = integrate(tempf,0,Inf)$value
# Estimator of theta
px.h + p.l*(1-px.h)*(1-py.h)
Similarly, the estimator of $P(X<Y)$ can be calculated as follows
# Estimator of theta2
p.l*(1-px.h)*(1-py.h)
Then the quantity $P(X\leq Y)$ depends on the probabilities $p_1$ and $p_2$ and therefore this quantity may be misleading. For instance, if $X$ and $Y$ are i.i.d. and $p_1,p_2\approx 1$, we have that $P(X\lt Y)\approx 0$ and $P(X \leq Y)\approx 1$.
My conclusion: The stress-strength coefficient is not what you need for comparing the performance of both companies.
How to solve the problem?
I think this problem can be seen as a decision problem. You have two companies providing a programming service and you want to decide which one is better. Consider the hypothetical case where one of the companies produce a large proportion of codes with zero errors but also that when a code contains errors, it is likely that the number of errors is large. Is this better than a company with a lower proportion of codes with zero errors but smaller positive errors?
A toy naive example. Suppose that your decision rule is based on the estimated proportions of 0-error codes as follows:
- Estimate $p_1$ and $p_2$. If $\hat{p}_1/\hat{p}_2\in(0.9,1.1)$, then proceed to estimate $\theta = P(X<Y)$
using only the positive quantities. If a $95\%$ confidence interval for $\theta$ contains the value $0.5$, then there is no criterion for choosing one of the companies. If this value is not contained in the confidence interval, then choose Company X if $P(X<Y)>0.5$ or Company Y is $P(X<Y)<0.5$.
- If the ratio of estimators $\hat{p}_1/\hat{p}_2<0.9$, then choose Company Y.
- If the ratio of estimators $\hat{p}_1/\hat{p}_2>1.1$, then choose Company X.
This (naive, I repeat) criterion favours companies that produce more 0-error codes and proceeds to select one based on the stress-stress coefficient of the continous part when they seem to produce a similar proportion of 0-error codes.
In order to conduct a proper analysis, one would need to select a proper loss-function based on expert's opinion in order to come up with a reasonable selection criterion. This would require more effort and I think it would fall out of the scope of this site but I hope this answer gives you some help.
Some references of possible interest:
Statistical Decision Theory and Bayesian Analysis
Bayesian Theory
Bayesian Decision Analysis: Principles and Practice
It would also help to check the literature on Software quality control and see the critera adopted by some companies.
I'm not sure what the basis is for your colleague's claim -- but they should support the claims they make before you accept them as true -- there's an astonishing amount of misinformed folklore about. (How do they know that this is true? Do you have good reason to think it must be true in your case?)
Both tests assume$^\dagger$ continuous distributions and both are impacted by ties (however, it's relatively easy to deal with ties in the Mann-Whitney and some software will do so automatically).
--
$\dagger$ Edit: To support my claim of the assumption of continuity in respect of the Mann-Whitney (since whuber says I am wrong on this point, I had better justify it), I refer to the beginning of Mann and Whitney (1947):
1. Summary. Let $x$ and $y$ be two random variables with continuous cumulative distribution functions $f$ and $g$.
So for Mann and Whitney's version of the test, they do explicitly assume continuity - and not idly, since they do rely on it in their derivation. However, it's possible (as I mention later) to deal with ties in the Mann-Whitney by working out the distribution of the test statistic at the null under the pattern of ties, or by correctly computing the effect of ties on the variance of the statistic under the normal approximation (what's usually referred to as the 'adjustment for ties').
--
For both tests, if the effect of the ties are not properly dealt with, both kinds of error rate are impacted - their type I error rates are lowered, and lowering the significance level necessarily lowers power ($=1-\beta$).
It's not 100% clear to me which test might be the most impacted, nor under what circumstances, but offhand I'd have expected the greater sensitivity generally went with the KS test* - and this is even before one 'adjusts' the Mann-Whitney for ties (i.e. if you used the normal approximation and used the variance for the no ties case).
*(personally, I'd use simulation suited to the specific instance to see what the properties would be under the sorts of conditions you see, at those sample sizes.)
Below is an illustration of the impact on the distribution of p-values under identical population distributions with of a moderate level of ties$^\ddagger$ with sample sizes of 33 and 67 under the default settings in R (which for the Mann-Whitney uses the normal approximation with correct calculation of variance in the presence of ties for this sample size):
For the tests to work 'as advertized' under the null, these distributions should look close to uniform. As you see, the Mann-Whitney (at least when properly calculating the variance of the sum of the ranks under the presence of ties, as here) is indeed very close to uniform. Since (as we can see) for the Kolmogorov-Smirnov test the proportion of p-values below $\alpha$ will be much smaller than $\alpha$, the test is highly conservative, with corresponding effects on power. [If anything, the effect is somewhat stronger than I'd have anticipated.]
$\ddagger\,$(the impact on the variance of the test statistic is fairly small in percentage terms)
Further, if your interest lies in a location-shift alternative, the Mann-Whitney would have greater power against that alternative to start with, so even if it did lose more power as a result of the discreteness (which I doubt), it may still have more power afterward.
You don't say how heavily tied your data are, nor in what sort of pattern. If both tests are more impacted than you're prepared to accept, you can work with the permutation distribution of either test statistic for you data (or with the permutation distribution of some other statistic, including a difference in sample medians if you wish).
In spite of many books (especially in some particular areas of application) stating that it is, the Mann-Whitney is not actually a test for a difference in medians. However, if you additionally assume that the populations distributions are the same under the null, and restrict the alternative to a location-shift, then it's a test for difference in any reasonable location measure - population medians, population lower quartiles, even population means (if they exist).
Indeed, one needn't restrict oneself to location shift alternatives. Assuming identical distributions under the null against an alternative that will move medians (or any other measure of location) will work; so for example, it would work perfectly well that way as a test of medians under an assumption of scale-shift. We must keep in mind however, that the Mann-Whitney is a far more general test than that and that when we rely on an assumption to make it a test for medians or whatever, we do actually lean on our assumption for the conclusion to make it mean what we want it to.
In short, which test do I trust?
Don't simply trust what anyone says (including me!) - unless they have solid evidence (I haven't brought any that's directly relevant to your situation,, and none relating to power because I haven't seen your pattern of ties and I am not 100% sure whether you're only interested in location shifts).
What kind of data do you have (what are you measuring, how are you measuring it, and how do ties arise)? What are you interested in finding out? Why do you mention medians?
Use simulation to find out how any tests you contemplate behave in circumstances similar to yours, and decide for yourself whether there's a problem to worry about. For both tests, see what the impact of ties is on the test, both under the null and under alternatives you care about, and then the case of the Mann-Whitney, see the effect of the adjustment for ties, and compare it with dealing with the exact permutation distribution (or in large samples like yours, with the randomization distribution). For the KS you can look at the exact permutation distribution as well.
Best Answer
Most of the work on non-parametrics was originally done assuming that there was an underlying continuous distribution in which ties would be impossible (if measured accurately enough). The theory can then be based on the distributions of order statistics (which are a lot simpler without ties) or other formulas. In some cases the statistic works out to be approximately normal which makes things really easy. When ties are introduced either because the data was rounded or is naturally discrete, then the standard assumptions do not hold. The approximation may still be good enough in some cases, but not in others, so often the easiest thing to do is just give a warning that these formulas don't work with ties.
There are tools for some of the standard non-parametric tests that have worked out the exact distribution when ties are present. The exactRankTests package for R is one example.
One simple way to deal with ties is to use randomization tests like permutation tests or bootstrapping. These don't worry about asymptotic distributions, but use the data as it is, ties and all (note that with a lot of ties, even these techniques may have low power).
There was an article a few years back (I thought in the American Statistician, but I am not finding it) that discussed the ideas of ties and some of the things that you can do with them. One point is that it depends on what question you are asking, what to do with ties can be very different in a superiority test vs. a non-inferiority test.