Hypothesis Testing – How to Test if One Empirical CDF is to the Left or Right of Another

boxplotcumulative distribution functionhypothesis testing

I am currently working with a box plot (shown below) that consists of two boxes per value of one of the independent variables (call it $x$). The other independent variable is indicated by the two boxes (call it $y$). The blue box represents the dependent variable (call it $z$) under the condition $y = 1$, while the red box represents $z$ under the condition $y = 2$. The bottom and top whiskers represent the 10th and 90th percentiles respectively, while the bottom and top edges of a box represent the 25th and 75th percentiles respectively. The median is marked in black.

My hypothesis is that, when $y = 1$ (blue box), the empirical CDF of $z$ is to the right of the empirical CDF of $z$ when $y = 2$ (red box) "in general" (on average or otherwise) for every value of $x$. This relationship (although not particularly strong) can be seen in the plot below. However, I am not sure how to phrase this precisely in terms of a statistical test.

One possibility that I thought of was to use a two-sample Kolmogorov-Smirnov test for each value of $x$, but I am not sure how helpful this would be. Another possibility, is that, because the data was generated in pairs, i.e., one specific value of $z$ when $y = 1$ can be matched to another specific value of $z$ when $y = 2$, then I should subtract the value of $z$ when $y = 2$ from the corresponding value of $z$ when $y = 1$, and then check that the values are always (or mostly) positive. Any suggestions would be appreciated.

enter image description here

Best Answer

Maybe you're interested in whether sample y stochastically dominates sample x. If so, you might want to look directly at ECDF plots, and do some formal tests.

Here are summaries and ECDF plots of two samples.

summary(x); length(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.067  14.628  21.012  21.807  28.297  53.943 
[1] 30        # sample size
[1] 10.56207  # sample SD

summary(y); length(y);  sd(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.81   25.30   29.27   29.25   32.56   45.47 
[1] 30
[1] 8.098928

plot(ecdf(x), col="blue", main="ECDFs", xlab="values")
lines(ecdf(y), col="brown")

enter image description here

Because the ECDF of y (brown) plots to the right of the ECDF of x (blue), and therefore below, it seems the values of y are generally larger than values of x

A two-sample Kolmogorov-Smirnov test confirms this with a P-value below 5%. The test statistic $D$ is the maximum vertical distance between the two ECDF plots.

ks.test(x,y)

        Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.43333, p-value = 0.006548
alternative hypothesis: two-sided

When two samples are not of the same shape (including the same variability), a two-sample Wilcoxon Rank Sum test, is said to be a test of stochastic dominance (rather than of different medians).

boxplot(x, y, horizontal=T, col=c("skyblue2", "wheat"))

enter image description here

wilcox.test(x,y)

        Wilcoxon rank sum test

data:  x and y
W = 236, p-value = 0.001292
alternative hypothesis: 
 true location shift is not equal to 0

Notes: (1) Technically speaking, there are several different types of 'stochastic dominance' with somewhat different definitions. You may be interested in googling that. Perhaps start here.

(2) The fictitious samples used in the above discussion were sampled in R as follows:

set.seed(2022)
x = rgamma(30, 4, 1/5)
y = rgamma(30, 5, 1/5) + 7
Related Question