Solved – Statistical test for two distributions where only 5-number summary is known

descriptive statisticsdistributionshypothesis testingnonparametric

I have two distributions where only the 5-number summary (minimum, 1st quartile, median, 3rd quartile, maximum) and sample size are known. Contrary to the question here, not all data points are available.

Is there any non-parametric statistical test which allows me to check whether the underlying distributions of the two are different?

Best Answer

Under the null hypothesis that the distributions are the same and both samples are obtained randomly and independently from the common distribution, we can work out the sizes of all $5\times 5$ (deterministic) tests that can be made by comparing one letter value to another. Some of these tests appear to have reasonable power to detect differences in distributions.

Analysis

The original definition of the $5$-letter summary of any ordered batch of numbers $x_1 \le x_2 \le \cdots \le x_n$ is the following [Tukey EDA 1977]:

For any number $m = (i + (i+1))/2$ in $\{(1+2)/2, (2+3)/2, \ldots, (n-1+n)/2\}$ define $x_m = (x_i + x_{i+1})/2.$
Let $\bar{i} = n+1-i$.
Let $m = (n+1)/2$ and $h = (\lfloor m \rfloor + 1)/2.$
The $5$-letter summary is the set $\{X^{-} = x_1, H^{-}=x_h, M=x_m, H^{+}=x_\bar{h}, X^{+}=x_n\}.$ Its elements are known as the minimum, lower hinge, median, upper hinge, and maximum, respectively.

For example, in the batch of data $(-3, 1, 1, 2, 3, 5, 5, 5, 7, 13, 21)$ we may compute that $n=12$, $m=13/2$, and $h=7/2$, whence

$$\eqalign{ &X^{-} &= -3, \\ &H^{-} &= x_{7/2} = (x_3+x_4)/2 = (1+2)/2 = 3/2, \\ &M &= x_{13/2} = (x_6+x_7)/2 = (5+5)/2 = 5, \\ &H^{+} &= x_\overline{7/2} = x_{19/2} = (x_9+x_{10})/2 = (5+7)/2 = 6, \\ &X^{+} &= x_{12} = 21. }$$

The hinges are close to (but usually not exactly the same as) the quartiles. If quartiles are used, note that in general they will be weighted arithmetic means of two of the order statistics and thereby will lie within one of the intervals $[x_i, x_{i+1}]$ where $i$ can be determined from $n$ and the algorithm used to compute the quartiles. In general, when $q$ is in an interval $[i, i+1]$ I will loosely write $x_q$ to refer to some such weighted mean of $x_i$ and $x_{i+1}$.

With two batches of data $(x_i, i=1,\ldots, n)$ and $(y_j, j=1,\ldots,m),$ there are two separate five-letter summaries. We can test the null hypothesis that both are iid random samples of a common distribution $F$ by comparing one of the $x$-letters $x_q$ to one of the $y$-letters $y_r$. For instance, we might compare the upper hinge of $x$ to the lower hinge of $y$ in order to see whether $x$ is significantly less than $y$. This leads to a definite question: how to compute this chance,

$${\Pr}_F(x_q \lt y_r).$$

For fractional $q$ and $r$ this is not possible without knowing $F$. However, because $x_q \le x_{\lceil q \rceil} $ and $y_{\lfloor r \rfloor} \le y_r,$ then a fortiori

$${\Pr}_F(x_q \lt y_r) \le {\Pr}_F(x_{\lceil q \rceil} \lt y_{\lfloor r \rfloor}).$$

We can thereby obtain universal (independent of $F$) upper bounds on the desired probabilities by computing the right hand probability, which compares individual order statistics. The general question in front of us is

What is the chance that the $q^\text{th}$ highest of $n$ values will be less than the $r^\text{th}$ highest of $m$ values drawn iid from a common distribution?

Even this does not have a universal answer unless we rule out the possibility that probability is too heavily concentrated on individual values: in other words, we need to assume that ties are not possible. This means $F$ must be a continuous distribution. Although this is an assumption, it is a weak one and it is non-parametric.

Solution

The distribution $F$ plays no role in the calculation, because upon re-expressing all values by means of the probability transform $F$, we obtain new batches

$$X^{(F)} = F(x_1) \le F(x_2) \le \cdots \le F(x_n)$$

and

$$Y^{(F)} = F(y_1) \le F(y_2) \le \cdots \le F(y_m).$$

Moreover, this re-expression is monotonic and increasing: it preserves order and in so doing preserves the event $x_q \lt y_r.$ Because $F$ is continuous, these new batches are drawn from a Uniform$[0,1]$ distribution. Under this distribution--and dropping the now superfluous "$F$" from the notation--we easily find that $x_q$ has a Beta$(q, n+1-q)$ = Beta$(q, \bar{q})$ distribution:

$$\Pr(x_q\le x) = \frac{n!}{(n-q)!(q-1)!}\int_0^x t^{q-1}(1-t)^{n-q}dt.$$

Similarly the distribution of $y_r$ is Beta$(r, m+1-r)$. By performing the double integration over the region $x_q \lt y_r$ we can obtain the desired probability,

$$\Pr(x_q \lt y_r) = \frac{\Gamma (m+1) \Gamma (n+1) \Gamma (q+r)\, _3\tilde{F}_2(q,q-n,q+r;\ q+1,m+q+1;\ 1)}{\Gamma (r) \Gamma (n-q+1)}$$

Because all values $n, m, q, r$ are integral, all the $\Gamma$ values are really just factorials: $\Gamma(k) = (k-1)! = (k-1)(k-2)\cdots(2)(1)$ for integral $k\ge 0.$ The little-known function $_3\tilde{F}_2$ is a regularized hypergeometric function. In this case it can be computed as a rather simple alternating sum of length $n-q+1$, normalized by some factorials:

$$\Gamma(q+1)\Gamma(m+q+1)\ {_3\tilde{F}_2}(q,q-n,q+r;\ q+1,m+q+1;\ 1) \\ =\sum_{i=0}^{n-q}(-1)^i \binom{n-q}{i} \frac{q(q+r)\cdots(q+r+i-1)}{(q+i)(1+m+q)(2+m+q)\cdots(i+m+q)} \\ = 1 - \frac{\binom{n-q}{1}q(q+r)}{(1+q)(1+m+q)} + \frac{\binom{n-q}{2}q(q+r)(1+q+r)}{(2+q)(1+m+q)(2+m+q)} - \cdots.$$

This has reduced the calculation of the probability to nothing more complicated than addition, subtraction, multiplication, and division. The computational effort scales as $O((n-q)^2).$ By exploiting the symmetry

$$\Pr(x_q \lt y_r) = 1 - \Pr(y_r \lt x_q)$$

the new calculation scales as $O((m-r)^2),$ allowing us to pick the easier of the two sums if we wish. This will rarely be necessary, though, because $5$-letter summaries tend to be used only for small batches, rarely exceeding $n, m \approx 300.$

Application

Suppose the two batches have sizes $n=8$ and $m=12$. The relevant order statistics for $x$ and $y$ are $1,3,5,7,8$ and $1,3,6,9,12,$ respectively. Here is a table of the chance that $x_q \lt y_r$ with $q$ indexing the rows and $r$ indexing the columns:

q\r 1       3       6       9       12
1   0.4      0.807  0.9762  0.9987  1.
3   0.0491  0.2962  0.7404  0.9601  0.9993
5   0.0036  0.0521  0.325   0.7492  0.9856
7   0.0001  0.0032  0.0542  0.3065  0.8526
8   0.      0.0004  0.0102  0.1022  0.6

A simulation of 10,000 iid sample pairs from a standard Normal distribution gave results close to these.

To construct a one-sided test at size $\alpha,$ such as $\alpha = 5\%,$ to determine whether the $x$ batch is significantly less than the $y$ batch, look for values in this table close to or just under $\alpha$. Good choices are at $(q,r)=(3,1),$ where the chance is $0.0491,$ at $(5,3)$ with a chance of $0.0521$, and at $(7,6)$ with a chance of $0.0542.$ Which one to use depends on your thoughts about the alternative hypothesis. For instance, the $(3,1)$ test compares the lower hinge of $x$ to the smallest value of $y$ and finds a significant difference when that lower hinge is the smaller one. This test is sensitive to an extreme value of $y$; if there is some concern about outlying data, this might be a risky test to choose. On the other hand the test $(7,6)$ compares the upper hinge of $x$ to the median of $y$. This one is very robust to outlying values in the $y$ batch and moderately robust to outliers in $x$. However, it compares middle values of $x$ to middle values of $y$. Although this is probably a good comparison to make, it will not detect differences in the distributions that occur only in either tail.

Being able to compute these critical values analytically helps in selecting a test. Once one (or several) tests are identified, their power to detect changes is probably best evaluated through simulation. The power will depend heavily on how the distributions differ. To get a sense of whether these tests have any power at all, I conducted the $(5,3)$ test with the $y_j$ drawn iid from a Normal$(1,1)$ distribution: that is, its median was shifted by one standard deviation. In a simulation the test was significant $54.4\%$ of the time: that is appreciable power for datasets this small.

Much more can be said, but all of it is routine stuff about conducting two-sided tests, how to assess effects sizes, and so on. The principal point has been demonstrated: given the $5$-letter summaries (and sizes) of two batches of data, it is possible to construct reasonably powerful non-parametric tests to detect differences in their underlying populations and in many cases we might even have several choices of test to select from. The theory developed here has a broader application to comparing two populations by means of a appropriately selected order statistics from their samples (not just those approximating the letter summaries).

These results have other useful applications. For instance, a boxplot is a graphical depiction of a $5$-letter summary. Thus, along with knowledge of the sample size shown by a boxplot, we have available a number of simple tests (based on comparing parts of one box and whisker to another one) to assess the significance of visually apparent differences in those plots.

Related Solutions

Solved – Five-number summary and mean

The five-number summary was, I believe, introduced by John W. Tukey about 1970. The point was that once you have ordered the data (e.g. using a stem-and-leaf plot), then those summaries could be produced by at most counting and averaging pairs of values. The context was pencil and paper methods for tens or (say) a few hundred values.

Now it is, as we all know, immensely more likely that people have their data on a computer and may even be unused to mechanical arithmetic such as adding numbers and dividing by 2. But there is usually no difficulty in calculating a mean. Whether a mean is a useful summary is open to discussion, but we can always have a look.

The five-number summary idea lives on in the form of box plots. Arguably, box plots have even been oversold, as when box plots without means or SDs are presented as cognate to analysis of variance. More on that in this thread

Descriptive Statistics – Do Identical 5-Number Summaries Mean Same Shape Distributions

Just because the five-number summary is identical doesn't mean that the distribution is identical. This tells you just how much information is lost when we present data graphically in a box plot!

Perhaps the easiest way to see the problem is that the five number summary tells you nothing about the distribution of the values between the minimum and lower quartile, or between the lower quartile and the median, and so on. You know that the frequency between minimum and lower quartile must match the frequency between lower quartile and median (with the obvious exceptions, e.g. if we have data lying on a quartile, or worse, if two quartiles are tied) but don't know to which values of the variable those frequencies are allocated. We can have a situation like this:

Different distributions with the same five-number summary and box plot

These two distributions have the same five-number summary, so their box plots are identical, but I have chosen $X$ to have a uniform distribution between each quartile whereas $Y$ has a distribution with low frequencies close to the quartiles and high frequencies in the middle of two quartiles. Effectively the distribution of $Y$ has been formed by taking the distribution of $X$ and moving most of the data that is close to a quartile further away from it; my R code actually performs this in reverse, starting with the irregular distribution of $Y$ and levelling out the frequencies by reallocating data from the peaks to fill in the troughs.

EDIT: As @Glen_b says, this becomes even more obvious when you look at the cumulative distributions. I've added gridlines to show the location of the quartiles, which are the same for the two distributions so their empirical CDFs intersect.

Empirical CDFs of two distributions with the same five-number summary

R code

yfreq <- 2*rep(c(1:10, 10:1), times=4)
xfreq <- rep(mean(yfreq), times=length(yfreq))

x <- rep(1:length(xfreq), times=xfreq)
y <- rep(1:length(yfreq), times=yfreq)

ecdfX <- ecdf(x)
ecdfY <- ecdf(y)
plot(ecdfX, verticals=TRUE, do.points=FALSE, col="blue", lwd=2, yaxt="n", 
    main="Empirical CDFs", xlab="", ylab="Relative cumulative frequency")
plot(ecdfY, verticals=TRUE, do.points=FALSE, add=TRUE, col="black",
    yaxt="n", lwd=2)
axis(side=2, at=seq(0, 1, by=0.1), las=2)
abline(h=c(0.25,0.5,0.75,1), col="lightgrey", lty="dashed")
abline(v=summary(x), col="lightgrey", lty="dashed")
legend("right", c("x", "y"), col = c("blue", "black"),
       lty = "solid", lwd=2, bty="n")

par(mfrow=c(2,2))
hist(x, col="steelblue", breaks=((0:81)-0.5), ylim=c(0,25))
hist(y, col="grey", breaks=((0:81)-0.5), ylim=c(0,25))
boxplot(x, col="steelblue", main="Boxplot of x")
boxplot(y, col="grey", main="Boxplot of y")

summary(x)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.00   20.75   40.50   40.50   60.25   80.00 

summary(y)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.00   20.75   40.50   40.50   60.25   80.00