Probability Distributions – Violating an Order Statistic Inequality

pr.probabilityprobability distributionssimulationst.statistics

[Edit: for posterity, I'm adding two small comments to the code explaining how to fix it, in light of Iosef Pinelis' answer below. Look for "Should be:" to find the corrections.]

Suppose we draw a sample of $n$ i.i.d. observations from some continuous distribution on $\mathbb{R}$. Sort these observations from smallest to largest and call them $x_1,\ldots,x_n$. We consider the quantile corresponding to each observation, $q_1,\ldots,q_n$. We are interested in the probability that the $q_i$ are all simultaneously close to $i/n$.

The DKW inequality shows that
$$ \Pr\left[\max_{i=1,\ldots,n}\left|q_i-\frac{i}{n}\right|>\varepsilon\right] \leq 2 e^{-2n\varepsilon^2} $$
Note that this result holds for all $n$ and there are no hidden constants.

It is extremely straightforward to run a Monte Carlo simulation of this process. The uniform distribution on $[0,1]$ is particularly convenient to use since each obersvation in the sample equals its own quantile (i.e., $x_i=q_i$). I've shared some Python code; 1 million Monte Carlo trials take about 10 seconds.

My simulation appears to violate the DKW inequality. Question: What's going on?

# Double-check the DKW Inequality
import numpy as np

################
# Set parameters
# 
# "prob_target" is the probability of *satisfying* the bound
# (i.e., that the gap is less than epsilon)
prob_target = 0.99
# We consider the CDF of "n" samples
n=20
# We are going to use "num_trials" Monte Carlo runs, where each run
# consists of "n" draws from a Uniform distribution
num_trials=1000000
print(f"Parameters: n={n}, # trials={num_trials}, target probability={prob_target}")

#######################
# Compute DKW threshold
epsilon_DKW = np.sqrt(np.log(2.0 / prob_target) / (2.0 * float(n)))
#   Should be:
# epsilon_DKW = np.sqrt(np.log(2.0 / (1-prob_target)) / (2.0 * float(n)))
# Double-check that we actually computed the intended probability;
# hopefully prob_target == prob_DKW
prob_DKW = 2 * np.exp(-2 * n * (epsilon_DKW**2))
#   Should be:
# prob_DKW = 1 - 2 * np.exp(-2 * n * (epsilon_DKW**2))
assert np.isclose(prob_target, prob_DKW)
print(f"Computed parameters: epsilon_DKW ={epsilon_DKW:.6f}, prob_DKW={prob_DKW:.6f}")

########################
# Monte Carlo simulation
disparity_list = np.zeros(num_trials)
ecdf = np.arange(1, n+1)/n  # = [1/n, 2/n, ..., n/n]
for trial in range(num_trials):
    data = np.random.uniform(size=n)
    data = np.sort(data)
    quantiles = data  # ...because uniform distribution on [0,1]
    worst_disparity = np.max(np.abs(quantiles-ecdf))

    disparity_list[trial] = worst_disparity
# Compute fraction of trials with disparity below the DKW bound
prob_true = np.sum(disparity_list < epsilon_DKW) / num_trials
# Compute the *actual* bound such that prob_DKW fraction of
# trials have disparity below the bound
epsilon_true = np.quantile(disparity_list, prob_DKW)

###############
# Print results
print(f"Measured threshold:  epsilon_best={epsilon_true:.6f}")
if prob_true < prob_DKW:
    print("\nWe have a problem.")
    print(f"DKW promises success probability at least {prob_DKW:.6f}, ")
    print(f"but we only observe probability           {prob_true:.6f}")
    print(f"The required epsilon is {epsilon_true / epsilon_DKW:.6f}x larger than DKW!")
else:
    print(f"DKW Inequality works!  Success probability={prob_true:.6f}>={prob_DKW:.6f}")

Here's the result of the run:

Parameters: n=20, # trials=1000000, target probability=0.99
Computed parameters: epsilon_DKW =0.132589, prob_DKW=0.990000
Measured threshold:  epsilon_best=0.335025

We have a problem.
DKW promises success probability at least 0.990000, 
but we only observe probability           0.333689
The required epsilon is 2.526787x larger than DKW!

Best Answer

The DKW inequality can be stated as follows: \begin{equation} p_n(z):=P(\|S_n\|_\infty>z)\le 2e^{-2z^2} \tag{1}\label{1} \end{equation} for $z>0$, where \begin{equation} S_n(t):=\frac1{\sqrt n}\,\sum_{i=1}^n Y_i(t),\quad Y_i(t):=1(X_i\le t)-t, \end{equation} and the $X_i$'s are iid random variables (r.v.'s) each uniformly distributed on $[0,1]$.

You seem to say that your simulations appear to be suggesting that the constant factor $2$ in the exponent of $e^{-2z^2}$ in the bound in \eqref{1} must be replaced a constant factor that is about $2.5$ times worse than the constant factor $2$.


This suggestion is extremely unlikely to be true, given the very long and rich history of the DKW inequality. Apparently, this history begins with Kolmogorov and Smirnov, who showed that
\begin{equation} p_n(z)\to p(z):=2\sum_{k=1}^\infty(-1)^{k-1}e^{-2k^2z^2} \end{equation} for each $z>0$ as $n\to\infty$. Note that $p(z)\le 2e^{-2z^2}$ for all $z>0$ and $p(z)\sim 2e^{-2z^2}$ as $z\to\infty$. This strongly suggests that $2$ is the right (and the best possible) constant factor $2$ in the exponent of $e^{-2z^2}$ in \eqref{1}.

One may also note here the iid random functions $Y_i$ are zero-mean, each with the covariance function $[0,1]^2\ni(s,t)\mapsto EY_i(s)Y_i(t)=\min(s,t)-st$. So, the mean and covariance functions of each random function $Y_i$ are the same as those of the Brownian bridge. By an appropriate functional central limit theorem, it is at least plausible that the random function $S_n$ converges in an appropriate sense to a Gaussian process. Since the distribution of a Gaussian process is determined by its mean and covariance functions, it follows that the random function $S_n$ converges in an appropriate sense to a Brownian bridge $B^0$. So, it is likely that \begin{equation} p_n(z)\to P(\|B^0\|_\infty>z)=p(z); \tag{2}\label{2} \end{equation} concerning the latter equality, cf. e.g. this or this. Of course, \eqref{2} is, not only plausible, but true and well known. In particular, in view of (say) [Talangrand's Theorem 1.1] with $\mathcal F=\{1_{[0,t]}\colon t\in[0,1]\}$, $\|S_n-B^0\|_\infty$ converges to $0$ in probability as $n\to\infty$.

So, again, it is extremely unlikely that the DKW inequality is false. It is much more likely that the problem is with the software you were using and/or with your code, which I cannot read.


Below is an image of a Mathematica notebook confirming your inequality:

enter image description here

Related Question