Solved – Can the mean of an empirical CDF be different than .5

cumulative distribution functionempirical-cumulative-distr-fn

I was asked to convert a series of values into a series of percentiles corresponding to these values with respect to the empirical cdf constituted by the initial series.

Using R, I wrote:

toPercTS <- function(aSeries){
  ECDF <- ecdf(aSeries)
  percTS <- c()
  for (i in 1:length(aSeries)){
    percTS[i] <- ECDF(aSeries[i])
  }
  return(percTS)
}

The function ecdf returns the empirical cdf of a series. A detail surprised me:

> max(toPercTS(someSeries))
[1] 1
> min(toPercTS(someSeries))
[1] 0.00990099
> mean(toPercTS(someSeries))
[1] 0.5049505

The mean is slightly greater than .5 whereas I thought it would be exactly equal to 0.5, by construction. Where am I wrong?

Best Answer

Suppose we have data $x_1, \dots, x_n$ where each $x_i$ is an iid realization of some random variable $X \sim f$. Then the ECDF is $$ \hat F_n(x) = \frac 1n \sum_{i=1}^n \mathbf 1(x_i \leq x). $$

Your question is if $\frac 1n \sum_i \hat F_n(x_i) \stackrel ?= 0.5$.

Let's first assume that all of the datapoints are unique, and WLOG let's assume that they're sorted so that $x_1 < \dots < x_n$.

For some $i$, think about the sum $\sum_j \mathbf 1(x_j \leq x_i)$. Suppose $i=5$. Then we know $x_1 < \dots < x_4 < x_5 < x_6 < \dots < x_n$ so the first 5 terms of the sum are 1 and the rest are 0. In general, this sums counts the number of datapoints less than $x_i$, which since they are sorted and unique, is $i$. Putting this together, we have $$ \frac 1n \sum_i \hat F_n(x_i) = \frac 1{n^2} \sum_i i = \frac{n+1}{2n} $$

so it's very close to but not quite equal to 1/2 (in the continuous case).

If the $x_i$ are not all unique then this does not necessarily hold. Suppose have $x_1 = 1$ and $x_2 = x_3 = 2$. Then $$ \frac 1n \sum_i \hat F_n(x_i) = \frac{1 + 3 + 3}{3^2} = \frac 79 \neq \frac{3 + 1}{2 \times 3} = \frac 23. $$

An example in R (taking advantage of how the resulting elements of x are almost surely unique):

n <- 24
x <- rnorm(n)
e <- ecdf(x)(x)
mean(e)            # 0.5208333
(n + 1) / (2 * n)  # 0.5208333