[Math] Definition and use of Empirical Cumulative Distribution Function (ECDF)

probability distributionsprobability theorystatistical-inference

Let $X_1 , X_2, \ldots ,X_n$ be independent identically distributed random variables with a common cdf $F(t)$. Then the empirical cdf is defined as ,

$$F_n(t) = \frac { \text{number of elements in the sample } \le \space t}{n}$$

My first question is this the sample space (the one in the numerator) of random variables or of something else?

Also can someone explain it to me the significance of empirical cdf in comparison to cumulative distribution function $F(t) = P(X \le t)$?

Best Answer

Sometimes one says that a histogram based on a large sample size gives a good idea about the shape of the population density function. (But information is lost in binning, and a modern 'density estimator' usually works better.)

In somewhat the same way an empirical cumulative distribution function (ECDF) of a large sample is a good estimator of the population CDF.

The following R program samples 3000 observations from $Gamma(5, 1)$ to illustrate @Clement C's comment. The figure below shows the histogram (at left) along with the known population density (dotted) and a density estimator. At right, the CDF (thin light green) is superimposed on the ECDF (heavy black) of the sample. A larger sample would show better fit, but perhaps too good to see distinctions between population and sample curves.

 x = rgamma(3000, 5, 1)   # generate random sample
 par(mfrow=c(1,2))        # two panels in one graph
   hist(x, prob=T, col="wheat")
     lines(density(x), lwd=2, col="blue")  # density estimator
     curve(dgamma(x, 5, 1), lty="dotted", lwd=2, col="red", add=T)
   plot.ecdf(x)           # empirical CDF
     curve(pgamma(x, 5, 1), col="green", add=T)  # pop CDF
 par(mfrow=c(1,1))        # returns to default single panel

enter image description here

If you have access to R, you can try other population distributions and sample sizes. The same program as above, except with a sample of size $n = 100$ was used to produce the figure below. Roughly speaking, the ECDF gives a better estimate of the CDF than a histogram gives of the PDF. A 'nonparametric bootstrap' procedure uses the sample ECDF in place of the unknown population CDF.

enter image description here