[Math] Empirical Distribution Function Understanding

probability distributionsprobability theory

I'm studying this topic by myself and I'm pretty sure there will be some big misunderstanding on my part, so please be patient with me.

Given the sample $X_1,\ldots, X_n$, iid with distribution $F$, the Empirical (Cumulative) Distribution Function (EDF) is the random probability measure $F_N:\mathbb{R}\rightarrow [0,1]$, such that
$$F_N(x)=\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)$$

where $I$ is the indicator function.

My problems are about the definition itself. Besides the explanations, examples are also welcome. I just want to get a good understanding about what is going on. Here go my doubts:

1) $\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)$ is a sum of functions, not a real value in $[0,1]$, should it be $\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)(x)$ or something else ? How should I interpret and calculate $I(X_i \leq x)$ ?

2) What is exactly a sample? At this time, I think it is a vector $(X_1(\omega),\ldots,X_N(\omega))$, where $\omega$ is some fixed event (that for some reason no one ever mention) and the $X_i$ are random "random variables" from the set of random variables with distribution $F$. Could one give a formal definition of sample?

3) In my understanding, $X_i\leq x=\{\omega\in\Omega: \ X_i(\omega)\leq x\}$, it is necessary to evaluate $X_i(\omega)$ for some $\omega$ to calculates $I(X_i\leq x)$. I already see the definition $I(X_i\leq x)=1 \textrm{ if }X_i\leq x \textrm{, or }0\textrm{ otherwise}$, how am I suposed to know if $X_i\leq x$ without some $\omega\in\Omega$? And why the set of events is never mentioned?

4) $F_N$ is called a distribution but they say it is a probability measure, also I already read it is a random variable. After all, what is it?

5) If $F_N$ is a probability measure, the function should be $F_N:\Sigma\rightarrow[0,1]$, where $\Sigma$ is a $\sigma-$ algebra over $\mathbb{R}$, but that is not the case, how to explain that?

PS: if there is some extra detail one want to bring up cause it is relevant, please do it. I need to understand, but it is really hard to do it myself.

Thank you very much.

Best Answer

Let us denote our probability space by $(\Omega,\mathcal{F},P)$ and let $X_1,X_2,\ldots,X_n$ be a sequence of i.i.d. random variables defined on $\Omega$.

You're correct that $\{X_i\leq x\}$ is shorthand notation for $\{\omega\in\Omega\mid X_i(\omega)\leq x\}$ which is a subset of $\Omega$ that belongs to $\mathcal{F}$ (since $X_i$ is a random variable). Futhermore, $I(X_i\leq x)$ is the indicator function for the set $\{X_i\leq x\}\subseteq\Omega$ and by definition it is a function defined on $\Omega$ (in fact it is a random variable since the set belongs to $\mathcal{F}$): $$ \begin{align} I(X_i\leq x)(\omega)&= \begin{cases} 1,\quad \text{if }\omega\in \{X_i\leq x\},\\ 0,\quad \text{otherwise}. \end{cases} \\ &= \begin{cases} 1,\quad\text{if }X_i(\omega)\leq x,\\ 0,\quad\text{otherwise}. \end{cases} \end{align} $$

Therefore, $\frac1n \sum_{i=1}^n I(X_i\leq x)$ is also a random variable for each fixed $n$.

A sample in this connection just denotes a sequence of i.i.d. random variables $X_1,\ldots,X_n$. An outcome of this sample corresponds to a fixed $\omega$, and $X_1(\omega),\ldots,X_n(\omega)$ would be an outcome or observation of the sample $X_1,\ldots,X_n$.

The empirical distribution function $F_n(x)=\frac1n \sum_{i=1}^n I(X_i\leq x)$ is indeed a random variable, and we can evaluate it in the following way: $$ (F_n(x))(\omega)=\frac1n\sum_{i=1}^n I(X_i(\omega)\leq x), $$ i.e. for a fixed outcome $\omega\in\Omega$, $(F_n(x))(\omega)$ is the number of observations that are less than $x$ divided by $n$ based on the outcome $X_1(\omega),X_2(\omega),\ldots,X_n(\omega)$.

Now suppose we have an infinite sample of i.i.d. variables $X_1,X_2,\ldots$. Then by the law of large numbers one has that for every fixed $x$, the random variables $F_1(x), F_2(x),F_3(x)$ converges almost surely to the true CDF $F$: $$ F_n(x)\to F(x)\;\;\text{almost surely as } n\to\infty. $$

Related Solutions

[Math] Definition and use of Empirical Cumulative Distribution Function (ECDF)

Sometimes one says that a histogram based on a large sample size gives a good idea about the shape of the population density function. (But information is lost in binning, and a modern 'density estimator' usually works better.)

In somewhat the same way an empirical cumulative distribution function (ECDF) of a large sample is a good estimator of the population CDF.

The following R program samples 3000 observations from $Gamma(5, 1)$ to illustrate @Clement C's comment. The figure below shows the histogram (at left) along with the known population density (dotted) and a density estimator. At right, the CDF (thin light green) is superimposed on the ECDF (heavy black) of the sample. A larger sample would show better fit, but perhaps too good to see distinctions between population and sample curves.

 x = rgamma(3000, 5, 1)   # generate random sample
 par(mfrow=c(1,2))        # two panels in one graph
   hist(x, prob=T, col="wheat")
     lines(density(x), lwd=2, col="blue")  # density estimator
     curve(dgamma(x, 5, 1), lty="dotted", lwd=2, col="red", add=T)
   plot.ecdf(x)           # empirical CDF
     curve(pgamma(x, 5, 1), col="green", add=T)  # pop CDF
 par(mfrow=c(1,1))        # returns to default single panel

If you have access to R, you can try other population distributions and sample sizes. The same program as above, except with a sample of size $n = 100$ was used to produce the figure below. Roughly speaking, the ECDF gives a better estimate of the CDF than a histogram gives of the PDF. A 'nonparametric bootstrap' procedure uses the sample ECDF in place of the unknown population CDF.

Empirical distribution function

For every $i\in\{1,\dots,n\}$ there is a $k\in\{1,\dots,n\}$ such that $\xi_i=\xi_{(k)}$ where $\xi_{(k)}$ denotes the $k$-th order statistic.

Observe that - if $i\neq j$ - we have $\chi(\xi_i\leq\xi_j)\stackrel{a.s.}{=}\chi(\xi_i<\xi_j)$ because $F$ is a continuous distribution.

Actually the continuity of $F$ allows you to assume that: $$\xi_{(1)}<\xi_{(2)}<\cdots<\xi_{(n)}\tag1$$where $<$ replaces $\leq$.

Based on this it is not difficult to find that $nF_n(\xi_i)=k$ where $k$ is the integer that satisfies $\xi_i=\xi_{(k)}$.

Your expression for $F_n^{-1}(u)$ is okay and again applying $(1)$ we find: $$F^{-1}_n\left(\frac{i}{n}\right)=\xi_{(i)}$$

You are correct.

In general if $F$ is a CDF and $\Phi:(0,1)\to\mathbb R$ is defined by: $$\Phi(u)=\inf(\{x\in\mathbb R\mid F(x)\geq u\}$$then it can be deduced that: $$u\leq F(x)\iff \Phi(u)\leq x$$

Based on that we find for $\eta$ uniformly distributed on $(0,1)$:$$P(\Phi(\eta)\leq x)=P(\eta\leq F(x))=F(x)$$

Best Answer

Related Solutions

[Math] Definition and use of Empirical Cumulative Distribution Function (ECDF)

Empirical distribution function

Related Question