[Math] Empirical Distribution Function Understanding

probability distributionsprobability theory

I'm studying this topic by myself and I'm pretty sure there will be some big misunderstanding on my part, so please be patient with me.

Given the sample $X_1,\ldots, X_n$, iid with distribution $F$, the Empirical (Cumulative) Distribution Function (EDF) is the random probability measure $F_N:\mathbb{R}\rightarrow [0,1]$, such that
$$F_N(x)=\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)$$

where $I$ is the indicator function.

My problems are about the definition itself. Besides the explanations, examples are also welcome. I just want to get a good understanding about what is going on. Here go my doubts:

1) $\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)$ is a sum of functions, not a real value in $[0,1]$, should it be $\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)(x)$ or something else ? How should I interpret and calculate $I(X_i \leq x)$ ?

2) What is exactly a sample? At this time, I think it is a vector $(X_1(\omega),\ldots,X_N(\omega))$, where $\omega$ is some fixed event (that for some reason no one ever mention) and the $X_i$ are random "random variables" from the set of random variables with distribution $F$. Could one give a formal definition of sample?

3) In my understanding, $X_i\leq x=\{\omega\in\Omega: \ X_i(\omega)\leq x\}$, it is necessary to evaluate $X_i(\omega)$ for some $\omega$ to calculates $I(X_i\leq x)$. I already see the definition $I(X_i\leq x)=1 \textrm{ if }X_i\leq x \textrm{, or }0\textrm{ otherwise}$, how am I suposed to know if $X_i\leq x$ without some $\omega\in\Omega$? And why the set of events is never mentioned?

4) $F_N$ is called a distribution but they say it is a probability measure, also I already read it is a random variable. After all, what is it?

5) If $F_N$ is a probability measure, the function should be $F_N:\Sigma\rightarrow[0,1]$, where $\Sigma$ is a $\sigma-$ algebra over $\mathbb{R}$, but that is not the case, how to explain that?

PS: if there is some extra detail one want to bring up cause it is relevant, please do it. I need to understand, but it is really hard to do it myself.

Thank you very much.

Best Answer

Let us denote our probability space by $(\Omega,\mathcal{F},P)$ and let $X_1,X_2,\ldots,X_n$ be a sequence of i.i.d. random variables defined on $\Omega$.

You're correct that $\{X_i\leq x\}$ is shorthand notation for $\{\omega\in\Omega\mid X_i(\omega)\leq x\}$ which is a subset of $\Omega$ that belongs to $\mathcal{F}$ (since $X_i$ is a random variable). Futhermore, $I(X_i\leq x)$ is the indicator function for the set $\{X_i\leq x\}\subseteq\Omega$ and by definition it is a function defined on $\Omega$ (in fact it is a random variable since the set belongs to $\mathcal{F}$): $$ \begin{align} I(X_i\leq x)(\omega)&= \begin{cases} 1,\quad \text{if }\omega\in \{X_i\leq x\},\\ 0,\quad \text{otherwise}. \end{cases} \\ &= \begin{cases} 1,\quad\text{if }X_i(\omega)\leq x,\\ 0,\quad\text{otherwise}. \end{cases} \end{align} $$

Therefore, $\frac1n \sum_{i=1}^n I(X_i\leq x)$ is also a random variable for each fixed $n$.

A sample in this connection just denotes a sequence of i.i.d. random variables $X_1,\ldots,X_n$. An outcome of this sample corresponds to a fixed $\omega$, and $X_1(\omega),\ldots,X_n(\omega)$ would be an outcome or observation of the sample $X_1,\ldots,X_n$.

The empirical distribution function $F_n(x)=\frac1n \sum_{i=1}^n I(X_i\leq x)$ is indeed a random variable, and we can evaluate it in the following way: $$ (F_n(x))(\omega)=\frac1n\sum_{i=1}^n I(X_i(\omega)\leq x), $$ i.e. for a fixed outcome $\omega\in\Omega$, $(F_n(x))(\omega)$ is the number of observations that are less than $x$ divided by $n$ based on the outcome $X_1(\omega),X_2(\omega),\ldots,X_n(\omega)$.

Now suppose we have an infinite sample of i.i.d. variables $X_1,X_2,\ldots$. Then by the law of large numbers one has that for every fixed $x$, the random variables $F_1(x), F_2(x),F_3(x)$ converges almost surely to the true CDF $F$: $$ F_n(x)\to F(x)\;\;\text{almost surely as } n\to\infty. $$

Related Question