I'm studying this topic by myself and I'm pretty sure there will be some big misunderstanding on my part, so please be patient with me.
Given the sample $X_1,\ldots, X_n$, iid with distribution $F$, the Empirical (Cumulative) Distribution Function (EDF) is the random probability measure $F_N:\mathbb{R}\rightarrow [0,1]$, such that
$$F_N(x)=\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)$$
where $I$ is the indicator function.
My problems are about the definition itself. Besides the explanations, examples are also welcome. I just want to get a good understanding about what is going on. Here go my doubts:
1) $\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)$ is a sum of functions, not a real value in $[0,1]$, should it be $\frac{1}{N}\sum_{i=1}^NI(X_i \leq x)(x)$ or something else ? How should I interpret and calculate $I(X_i \leq x)$ ?
2) What is exactly a sample? At this time, I think it is a vector $(X_1(\omega),\ldots,X_N(\omega))$, where $\omega$ is some fixed event (that for some reason no one ever mention) and the $X_i$ are random "random variables" from the set of random variables with distribution $F$. Could one give a formal definition of sample?
3) In my understanding, $X_i\leq x=\{\omega\in\Omega: \ X_i(\omega)\leq x\}$, it is necessary to evaluate $X_i(\omega)$ for some $\omega$ to calculates $I(X_i\leq x)$. I already see the definition $I(X_i\leq x)=1 \textrm{ if }X_i\leq x \textrm{, or }0\textrm{ otherwise}$, how am I suposed to know if $X_i\leq x$ without some $\omega\in\Omega$? And why the set of events is never mentioned?
4) $F_N$ is called a distribution but they say it is a probability measure, also I already read it is a random variable. After all, what is it?
5) If $F_N$ is a probability measure, the function should be $F_N:\Sigma\rightarrow[0,1]$, where $\Sigma$ is a $\sigma-$ algebra over $\mathbb{R}$, but that is not the case, how to explain that?
PS: if there is some extra detail one want to bring up cause it is relevant, please do it. I need to understand, but it is really hard to do it myself.
Thank you very much.
Best Answer
Let us denote our probability space by $(\Omega,\mathcal{F},P)$ and let $X_1,X_2,\ldots,X_n$ be a sequence of i.i.d. random variables defined on $\Omega$.
You're correct that $\{X_i\leq x\}$ is shorthand notation for $\{\omega\in\Omega\mid X_i(\omega)\leq x\}$ which is a subset of $\Omega$ that belongs to $\mathcal{F}$ (since $X_i$ is a random variable). Futhermore, $I(X_i\leq x)$ is the indicator function for the set $\{X_i\leq x\}\subseteq\Omega$ and by definition it is a function defined on $\Omega$ (in fact it is a random variable since the set belongs to $\mathcal{F}$): $$ \begin{align} I(X_i\leq x)(\omega)&= \begin{cases} 1,\quad \text{if }\omega\in \{X_i\leq x\},\\ 0,\quad \text{otherwise}. \end{cases} \\ &= \begin{cases} 1,\quad\text{if }X_i(\omega)\leq x,\\ 0,\quad\text{otherwise}. \end{cases} \end{align} $$
Therefore, $\frac1n \sum_{i=1}^n I(X_i\leq x)$ is also a random variable for each fixed $n$.
A sample in this connection just denotes a sequence of i.i.d. random variables $X_1,\ldots,X_n$. An outcome of this sample corresponds to a fixed $\omega$, and $X_1(\omega),\ldots,X_n(\omega)$ would be an outcome or observation of the sample $X_1,\ldots,X_n$.
The empirical distribution function $F_n(x)=\frac1n \sum_{i=1}^n I(X_i\leq x)$ is indeed a random variable, and we can evaluate it in the following way: $$ (F_n(x))(\omega)=\frac1n\sum_{i=1}^n I(X_i(\omega)\leq x), $$ i.e. for a fixed outcome $\omega\in\Omega$, $(F_n(x))(\omega)$ is the number of observations that are less than $x$ divided by $n$ based on the outcome $X_1(\omega),X_2(\omega),\ldots,X_n(\omega)$.
Now suppose we have an infinite sample of i.i.d. variables $X_1,X_2,\ldots$. Then by the law of large numbers one has that for every fixed $x$, the random variables $F_1(x), F_2(x),F_3(x)$ converges almost surely to the true CDF $F$: $$ F_n(x)\to F(x)\;\;\text{almost surely as } n\to\infty. $$