[Math] Is there some connection between Kernel density estimation and Empirical distribution function

probability distributionsstatistics

This wiki page says

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.

and gives this formula

${\displaystyle {\widehat {F}}_{n}(t)={\frac {{\mbox{number of elements in the sample}}\leq t}{n}}={\frac {1}{n}}\sum _{i=1}^{n}\mathbf {1} _{X_{i}\leq t},}$

Another wiki page says

kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable

and gives this formula

${\displaystyle {\widehat {f}}_{h}(x)={\frac {1}{n}}\sum _{i=1}^{n}K_{h}(x-x_{i})={\frac {1}{nh}}\sum _{i=1}^{n}K{\Big (}{\frac {x-x_{i}}{h}}{\Big )},}$

This post says

the pdf is the first derivative of the cdf for a continuous random variable

question

Is there some connection between Kernel density estimation and Empirical distribution function, such as the former is the derivative of the latter for a continuous random variable? If yes, what is the derivation?

Best Answer

Not precisely.

About histograms, KDEs and ECDFs.

(1) Roughly speaking, a histogram (on a density scale so that the sum of areas of bars is unity) can be viewed as a estimate of the density function. A KDE is a more sophisticated method of density estimation. Generally speaking one cannot reconstruct the exact values of the data for either a histogram or a KDE.

(2) By contrast an empirical CDF (ECDF) retains exact information about all of the data. An ECDF is made as follows: (a) sort the data from smallest to largest, (b) make a stair-step function that begins at 0 below the minimum and increases by $1/n$ at each data value, where $n$ is the sample size. If $k$ values are tied then the increase is $k/n$ at the tied value.

Thus the ECDF approximates the CDF of the distribution, with increasingly accurate approximations for samples of increasing size. Generally speaking an ECDF gives a better approximation to the population CDF than a histogram gives for the density function. (Information is lost in binning data to make a histogram.)

[By suitable manipulation (a kind of numerical integration), information in a KDE could be used to make a function that imitates the population CDF, but it does not use the actual data values. In my experience, this is rarely done.]

Graphical illustrations.

(1) A sample of size $n = 100$ from $$\mathsf{Gamma}(\text{shape} = \alpha = 5,\,\text{rate} = \lambda = 1/6)$$ is simulated. The figure shows a density histogram (blue bars), the default KDE from R statistical software (red curve), and the population density function (black).

set.seed(930)
x = rgamma(100, 5, 1/6)
summary(x)
hist(x, prob=T, ylim=c(0,.035), 
   col="skyblue2", main="n = 100")
 rug(x)  # tick marks below x-axis
 lines(density(x), lwd=2, lty="dotted", col="red")
 curve(dgamma(x, 5, 1/6), add=T)

(2) Sampling from the same distribution, we show the ECDF for a sample of size $n = 20,$ so that the steps are easy to see.

set.seed(2019)
x = rgamma(20, 5, 1/6)
plot(ecdf(x), main="n = 20", col="blue");  rug(x)
  curve(pgamma(x, 5, 1/6), add=T, lwd=2)

Related Solutions

[Math] Empirical Distribution Function Understanding

Let us denote our probability space by $(\Omega,\mathcal{F},P)$ and let $X_1,X_2,\ldots,X_n$ be a sequence of i.i.d. random variables defined on $\Omega$.

You're correct that $\{X_i\leq x\}$ is shorthand notation for $\{\omega\in\Omega\mid X_i(\omega)\leq x\}$ which is a subset of $\Omega$ that belongs to $\mathcal{F}$ (since $X_i$ is a random variable). Futhermore, $I(X_i\leq x)$ is the indicator function for the set $\{X_i\leq x\}\subseteq\Omega$ and by definition it is a function defined on $\Omega$ (in fact it is a random variable since the set belongs to $\mathcal{F}$): $$ \begin{align} I(X_i\leq x)(\omega)&= \begin{cases} 1,\quad \text{if }\omega\in \{X_i\leq x\},\\ 0,\quad \text{otherwise}. \end{cases} \\ &= \begin{cases} 1,\quad\text{if }X_i(\omega)\leq x,\\ 0,\quad\text{otherwise}. \end{cases} \end{align} $$

Therefore, $\frac1n \sum_{i=1}^n I(X_i\leq x)$ is also a random variable for each fixed $n$.

A sample in this connection just denotes a sequence of i.i.d. random variables $X_1,\ldots,X_n$. An outcome of this sample corresponds to a fixed $\omega$, and $X_1(\omega),\ldots,X_n(\omega)$ would be an outcome or observation of the sample $X_1,\ldots,X_n$.

The empirical distribution function $F_n(x)=\frac1n \sum_{i=1}^n I(X_i\leq x)$ is indeed a random variable, and we can evaluate it in the following way: $$ (F_n(x))(\omega)=\frac1n\sum_{i=1}^n I(X_i(\omega)\leq x), $$ i.e. for a fixed outcome $\omega\in\Omega$, $(F_n(x))(\omega)$ is the number of observations that are less than $x$ divided by $n$ based on the outcome $X_1(\omega),X_2(\omega),\ldots,X_n(\omega)$.

Now suppose we have an infinite sample of i.i.d. variables $X_1,X_2,\ldots$. Then by the law of large numbers one has that for every fixed $x$, the random variables $F_1(x), F_2(x),F_3(x)$ converges almost surely to the true CDF $F$: $$ F_n(x)\to F(x)\;\;\text{almost surely as } n\to\infty. $$

[Math] Definition and use of Empirical Cumulative Distribution Function (ECDF)

Sometimes one says that a histogram based on a large sample size gives a good idea about the shape of the population density function. (But information is lost in binning, and a modern 'density estimator' usually works better.)

In somewhat the same way an empirical cumulative distribution function (ECDF) of a large sample is a good estimator of the population CDF.

The following R program samples 3000 observations from $Gamma(5, 1)$ to illustrate @Clement C's comment. The figure below shows the histogram (at left) along with the known population density (dotted) and a density estimator. At right, the CDF (thin light green) is superimposed on the ECDF (heavy black) of the sample. A larger sample would show better fit, but perhaps too good to see distinctions between population and sample curves.

 x = rgamma(3000, 5, 1)   # generate random sample
 par(mfrow=c(1,2))        # two panels in one graph
   hist(x, prob=T, col="wheat")
     lines(density(x), lwd=2, col="blue")  # density estimator
     curve(dgamma(x, 5, 1), lty="dotted", lwd=2, col="red", add=T)
   plot.ecdf(x)           # empirical CDF
     curve(pgamma(x, 5, 1), col="green", add=T)  # pop CDF
 par(mfrow=c(1,1))        # returns to default single panel

If you have access to R, you can try other population distributions and sample sizes. The same program as above, except with a sample of size $n = 100$ was used to produce the figure below. Roughly speaking, the ECDF gives a better estimate of the CDF than a histogram gives of the PDF. A 'nonparametric bootstrap' procedure uses the sample ECDF in place of the unknown population CDF.

question

Best Answer

Related Solutions

[Math] Empirical Distribution Function Understanding

[Math] Definition and use of Empirical Cumulative Distribution Function (ECDF)

Related Question