[Math] Is there some connection between Kernel density estimation and Empirical distribution function

probability distributionsstatistics

This wiki page says

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.

and gives this formula

${\displaystyle {\widehat {F}}_{n}(t)={\frac {{\mbox{number of elements in the sample}}\leq t}{n}}={\frac {1}{n}}\sum _{i=1}^{n}\mathbf {1} _{X_{i}\leq t},}$

Another wiki page says

kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable

and gives this formula

${\displaystyle {\widehat {f}}_{h}(x)={\frac {1}{n}}\sum _{i=1}^{n}K_{h}(x-x_{i})={\frac {1}{nh}}\sum _{i=1}^{n}K{\Big (}{\frac {x-x_{i}}{h}}{\Big )},}$

This post says

the pdf is the first derivative of the cdf for a continuous random variable

question

Is there some connection between Kernel density estimation and Empirical distribution function, such as the former is the derivative of the latter for a continuous random variable? If yes, what is the derivation?

Best Answer

Not precisely.

About histograms, KDEs and ECDFs.

(1) Roughly speaking, a histogram (on a density scale so that the sum of areas of bars is unity) can be viewed as a estimate of the density function. A KDE is a more sophisticated method of density estimation. Generally speaking one cannot reconstruct the exact values of the data for either a histogram or a KDE.

(2) By contrast an empirical CDF (ECDF) retains exact information about all of the data. An ECDF is made as follows: (a) sort the data from smallest to largest, (b) make a stair-step function that begins at 0 below the minimum and increases by $1/n$ at each data value, where $n$ is the sample size. If $k$ values are tied then the increase is $k/n$ at the tied value.

Thus the ECDF approximates the CDF of the distribution, with increasingly accurate approximations for samples of increasing size. Generally speaking an ECDF gives a better approximation to the population CDF than a histogram gives for the density function. (Information is lost in binning data to make a histogram.)

[By suitable manipulation (a kind of numerical integration), information in a KDE could be used to make a function that imitates the population CDF, but it does not use the actual data values. In my experience, this is rarely done.]

Graphical illustrations.

(1) A sample of size $n = 100$ from $$\mathsf{Gamma}(\text{shape} = \alpha = 5,\,\text{rate} = \lambda = 1/6)$$ is simulated. The figure shows a density histogram (blue bars), the default KDE from R statistical software (red curve), and the population density function (black).

set.seed(930)
x = rgamma(100, 5, 1/6)
summary(x)
hist(x, prob=T, ylim=c(0,.035), 
   col="skyblue2", main="n = 100")
 rug(x)  # tick marks below x-axis
 lines(density(x), lwd=2, lty="dotted", col="red")
 curve(dgamma(x, 5, 1/6), add=T)

enter image description here

(2) Sampling from the same distribution, we show the ECDF for a sample of size $n = 20,$ so that the steps are easy to see.

set.seed(2019)
x = rgamma(20, 5, 1/6)
plot(ecdf(x), main="n = 20", col="blue");  rug(x)
  curve(pgamma(x, 5, 1/6), add=T, lwd=2)

enter image description here