Solved – Probability distribution estimation — why normalize by bin width

density functionkernel-smoothing

This is from a typical introduction to kernel density estimation.

Suppose we want to estimate the probability density function $p(x)$ given a set of samples $x_1,x_2 \ldots x_N$. The simplest method that does this is from the histogram of the samples.

Divide the sample space into a number of bins and approximate the
density at the center of each bin by the fraction of points in the
training data that fall into the corresponding bin.

Suppose we have divided the sample space into $K$ bins of width $h$. The histogram of the given data is computed as :

$$
H(k) = \# \hspace{1mm} \mbox{of} \hspace{1mm} x^{(i)} \hspace{1mm} \mbox{in bin k }
$$

This can be converted into a probability density as

$$
P(k) = \frac{H(k)}{N}
$$

For a given sample $x$, let $p_k(x)$ be the probability that $x$ falls in bin $k$. From above, we have

$$
p_k(x) = P(k)
$$

Why is it that the above is mentioned further normalized by bin width $h$ as follows ?

$$
p_k(x) = \frac{\# \hspace{1mm} \mbox{of} \hspace{1mm} x^{(k)} \hspace{1mm} \mbox{in same bin as x} }{N h}
$$

Best Answer

The expression $$P(k) = \frac{H(k)}{N}$$ is the empirical relative frequency mass that falls into bin $k$, which we think of as empirical probability mass.

Let bin $k$ be the interval $[x_1, x_2]$, with center $x_k$. Then the estimated density curve $p(x)$ should be such that the area on this interval below the density curve is approximately $P(k)$. This can be approximated by a trapezoid, especially if bin length $h=x_2-x_1$ is "small". Applying the rule for the area of a trapezoid we have

$$ \frac{H(k)}{N}\approx (x_2-x_1)\frac {p(x_2) + p(x_1)}{2} \approx hp(x_k) $$

$$\Rightarrow \hat p(x_k)\approx\frac{H(k)}{Nh}$$

Sine we have started by remarking that $H(k)/N$ is empirical probability, this result also helps seeing why the density does not give probabilities.

Related Question