Solved – Probability distribution estimation — why normalize by bin width

density functionkernel-smoothing

This is from a typical introduction to kernel density estimation.

Suppose we want to estimate the probability density function $p(x)$ given a set of samples $x_1,x_2 \ldots x_N$. The simplest method that does this is from the histogram of the samples.

Divide the sample space into a number of bins and approximate the
density at the center of each bin by the fraction of points in the
training data that fall into the corresponding bin.

Suppose we have divided the sample space into $K$ bins of width $h$. The histogram of the given data is computed as :

$$
H(k) = \# \hspace{1mm} \mbox{of} \hspace{1mm} x^{(i)} \hspace{1mm} \mbox{in bin k }
$$

This can be converted into a probability density as

$$
P(k) = \frac{H(k)}{N}
$$

For a given sample $x$, let $p_k(x)$ be the probability that $x$ falls in bin $k$. From above, we have

$$
p_k(x) = P(k)
$$

Why is it that the above is mentioned further normalized by bin width $h$ as follows ?

$$
p_k(x) = \frac{\# \hspace{1mm} \mbox{of} \hspace{1mm} x^{(k)} \hspace{1mm} \mbox{in same bin as x} }{N h}
$$

Best Answer

The expression $$P(k) = \frac{H(k)}{N}$$ is the empirical relative frequency mass that falls into bin $k$, which we think of as empirical probability mass.

Let bin $k$ be the interval $[x_1, x_2]$, with center $x_k$. Then the estimated density curve $p(x)$ should be such that the area on this interval below the density curve is approximately $P(k)$. This can be approximated by a trapezoid, especially if bin length $h=x_2-x_1$ is "small". Applying the rule for the area of a trapezoid we have

$$ \frac{H(k)}{N}\approx (x_2-x_1)\frac {p(x_2) + p(x_1)}{2} \approx hp(x_k) $$

$$\Rightarrow \hat p(x_k)\approx\frac{H(k)}{Nh}$$

Sine we have started by remarking that $H(k)/N$ is empirical probability, this result also helps seeing why the density does not give probabilities.

Related Solutions

Density Function – Estimating PDF of Continuous Distribution from Limited Data Points

What you are looking for is kernel density estimation. You should find numerous hits on an internet search for these terms, and it is even on Wikipedia so that should get you started. If you have R at your disposition, the function density provides what you need:

histAndDensity<-function(x, ...)
{
  retval<-hist(x, freq=FALSE, ...)
  lines(density(x, na.rm=TRUE), col="red")
  invisible(retval)
}

Data Visualization – Histogram and Distribution Fitting for Datasets with Unequal and Open-Ended Intervals

There are lots of possible data sets that could generate these summary bins, so it's impossible to be exact, but you can make reasonable guesses.

One way to get subinterval estimates is to create a function that gives the number of people at each income level. The easiest, and perhaps the best (simplest assumptions), is to connect known points and interpolate between them. You don't really have known points, but I used the (x=median, y=intervalCount/intervalWidth). There's not much difference between the mean and medium in this set, which suggests the data values are pretty well-behaved in each interval.

Once you have such a function, you can integrate it between any two points to get any subinterval counts.

Connected Line Plot

I left out the 0-0 interval because the value is literally off the chart and 1000+ because it has no real width.

Since the data is obviously not any traditional distribution, a local smoother is a decent way to smooth it out. Here's a spline smoother:

Spline Smoother Plot

It does better at the tail, but is perhaps too smooth at the beginning.

The 100-119 interval looks high in both populations. It could be due to a propensity for people to round up to 100 when answering the survey.

As far are truth in graphics goes, it best to just plot the data that you have, which is the intervals. It might be useful to show the mean/medians, but they only depart from the middle for the high ranges, which might be worth separate study.

Income bins

We can try in double our bin count by considering the medians. Theoretically, the median divides each interval into two intervals with equal population (two bars of equal area but possibly different heights). However, the breakdown is not so obvious due to possible ties and fractional medians. Here is it with interval widths of (median-lo) and (hi-median+1): (each full interval width is (hi-lo+1)).

Income Half Bins

Best Answer

Related Solutions

Density Function – Estimating PDF of Continuous Distribution from Limited Data Points

Data Visualization – Histogram and Distribution Fitting for Datasets with Unequal and Open-Ended Intervals

Related Question