Probability for a bin in a binned histogram

histogramprobability

This question is very basic, but I cannot figure the error in my thinking. According to the author of the book "Pattern Recognition and Machine Learning", we can get the Probability Distribution Function of a distribution in the form of histograms by

"simply divide $n$ (the number of observation for one bin) by the total number $N$ of observations and by the width $\Delta_i$ of the bins to
obtain probability values given by"

$p(x) = \frac{n_i}{N\Delta_i}$

I simply cannot get my head around how this give me the probability for a specific bin. $N\Delta_i$ is basically the sum of the area of every histogram. To calculate a relative frequency of one specific bin, why do we ignore the width $\Delta_i$ in the nominator, which would equal to the division of the area of one bin over the whole area.

Best Answer

One style of histogram of a sample has a vertical axis called Density, scaled so that the total area of the histogram bars is unity $(1).$ Thus, suppose you have a large sample from a population with density function $f_X(x).$ Then the histogram will tend to imitate the shape of $f_X(x).$ That is, the area of a histogram bar with base $(a,b]$ of width $\Delta = b-a$ will approximate $P(a < X \le b) = \int_a^b f_X(x)\, dx.$

For example, suppose x is a sample of size $n = 1000$ from a population distributed $\mathsf{Gamma}(\mathrm{shape}=3, \mathrm{rate}=1/5).$ Then we might have one of the two histograms shown below, each along with the density function $f_X(x)$ of $\mathsf{Gamma}(3,1/5).$ In this example, the population mean is $\mu = 15, \sigma = \sqrt{75} \approx 8.660.$ (Using R, where parameter prob=T of function hist plots a density histogram, and parameter br suggests the number of bins.)

set.seed(2022)
x = rgamma(1000, 3, .2)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.236   8.673  13.495  15.013  19.939  53.914 
sd(x)
[1] 8.488901

par(mfrow=c(1,2))
 hist(x, prob=T, br=5, ylim=c(0,.06), col="skyblue2")
  curve(dgamma(x, 3, .2), add=T, lwd=2, col="brown")
 hist(x, prob=T, ylim=c(0,.06), col="skyblue2")
  curve(dgamma(x, 3, .2), add=T, lwd=2, col="brown")
par(mfrow=c(1,1))

In the left panel, the bin with base $(10,20]$ of width $\Delta = 10$ contains $432$ observations, has height $.0432,$ and thus area $0.432.$ According to the density function, the probability within this interval is $0.4386.$

diff(pgamma(c(10,20), 3,.2))
[1] 0.4385731

In the right panel, the bin with base $(5,10]$ of width $\Delta = 5$ contains $432$ observations, has height $.0496,$ and thus area $0.248.$ According to the density function, the probability within this interval is $0.243.$

diff(pgamma(c(5,10), 3,.2))
[1] 0.2430222

Note: In R, some details of a particular histogram can be listed by making a non-plotted histogram (parameter plot=F.) For the first histogram above, we have the following partial printout:

hist(x, prob=T, br=5, ylim=c(0,.06), plot=F)

$breaks
[1]  0 10 20 30 40 50 60

$counts
[1] 321 432 196  37  12   2

$density
[1] 0.0321 0.0432 0.0196 0.0037 0.0012 0.0002

$mids
[1]  5 15 25 35 45 55

...

Related Solutions

Solved – Сonfidence interval of histogram probability density function estimator

In part an answer depends on whether you condition on the total number of observations or not.

If you do condition on $N$, your observations are multinomial, and an appropriate CI can be derived from the binomial

If you don't condition on $N$, your observations are Poisson, and an appropriate CI can be derived from that.

Solved – Histogram with uniform vs non-uniform Bins

When is a uniform-bin histogram better than a non-uniform bin one?

This requires some kind of identification of what we'd seek to optimize; many people try to optimize average integrated mean square error, but in many cases I think that somewhat misses the point of doing a histogram; it often (to my eye) 'oversmooths'; for an exploratory tool like a histogram I can tolerate a good deal more roughness, since the roughness itself gives me a sense of the extent to which I should "smooth" by eye; I tend to at least double the usual number of bins from such rules, sometimes a good deal more. I tend to agree with Andrew Gelman on this; indeed if my interest was really getting a good AIMSE, I probably shouldn't be considering a histogram anyway.

So we need a criterion.

Let me start by discussing some of the options of non-equal area histograms:

There are some approaches that do more smoothing (fewer, wider bins) in areas of lower density and have narrower bins where the density is higher - such as "equal-area" or "equal count" histograms. Your edited question seems to consider the equal count possibility.

The histogram function in R's lattice package can produce approximately equal-area bars:

library("lattice")
histogram(islands^(1/3))  # equal width
histogram(islands^(1/3),breaks=NULL,equal.widths=FALSE)  # approx. equal area

comparison of equal width and equal area

That dip just to the right of the leftmost bin is even clearer if you take fourth roots; with equal-width bins you can't see it unless you use 15 to 20 times as many bins, and then the right tail looks terrible.

There's an equal-count histogram here, with R-code, which uses sample-quantiles to find the breaks.

For example, on the same data as above, here's 6 bins with (hopefully) 8 observations each:

equalcount histogram

ibr=quantile(islands^(1/3),0:6/6)
hist(islands^(1/3),breaks=ibr,col=5,main="")

This CV question points to a paper by Denby and Mallows a version of which is downloadable from here which describes a compromise between equal-width bins and equal-area bins.

It also addresses the questions you had to some extent.

You could perhaps consider the problem as one of identifying the breaks in a piecewise-constant Poisson process. That would lead to work like this. There's also the related possibility of looking at clustering/classification type algorithms on (say) Poisson counts, some of which algorithms would yield a number of bins. Clustering has been used on 2D histograms (images, in effect) to identify regions that are relatively homogenous.

If we had an equal-count histogram, and some criterion to optimize we could then try a range of counts per bin and evaluate the criterion in some way. The Wand paper mentioned here [paper, or working paper pdf] and some of its references (e.g. to the Sheather et al papers for example) outline "plug in" bin width estimation based on kernel smoothing ideas to optimize AIMSE; broadly speaking that kind of approach should be adaptable to this situation, though I don't recall seeing it done.

Best Answer

Related Solutions

Solved – Сonfidence interval of histogram probability density function estimator

Solved – Histogram with uniform vs non-uniform Bins

Related Question