Probability for a bin in a binned histogram

histogramprobability

This question is very basic, but I cannot figure the error in my thinking. According to the author of the book "Pattern Recognition and Machine Learning", we can get the Probability Distribution Function of a distribution in the form of histograms by

"simply divide $n$ (the number of observation for one bin) by the total number $N$ of observations and by the width $\Delta_i$ of the bins to
obtain probability values given by"

$p(x) = \frac{n_i}{N\Delta_i}$

I simply cannot get my head around how this give me the probability for a specific bin. $N\Delta_i$ is basically the sum of the area of every histogram. To calculate a relative frequency of one specific bin, why do we ignore the width $\Delta_i$ in the nominator, which would equal to the division of the area of one bin over the whole area.

Best Answer

One style of histogram of a sample has a vertical axis called Density, scaled so that the total area of the histogram bars is unity $(1).$ Thus, suppose you have a large sample from a population with density function $f_X(x).$ Then the histogram will tend to imitate the shape of $f_X(x).$ That is, the area of a histogram bar with base $(a,b]$ of width $\Delta = b-a$ will approximate $P(a < X \le b) = \int_a^b f_X(x)\, dx.$

For example, suppose x is a sample of size $n = 1000$ from a population distributed $\mathsf{Gamma}(\mathrm{shape}=3, \mathrm{rate}=1/5).$ Then we might have one of the two histograms shown below, each along with the density function $f_X(x)$ of $\mathsf{Gamma}(3,1/5).$ In this example, the population mean is $\mu = 15, \sigma = \sqrt{75} \approx 8.660.$ (Using R, where parameter prob=T of function hist plots a density histogram, and parameter br suggests the number of bins.)

set.seed(2022)
x = rgamma(1000, 3, .2)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.236   8.673  13.495  15.013  19.939  53.914 
sd(x)
[1] 8.488901

par(mfrow=c(1,2))
 hist(x, prob=T, br=5, ylim=c(0,.06), col="skyblue2")
  curve(dgamma(x, 3, .2), add=T, lwd=2, col="brown")
 hist(x, prob=T, ylim=c(0,.06), col="skyblue2")
  curve(dgamma(x, 3, .2), add=T, lwd=2, col="brown")
par(mfrow=c(1,1))

enter image description here

In the left panel, the bin with base $(10,20]$ of width $\Delta = 10$ contains $432$ observations, has height $.0432,$ and thus area $0.432.$ According to the density function, the probability within this interval is $0.4386.$

diff(pgamma(c(10,20), 3,.2))
[1] 0.4385731

In the right panel, the bin with base $(5,10]$ of width $\Delta = 5$ contains $432$ observations, has height $.0496,$ and thus area $0.248.$ According to the density function, the probability within this interval is $0.243.$

diff(pgamma(c(5,10), 3,.2))
[1] 0.2430222

Note: In R, some details of a particular histogram can be listed by making a non-plotted histogram (parameter plot=F.) For the first histogram above, we have the following partial printout:

hist(x, prob=T, br=5, ylim=c(0,.06), plot=F)

$breaks
[1]  0 10 20 30 40 50 60

$counts
[1] 321 432 196  37  12   2

$density
[1] 0.0321 0.0432 0.0196 0.0037 0.0012 0.0002

$mids
[1]  5 15 25 35 45 55

...