[Math] How to “relative frequency histogram” become a “probability density curve”

probability distributionsstatistics

Suppose I've rolled two dice and took the sum, for $25$ times; then plotted the results on below histogram.

If I added up all the heights, I get the total $25$ as expected :
$$1+1+1+3+2+7+2+1+4+2+1=25$$
No issues so far.

Next, if I want a relative frequency histogram, I just need to scale the heights of the bars by $1/25$. Here if I add up all the heights, I will get $1$.
No issues here too.

In this video of khan academy and everywhere they say that the "area" under a relative frequency histogram equals 1. I don't know how this is true and it is throwing me off completely. I only see that the heights add up to 1. Maybe I'm missing something… Any help ?

EDIT : In above histogram the bin width is $1$, so it may not be a good example. Kindly also consider general histograms like below :

Best Answer

There seem to be two competing definitions of a relative frequency histogram, and some sources slip silently from one definition to the other. This seems to be the case in the video you watched, in which at 3:32 we can see some kind of histogram being replaced by another kind of histogram with approximately the same area but approximately twice the total bar height.

You can define a kind of histogram in which by definition the sum of areas of the bars is $1,$ and the bars are scaled so that the area of each bar is the probability that a randomly chosen observation is in that bar. Here are some course notes that define this kind of histogram as a scaled relative frequency histogram (while using the term relative frequency histogram for a chart in which the bar heights add up to $1$). On the other hand, this answer on stats.SE defines a relative frequency histogram as one in which the areas of the bars add up to $1$; in fact, that answer defines an ordinary frequency histogram as one in which the areas (not heights!) of the bars add up to the total number of observations.

And then you have sources such as this web page, in which they explicitly say that relative frequency is the number of observations (of a particular subset of values) divided by the total number of observations, and then claim that the area under this histogram is always $1,$ which is a dubious statement. The figure actually drawn on that web page for the "relative" histogram has scaled the height of each bar to be twice the number of observations divided by the total number of observations in order to make the total area come out to $1.$

So you are justified in being a bit confused, because different sources use the same words for different things, and some sources even contradict themselves.

The key thing is to find a source of instruction that is clear and consistent about the meaning of each thing it shows you. A good source, when it wants the total area of a histogram to be $1,$ will clearly define the construction of the histogram in a way that forces the total area of the bars to be $1.$ A good way to do this is to set the area of each bar to the fraction of the observations that are in the range of that bar along the horizontal axis. If we do this, and if we make sure the bottom of each bar is exactly the part of the horizontal axis covered by its range, we get a histogram that resembles a density plot.

Related Solutions

[Math] Making a Histogram When Given Cumulative Relative Frequency

Let $X(i)$ be the i-th cummulative relative frequency and $r(i)$ the i-th relative frequency. The steps to calculate the $r(i)$´s are shown below:

     X(i)    r(i)

1    X(1)    r(1)=X(1)

2    X(2)    r(2)=X(2)-X(1)

3    X(3)    r(3)=X(3)-X(2)

4    X(4)    r(4)=X(4)-X(3)

...

n    X(n)    r(n)=X(n)-X(n-1)

[Math] the Difference between Frequency and Density in a Histogram

Illustrations:

Suppose $X_1, X_2, \dots, X_{100}$ is a random sample of size $n$ from a normal distribution with mean $\mu=100$ and standard deviation $\sigma=12.$ Also, we have bins (intervals) of equal width, which we use to make a histogram.

The vertical scale of a 'frequency histogram' shows the number of observations in each bin. Optionally, we can also put numerical labels atop each bar that show how many individuals it represents.

The vertical scale of a 'density histogram' shows units that make the total area of all the bars add to $1.$ This makes it possible to show the density curve of the population using the same vertical scale.

From above, we know that the tallest bar has 30 observations, so this bar accounts for relative frequency $\frac{30}{100} = 0.3$ of the observations. The width of this bar is $10.$ So its density is $0.03$ and its area is $0.03(10) = 0.3.$ The density curve of the distribution $\mathsf{Norm}(100, 15)$ is also shown superimposed on the histogram. The area beneath this density curve is also $1.$ (By definition, the area beneath a density function is always $1.)$ Optionally, I have added tick marks below the histogram to show the locations of the individual observations.

Definitions: If the frequency of the $i$th bar is $f_i,$ then its relative frequency is $r_i = f_i/n,$ where $n$ is the sample size. Its density is $d_i = r_i/w_i,$ where $w_i$ is its width. Ordinarily, you should make a density histogram only if each bar has the same width.

Notes: (1) Another type of histogram (that you did not ask about) would be a 'relative frequency' histogram with relative frequencies (not densities) on the vertical scale. (2) The sample mean of the data shown is $\bar X =102.98$ and the sample standard deviation is $S = 15.37.$ (3) These histograms were made using R statistical software.

Best Answer

Related Solutions

[Math] Making a Histogram When Given Cumulative Relative Frequency

[Math] the Difference between Frequency and Density in a Histogram

Related Question