Solved – the purpose of multiplying by the difference between the midpoints of two bins in this recipe

data visualizationhistogramr

While browsing for information on how I might plot a fitted normal curve over a histogram, I found the following:

http://www.statmethods.net/graphs/density.html

There is a line I don't fully understand, though I recognize that it really does work:

yfit <- yfit*diff(h$mids[1:2])*length(x)

Here, yfit is initially a list of values drawn from the pdf of an inferred normal distribution at regular intervals along the x-axis, length(x) is the number of observations in a list x from which a histogram was prepared, and diff(h$mids[1:2]) is the difference between the midpoints of the second and first bars of said histogram on the x-axis. After this statement is run, yfit becomes itself multiplied by those other two terms.

I understand that multiplying by length makes sense as this turns values for a probability distribution function into number of observations around each respective value—taking into account that a continuous pdf is being used here and the number of observations at any single point is zero.

I don't understand why it is necessary to multiply by diff(h$mids[1:2]) to get the right outcome in the graph, although I can confirm that it does get the right outcome.

Does anyone have an explanation?

Best Answer

Since the histogram is a bar chart with area = height (yfit from dnorm) times base ”diff(h$mids[1:2])” the area converts the bar chart area to a probability area so final yfit (which is freq of occurrence) becomes probability (or area) times number of observations classical formula is

$prob = \frac{freq occurrence}{total possible occurrence}$

Here is the mapping to classical formula

yfit             =     yfit * diff(hmids[1:2]) *   length(x) 
freq occurrence  =     probability area         *   total occurrences
Related Question