Statistics – Deriving the Mode of Grouped Data

statistics

A formula to calculate the mode for grouped data's is given in my text book:

Mode = $l + \dfrac{(f_1 – f_0)h}{2f_1 – f_0 – f_2} $

Where, $l = $ lower limit of the modal class,

$h = $ size of the class interval,

$f_1 = $ frequency of the modal class,

$f_0 = $ frequency of the class preceding the modal class,

$f_2 =$ frequency of the class succeeding the modal class.

Can you please explain the derivation of this formula, as it is not given in my T.B. Thanks .

Best Answer

The following is not a rigorous derivation (a derivation would require a lot of assumptions about what makes one estimator better than another), but is an attempt to "make sense" of the formula so that you can more easily remember and use it.

Consider a bar graph with a bar for each of the classes of data. Then $f_1$ is the height of the bar of the modal class, $f_0$ is the height of the bar on the left of it, and $f_2$ is the height of the bar on the right of it.

The quantity $f_1 - f_0$ measures how far the modal class's bar "sticks up" above the bar on its left. The quantity $f_1 - f_2$ measures how far the modal class's bar "sticks up" above the bar on its right.

Now, observe that $$ \frac{f_1 - f_0}{2f_1 - f_0 - f_2} + \frac{f_1 - f_2}{2f_1 - f_0 - f_2} = \frac{f_1 - f_0}{(f_1 - f_0) + (f_1 - f_2)} + \frac{f_1 - f_2}{(f_1 - f_0) + (f_1 - f_2)} = 1 $$ So if we want to divide an interval of width $h$ into two pieces, where the ratio of sizes of those two pieces is $(f_1 - f_0) : (f_1 - f_2)$, the first piece will have width $\frac{f_1 - f_0}{2f_1 - f_0 - f_2} h$.

This is what the formula for estimating the mode does. It splits the width of the modal bar into two pieces whose ratio of widths is $(f_1 - f_0) : (f_1 - f_2)$, and it says the mode is at the line separating those two pieces, that is, at a distance $\frac{f_1 - f_0}{2f_1 - f_0 - f_2} h$ from the left edge of that bar, $l$.

If $f_1 - f_0 = f_1 - f_2,$ that is, the modal bar is equally far above the bars on both its left and right, then this formula estimates the mode right in the middle of the modal class: $$ l + \frac{f_1 - f_0}{2f_1 - f_0 - f_2} h = l + \frac12 h. $$ But if height of the bar on the left is closer to the modal bar's height, then the estimated mode is to the left of the centerline of the modal class. In the extreme case where the bar on the left is exactly the height of the modal bar, and both are taller than the bar on the right, that is, when $f_1 - f_0 = 0$ but $f_1 - f_2 > 0$, the formula estimates the mode at $l$ exactly, that is, at the left edge of the modal bar. In the other extreme case, where the bar on the left is shorter but the bar on the right is the same height as the modal bar ($f_1 - f_0 > 0$ but $f_1 - f_2 = 0$), the formula estimates the mode at $l + h$, that is, at the right edge of the modal bar.

Related Solutions

[Math] Grouped data median, using lower class limit or lower class boundary

After some consideration, in my opinion, "lower boundary" will make more sense rather than lower limit. For example, this is the data,

Class  Frequency
 1       1
 2       1
 3       1
 4       1

Based on the data, using we can know that the median is 2.5, without calculation. If using the formula as mentioned above, $\frac{n}{2}$ will get 2, there for the class contains the median is class 2, then using $L_m$ is a lower boundary,

$median = 1.5 + \left[ \frac{2 -1}{1}\right] \times 1 = 2.5$

This doesn't make sense for using lower limit. If changing the class to

Class Frequency
 1-2    1
 3-4    1
 5-6    1
 7-8    1

Using the method above, we will get,

$median = 2.5 + \left[ \frac{2 -1}{1}\right] \times 2 = 4.5$

However, if using class limit, then we will get 5.

[Math] Mode of a frequency distribution with unequal class length

Here is an outline of what I intend to do:

(1) 'Reconstruct' the original data by using R to spread the observations in each interval at random within the interval. Here is a density histogram (intervals of equal length) of one such reconstruction.

 x = c(runif(32,0,20),runif(45,20,50),runif(15,50,70),runif(8,70,100))
 hist(x, prob=T, col="wheat")

(2) Use a modern density estimator to 'smooth' this histogram, and determine the location of the highest point of the density estimator, which is a reasonable estimate of the mode of the reconstructed data. For this reconstruction, the mode is 22.4.

 hist(x, prob=T, col="wheat")
 lines(density(x), col="blue")
 dxy = density(x);  dx = dxy$x; dy = dxy$y # (x,y) components of 'smooth'
 dx[dy == max(dy)]  # x-value at which 'smooth' has its max
 ## 22.36885   # estimated density

(3) Of course, each random reconstruction of the data will be somewhat different. Repeat steps (1) and (2) 2000 times and keep track of the 2000 modes produced. The median of these estimated modes was 23.6. Take this value to be a reasonable estimator of the mode of the distribution from which the original data were sampled.

However, these estimated modes where quite variable (mainly because so much information was lost in the original summary of the data into four groups of unequal lengths). Below is a boxplot of the 2000 mode estimates. (Note: The histogram and density-estimator curve in the figure above happen to be for the last of the 2000 reconstructions of the data in my simulation.)

I doubt that this is anything like the method you were expected to use, but I believe this is a responsible approach to solving the problem. (Certainly better than the approaches I initially suggested in my Comment an hour ago. Maybe I should delete the Comment now, but that seems like cheating.)

Best Answer

Related Solutions

[Math] Grouped data median, using lower class limit or lower class boundary

[Math] Mode of a frequency distribution with unequal class length

Related Question