[Math] Mode of a frequency distribution with unequal class length

statistics

How can I find the mode for a grouped frequency distribution with unequal class lengths? I have to find the mode for the following problem:

\begin{array}{c|c}
\text{Marks} & \text{# of Students} \\ \hline
\text{0 – 20} & 32 \\ \hline
\text{20 – 50} & 45 \\ \hline
\text{50 – 70} & 15 \\ \hline
\text{70 – 100} & 8 \\ \hline
\end{array}$$

For equal class lengths, we use the formula
$$\text{Mode} = l+\frac{(f_0-f_{-1})}{2f_0-f_{-1}-f_{+1}}W_o$$
where
$l$ is the lower class boundary of the modal class,
$f_0$ is the frequency of the modal class,
$f_{-1}$ is the frequency preceding the modal class,
$f_{+1}$ is the frequency following the modal class,
$W_{o}$ is the class width of the modal class

But how to proceed for the above example?

Best Answer

Here is an outline of what I intend to do:

(1) 'Reconstruct' the original data by using R to spread the observations in each interval at random within the interval. Here is a density histogram (intervals of equal length) of one such reconstruction.

 x = c(runif(32,0,20),runif(45,20,50),runif(15,50,70),runif(8,70,100))
 hist(x, prob=T, col="wheat")

(2) Use a modern density estimator to 'smooth' this histogram, and determine the location of the highest point of the density estimator, which is a reasonable estimate of the mode of the reconstructed data. For this reconstruction, the mode is 22.4.

 hist(x, prob=T, col="wheat")
 lines(density(x), col="blue")
 dxy = density(x);  dx = dxy$x; dy = dxy$y # (x,y) components of 'smooth'
 dx[dy == max(dy)]  # x-value at which 'smooth' has its max
 ## 22.36885   # estimated density

(3) Of course, each random reconstruction of the data will be somewhat different. Repeat steps (1) and (2) 2000 times and keep track of the 2000 modes produced. The median of these estimated modes was 23.6. Take this value to be a reasonable estimator of the mode of the distribution from which the original data were sampled.

However, these estimated modes where quite variable (mainly because so much information was lost in the original summary of the data into four groups of unequal lengths). Below is a boxplot of the 2000 mode estimates. (Note: The histogram and density-estimator curve in the figure above happen to be for the last of the 2000 reconstructions of the data in my simulation.)

I doubt that this is anything like the method you were expected to use, but I believe this is a responsible approach to solving the problem. (Certainly better than the approaches I initially suggested in my Comment an hour ago. Maybe I should delete the Comment now, but that seems like cheating.)

Related Solutions

[Math] Class Limits, boundaries, midpoint, relative frequency

Here's the tally of your numbers:

{{65, 8}, {75, 6}, {45, 5}, {70, 5}, {90, 4}, {50, 3}, {55, 3}, {80, 
  3}, {85, 3}, {95, 3}, {15, 2}, {30, 2}, {60, 2}, {68, 2}, {120, 
  2}, {125, 2}, {10, 1}, {28, 1}, {33, 1}, {40, 1}, {46, 1}, {52, 
  1}, {58, 1}, {73, 1}, {78, 1}, {82, 1}, {99, 1}, {100, 1}, {105, 
  1}, {115, 1}, {137, 1}, {140, 1}, {145, 1}, {200, 1}}

And here's the histogram with bin width = 1, thus replicating the above tally:

enter image description here

The mean is 73.7, the quartiles are {54.3, 70, 90}. There's many more summary statistics that can computed.

What method are you using to classify (ie, partition the x-axis) or cluster the data?

Statistics – Deriving the Mode of Grouped Data

The following is not a rigorous derivation (a derivation would require a lot of assumptions about what makes one estimator better than another), but is an attempt to "make sense" of the formula so that you can more easily remember and use it.

Consider a bar graph with a bar for each of the classes of data. Then $f_1$ is the height of the bar of the modal class, $f_0$ is the height of the bar on the left of it, and $f_2$ is the height of the bar on the right of it.

The quantity $f_1 - f_0$ measures how far the modal class's bar "sticks up" above the bar on its left. The quantity $f_1 - f_2$ measures how far the modal class's bar "sticks up" above the bar on its right.

Now, observe that $$ \frac{f_1 - f_0}{2f_1 - f_0 - f_2} + \frac{f_1 - f_2}{2f_1 - f_0 - f_2} = \frac{f_1 - f_0}{(f_1 - f_0) + (f_1 - f_2)} + \frac{f_1 - f_2}{(f_1 - f_0) + (f_1 - f_2)} = 1 $$ So if we want to divide an interval of width $h$ into two pieces, where the ratio of sizes of those two pieces is $(f_1 - f_0) : (f_1 - f_2)$, the first piece will have width $\frac{f_1 - f_0}{2f_1 - f_0 - f_2} h$.

This is what the formula for estimating the mode does. It splits the width of the modal bar into two pieces whose ratio of widths is $(f_1 - f_0) : (f_1 - f_2)$, and it says the mode is at the line separating those two pieces, that is, at a distance $\frac{f_1 - f_0}{2f_1 - f_0 - f_2} h$ from the left edge of that bar, $l$.

If $f_1 - f_0 = f_1 - f_2,$ that is, the modal bar is equally far above the bars on both its left and right, then this formula estimates the mode right in the middle of the modal class: $$ l + \frac{f_1 - f_0}{2f_1 - f_0 - f_2} h = l + \frac12 h. $$ But if height of the bar on the left is closer to the modal bar's height, then the estimated mode is to the left of the centerline of the modal class. In the extreme case where the bar on the left is exactly the height of the modal bar, and both are taller than the bar on the right, that is, when $f_1 - f_0 = 0$ but $f_1 - f_2 > 0$, the formula estimates the mode at $l$ exactly, that is, at the left edge of the modal bar. In the other extreme case, where the bar on the left is shorter but the bar on the right is the same height as the modal bar ($f_1 - f_0 > 0$ but $f_1 - f_2 = 0$), the formula estimates the mode at $l + h$, that is, at the right edge of the modal bar.

Best Answer

Related Solutions

[Math] Class Limits, boundaries, midpoint, relative frequency

Statistics – Deriving the Mode of Grouped Data

Related Question