[Math] calculation of median of grouped data

data analysismedianstatistics

While calculating the median of grouped data of total frequency $N$, in order to find the median class which value should be taken into consideration to match against cumulative frequency : $\frac N2$ or $\frac{N+1}{2}$ (it seems both are used)? I think $\frac{N+1}{2}$ should be taken since in case of list of values (i.e. ungrouped data), its fractional value indicates that the average of $\frac N2 th$ and $(\frac N2 + 1) th$ values should give the median.

And then comes the second part of my question — while calculating the median of grouped data, if the value of $\frac{N+1}{2}$ ( or $\frac N2$) is a fraction, say 50.5, and there is a cumulative frequency 50, then what should we do? Should we take two median classes, one having cumulative frequency 50 and another coming next to it, and calculate two medians considering each of the median class using the formula: $L + \frac {\frac N2 – C}{f} \times w$ and take their average as the ultimate median? Or do something else? I mean what is the correct procedure in this kind of situation?

EDIT:

So, here is a specific problem regarding the second part of my question-

We have to find out the median score from the following frequency distribution table:

Score                :  0-10    10-20    20-30    30-40    40-50
Number of students   :   4        3        5        6        7
Cumulative frequency :   4        7        12       18       25

Here intervals are of type (,] .

Now, $N=25 \implies \frac N2 = 12.5$, which means that we have to look for the interval which covers 12th item and 13th item. Looking at the cumulative frequencies, we see that the 3rd interval(i.e. 20-30) covers the 12th item,while 4th interval(i.e. 30-40) covers the 13th item. If we are supposed to take both the intervals as median class for the sake of using the formula:
$median=L + \frac {\frac N2 – C}{f} \times w$, then we will end up with two medians. We can take the average of these as the required median, though. I want to know the correct procedure here.

Note 1:

I am only concerned with using the above formula and not any other method of finding median of grouped data. There is a variation of the above formula where $\frac{N+1}{2}$ is used instead of $\frac N2$, the first part of my question refers to this confusion as well.

Note 2:

In the formula,

L = lower boundary of the median class
N = total frequency
C = cumulative frequency of the class preceding the median class
f = frequency of the median class
w = width of the median class i.e. upper boundary - lower boundary

Note 3:

If we consider the interval 20-30 as the median class and use the above formula, then the median will be

$20 + \frac{\frac{25}{2} – 7}{5} \times 10 = 31$

Interestingly, considering the interval 30-40 as the median class, we would get the same median using the above formula. Though, I am not sure if this will be the case for every problem of this type. In that case we can take any of the two interval as the median class.

Note 4:

I don't know whether there is any rule for such kind of situation saying that we have to select that cumulative frequency (and hence the corresponding interval as the median class) which is nearer to the value of $\frac N2$, in that case we have to take the interval 20-30 in this example as median class. It will be great and enough if anyone can confirm such a rule.

Best Answer

Because this is essentially a duplicate, I address a few issues that are do not explicitly overlap the related question or answer:

If a class has cumulative frequency .5, then the median is at the boundary of that class and the next larger one.

If $N$ is large (really the only case where this method is generally successful), there is little difference between $N/2$ and $(N+1)/2$ in the formula. All references I checked use $N/2$.

Before computers were widely available, large datasets were customarily reduced to categories (classes) and plotted as histograms. Then the histograms were used to approximate the mean, variance, median, and other descriptive measures. Nowadays, it is best just to use a statistical computer package to find exact values of all measures.

One remaining application is to try to re-claim the descriptive measures from grouped data or from a histogram published in a journal. These are cases in which the original data are no longer available.

This procedure to approximate the sample median from grouped data $assumes$ that data are distributed in roughly a uniform fashion throughout the median interval. Then it uses interpolation to approximate the median. (By contrast, methods to approximate the sample mean and sample variance from grouped data one assumes that all obseervations are concentrated at their class midpoints.)