[Math] Median estimated from grouped data with a single class

descriptive statisticsmedianstatistics

Given the formula for grouped median:

$Median = L_m + \left [ \frac { \frac{n}{2} – F_{m-1} }{f_m} \right ] \times c$

Where:

  • $L_m$: lower boundary of median class
  • $c$ : size of the median class
  • $F_{m-1}$ : cumulative frequency of the class before median class
  • $f_m$ : frequency of the median class
  • $n$ : size data

Example: What should the median be for the following:
– 100, 100, 100, 100, 100, 100, 100, 100, 100, 100 (a repeat of 100 ten times)?

Calculation:

Using a bin/class size of 0.5:

enter image description here

$L_m$ = 100

$c$ = 0.5

$F_{m-1}$ = 0*

$f_m$ = 10

$n$ = 10

100 + [(5-0)/10]*0.5

= 100.25

Best Answer

When you group data into intervals, information is lost. So assumptions are made in order to make reasonable estimates of the sample mean, median, etc.

The assumption of this formula for estimating the median from grouped data is that the data are spread roughly uniformly throughout the interval. Clearly, this assumption is not met in your situation because all ten of the $100$'s lie at the lower endpoint of the interval. The idea of the formula is to estimate the median by interpolation, putting the estimate somewhere within the interval. In your case the estimated value $100.25$ is in the middle of the 'median interval' (the interval known to contain the median).

If you were trying to contrive a situation in which the estimate is even farther from the truth, you could put your ten $100$'s at the left end of an interval $[100, 120).$ With no other data, your estimate of the median would then be $110.$

There is nothing wrong with the formula, provided the assumption of data spread evenly throughout the interval is close to the truth. But any formula for estimating the median from grouped data will have to depend on assumptions. All that can be said for sure is the the median lies somewhere in the median interval. You have to recognize that the information lost in grouping data into intervals cannot be precisely recovered (unless the original data are saved and used).


Note: By contrast, the assumption usually made when trying to estimate the sample mean from grouped data is that each observation lies precisely at the midpoint of the interval that contains it. This idea gives rise to the formula $\bar X \approx \frac 1 n \sum_{i=1}^k f_jm_j,$ where there are $k$ intervals (usually of equal width), with midpoints $m_j$ and frequencies $f_j.$

Related Question