[Math] Is percentile affected by extreme values

statistics

A coworker asked me how to calculate a 25-percentile and I gave him an answer but then I got unsure if I figured correctly. The problem is that our sample size will tend to be quite small so definition-wise there's no point calculating such. However, we need to set up a semi-scientific computation anyway. 🙂

My question is as follows. If we with a huge amount of good will assume that the following sample:

$$
s_1 := \{ 72, 88, 100 \}
$$

has a 25-percentile of 80 (mean between 72 and 88), or even 76 (quarter way through between 72 and 88), should the value of such a percentile be affected if we increase the maximum value as in the following sample?

$$
s_2 := \{ 72, 88, 200 \}
$$

Best Answer

With a sample size of three, it is meaningless to ask for "percentile".

By definition, the 25th percentile is "any value relative to which 25% of the observed values are lower (or greater)".

With a sample size of three, you can only get (approximately) 0, 33, 67, or 100% of the sample size.


In general, it is very difficult to get exactly 25% above/below the line (for starters, this will require the sample size to be divisible by 4). So in practice for the computation of percentiles there are various different ways and there is no "one right answer". (The methods generally agree for large sample sizes.) Irrespective of the methods, as long as your percentile line is drawn sufficiently far from the extreme values (which would be the case generally if the percentile $P$ and the number of samples $N$ satisfy $P / 100 * N$ and $(100-P)/100 * N$ are much bigger than 1), your percentile numbers will not depend on the extreme values.

When the sample size is too small:

  • If $P/100 *N$ is too close to 1 (or even less than it) than the lower extreme value will likely affect the percentile line
  • If $(1-P)/100 * N$ is too close to 1 (or even less than it) than the upper extreme value will likely affect the percentile line

How and how much this effect plays in depends on the method you use to compute the percentile line.

Related Question