Solved – How to classify based on percentile ranking when most scores are the same

normalizationquantilesr

I am dealing with a simple dataset of test scores. It was an easy test — 98 out of 100 persons got a perfect score. 1 person got a 2% and one person got a 3%.

Here's what it looks like in R:

test_scores <- c(rep(100, 98), 2, 3)
zscores <- scale(test_scores)
percentile <- pnorm(zscores)*100
plot(percentile)

enter image description here

Now, say someone asks, "I got 100%, so what percentile did I score?"

The knee-jerk response should be 100th percentile, but wouldn't it be equally accurate if the answer was anywhere from 3rd to 100th percentile?

Maybe we can consult the plot we produced above — but wait! According to the y-axis, if you score 100% you are in the 55.6th percentile!

(percentile clearly shows the values)

If this is the case, maybe the answer is, "You got 100% — you are in the 56th percentile. So anyone who got 100% gets an "F" on this test."

This is an extreme case I cooked up, but it still feels troublesome in lesser incantations. What if you are asked to rank the above test scores in some way like this:

100 – 98 percentile –> "Very Superior",
97 – 90 percentile –> "Superior",
89 – 75 percentile –> "Average",
ad nauseam

You could make an argument that any perfect score could fall into some number of bins.

Is there any reconciliation for this situation? When do you reach the point of 'enough is enough' and throw out classification based on percentiles?

Or, why aren't there some set of assumptions to be employed regarding the distribution of data before percentiles are considered?

Best Answer

I don't know if it has been formally proved, but I believe that all classification and ranking algorithms can "fail" (i.e. produce unacceptable or paradoxical results) when applied to data that is sufficiently pathological. The only way to avoid these pathological cases is to impose strong assumptions or constraints on the input data.

For example, consider vector data $X = \{{x_1, x_2, \dots x_n}\}$, where each $x_i$ is bounded: $a < x_i < b$. Clustering aims to reduce this $n$-dimensional data to a smaller $k$-dimensional list representation. A pathological data set would be $n$ data points with only extreme values of $x_i$, the cross product of vectors with $x_i = \{a,b\}$. In other words, this data consists only points at the vertices of the $n$-dimensional space. In this case, no $k$-dimensional clustering algorithm will be successful, though some might be worse than others. This is true even though clustering (in general) makes very few assumptions on the data.

Though they are not always stated explicitly, percentile ranking in the context of student grading systems makes strong assumptions about the distribution of student scores and the proper interpretation in terms of assigning letter grades (ordinal categories). It assumes that there are few ties compared to the total size of the data. It assumes that the size of the data is not small $(n > 10)$. It assumes that there is significant variation in scores relative to the full scale. It assumes that the scores are distributed across the range of scores, and not overly concentrated on any one score. These assumptions would be met of the scores were Normally distributed with variance at least one quarter of the score scale (approximately). For 100 point scale, the standard deviation of scores would be at least 5. Other distributions (e.g. Beta) would also meet these assumptions.

Because of these strong assumptions, the number of pathological data sets is quite large. In addition to your case, all of the following are pathological data sets:

  • $\{100, 50_1,50_2,\dots, 50_{98}, 0\}$
  • $\{100, 99_1,99_2,\dots, 99_{98}, 98\}$
  • $\{0_1,0_2,\dots, 0_{100}\}$
  • $\{50_1,50_2,\dots, 50_{100}\}$
  • $\{100, 90, 95\} ; n = 3$
  • and so on

I believe the ethical approach is to not perform percentile ranking when ever these assumptions are violated. This is better than doing percentile ranking and then trying to modify the results to fit intuition. These modifications would be arbitrary "fudge factors" that, in my opinion, would be putting "lipstick on a pig".

It would be better to apply a set of classification heuristics that mapped scores to grades, where the heuristics make fewer assumptions about the distribution of data.

Of course, bureaucracies that administer grading systems will not like "missing data" in the reported scores. I believe it is our duty to fight for change when the "System" is broken, as it is in this case.

Related Question