Solved – Making a measure of ranking accuracy

ranking

Say I wanted to make a measure of how well a formula ranks a set of contestants in a running race. The inputs to the formula would be factors factors such as height, bodyweight, diet etc.

I had thought that the following would be reasonable. At the end of each race we take the absolute value of the difference between its rank (as the person finished in the race) and their predicted rank (as dictated by the formula). This would be summed for every person in the race. Finally this number would be divided by the number of people in the race – this last step is so that we can directly compare the ranking accuracy of large races to small ones. Call this number the "rank-accuracy".

I then decided to double check the last step – the "divide by the number of entrants" part. To do this I simulated random race results and random predictions and measured the average rank accuracy for a variety of race sizes. The results are as follows:

sz =  2 0.4992
sz =  3 0.8879
sz =  4 1.2500
sz =  5 1.6003
sz =  6 1.9424
sz =  7 2.2885
sz =  8 2.6230
sz =  9 2.9623
sz = 10 3.2983
sz = 11 3.6319
sz = 12 3.9764

As you can see my desire to make the rank-accuracy similar for all race sizes appears to have failed. It is clearly easier to get a low rank accuracy if the race size is small.

So my question is – what measure of rank accuracy could I use to make large and small races more directly comparable?

Best Answer

It seems you are reinventing the wheel here. To compare the ranks given by your scoring function to the true ranks, you can just use Spearman correlation. This method simply converts your scores to numeric ranks, and computes the linear relationship with the true ranks. It penalizes large errors, so if your predictions have first and last place swapped, that will result in a worse association than if second and third place are swapped, for example. You can also compute a confidence interval on the Spearman correlation to determine if your scoring method is doing significantly better than random. Races with small N will have larger confidence intervals, since it is easier to rank only 2 items correctly by chance alone, but that's a feature and not a bug. The actual point estimate shouldn't be particularly affected by N, though - randomly generated scores will have near-zero correlation with the true ranks, regardless of N.

Related Question