Elo Rating – How to Prove the Validity of Elo Rating or Page Ranking for a Specific Dataset

goodness of fitrankingrating

I have a set of players. They play against each other (pairwise). Pairs of players are chosen randomly. In any game, one player wins and another one loses. The players play with each other a limited number of games (some players play more games, some less). So, I have data (who wins against whom and how many times). Now I assume that every player has a ranking that determines the probability of winning.

I want to check if this assumption is actually truth. Of course, I can use the Elo rating system or the PageRank algorithm to a calculate rating for every player. But by calculating ratings I do not prove that they (ratings) actually exist or that they mean anything.

In other words, I want to have a way to prove (or to check) that players do have different strengths. How can I do it?

ADDED

To be more specific, I have 8 players and only 18 games. So, there a lot of pairs of players who did not play against each other and there a lot of pairs that played only once with each other. As a consequence, I cannot estimate the probability of a win for a given pair of players. I also see, for example, that there is a player who won 6 times in 6 games. But maybe it is just a coincidence.

Best Answer

You need a probability model.

The idea behind a ranking system is that a single number adequately characterizes a player's ability. We might call this number their "strength" (because "rank" already means something specific in statistics). We would predict that player A will beat player B when strength(A) exceeds strength(B). But this statement is too weak because (a) it is not quantitative and (b) it does not account for the possibility of a weaker player occasionally beating a stronger player. We can overcome both problems by supposing the probability that A beats B depends only on the difference in their strengths. If this is so, then we can re-express all the strengths is necessary so that the difference in strengths equals the log odds of a win.

Specifically, this model is

$$\mathrm{logit}(\Pr(A \text{ beats } B)) = \lambda_A - \lambda_B$$

where, by definition, $\mathrm{logit}(p) = \log(p) - \log(1-p)$ is the log odds and I have written $\lambda_A$ for player A's strength, etc.

This model has as many parameters as players (but there is one less degree of freedom, because it can only identify relative strengths, so we would fix one of the parameters at an arbitrary value). It is a kind of generalized linear model (in the Binomial family, with logit link).

The parameters can be estimated by Maximum Likelihood. The same theory provides a means to erect confidence intervals around the parameter estimates and to test hypotheses (such as whether the strongest player, according to the estimates, is significantly stronger than the estimated weakest player).

Specifically, the likelihood of a set of games is the product

$$\prod_{\text{all games}}{\frac{\exp(\lambda_{\text{winner}} - \lambda_{\text{loser}})}{1 + \exp(\lambda_{\text{winner}} - \lambda_{\text{loser}})}}.$$

After fixing the value of one of the $\lambda$, the estimates of the others are the values that maximize this likelihood. Thus, varying any of the estimates reduces the likelihood from its maximum. If it is reduced too much, it is not consistent with the data. In this fashion we can find confidence intervals for all the parameters: they are the limits in which varying the estimates does not overly decrease the log likelihood. General hypotheses can similarly be tested: a hypothesis constrains the strengths (such as by supposing they are all equal), this constraint limits how large the likelihood can get, and if this restricted maximum falls too far short of the actual maximum, the hypothesis is rejected.


In this particular problem there are 18 games and 7 free parameters. In general that is too many parameters: there is so much flexibility that the parameters can be quite freely varied without changing the maximum likelihood much. Thus, applying the ML machinery is likely to prove the obvious, which is that there likely are not enough data to have confidence in the strength estimates.

Related Question