Currently I am building a recommender system and using ranking metrics to verify its performance. I am using the NDCG@k score. Today I was experimenting and I realized that I might be calculating the NDCG@k score wrong so I was hoping to get some validation from the community on the correct way to implement it.
Here is the code I am using for the calculation:
def dcg_at_k(r, k, method=0):
"""Score is discounted cumulative gain (dcg)
Relevance is positive real values. Can use binary
as the previous methods.
Example from
http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
Args:
r: Relevance scores (list or numpy) in rank order
(first element is the first item)
k: Number of results to consider
method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
Returns:
Discounted cumulative gain
"""
r = np.asfarray(r)[:k]
if r.size:
if method == 0:
return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
elif method == 1:
return np.sum(r / np.log2(np.arange(2, r.size + 2)))
else:
raise ValueError('method must be 0 or 1.')
return 0.
def ndcg_at_k(r, k=20, method=1):
"""Score is normalized discounted cumulative gain (ndcg)
Relevance is positive real values. Can use binary
as the previous methods.
Example from
http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
Args:
r: Relevance scores (list or numpy) in rank order
(first element is the first item)
k: Number of results to consider
method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
Returns:
Normalized discounted cumulative gain
"""
dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
if not dcg_max:
return 0.
return dcg_at_k(r, k, method) / dcg_max
Here is how I have been calculating the NDCG@k score for my recommender system. Suppose I have a catalog of 10 items for some user u
and user u
in the test set has interacted with items 4 and 8 and I want to test the NDCG@5 to see if the items appear in the recommendations.
When I compute the top 5 recommendations I will have a ranking score for each of the 10 items let's say something like:
(1, 0.90)
(4, 0.85)
(3, 0.80)
(2, 0.75)
(5, 0.70)
(6, 0.65)
(7, 0.60)
(8, 0.55)
(9, 0.50)
(10, 0.45)
So my top 5 recommendations are [1, 4, 3, 2, 5]
. What I have done from here is check to see if any of the items recommended are in the test set so this would be a boolean list:
[0, 1, 0, 0, 0]
and I would calculate the NDCG@5 as:
r = [0, 1, 0, 0, 0]
ndcg_at_k(r, 5, method=1)
0.6309
After looking at it this felt off to me because now from the metric perspective it has no information that the user was interested in 8 and the score did not penalize that. Also I noticed in the code this line:
dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
which takes what seems like it should be a full list of all the items THEN cuts if off in the calculation. So I tried this as well:
Again based on my rankings:
[1, 4, 3, 2, 5, 6, 7, 8, 9, 10]
and the boolean values would be:
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
where the NDCG@5 is now:
r = [0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
ndcg_at_k(r, 5, method=1)
0.3868
Which is the proper calculation?
Best Answer
In "plain" language
The Discounted Cumulative Gain for k shown recommendations ($DCG@k$) sums the relevance of the shown items for the current user (cumulative), meanwhile adding a penalty for relevant items placed on later positions (discounted).
The Normalized Cumulative Gain for k shown recommendations ($NDCG@k$) divides this score by the maximum possible value of $DCG@k$ for the current user, i.e. what the score $DCG@k$ would be if the items in the ranking were sorted by the true (unknown for the recommender model) relevance. This is called Ideal Discounted Cumulative Gain ($IDCG@k$). So the score is normalized for different users.
$NDCG@k=\frac{DCG@k}{IDCG@k}$
Hence, to calculate $IDCG@k$ and hence $NDCG@k$, one needs to know all relevant items for the current user in the test set. So your second call, passing the entire ranking, is correct.
Formulas
Let $rel_i$ the true relevance of the recommendation at position i for the current user.
The traditional method to calculate DCG (corresponds to method=1 in your code)
$DCG@k=\sum_{i=1}^{k}\frac{rel_i}{log_2(i+1)}=rel_1+\sum_{i=2}^{k}\frac{rel_i}{log_2(i+1)}$
An alternative method to calculate DCG to put more emphasis on relevance
$DCG@k=\sum_{i=1}^{k}\frac{2^{rel_i}-1}{log_2(i+1)}$
The parameter method=0 in your code corresponds to
$DCG@k=rel_1+\sum_{i=2}^{k}\frac{rel_i}{log_2(i-1+1)}=rel_1+\sum_{i=2}^{k}\frac{rel_i}{log_2(i)}$
resulting in the weights mentioned in the doc of the code. I do not know why someone wants to give the same weight of 1 for the first two items and discounting the rest. Might be based on how the recommendations are shown ?
$IDCG@k$ is calculated by sorting the ranking by the true unknown relevance (in descending order) and then use the formula for $DCG@k$ (just like in your code).
Sources
The wikipedia page for Discounted Cumulative Gain is quite helpful (note that $DCG@k$ is called $DCG_p$ there) and contains many useful links, e.g. to the Stanford handout mentioned in your code and the paper A Theoretical Analysis of NDCG Ranking Measures by Wang et al.