Solved – Proper way to use NDCG@k score for recommendations

metricrankingrecommender-systemvalidation

Currently I am building a recommender system and using ranking metrics to verify its performance. I am using the NDCG@k score. Today I was experimenting and I realized that I might be calculating the NDCG@k score wrong so I was hoping to get some validation from the community on the correct way to implement it.

Here is the code I am using for the calculation:

def dcg_at_k(r, k, method=0):
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Discounted cumulative gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.

def ndcg_at_k(r, k=20, method=1):
    """Score is normalized discounted cumulative gain (ndcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf

    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Normalized discounted cumulative gain
    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max

Here is how I have been calculating the NDCG@k score for my recommender system. Suppose I have a catalog of 10 items for some user u and user u in the test set has interacted with items 4 and 8 and I want to test the NDCG@5 to see if the items appear in the recommendations.

When I compute the top 5 recommendations I will have a ranking score for each of the 10 items let's say something like:

(1, 0.90)
(4, 0.85)
(3, 0.80)
(2, 0.75)
(5, 0.70)
(6, 0.65)
(7, 0.60)
(8, 0.55)
(9, 0.50)
(10, 0.45)

So my top 5 recommendations are [1, 4, 3, 2, 5]. What I have done from here is check to see if any of the items recommended are in the test set so this would be a boolean list:

[0, 1, 0, 0, 0]

and I would calculate the NDCG@5 as:

r = [0, 1, 0, 0, 0]
ndcg_at_k(r, 5, method=1)
0.6309

After looking at it this felt off to me because now from the metric perspective it has no information that the user was interested in 8 and the score did not penalize that. Also I noticed in the code this line:

dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)

which takes what seems like it should be a full list of all the items THEN cuts if off in the calculation. So I tried this as well:

Again based on my rankings:

[1, 4, 3, 2, 5, 6, 7, 8, 9, 10]

and the boolean values would be:

[0, 1, 0, 0, 0, 0, 0, 1, 0, 0]

where the NDCG@5 is now:

r = [0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
ndcg_at_k(r, 5, method=1)
0.3868

Which is the proper calculation?

Best Answer

In "plain" language

The Discounted Cumulative Gain for k shown recommendations ($DCG@k$) sums the relevance of the shown items for the current user (cumulative), meanwhile adding a penalty for relevant items placed on later positions (discounted).

The Normalized Cumulative Gain for k shown recommendations ($NDCG@k$) divides this score by the maximum possible value of $DCG@k$ for the current user, i.e. what the score $DCG@k$ would be if the items in the ranking were sorted by the true (unknown for the recommender model) relevance. This is called Ideal Discounted Cumulative Gain ($IDCG@k$). So the score is normalized for different users.

$NDCG@k=\frac{DCG@k}{IDCG@k}$

Hence, to calculate $IDCG@k$ and hence $NDCG@k$, one needs to know all relevant items for the current user in the test set. So your second call, passing the entire ranking, is correct.

Formulas

Let $rel_i$ the true relevance of the recommendation at position i for the current user.

The traditional method to calculate DCG (corresponds to method=1 in your code)

$DCG@k=\sum_{i=1}^{k}\frac{rel_i}{log_2(i+1)}=rel_1+\sum_{i=2}^{k}\frac{rel_i}{log_2(i+1)}$

An alternative method to calculate DCG to put more emphasis on relevance

$DCG@k=\sum_{i=1}^{k}\frac{2^{rel_i}-1}{log_2(i+1)}$

The parameter method=0 in your code corresponds to

$DCG@k=rel_1+\sum_{i=2}^{k}\frac{rel_i}{log_2(i-1+1)}=rel_1+\sum_{i=2}^{k}\frac{rel_i}{log_2(i)}$

resulting in the weights mentioned in the doc of the code. I do not know why someone wants to give the same weight of 1 for the first two items and discounting the rest. Might be based on how the recommendations are shown ?

$IDCG@k$ is calculated by sorting the ranking by the true unknown relevance (in descending order) and then use the formula for $DCG@k$ (just like in your code).

Sources

The wikipedia page for Discounted Cumulative Gain is quite helpful (note that $DCG@k$ is called $DCG_p$ there) and contains many useful links, e.g. to the Stanford handout mentioned in your code and the paper A Theoretical Analysis of NDCG Ranking Measures by Wang et al.

Related Question