Solved – How does XGBoost/lightGBM evaluate ndcg metric for ranking

machine learningmodel-evaluationrankingrecommender-system

I am currently running tests between XGBoost/lightGBM for their ability to rank items. I am reproducing the benchmarks presented here: https://github.com/guolinke/boosting_tree_benchmarks.

I have been able to successfully reproduce the benchmarks mentioned in their work. I want to make sure that I am correctly implementing my own version of the ndcg metric and also understanding the ranking problem correctly.

My questions are:

When creating the validation for the test set using ndcg – there is a test.group file that says the first X rows are group 0, etc. To get the recommendations for the group, I get the predicted values and known relevance scores and sort that list by descending predicted values for each group?
In order to get the final ndcg scores from the lists created above – do I get the ndcg scores and take the mean over all the scores? Is this the same evaluation methodology that XGBoost/lightGBM in the evaluation phase?

Here is my methodology for evaluating the test set after the model has finished training.

For the final tree when I run lightGBM I obtain these values on the validation set:

[500]   valid_0's ndcg@1: 0.513221  valid_0's ndcg@3: 0.499337  valid_0's ndcg@5: 0.505188  valid_0's ndcg@10: 0.523407

My final step is to take the predicted output for the test set and calculate the ndcg values for the predictions.

Here is my python code for calculating ndcg:

import numpy as np

def dcg_at_k(r, k):
    r = np.asfarray(r)[:k]
    if r.size:
        return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
    return 0.


def ndcg_at_k(r, k):
    idcg = dcg_at_k(sorted(r, reverse=True), k)
    if not idcg:
        return 0.
    return dcg_at_k(r, k) / idcg

After I get the predictions for the test set for a particular group (GROUP-0) I have these predictions:

query_id    predict
0   0   (2.0, -0.221681199441)
1   0   (1.0, 0.109895548348)
2   0   (1.0, 0.0262799346312)
3   0   (0.0, -0.595343431322)
4   0   (0.0, -0.52689043426)
5   0   (0.0, -0.542221350664)
6   0   (1.0, -0.448015576024)
7   0   (1.0, -0.357090949646)
8   0   (0.0, -0.279677741045)
9   0   (0.0, 0.2182200869)

NOTE

Group-0 actually has about 112 rows.

I then sort the list of tuples in descending order which provides a list of relevance scores:

def get_recommendations(x):

    sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)
    return [k for k, _ in sorted_list]

relavance = evaluation.groupby('query_id').predict.apply(get_recommendations)

query_id
0    [4.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
1    [4.0, 2.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 1.0, ...
2    [2.0, 3.0, 2.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, ...
3    [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
4    [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...

Finally, for each query id I calculated the ndcg scores on the relevance list and then take the mean of all the ndcg scores calculated for each query id:

relavance.apply(lambda x: ndcg_at_k(x, 10)).mean()

The value I obtain is ~0.497193.

Best Answer

I happened across this myself, and finally dug into the code to figure it out.

The difference is the handling of a missing IDCG. Your code returns 0, while LightGBM is treating that case as a 1.

The following code produced matching results for me:

import numpy as np

def dcg_at_k(r, k):
    r = np.asfarray(r)[:k]
    if r.size:
        return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
    return 0.


def ndcg_at_k(r, k):
    idcg = dcg_at_k(sorted(r, reverse=True), k)
    if not idcg:
        return 1.  # CHANGE THIS
    return dcg_at_k(r, k) / idcg

Best Answer

Related Solutions

Solved – Isn’t stacking models a direct approach to overfitting

Solved – Proper way to use NDCG@k score for recommendations

In "plain" language

Formulas

Sources

Related Question