I am currently running tests between XGBoost/lightGBM for their ability to rank items. I am reproducing the benchmarks presented here: https://github.com/guolinke/boosting_tree_benchmarks.
I have been able to successfully reproduce the benchmarks mentioned in their work. I want to make sure that I am correctly implementing my own version of the ndcg metric and also understanding the ranking problem correctly.
My questions are:
-
When creating the validation for the test set using ndcg – there is a test.group file that says the first X rows are group 0, etc. To get the recommendations for the group, I get the predicted values and known relevance scores and sort that list by descending predicted values for each group?
-
In order to get the final ndcg scores from the lists created above – do I get the ndcg scores and take the mean over all the scores? Is this the same evaluation methodology that XGBoost/lightGBM in the evaluation phase?
Here is my methodology for evaluating the test set after the model has finished training.
For the final tree when I run lightGBM
I obtain these values on the validation set:
[500] valid_0's ndcg@1: 0.513221 valid_0's ndcg@3: 0.499337 valid_0's ndcg@5: 0.505188 valid_0's ndcg@10: 0.523407
My final step is to take the predicted output for the test set and calculate the ndcg values for the predictions.
Here is my python code for calculating ndcg:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 0.
return dcg_at_k(r, k) / idcg
After I get the predictions for the test set for a particular group (GROUP-0) I have these predictions:
query_id predict
0 0 (2.0, -0.221681199441)
1 0 (1.0, 0.109895548348)
2 0 (1.0, 0.0262799346312)
3 0 (0.0, -0.595343431322)
4 0 (0.0, -0.52689043426)
5 0 (0.0, -0.542221350664)
6 0 (1.0, -0.448015576024)
7 0 (1.0, -0.357090949646)
8 0 (0.0, -0.279677741045)
9 0 (0.0, 0.2182200869)
NOTE
Group-0 actually has about 112 rows.
I then sort the list of tuples in descending order which provides a list of relevance scores:
def get_recommendations(x):
sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)
return [k for k, _ in sorted_list]
relavance = evaluation.groupby('query_id').predict.apply(get_recommendations)
query_id
0 [4.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
1 [4.0, 2.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 1.0, ...
2 [2.0, 3.0, 2.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, ...
3 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
4 [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...
Finally, for each query id I calculated the ndcg scores on the relevance list and then take the mean of all the ndcg scores calculated for each query id:
relavance.apply(lambda x: ndcg_at_k(x, 10)).mean()
The value I obtain is ~0.497193
.
Best Answer
I happened across this myself, and finally dug into the code to figure it out.
The difference is the handling of a missing IDCG. Your code returns 0, while LightGBM is treating that case as a 1.
The following code produced matching results for me: