Solved – Does it make sense to measure recall in recommender systems

precision-recallrecommender-system

Assume I've built a recommender system that (given say movie rankings or whatever of many users) will produce a list of 10 recommended movies for each user to watch.
Imagine that I also have some large pool of movie items, together with a log of user ratings together with movies that they actually decided to watch. So I want to use this data set to evaluate my system.

I've seen in the literature that these "suggest some good items" tasks are usually evaluated using precision, recall and F1-scores (e.g. see [1]). I guess that I should be interested, in particular, on "precision at 10". However I'm not quite sure how one is supposed to compute these measures (or if they make any sense) in the scenario that I've described above.

Apparently, the preferred thing to do is to randomly break the sample into a "training" and a "testing" part. And then feed the training data to my algorithm so that it can come up with a list of 10 predictions.

Now precision sort of makes sense, I can check from the 10 predictions how many of these are actually found in the movies watched by the user in the test data.

However for recall, if the user watched a lot of movies in the test data, say 50 or so; there is no way to obtain a "good" recall score, simply because my system was constrained to produce only 10 movies and I would get at most a 1/5 = 0.2 of recall.

Alternatively, if I constrain the test only to guess the "next 10 watched" movies of the user (so that there is a chance to get a "perfect recall"), then precision and recall will always be exactly the same number (if the number recommended and the number relevant for the user is the same, precision and recall are also always the same).

Am I doing something wrong? Or these metrics simply don't make much sense in the considered scenario?

Best Answer

In case of a "top-N" recommender system, it is helpful to construct an "unbiased" test data set (e.g. by adding a thousand random unwatched/unrated movies to the list of watched movies from the holdout data set for a given user), and then scoring the resulting test data set using a model. Once it is done for a bunch of users, one can then calculate "precision vs recall" curve and "recall-at-N vs N" curve (as well as sensitivity/specificity and lift curves) which can be used to judge quality of a given model. This paper, Performance of Recommender Algorithms on Top-N Recommendation Tasks by Cremonesi et al., has more details.

If a given model includes time dynamics then the split between training and test should be done along the time dimension (not entirely randomly)

Related Question