Solved – How to find the similarity between movie preferences (in the form of a probability vector)of two users

correspondence-analysiscosine distancecosine similarityrecommender-systemsimilarities

I am working on recommender systems, and using some methodology I have got a probability of each user liking a movie. To elaborate, say user $u_1$ has the following distribution for movie preferences over $8$ movies:

m1     m2    m3   m4    m5    m6    m7   m8
0.12   0.2   0    0.15  0.15  0.3   0    0.08

This shows the preference of $u_1$ towards movies. So $u_1$ prefers movie $m_6$ the most , $m_4$ and $m_5$ equally, and so on. A $0$ means that $u_1$ hasn't seen movies $m_3$ and $m_7$.

Similarly I have the probability values for another user $u_2$ on these 8 movies. I need to know how to compute the similarity between these two probability vectors, because this helps me find the most similar users to a given user, and this is what helps me in collaborative filtering.

I am aware of one method called cosine similarity, but I'm not sure if it's the best way to find similarity in this context. Are there any other methods to find the similarity between two users?

Best Answer

Q1: What's the similarity between movie preferences of two users?

A1: Depends. You first need to define the aspect of similarity that you are looking for. In my view, in order to know this aspect, before that you need to specify your objective of knowing this similarity score.

Do you need this similarity measure to create clusters of similarly minded users, and then use movies that they watch to recommend them on others in the same cluster that haven't watched yet?

Let's say that $S$ is the scores random variable, $U$ is the users random variable, and $M$ is the movies random variable. For any user $u_i$ and any movie $m_j$, what we want is knowing this scores expectation: $$ \mathbb{E}\big[S|U=u_i, M=m_j\big] $$

If movie $m_j$ is unseen by user $u_i$, then movie $m_j$ is a candidate for being recommended to $u_i$. If the scores expectation for movie $m_j$ is enough, i.e. greater than some threshold $t$, then we shall recommend it to user $u_i$.

But how can we find $\mathbb{E}\big[S|U=u_i, M=m_j\big]$? We are still in the past where user $u_i$ hasn't seen $m_j$.

Therefore we need to find its estimation $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$.


Q2: How can we estimate $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$?

A2: We need to look at the behaviour of user $u_i$ in relation to past movies that $u_i$ has watched, and then look at how similar movie $m_j$ is to the past movies, and based on this similarity decide the score that $u_i$ would give it if he watches it.

But we can do even better by looking at similarly behaving users $u_a,u_b, \ldots$ to $u_i$ in order to enhance our prediction of $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$.

We can do even better by also looking at general world events (call it context maybe?). E.g. certain recent events can make certain users end up wanting to watch different things.

To cut it short, try this:

  1. Represent the all users as $n$ dimensional vectors. The components of such vectors could be movie ratings, or it could be other things like how they review movies, or even how they click and browser the movie streaming website. User representation is very general, what you mention above (e.g. movie ratings) is only a special case of user representation.
  2. Represent each movie as an $m$ dimensional vector. There are many ways to do this. Simply looking at the list of user rankings for the movie is only a special case of representing the movie. You can look at things such as, movie length, distribution of sound, distribution of visual effects, names of characters that play in the movie, etc. (You can even go as extreme as putting the whole movie in such that each dimension represents the byte value! But this could be too extreme).
  3. Create a dataset such that each user-movie-score triple is represented as a vector. This will give us an $n+m+1$ dimensional vector (recall that $n$ is the user representation, and $m$ is the movie representation). The score is just one number, so it adds just $+1$ to the dimensionality. Say that you have $1000$ users, then you will have $1000$ rows, where each row is a $n+m+1$ dimensional vector. You may also add the context information to each row in the dataset (e.g. augmenting a representation of the context as a $c$ dimensional vector, so each row becomes $n+m+c+1$ dimensional vector).
  4. Train Random Forests on this dataset and ask it to analyze the $n+m$ components (user-movie components) in order to predict the last component (the movie scores). This will give you a regression model.

Then, in order to use this model to predict scores, repeat the same steps 1 to 3, except for pairing users against movies that they haven't seen. Then plug these vectors against the Random Forests regression model to get an expected score for the test user-movie tuple. Finally, if this score is large enough, then recommend the test movie to the test user.

Why is this nice?

  • Random Forests is non-parametric and can identify non-linear patterns.
  • Easy to distribute.

Q3: Can you use other regression methods?

A3: Yes. You need to choose the algorithm that seems to work best in your domain. I suggested Random Forests because I think it's a nice baseline to start with, but you can try fancier ones such as deep learning algorithms along with different methods of representing your users and their movies.

Related Question