I am working on recommender systems, and using some methodology I have got a probability of each user liking a movie. To elaborate, say user $u_1$ has the following distribution for movie preferences over $8$ movies:
m1 m2 m3 m4 m5 m6 m7 m8
0.12 0.2 0 0.15 0.15 0.3 0 0.08
This shows the preference of $u_1$ towards movies. So $u_1$ prefers movie $m_6$ the most , $m_4$ and $m_5$ equally, and so on. A $0$ means that $u_1$ hasn't seen movies $m_3$ and $m_7$.
Similarly I have the probability values for another user $u_2$ on these 8 movies. I need to know how to compute the similarity between these two probability vectors, because this helps me find the most similar users to a given user, and this is what helps me in collaborative filtering.
I am aware of one method called cosine similarity, but I'm not sure if it's the best way to find similarity in this context. Are there any other methods to find the similarity between two users?
Best Answer
Q1: What's the similarity between movie preferences of two users?
A1: Depends. You first need to define the aspect of similarity that you are looking for. In my view, in order to know this aspect, before that you need to specify your objective of knowing this similarity score.
Do you need this similarity measure to create clusters of similarly minded users, and then use movies that they watch to recommend them on others in the same cluster that haven't watched yet?
Let's say that $S$ is the scores random variable, $U$ is the users random variable, and $M$ is the movies random variable. For any user $u_i$ and any movie $m_j$, what we want is knowing this scores expectation: $$ \mathbb{E}\big[S|U=u_i, M=m_j\big] $$
If movie $m_j$ is unseen by user $u_i$, then movie $m_j$ is a candidate for being recommended to $u_i$. If the scores expectation for movie $m_j$ is enough, i.e. greater than some threshold $t$, then we shall recommend it to user $u_i$.
But how can we find $\mathbb{E}\big[S|U=u_i, M=m_j\big]$? We are still in the past where user $u_i$ hasn't seen $m_j$.
Therefore we need to find its estimation $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$.
Q2: How can we estimate $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$?
A2: We need to look at the behaviour of user $u_i$ in relation to past movies that $u_i$ has watched, and then look at how similar movie $m_j$ is to the past movies, and based on this similarity decide the score that $u_i$ would give it if he watches it.
But we can do even better by looking at similarly behaving users $u_a,u_b, \ldots$ to $u_i$ in order to enhance our prediction of $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$.
We can do even better by also looking at general world events (call it context maybe?). E.g. certain recent events can make certain users end up wanting to watch different things.
To cut it short, try this:
Then, in order to use this model to predict scores, repeat the same steps 1 to 3, except for pairing users against movies that they haven't seen. Then plug these vectors against the Random Forests regression model to get an expected score for the test user-movie tuple. Finally, if this score is large enough, then recommend the test movie to the test user.
Why is this nice?
Q3: Can you use other regression methods?
A3: Yes. You need to choose the algorithm that seems to work best in your domain. I suggested Random Forests because I think it's a nice baseline to start with, but you can try fancier ones such as deep learning algorithms along with different methods of representing your users and their movies.