However: With pure vanilla SVD you might have problems recreating the original matrix, let alone predicting values for missing items. The useful rule-of-thumb in this area is calculating average rating per movie, and subtracting this average for each user / movie combination, that is, subtracting movie bias from each user. Then it is recommended you run SVD, and of course, you would have to record these bias values somewhere, in order to recreate ratings, or predict for unknown values. I'd read Simon Funk's post on SVD for recommendations - he invented an incremental SVD approach during Netflix competition.
http://sifter.org/~simon/journal/20061211.html
I guess demeaning matrix A before SVD makes sense, since SVD's close cousin PCA also works in a similar way. In terms of incremental computation, Funk told me that if you do not demean, first gradient direction dominates the rest of the computation. I've seen this firsthand, basically without demeaning things do not work.
Q1: What's the similarity between movie preferences of two users?
A1: Depends. You first need to define the aspect of similarity that you are looking for. In my view, in order to know this aspect, before that you need to specify your objective of knowing this similarity score.
Do you need this similarity measure to create clusters of similarly minded users, and then use movies that they watch to recommend them on others in the same cluster that haven't watched yet?
Let's say that $S$ is the scores random variable, $U$ is the users random variable, and $M$ is the movies random variable. For any user $u_i$ and any movie $m_j$, what we want is knowing this scores expectation:
$$
\mathbb{E}\big[S|U=u_i, M=m_j\big]
$$
If movie $m_j$ is unseen by user $u_i$, then movie $m_j$ is a candidate for being recommended to $u_i$. If the scores expectation for movie $m_j$ is enough, i.e. greater than some threshold $t$, then we shall recommend it to user $u_i$.
But how can we find $\mathbb{E}\big[S|U=u_i, M=m_j\big]$? We are still in the past where user $u_i$ hasn't seen $m_j$.
Therefore we need to find its estimation $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$.
Q2: How can we estimate $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$?
A2: We need to look at the behaviour of user $u_i$ in relation to past movies that $u_i$ has watched, and then look at how similar movie $m_j$ is to the past movies, and based on this similarity decide the score that $u_i$ would give it if he watches it.
But we can do even better by looking at similarly behaving users $u_a,u_b, \ldots$ to $u_i$ in order to enhance our prediction of $\mathbb{\widehat E}\big[S|U=u_i, M=m_j\big]$.
We can do even better by also looking at general world events (call it context maybe?). E.g. certain recent events can make certain users end up wanting to watch different things.
To cut it short, try this:
- Represent the all users as $n$ dimensional vectors. The components of such vectors could be movie ratings, or it could be other things like how they review movies, or even how they click and browser the movie streaming website. User representation is very general, what you mention above (e.g. movie ratings) is only a special case of user representation.
- Represent each movie as an $m$ dimensional vector. There are many ways to do this. Simply looking at the list of user rankings for the movie is only a special case of representing the movie. You can look at things such as, movie length, distribution of sound, distribution of visual effects, names of characters that play in the movie, etc. (You can even go as extreme as putting the whole movie in such that each dimension represents the byte value! But this could be too extreme).
- Create a dataset such that each user-movie-score triple is represented as a vector. This will give us an $n+m+1$ dimensional vector (recall that $n$ is the user representation, and $m$ is the movie representation). The score is just one number, so it adds just $+1$ to the dimensionality. Say that you have $1000$ users, then you will have $1000$ rows, where each row is a $n+m+1$ dimensional vector. You may also add the context information to each row in the dataset (e.g. augmenting a representation of the context as a $c$ dimensional vector, so each row becomes $n+m+c+1$ dimensional vector).
- Train Random Forests on this dataset and ask it to analyze the $n+m$ components (user-movie components) in order to predict the last component (the movie scores). This will give you a regression model.
Then, in order to use this model to predict scores, repeat the same steps 1 to 3, except for pairing users against movies that they haven't seen. Then plug these vectors against the Random Forests regression model to get an expected score for the test user-movie tuple. Finally, if this score is large enough, then recommend the test movie to the test user.
Why is this nice?
- Random Forests is non-parametric and can identify non-linear patterns.
- Easy to distribute.
Q3: Can you use other regression methods?
A3: Yes. You need to choose the algorithm that seems to work best in your domain. I suggested Random Forests because I think it's a nice baseline to start with, but you can try fancier ones such as deep learning algorithms along with different methods of representing your users and their movies.
Best Answer
$\DeclareMathOperator*{\argmin}{arg\,min}$ Ok, when you say SVD, presumably you're talking about truncated SVD (where you only keep the $k$ biggest singular values). There are two different ways to look at the truncated SVD of a matrix. One is the standard definition:
This is all fine and dandy (and easy to implement in R or matlab), but it doesn't make sense when talking about matrices with missing values. However, there's an interesting property of the $k$-truncated SVD--It's the best $k$-rank approximation to the original! That is:
This property seems easy to generalize to the missing value case. Basically you're looking for a $k$-rank matrix that minimizes the element-wise mean squared error across the known entries of the original matrix. That is, when you're training the system, you ignore all of the missing values. (For tips on how you might actually go about finding a $k$-rank approximation, here are some places to look).
Then, once you've come up with a suitably "close" $k$-rank approximation to the original, you use it to fill in the missing values. That is, if $X_{ij}$ was missing, then you fill in $\tilde{X}_{ij}$. Tada! You are now done.