Machine Learning – Applying SVD to Collaborative Filtering Problems Explained

machine learningrecommender-systemsvd

In Collaborative filtering, we have values that are not filled in. Suppose a user did not watch a movie then we have to put an 'na' in there.

If I am going to take an SVD of this matrix, then I have to put some number in there – say 0. Now if I factorize the matrix, I have a method to find similar users (by finding out which users are closer together in the reduced dimensional space). But the predicted preference itself – for a user to an item will be zero. (because thats what we entered on the unknown columns).

So I am stuck with the problem of collaborative filtering vs SVD. They seem to be almost the same, but not quite.

What is the difference between them and what happens when I apply an SVD to a collaborative filtering problem? I did, and the results seem acceptable in terms of finding nearby users, which is great, but how?

Best Answer

$\DeclareMathOperator*{\argmin}{arg\,min}$ Ok, when you say SVD, presumably you're talking about truncated SVD (where you only keep the $k$ biggest singular values). There are two different ways to look at the truncated SVD of a matrix. One is the standard definition:

First you do the SVD: $\underset{n\times m}{X} = \underset{n\times n}{U} \overset{n\times m}{\Sigma} \underset{m\times m}{V^T}$, where $U$ and $V$ are rotation matrices, and $\Sigma$ has the singular values along the diagonal. Then you pick the top $k$ singular values, zero out the rest, and hack off irrelevant rows and columns to make a $k$-rank approximation to the original: $X \approx \tilde{X} = \underset{n\times k}{\tilde{U}} \overset{k\times k}{\tilde{\Sigma}} \underset{k\times m}{\tilde{V}^T}$

This is all fine and dandy (and easy to implement in R or matlab), but it doesn't make sense when talking about matrices with missing values. However, there's an interesting property of the $k$-truncated SVD--It's the best $k$-rank approximation to the original! That is:

$ \tilde{X} = \argmin_{B : rank(B)=k} \displaystyle\sum\limits_{i,j} (X_{ij} - B_{ij})^2$

This property seems easy to generalize to the missing value case. Basically you're looking for a $k$-rank matrix that minimizes the element-wise mean squared error across the known entries of the original matrix. That is, when you're training the system, you ignore all of the missing values. (For tips on how you might actually go about finding a $k$-rank approximation, here are some places to look).

Then, once you've come up with a suitably "close" $k$-rank approximation to the original, you use it to fill in the missing values. That is, if $X_{ij}$ was missing, then you fill in $\tilde{X}_{ij}$. Tada! You are now done.

Related Question