Solved – Predicting with Restricted Boltzmann Machines for Collaborative Filtering

machine learningrecommender-system

Restricted Boltzmann Machines (RBMs) were used in the Netflix competition to improve the prediction of user ratings for movies based on collaborative filtering.

I think I understand how to use RBMs as a generative model after obtaining the weights that maximize the likelihood of the data (in this case, of the visible units.) However, the Netflix competition had a very large number of missing ratings, so Salakhutdinov, Mnih and Hinton decided to use an RBM for each user with shared weights for users that rated the same movie. The training follows as usual, but I don't understand how to actually fill in missing entries of movies for a given user:

enter image description here

The paper linked above has a section titled "Making Predictions" in which the following equations are supposed to predict a rating for a new movie q:

$$p(v_{q}^{k}=1|V) \propto \sum_{h_1,…,h_p} \exp(-E(v_{q}^{k},V,h)) \\ \propto \Gamma_{q}^{k} \prod_{j=1}^{F}\sum_{h_J \in \{0,1\}}\exp(\sum_{il}v_{i}^{l}h_{j}W_{ij}^{l} + v_{q}^{k}h_{j}W_{qj}^{k}+h_jb_j)$$

This seems to be calculating the probability of an active visible unit with rating $k$ given the ratings $V$ of a single user. But I'm not sure how this helps to infer another movie that wasn't rated by a particular user and therefore, there are no units or weights available in the RBM for that user.

In this PDF (page 27), the author describes prediction based on the reconstruction phase:

$$\sum_k p(v_{i}^{k}=1, h)\times k$$ which is very different to the previous equation.

Best Answer

The input for missing movies are all zero.

In a vanilla RBM, once you go to the hidden layer and then come back to the visible layer, you'll get reconstructions for all movies, not just the ones that the current user have interacted with. In the training process it's really important to ignore those reconstructions so that they don't affect your weight matrix and visible layer bias in the update step. In this context, "ignore" means set the value to 0. If you don't set to 0 you'll get corrections in more weights than you really need.

On the other hand, to get predictions, you need use the full matrix to actually get all reconstructions.

As you pointed, you just calculate the activation for all visible units(using the softmax function), pick the five units related to some specific item and then apply the last formula which is the expected value for some movie $i$

$p_i = \sum_{k} p(v_{i}^{k} = 1 | h) k$

If your dataset contains natural numbers you can just round() them.