Solved – Training and testing an autoencoder on very sparsely populated data

autoencodersmachine learningrecommender-systemunsupervised learningvalidation

I am exploring the possibility of using a deep autoencoder neural net to build a recommender system. I am firstly testing the model's performance on the traditionally used benchmark of the Movielens data set. I have pivoted the data set into a matrix of size U x M where U is the number of users and M is the number of movies. Each cell R_u,m is the rating (1-5) that user u gave to movie m, provided that they have rated that movie. If not, the cell is populated with a 0. Obviously, each user has only rated a very small proportion of all movies in the data set, so it is predominantly populated with zeros.

I am training the autoencoder by feeding the data set in to it in batches of rows, encoding each row in to a dimension of k < M, and then reconstructing it back to a dimension of M, and using the original input as the label in backprop – but only calculating the loss on the non-zero values of the input.

The idea is that the model will reconstruct the original non-zero values accurately, and hence will populate all the original zero values with accurate predicted ratings.

It learns quite quickly to what seems to be an impressive loss value, but of course this could be completely overfitting. I now want to do some testing on unseen data, but am having difficulty figuring out how to divide the data set into training and testing sets.

Any recommendations on how to correctly train and test a model on this type of data?

What first came to mind is the following:

  • All rows in the original matrix have at least 20 non-zero values
  • Set 10 randomly chosen of these non-zero values to 0, call this new matrix input
  • Feed this input matrix into the model for training, still calculating the loss only on the non-zero values
  • For testing, feed original into the model, but now only calculate the loss only on the 10 values per row that were randomly set to 0.

Best Answer

Seems there is a similar case here.

The answer's author seems to have a very similar proposal to yours. And I would probably do this the same way as well.

The key takeaway is to make sure you don't zero out everything on some row or column. You correctly identified this caveat already, at least for the rows. Make sure your columns also have at least some values (perhaps 10 is enough to get a feeling for someone's movie taste).

If you don't know anything about a movie or a user, you can't expect the autoencoder to know the user's tastes (though it might predict the general population's taste).

Best of luck with your algorithm!