Solved – How to set preferences for ALS implicit feedback in Collaborative Filtering

recommender-system

I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields userId and productId. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I use:

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel

(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)

This API requires Rating object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on trainImplicit tells: Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, preference) pairs.

When I set rating / preferences to 1 as in:

val ratings = sc.textFile(new File(dir, file).toString).map { line =>
  val fields = line.split(",")
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x => x._1 < 60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()

And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
  .join(data.map(x => ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set rating in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in ALS.trainImplicit method) ?

Update

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).

Best Answer

Lets start with an example: Say you have data on transaction details of customers in a store. So you have who bought what and when. Clearly you don't have rating data, but you do have the understanding that if a person buys an item multiple times, maybe he likes it(Or would rate it high if given the chance). Thus an implicit preference/Rating could be "How many times someone buys something".

You can also combine the time component, e.g."A" buys chips 5 times this week, but "B" bought it 5 times the last week. If your preference is "#purchases in last month", you'd miss out the information that maybe "B" has lesser chances of buying chips again than "A". So you can add a time-decay to aggregate the counts.

In the RDD<Rating>, you can provide these values directly, what trainImplicit() would do is train a value between 0 and 1 (p) which depends on how high the implicit preference is. Now when you use the model to predict, you should only expect value in range 0 to 1, and not the count that you provided in RDD<Rating>.

Related Question