If you are going to populate the entire user
xarticle
matrix with dwell-times, you are going to run in to sparsity issues very quickly.
Also, a simple average of dwell-times is prone to many problems, for example, what if you have very few records, or if one user left her browser open for a month ?
Step #1: Filling in the blanks
From my experience dealing with user dwell time, The amount of users that spend $t$ seconds viewing a site, decreases greatly as $t$ increases.
I found out that modelling user dwell-time as an Exponential curve, is a good approximation.
Using the Bayesian approach, and using the Gamma distribution as the prior distribution on the mean of each site's dwell-time, we get a familiar formula:
Harmonic mean:
$$\frac{n+m}{\frac{m}{b}+\frac{1}{t_1}+\dots++\frac{1}{t_b}}$$
Where $t_i$ is the time spent on site $i$, $b$ is the bias you introduce and $m$ is its strength.
For example, setting $b=3,m=2$ is like assuming two fictional users viewed a site for 3 seconds when we have no data for that user
xarticle
combination.
And note that this formula is much more immuned to outliers, since it assumes the exponential distribution (and not the Gaussian distribution like the arithmetic mean)
Step #2: Populating the matrix
Times are positive, and they have a certain bounds that make sense (for example, maximum of one day).
However, after the matrix factorization, any numeric value can appear in the matrix cells, including negative terms.
The common practice is to populate the user
xarticle
matrix with
$$logit(t)$$
Where logit is the inverse of the sigmoid function.
And then when interpolating the dwell time for a user $i$ and article $j$, we use:
$$sigmoid(<\vec{u_i},\vec{a_j}>)$$
Instead of only using the dot product.
This way we can be certain that the end result would be bounded to a certain range that makes sense.
Lets start with an example: Say you have data on transaction details of customers in a store. So you have who bought what and when. Clearly you don't have rating data, but you do have the understanding that if a person buys an item multiple times, maybe he likes it(Or would rate it high if given the chance). Thus an implicit preference/Rating could be "How many times someone buys something".
You can also combine the time component, e.g."A" buys chips 5 times this week, but "B" bought it 5 times the last week. If your preference is "#purchases in last month", you'd miss out the information that maybe "B" has lesser chances of buying chips again than "A". So you can add a time-decay to aggregate the counts.
In the RDD<Rating>
, you can provide these values directly, what trainImplicit()
would do is train a value between 0 and 1 (p
) which depends on how high the implicit preference is. Now when you use the model to predict, you should only expect value in range 0 to 1, and not the count that you provided in RDD<Rating>
.
Best Answer
These low-rank approximations are quite hard to interpret. Moreover, once you've got your $X$ and $Y$ such that $X Y^T \approx Q$ you can apply an unitary transformation $U$ to them to obtain $X_* = X U$ and $Y_* = Y U$ which leads to $X_* Y_*^T = X U U^T Y^T = X Y^T$.
This means that there are many possible solutions (each $X U$ for different $U$ is as good as the other one). This also harms interpretability in my opinion: suppose there is some "true" or "interpretable" factorization that learns low-rank representations of items and users where components of of user/item's vector mean something like "how much of comedy there is in this film" or "how much this user loves comedies", these components are distinct and uncorrelated. With the above result what you're most likely to get is a vector whose components is a mix of those desirable and interpretable components. Unfortunately, you can't "unmix" them because you don't know mixing matrix $U$.
If you seek interpretability, you can try topic modeling techniques (collaborative filtering is mostly topic modeling where documents are users and words are items being recommended), in particular LDA. I'm not sure though if it'll work good with implicit feedback.
Efficient online update is troubling, indeed. In general, you might want to re-train your model from scratch on all current data each night or once a week. In the meanwhile you can do something like a online update each time you observe $q_{ui}$:
$$ x_u \leftarrow x_u - \alpha \frac{\partial l_{ui}}{\partial x_u} $$ $$ y_i \leftarrow y_i - \alpha \frac{\partial l_{ui}}{\partial y_i} $$
Both approaches are hacky and are not guaranteed to work. It'd better to resort to some model that was designed with online updates in mind.