Solved – Collaborative filtering and implicit ratings; normalization

algorithmsnormalizationrecommender-system

I would like to use the time a user spends viewing an article as an implicit rating of how much the user likes the article.

My question is how do I normalize this information across all users.

At the moment, I'm subtracting the time spent by the user-specific mean, and dividing by the standard deviation.

Is this the right way to go about it? It doesn't seem so, as the ratings can still take any values.

Maybe I should scale the ratings into some interval (like [$1$-$10$]) after?

Best Answer

If you are going to populate the entire userxarticle matrix with dwell-times, you are going to run in to sparsity issues very quickly.

Also, a simple average of dwell-times is prone to many problems, for example, what if you have very few records, or if one user left her browser open for a month ?

Step #1: Filling in the blanks

From my experience dealing with user dwell time, The amount of users that spend $t$ seconds viewing a site, decreases greatly as $t$ increases.

I found out that modelling user dwell-time as an Exponential curve, is a good approximation.

Using the Bayesian approach, and using the Gamma distribution as the prior distribution on the mean of each site's dwell-time, we get a familiar formula:

Harmonic mean:

$$\frac{n+m}{\frac{m}{b}+\frac{1}{t_1}+\dots++\frac{1}{t_b}}$$

Where $t_i$ is the time spent on site $i$, $b$ is the bias you introduce and $m$ is its strength.

For example, setting $b=3,m=2$ is like assuming two fictional users viewed a site for 3 seconds when we have no data for that userxarticle combination.

And note that this formula is much more immuned to outliers, since it assumes the exponential distribution (and not the Gaussian distribution like the arithmetic mean)

Step #2: Populating the matrix

Times are positive, and they have a certain bounds that make sense (for example, maximum of one day).

However, after the matrix factorization, any numeric value can appear in the matrix cells, including negative terms.

The common practice is to populate the userxarticle matrix with $$logit(t)$$ Where logit is the inverse of the sigmoid function.

And then when interpolating the dwell time for a user $i$ and article $j$, we use:

$$sigmoid(<\vec{u_i},\vec{a_j}>)$$

Instead of only using the dot product.

This way we can be certain that the end result would be bounded to a certain range that makes sense.

Related Question