Like Karl Broman said in his answer, a Bayesian approach would likely be a lot better than using confidence intervals.
The Problem With Confidence Intervals
Why might using confidence intervals not work too well? One reason is that if you don't have many ratings for an item, then your confidence interval is going to be very wide, so the lower bound of the confidence interval will be small. Thus, items without many ratings will end up at the bottom of your list.
Intuitively, however, you probably want items without many ratings to be near the average item, so you want to wiggle your estimated rating of the item toward the mean rating over all items (i.e., you want to push your estimated rating toward a prior). This is exactly what a Bayesian approach does.
Bayesian Approach I: Normal Distribution over Ratings
One way of moving the estimated rating toward a prior is, as in Karl's answer, to use an estimate of the form $w*R + (1-w)*C$:
- $R$ is the mean over the ratings for the items.
- $C$ is the mean over all items (or whatever prior you want to shrink your rating to).
- Note that the formula is just a weighted combination of $R$ and $C$.
- $w = \frac{v}{v+m}$ is the weight assigned to $R$, where $v$ is the number of reviews for the beer and $m$ is some kind of constant "threshold" parameter.
- Note that when $v$ is very large, i.e., when we have a lot of ratings for the current item, then $w$ is very close to 1, so our estimated rating is very close to $R$ and we pay little attention to the prior $C$. When $v$ is small, however, $w$ is very close to 0, so the estimated rating places a lot of weight on the prior $C$.
This estimate can, in fact, be given a Bayesian interpretation as the posterior estimate of the item's mean rating when individual ratings comes from a normal distribution centered around that mean.
However, assuming that ratings come from a normal distribution has two problems:
- A normal distribution is continuous, but ratings are discrete.
- Ratings for an item don't necessarily follow a unimodal Gaussian shape. For example, maybe your item is very polarizing, so people tend to either give it a very high rating or give it a very low rating.
Bayesian Approach II: Multinomial Distribution over Ratings
So instead of assuming a normal distribution for ratings, let's assume a multinomial distribution. That is, given some specific item, there's a probability $p_1$ that a random user will give it 1 star, a probability $p_2$ that a random user will give it 2 stars, and so on.
Of course, we have no idea what these probabilities are. As we get more and more ratings for this item, we can guess that $p_1$ is close to $\frac{n_1}{n}$, where $n_1$ is the number of users who gave it 1 star and $n$ is the total number of users who rated the item, but when we first start out, we have nothing. So we place a Dirichlet prior $Dir(\alpha_1, \ldots, \alpha_k)$ on these probabilities.
What is this Dirichlet prior? We can think of each $\alpha_i$ parameter as being a "virtual count" of the number of times some virtual person gave the item $i$ stars. For example, if $\alpha_1 = 2$, $\alpha_2 = 1$, and all the other $\alpha_i$ are equal to 0, then we can think of this as saying that two virtual people gave the item 1 star and one virtual person gave the item 2 stars. So before we even get any actual users, we can use this virtual distribution to provide an estimate of the item's rating.
[One way of choosing the $\alpha_i$ parameters would be to set $\alpha_i$ equal to the overall proportion of votes of $i$ stars. (Note that the $\alpha_i$ parameters aren't necessarily integers.)]
Then, once actual ratings come in, simply add their counts to the virtual counts of your Dirichlet prior. Whenever you want to estimate the rating of your item, simply take the mean over all of the item's ratings (both its virtual ratings and its actual ratings).
It's easy to think of the following 'workaround' which adapts a multi-ranking system to the 'upvote/downvote' solution discussed in the linked article:
Let's say you have the popular 5 star rating system. So we have a number of votes, each having a value of: 1, 2, 3, 4 or 5.
To 'convert' these ratings to up/down votes, use the following rule:
For star rating -- Add
* - 0.00 to up votes and 1.00 to down votes (i.e. a full down vote)
** - 0.25 to up votes and 0.75 to down votes
*** - 0.50 to up votes and 0.50 to down votes
**** - 0.75 to up votes and 0.25 to down votes
***** - 1.00 to up votes and 0.00 to down votes (i.e. a full up vote)
After we reduce the 5 star ratings to up/down ratings, we can proceed with the usual score calculations described in Evan Miller's article.
As I am not a statistician or mathematician and I would love to hear from other people if this makes sense or not.
Best Answer
One way to cast your problem would be to treat it as a bayesian estimation problem.
Basically this means having a prior on your mean and update the mean based on each new observation over time.
A practical, yet theoretically disputable way to achieve this is to compute the mean as a function of the mean found in the corpus and the actual observations you have for this item. More precisely, in the recommender system setting, this could mean that you initialize the mean to the mean of the category of the item you're dealing with (in your example "statistics books" probably) and then update it each time a user gives a rating to this particular item.
You can design a clever update rule that has statistical foundations or rely on common sense to quickly produce a basic update rule like this one:
When dealing in general with this kind of problems I recommend reading the work of Koren et al on the Netflix challenge. They grabbed quite a bit of performance by using unsupervised learning on user and content variables - the idea of using the category mean being a similar, yet naive cousin.