Confidence Interval – How to Find Confidence Intervals for Ratings

confidence intervalestimation

Evan Miller's "How Not to Sort by Average Rating" proposes using the lower bound of a confidence interval to get a sensible aggregate "score" for rated items. However, it's working with a Bernoulli model: ratings are either thumbs up or thumbs down.

What's a reasonable confidence interval to use for a rating model which assigns a discrete score of $1$ to $k$ stars, assuming that the number of ratings for an item might be small?

I think I can see how to adapt the centre of the Wilson and Agresti-Coull intervals as

$$\tilde{p} = \frac{\sum_{i=1}^n{x_i} + z_{\alpha/2}^2\; p_0}{n + z_{\alpha/2}^2}$$

where either $p_0 = \frac{k+1}{2}$ or (probably better) it's the average rating over all items. However, I'm not sure how to adapt the width of the interval. My (revised) best guess would be

$$\tilde{p} \pm \frac{z_{\alpha/2}}{\tilde{n}} \sqrt{\frac{\sum_{i=1}^n{(x_i – \tilde{p})^2} + z_{\alpha/2}(p_0-\tilde{p})^2}{\tilde{n}}}$$

with $\tilde{n} = n + z_{\alpha/2}^2$, but I can't justify with more than hand-waving it as an analogy of Agresti-Coull, taking that as

$$\text{Estimate}(\bar{X}) \pm \frac{z_{\alpha/2}}{\tilde{n}} \sqrt{\text{Estimate}(\text{Var}(X))}$$

Are there standard confidence intervals which apply? (Note that I don't have subscriptions to any journals or easy access to a university library; by all means give proper references, but please supplement with the actual result!)

Best Answer

Like Karl Broman said in his answer, a Bayesian approach would likely be a lot better than using confidence intervals.

The Problem With Confidence Intervals

Why might using confidence intervals not work too well? One reason is that if you don't have many ratings for an item, then your confidence interval is going to be very wide, so the lower bound of the confidence interval will be small. Thus, items without many ratings will end up at the bottom of your list.

Intuitively, however, you probably want items without many ratings to be near the average item, so you want to wiggle your estimated rating of the item toward the mean rating over all items (i.e., you want to push your estimated rating toward a prior). This is exactly what a Bayesian approach does.

Bayesian Approach I: Normal Distribution over Ratings

One way of moving the estimated rating toward a prior is, as in Karl's answer, to use an estimate of the form $w*R + (1-w)*C$:

  • $R$ is the mean over the ratings for the items.
  • $C$ is the mean over all items (or whatever prior you want to shrink your rating to).
  • Note that the formula is just a weighted combination of $R$ and $C$.
  • $w = \frac{v}{v+m}$ is the weight assigned to $R$, where $v$ is the number of reviews for the beer and $m$ is some kind of constant "threshold" parameter.
  • Note that when $v$ is very large, i.e., when we have a lot of ratings for the current item, then $w$ is very close to 1, so our estimated rating is very close to $R$ and we pay little attention to the prior $C$. When $v$ is small, however, $w$ is very close to 0, so the estimated rating places a lot of weight on the prior $C$.

This estimate can, in fact, be given a Bayesian interpretation as the posterior estimate of the item's mean rating when individual ratings comes from a normal distribution centered around that mean.

However, assuming that ratings come from a normal distribution has two problems:

  • A normal distribution is continuous, but ratings are discrete.
  • Ratings for an item don't necessarily follow a unimodal Gaussian shape. For example, maybe your item is very polarizing, so people tend to either give it a very high rating or give it a very low rating.

Bayesian Approach II: Multinomial Distribution over Ratings

So instead of assuming a normal distribution for ratings, let's assume a multinomial distribution. That is, given some specific item, there's a probability $p_1$ that a random user will give it 1 star, a probability $p_2$ that a random user will give it 2 stars, and so on.

Of course, we have no idea what these probabilities are. As we get more and more ratings for this item, we can guess that $p_1$ is close to $\frac{n_1}{n}$, where $n_1$ is the number of users who gave it 1 star and $n$ is the total number of users who rated the item, but when we first start out, we have nothing. So we place a Dirichlet prior $Dir(\alpha_1, \ldots, \alpha_k)$ on these probabilities.

What is this Dirichlet prior? We can think of each $\alpha_i$ parameter as being a "virtual count" of the number of times some virtual person gave the item $i$ stars. For example, if $\alpha_1 = 2$, $\alpha_2 = 1$, and all the other $\alpha_i$ are equal to 0, then we can think of this as saying that two virtual people gave the item 1 star and one virtual person gave the item 2 stars. So before we even get any actual users, we can use this virtual distribution to provide an estimate of the item's rating.

[One way of choosing the $\alpha_i$ parameters would be to set $\alpha_i$ equal to the overall proportion of votes of $i$ stars. (Note that the $\alpha_i$ parameters aren't necessarily integers.)]

Then, once actual ratings come in, simply add their counts to the virtual counts of your Dirichlet prior. Whenever you want to estimate the rating of your item, simply take the mean over all of the item's ratings (both its virtual ratings and its actual ratings).

Related Question