Ordinal Data – Is Amazon’s Average Rating Misleading? Insights from Mean and Likert Data

likertmeanordinal-data

If I understand correctly, book ratings on a 1-5 scale are Likert scores. That is, a 3 for me may not necessarily be a 3 for someone else. It's an ordinal scale IMO. One shouldn't really average ordinal scales but can definitely take the mode, median and percentiles.

So is it 'okay' to bend the rules since the large part of the population understands means than the above statistics? Although the research community strongly rebukes taking averages of Likert scale based data, is it fine to do this with the masses (practically speaking)? Is taking the average in this case even misleading to start with?

Seems unlikely that a company like Amazon would fumble on basic statistics, but if not then what am I missing here? Can we claim that the ordinal scale is a convenient approximation to the ordinal to justify taking the mean? On what grounds?

Best Answer

Benefits of using the mean to summarise central tendency of a 5 point rating

As @gung mentioned I think there are often very good reasons for taking the mean of a five-point item as an index of central tendency. I have already outlined these reasons here.

To paraphrase:

the mean is easy to calculate

The mean is intuitive and well understood

The mean is a single number

Other indices often yield similar rank ordering of objects

Why the mean is good for Amazon

Think about the goals of Amazon in reporting the mean. They might be aiming to

provide an intuitive and understandable rating for an item
ensure user acceptance of the rating system
ensure that people understand what the rating means so they can use it appropriately to inform purchasing decisions

Amazon provides some sort of rounded mean, frequency counts for each rating option, and the sample size (i.e., number of ratings). This information presumably is enough for most people to appreciate both the general sentiment regarding the item and the confidence in such a rating (i.e., a 4.5 with 20 ratings is more likely to be accurate than a 4.5 with 2 ratings; an item with 10 5-star ratings, and one 1-star rating with no comments might still be a good item).

You could even see the mean as a democratic option. Many elections are decided based on which candidate gets the highest mean on a two-point scale. Similarly, if you take the argument that each person who submits a review gets a vote, then you can see the mean as a form that weights each person's vote equally.

Are differences in scale use really a problem?

There are a wide range of rating biases known in the psychological literature (for a review, see Saal et al 1980), such as central tendency bias, leniency bias, strictness bias. Also, some raters will be more arbitrary and some will be more reliable. Some may even systematically lie giving fake positive or fake negative reviews. This will create various forms of error when trying to calculate the true mean rating for an item.

However, if you were to take a random sample of the population, such biases would cancel out, and with a sufficient sample size of raters, you would still get the true mean.

Of course, you don't get a random sample on Amazon, and there is the risk that the particular set of raters you get for an item is systematically biased to be more lenient or strict and so on. That said, I think users of Amazon would appreciate that user submitted ratings come from an imperfect sample. I also think that it's quite likely that with a reasonable sample size that in many cases, the majority of response bias differences would start to disappear.

Possible advances beyond the mean

In terms of improving the accuracy of the rating, I wouldn't challenge the general concept of the mean, but rather I think there are other ways of estimating the true population mean rating for an item (i.e., the mean rating that would be obtained were a large representative sample asked to rate the item).

Weight raters based on their trustworthiness
Use a Bayesian rating system that estimates the mean rating as a weighted sum of the average rating for all items and the mean from the specific item, and increase the weighting for the specific item as the number of ratings increases
Adjust the information of a rater based on any general rating tendency across items (e.g., a 5 from someone who typically gives 3s would be worth more than someone who typically gives 4s).

Thus, if accuracy in rating was the primary goal of Amazon, I think it should endeavour to increase the number of ratings per item and adopt some of the above strategies. Such approaches might be particularly relevant when creating "best-of" rankings. However, for the humble rating on the page, it may well be that the sample mean better meets the goals of simplicity and transparency.

References

Saal, F.E., Downey, R.G. & Lahey, M.A. (1980). Rating the ratings: Assessing the psychometric quality of rating data.. Psychological Bulletin, 88, 413.

Related Solutions

Solved – How to adjust average rating for sample size on rating systems with more than two categories

One way to cast your problem would be to treat it as a bayesian estimation problem.

Basically this means having a prior on your mean and update the mean based on each new observation over time.

A practical, yet theoretically disputable way to achieve this is to compute the mean as a function of the mean found in the corpus and the actual observations you have for this item. More precisely, in the recommender system setting, this could mean that you initialize the mean to the mean of the category of the item you're dealing with (in your example "statistics books" probably) and then update it each time a user gives a rating to this particular item.

You can design a clever update rule that has statistical foundations or rely on common sense to quickly produce a basic update rule like this one:

X : item 
r_X^i : i-th rating for item X  
C : all item in the same category as X, discarding empty ratings
mean_C = (1/|C|) * sum_{c in C} sum_{i} (r_c^i)
# when no rating => use category mean
mean_X^0 = mean_C 
# when j ratings => ponderate category mean with actual ratings
mean_X^j = (1/n+1)(mean_C + sum_{i=1..n}(r_X^i))

When dealing in general with this kind of problems I recommend reading the work of Koren et al on the Netflix challenge. They grabbed quite a bit of performance by using unsupervised learning on user and content variables - the idea of using the category mean being a similar, yet naive cousin.

Solved – Correlation between interval and ordinal data

Correlation between ordinal data and metric data can be done using Spearman correlation.

The 3-point scale can obviously not be normally distributed. More likely: you have a rather small number of samples $n$ and therefore your test of deviation from normality has not enough power.