Ordinal Data – Is Amazon’s Average Rating Misleading? Insights from Mean and Likert Data

likertmeanordinal-data

If I understand correctly, book ratings on a 1-5 scale are Likert scores. That is, a 3 for me may not necessarily be a 3 for someone else. It's an ordinal scale IMO. One shouldn't really average ordinal scales but can definitely take the mode, median and percentiles.

So is it 'okay' to bend the rules since the large part of the population understands means than the above statistics? Although the research community strongly rebukes taking averages of Likert scale based data, is it fine to do this with the masses (practically speaking)? Is taking the average in this case even misleading to start with?

Seems unlikely that a company like Amazon would fumble on basic statistics, but if not then what am I missing here? Can we claim that the ordinal scale is a convenient approximation to the ordinal to justify taking the mean? On what grounds?

Best Answer

Benefits of using the mean to summarise central tendency of a 5 point rating

As @gung mentioned I think there are often very good reasons for taking the mean of a five-point item as an index of central tendency. I have already outlined these reasons here.

To paraphrase:

  1. the mean is easy to calculate
  2. The mean is intuitive and well understood
  3. The mean is a single number
  4. Other indices often yield similar rank ordering of objects

Why the mean is good for Amazon

Think about the goals of Amazon in reporting the mean. They might be aiming to

  • provide an intuitive and understandable rating for an item
  • ensure user acceptance of the rating system
  • ensure that people understand what the rating means so they can use it appropriately to inform purchasing decisions

Amazon provides some sort of rounded mean, frequency counts for each rating option, and the sample size (i.e., number of ratings). This information presumably is enough for most people to appreciate both the general sentiment regarding the item and the confidence in such a rating (i.e., a 4.5 with 20 ratings is more likely to be accurate than a 4.5 with 2 ratings; an item with 10 5-star ratings, and one 1-star rating with no comments might still be a good item).

You could even see the mean as a democratic option. Many elections are decided based on which candidate gets the highest mean on a two-point scale. Similarly, if you take the argument that each person who submits a review gets a vote, then you can see the mean as a form that weights each person's vote equally.

Are differences in scale use really a problem?

There are a wide range of rating biases known in the psychological literature (for a review, see Saal et al 1980), such as central tendency bias, leniency bias, strictness bias. Also, some raters will be more arbitrary and some will be more reliable. Some may even systematically lie giving fake positive or fake negative reviews. This will create various forms of error when trying to calculate the true mean rating for an item.

However, if you were to take a random sample of the population, such biases would cancel out, and with a sufficient sample size of raters, you would still get the true mean.

Of course, you don't get a random sample on Amazon, and there is the risk that the particular set of raters you get for an item is systematically biased to be more lenient or strict and so on. That said, I think users of Amazon would appreciate that user submitted ratings come from an imperfect sample. I also think that it's quite likely that with a reasonable sample size that in many cases, the majority of response bias differences would start to disappear.

Possible advances beyond the mean

In terms of improving the accuracy of the rating, I wouldn't challenge the general concept of the mean, but rather I think there are other ways of estimating the true population mean rating for an item (i.e., the mean rating that would be obtained were a large representative sample asked to rate the item).

  • Weight raters based on their trustworthiness
  • Use a Bayesian rating system that estimates the mean rating as a weighted sum of the average rating for all items and the mean from the specific item, and increase the weighting for the specific item as the number of ratings increases
  • Adjust the information of a rater based on any general rating tendency across items (e.g., a 5 from someone who typically gives 3s would be worth more than someone who typically gives 4s).

Thus, if accuracy in rating was the primary goal of Amazon, I think it should endeavour to increase the number of ratings per item and adopt some of the above strategies. Such approaches might be particularly relevant when creating "best-of" rankings. However, for the humble rating on the page, it may well be that the sample mean better meets the goals of simplicity and transparency.

References

  • Saal, F.E., Downey, R.G. & Lahey, M.A. (1980). Rating the ratings: Assessing the psychometric quality of rating data.. Psychological Bulletin, 88, 413.
Related Question