If I understand correctly, book ratings on a 1-5 scale are Likert scores. That is, a 3 for me may not necessarily be a 3 for someone else. It's an ordinal scale IMO. One shouldn't really average ordinal scales but can definitely take the mode, median and percentiles.
So is it 'okay' to bend the rules since the large part of the population understands means than the above statistics? Although the research community strongly rebukes taking averages of Likert scale based data, is it fine to do this with the masses (practically speaking)? Is taking the average in this case even misleading to start with?
Seems unlikely that a company like Amazon would fumble on basic statistics, but if not then what am I missing here? Can we claim that the ordinal scale is a convenient approximation to the ordinal to justify taking the mean? On what grounds?
Best Answer
Benefits of using the mean to summarise central tendency of a 5 point rating
As @gung mentioned I think there are often very good reasons for taking the mean of a five-point item as an index of central tendency. I have already outlined these reasons here.
To paraphrase:
Why the mean is good for Amazon
Think about the goals of Amazon in reporting the mean. They might be aiming to
Amazon provides some sort of rounded mean, frequency counts for each rating option, and the sample size (i.e., number of ratings). This information presumably is enough for most people to appreciate both the general sentiment regarding the item and the confidence in such a rating (i.e., a 4.5 with 20 ratings is more likely to be accurate than a 4.5 with 2 ratings; an item with 10 5-star ratings, and one 1-star rating with no comments might still be a good item).
You could even see the mean as a democratic option. Many elections are decided based on which candidate gets the highest mean on a two-point scale. Similarly, if you take the argument that each person who submits a review gets a vote, then you can see the mean as a form that weights each person's vote equally.
Are differences in scale use really a problem?
There are a wide range of rating biases known in the psychological literature (for a review, see Saal et al 1980), such as central tendency bias, leniency bias, strictness bias. Also, some raters will be more arbitrary and some will be more reliable. Some may even systematically lie giving fake positive or fake negative reviews. This will create various forms of error when trying to calculate the true mean rating for an item.
However, if you were to take a random sample of the population, such biases would cancel out, and with a sufficient sample size of raters, you would still get the true mean.
Of course, you don't get a random sample on Amazon, and there is the risk that the particular set of raters you get for an item is systematically biased to be more lenient or strict and so on. That said, I think users of Amazon would appreciate that user submitted ratings come from an imperfect sample. I also think that it's quite likely that with a reasonable sample size that in many cases, the majority of response bias differences would start to disappear.
Possible advances beyond the mean
In terms of improving the accuracy of the rating, I wouldn't challenge the general concept of the mean, but rather I think there are other ways of estimating the true population mean rating for an item (i.e., the mean rating that would be obtained were a large representative sample asked to rate the item).
Thus, if accuracy in rating was the primary goal of Amazon, I think it should endeavour to increase the number of ratings per item and adopt some of the above strategies. Such approaches might be particularly relevant when creating "best-of" rankings. However, for the humble rating on the page, it may well be that the sample mean better meets the goals of simplicity and transparency.
References