Benefits of using the mean to summarise central tendency of a 5 point rating
As @gung mentioned I think there are often very good reasons for taking the mean of a five-point item as an index of central tendency. I have already outlined these reasons here.
To paraphrase:
- the mean is easy to calculate
- The mean is intuitive and well understood
- The mean is a single number
- Other indices often yield similar rank ordering of objects
Why the mean is good for Amazon
Think about the goals of Amazon in reporting the mean. They might be aiming to
- provide an intuitive and understandable rating for an item
- ensure user acceptance of the rating system
- ensure that people understand what the rating means so they can use it appropriately to inform purchasing decisions
Amazon provides some sort of rounded mean, frequency counts for each rating option, and the sample size (i.e., number of ratings). This information presumably is enough for most people to appreciate both the general sentiment regarding the item and the confidence in such a rating (i.e., a 4.5 with 20 ratings is more likely to be accurate than a 4.5 with 2 ratings; an item with 10 5-star ratings, and one 1-star rating with no comments might still be a good item).
You could even see the mean as a democratic option. Many elections are decided based on which candidate gets the highest mean on a two-point scale. Similarly, if you take the argument that each person who submits a review gets a vote, then you can see the mean as a form that weights each person's vote equally.
Are differences in scale use really a problem?
There are a wide range of rating biases known in the psychological literature (for a review, see Saal et al 1980), such as central tendency bias, leniency bias, strictness bias. Also, some raters will be more arbitrary and some will be more reliable. Some may even systematically lie giving fake positive or fake negative reviews. This will create various forms of error when trying to calculate the true mean rating for an item.
However, if you were to take a random sample of the population, such biases would cancel out, and with a sufficient sample size of raters, you would still get the true mean.
Of course, you don't get a random sample on Amazon, and there is the risk that the particular set of raters you get for an item is systematically biased to be more lenient or strict and so on. That said, I think users of Amazon would appreciate that user submitted ratings come from an imperfect sample. I also think that it's quite likely that with a reasonable sample size that in many cases, the majority of response bias differences would start to disappear.
Possible advances beyond the mean
In terms of improving the accuracy of the rating, I wouldn't challenge the general concept of the mean, but rather I think there are other ways of estimating the true population mean rating for an item (i.e., the mean rating that would be obtained were a large representative sample asked to rate the item).
- Weight raters based on their trustworthiness
- Use a Bayesian rating system that estimates the mean rating as a weighted sum of the average rating for all items and the mean from the specific item, and increase the weighting for the specific item as the number of ratings increases
- Adjust the information of a rater based on any general rating tendency across items (e.g., a 5 from someone who typically gives 3s would be worth more than someone who typically gives 4s).
Thus, if accuracy in rating was the primary goal of Amazon, I think it should endeavour to increase the number of ratings per item and adopt some of the above strategies. Such approaches might be particularly relevant when creating "best-of" rankings. However, for the humble rating on the page, it may well be that the sample mean better meets the goals of simplicity and transparency.
References
- Saal, F.E., Downey, R.G. & Lahey, M.A. (1980). Rating the ratings: Assessing the psychometric quality of rating data.. Psychological Bulletin, 88, 413.
If you can assume independence and approximate normality of observed values around their own population curve, you could construct an asymptotic chi-squared statistic to test for any form of deviation from the WHO charts by summing the squared $Z$ values. Since the parameters are all determined outside the sample, the chi-squared statistic should have $n$ degrees of freedom. [However rejection could as easily imply an under-estimate of the variation about the model as an issue with the location of the curve.]
If you want a test for a directional shift (an overall tendency to be larger or an overall tendency to be smaller) you could instead sum the Z values and (under the same assumptions) compare with a $N(0,n)$
[It would also be possible to test general deviations by testing the set of $Z$ values by goodness-of-fit testing for a standard normal but that would - naturally - tend to be much more directly sensitive to the assumption of normality. I wouldn't advise this approach.]
If you instead assume symmetry and independence, you could test directional shift with a sign test and consistency more broadly with a runs test.
Note that if your curves are obtained by following individuals through time, the assumption of independence might not be tenable; more suitable models would be needed for that.
Combining males and females even though they would have different growth curves
-- this should be possible for the chi-square test and the sum-of-Z tests I mentioned. With the sign test I mentioned it should also work.
The runs test would not, however -- you couldn't just jam the two series together, you'd need to combine the two test statistics. If the series are large enough to use the normal approximation you could add the two; the hypothesized mean of the number of runs should be the sum of the component means and the variance should be the sum of the variances.
Best Answer
It is widely used practice to use so called Likert scales as quasi metric in a school of thought called Classical Test Theory (CTT, as opposed to Item Response Theory, IRT).
Usually you are expected to formulate the possible choices so that they feel equidistant and you are supposed to prove that what you add is somehow of the same kind and not apples and pears (think: Internal consistency, think: factor analysis).
If your scale has Internal Consistency and is composed of a number of items and otherwise "reasonable", then it is common practice to compute t-tests and do all sorts of other stuff for metric variables.
It may not be mathematically correct but is has stood the test of time, and often results are surprisingly similar compared to those gained by more complicated and less used IRT, even if the latter may be the future in a computerized time.
To read more, search for "Likert-scale" and "Classical Test Theory".