Solved – Measure spread of non normal distribution

descriptive statisticsinferencemeasurement

I have sentiment data for customer reviews drawn from a larger population of reviews. Each product has a number of customer reviews. Each review has a sentiment (opinion/feeling) between 0 and 1 where 0 is very negative and 1 is very positive.

Customers seldom write reviews when one is indifferent. They are either positive or negative. So the distribution of sentiments is not normal but more bimodal with some skew to the middle, so there are lots of negative reviews and lots of positive reviews but not much in the middle.

How can I measure the spread of data and compare them between different products? For example, a product can have a mean sentiment of 0.8 when it has most reviews around that value. Another product can also have a mean of 0.8 but have wildly positive reviews and some very negative. The latter product would have a larger spread of the sentiment. The products with the largest spread of sentiments is likely marketed wrongly, so it would be important to identify them: people may buy them and think they will do something it doesn't.

I assume standard deviation is out of the picture since the distribution isn't normal or t-distributed. Are there other measures of spread for this kind of bimodal distribution?

Best Answer

If you look at introductions to the beta distribution (e.g. this Wikipedia article) you will see that standard deviations (SDs) can certainly be defined for distributions that are bounded (e.g. confined to the interval $[0, 1]$) and may be far from normal in shape (e.g. bimodal and in your case possibly U-shaped). The standard deviation is also well defined for many distributions that can be far from symmetric such as the Poisson or the lognormal. So, any idea that the SD is defined and useful only for near-normal distributions is unfounded.

What is certain is that you can't carry over ideas which work only for normal distributions (e.g. that the SD is the distance between the mean and each inflection on the density function) and rules of thumb from normal distributions about the fraction of observations within $\pm 1, \pm 2, \dots$ SD of the mean may break down.

Your data may well be messier than the beta distribution and even show small gaps, bumps and spikes. But I see no barrier to your using the SD as a descriptive measure, nor indeed to also using the interquartile range or IQR. Just keep plotting your data so that you get a sense of how any measure works with your data and any circumstances where it is misleading.

You already have a strong sense that a given mean can correspond to different distributions. That is also going to be true for SDs.

Related Question