Solved – Measure spread of non normal distribution

descriptive statisticsinferencemeasurement

I have sentiment data for customer reviews drawn from a larger population of reviews. Each product has a number of customer reviews. Each review has a sentiment (opinion/feeling) between 0 and 1 where 0 is very negative and 1 is very positive.

Customers seldom write reviews when one is indifferent. They are either positive or negative. So the distribution of sentiments is not normal but more bimodal with some skew to the middle, so there are lots of negative reviews and lots of positive reviews but not much in the middle.

How can I measure the spread of data and compare them between different products? For example, a product can have a mean sentiment of 0.8 when it has most reviews around that value. Another product can also have a mean of 0.8 but have wildly positive reviews and some very negative. The latter product would have a larger spread of the sentiment. The products with the largest spread of sentiments is likely marketed wrongly, so it would be important to identify them: people may buy them and think they will do something it doesn't.

I assume standard deviation is out of the picture since the distribution isn't normal or t-distributed. Are there other measures of spread for this kind of bimodal distribution?

Best Answer

If you look at introductions to the beta distribution (e.g. this Wikipedia article) you will see that standard deviations (SDs) can certainly be defined for distributions that are bounded (e.g. confined to the interval $[0, 1]$) and may be far from normal in shape (e.g. bimodal and in your case possibly U-shaped). The standard deviation is also well defined for many distributions that can be far from symmetric such as the Poisson or the lognormal. So, any idea that the SD is defined and useful only for near-normal distributions is unfounded.

What is certain is that you can't carry over ideas which work only for normal distributions (e.g. that the SD is the distance between the mean and each inflection on the density function) and rules of thumb from normal distributions about the fraction of observations within $\pm 1, \pm 2, \dots$ SD of the mean may break down.

Your data may well be messier than the beta distribution and even show small gaps, bumps and spikes. But I see no barrier to your using the SD as a descriptive measure, nor indeed to also using the interquartile range or IQR. Just keep plotting your data so that you get a sense of how any measure works with your data and any circumstances where it is misleading.

You already have a strong sense that a given mean can correspond to different distributions. That is also going to be true for SDs.

Theory

The K function estimates the mean proportion of points within a distance $d$ of a typical point. For a uniform distribution on the interval $[0,1]$, the true proportion can be computed and (asymptotically in the sample size) equals $1 - (1-d)^2$. The appropriate one-dimensional version of the L function subtracts this value from K to show deviations from uniformity. We might therefore consider normalizing any batch of data to have a unit range and examining its L function for deviations around zero.

Worked Examples

To illustrate, I have simulated $999$ independent samples of size $64$ from a uniform distribution and plotted their (normalized) L functions for shorter distances (from $0$ to $1/3$), thereby creating an envelope to estimate the sampling distribution of the L function. (Plotted points well within this envelope cannot be significantly distinguished from uniformity.) Over this I have plotted the L functions for samples of the same size from a U-shaped distribution, a mixture distribution with four obvious components, and a standard Normal distribution. The histograms of these samples (and of their parent distributions) are shown for reference, using line symbols to match those of the L functions.

The sharp separated spikes of the U-shaped distribution (dashed red line, leftmost histogram) create clusters of closely spaced values. This is reflected by a very large slope in the L function at $0$. The L function then decreases, eventually becoming negative to reflect the gaps at intermediate distances.

The sample from the normal distribution (solid blue line, rightmost histogram) is fairly close to uniformly distributed. Accordingly, its L function does not depart from $0$ quickly. However, by distances of $0.10$ or so, it has risen sufficiently above the envelope to signal a slight tendency to cluster. The continued rise across intermediate distances indicates the clustering is diffuse and widespread (not confined to some isolated peaks).

The initial large slope for the sample from the mixture distribution (middle histogram) reveals clustering at small distances (less than $0.15$). By dropping to negative levels, it signals separation at intermediate distances. Comparing this to the U-shaped distribution's L function is revealing: the slopes at $0$, the amounts by which these curves rise above $0$, and the rates at which they eventually descend back to $0$ all provide information about the nature of the clustering present in the data. Any of these characteristics could be chosen as a single measure of "evenness" to suit a particular application.

These examples show how an L-function can be examined to evaluate departures of the data from uniformity ("evenness") and how quantitative information about the scale and nature of the departures can be extracted from it.

(One can indeed plot the entire L function, extending to the full normalized distance of $1$, to assess large-scale departures from uniformity. Ordinarily, though, assessing the behavior of the data at smaller distances is of greater importance.)

Software

R code to generate this figure follows. It starts by defining functions to compute K and L. It creates a capability to simulate from a mixture distribution. Then it generates the simulated data and makes the plots.

Ripley.K <- function(x, scale) {
  # Arguments:
  # x is an array of data.
  # scale (not actually used) is an option to rescale the data.
  #
  # Return value:
  # A function that calculates Ripley's K for any value between 0 and 1 (or `scale`).
  #
  x.pairs <- outer(x, x, function(a,b) abs(a-b))  # All pairwise distances
  x.pairs <- x.pairs[lower.tri(x.pairs)]          # Distances between distinct pairs
  if(missing(scale)) scale <- diff(range(x.pairs))# Rescale distances to [0,1]
  x.pairs <- x.pairs / scale
  #
  # The built-in `ecdf` function returns the proportion of values in `x.pairs` that
  # are less than or equal to its argument.
  #
  return (ecdf(x.pairs))
}
#
# The one-dimensional L function.
# It merely subtracts 1 - (1-y)^2 from `Ripley.K(x)(y)`.  
# Its argument `x` is an array of data values.
#
Ripley.L <- function(x) {function(y) Ripley.K(x)(y) - 1 + (1-y)^2}
#-------------------------------------------------------------------------------#
set.seed(17)
#
# Create mixtures of random variables.
#
rmixture <- function(n, p=1, f=list(runif), factor=10) {
  q <- ceiling(factor * abs(p) * n / sum(abs(p)))
  x <- as.vector(unlist(mapply(function(y,f) f(y), q, f)))
  sample(x, n)
}
dmixture <- function(x, p=1, f=list(dunif)) {
  z <- matrix(unlist(sapply(f, function(g) g(x))), ncol=length(f))
  z %*% (abs(p) / sum(abs(p)))
}
p <- rep(1, 4)
fg <- lapply(p, function(q) {
  v <- runif(1,0,30)
  list(function(n) rnorm(n,v), function(x) dnorm(x,v), v)
  })
f <- lapply(fg, function(u) u[[1]]) # For random sampling
g <- lapply(fg, function(u) u[[2]]) # The distribution functions
v <- sapply(fg, function(u) u[[3]]) # The parameters (for reference)
#-------------------------------------------------------------------------------#
#
# Study the L function.
#
n <- 64                # Sample size
alpha <- beta <- 0.2   # Beta distribution parameters

layout(matrix(c(rep(1,3), 3, 4, 2), 2, 3, byrow=TRUE), heights=c(0.6, 0.4))
#
# Display the L functions over an envelope for the uniform distribution.
#
plot(c(0,1/3), c(-1/8,1/6), type="n", 
     xlab="Normalized Distance", ylab="Total Proportion",
     main="Ripley L Functions")
invisible(replicate(999, {
  plot(Ripley.L(x.unif <- runif(n)), col="#00000010", add=TRUE)
}))
abline(h=0, lwd=2, col="White")
#
# Each of these lines generates a random set of `n` data according to a specified
# distribution, calls `Ripley.L`, and plots its values.
#
plot(Ripley.L(x.norm <- rnorm(n)), col="Blue", lwd=2, add=TRUE)
plot(Ripley.L(x.beta <- rbeta(n, alpha, beta)), col="Red", lwd=2, lty=2, add=TRUE)
plot(Ripley.L(x.mixture <- rmixture(n, p, f)), col="Green", lwd=2, lty=3, add=TRUE)
#
# Display the histograms.
#
n.breaks <- 24
h <- hist(x.norm, main="Normal Sample", breaks=n.breaks, xlab="Value")
curve(dnorm(x)*n*mean(diff(h$breaks)), add=TRUE, lwd=2, col="Blue")
h <- hist(x.beta, main=paste0("Beta(", alpha, ",", beta, ") Sample"), 
          breaks=n.breaks, xlab="Value")
curve(dbeta(x, alpha, beta)*n*mean(diff(h$breaks)), add=TRUE, lwd=2, lty=2, col="Red")
h <- hist(x.mixture, main="Mixture Sample", breaks=n.breaks, xlab="Value")
curve(dmixture(x, p, g)*n*mean(diff(h$breaks)), add=TRUE, lwd=2, lty=3, col="Green")

Best Answer

Related Solutions

Solved – Building a summary for values drawn from a bimodal distribution

Solved – measure of ‘evenness’ of spread

Theory

Worked Examples

Software

Related Question