Solved – measure of ‘evenness’ of spread

descriptive statisticsmeasurementstandard deviationvariance

I looked up on the web, but couldn't find anything helpful.

I'm basically looking for a way to measure how 'evenly' a value is distributed. As in, an 'evenly' distributed distribution like X:
enter image description here

and an 'unevenly' distributed distribution Y of roughly the same mean and standard deviation:
enter image description here

But is there any evenness measure m, such that m(X) > m(Y)? If there isn't, what would be the best way to create a measure like this?

(Images screenshot from Khan Academy)

Best Answer

A standard, powerful, well-understood, theoretically well-established, and frequently implemented measure of "evenness" is the Ripley K function and its close relative, the L function. Although these are typically used to evaluate two-dimensional spatial point configurations, the analysis needed to adapt them to one dimension (which usually is not given in references) is simple.

Theory

The K function estimates the mean proportion of points within a distance $d$ of a typical point. For a uniform distribution on the interval $[0,1]$, the true proportion can be computed and (asymptotically in the sample size) equals $1 - (1-d)^2$. The appropriate one-dimensional version of the L function subtracts this value from K to show deviations from uniformity. We might therefore consider normalizing any batch of data to have a unit range and examining its L function for deviations around zero.

Worked Examples

To illustrate, I have simulated $999$ independent samples of size $64$ from a uniform distribution and plotted their (normalized) L functions for shorter distances (from $0$ to $1/3$), thereby creating an envelope to estimate the sampling distribution of the L function. (Plotted points well within this envelope cannot be significantly distinguished from uniformity.) Over this I have plotted the L functions for samples of the same size from a U-shaped distribution, a mixture distribution with four obvious components, and a standard Normal distribution. The histograms of these samples (and of their parent distributions) are shown for reference, using line symbols to match those of the L functions.

The sharp separated spikes of the U-shaped distribution (dashed red line, leftmost histogram) create clusters of closely spaced values. This is reflected by a very large slope in the L function at $0$. The L function then decreases, eventually becoming negative to reflect the gaps at intermediate distances.

The sample from the normal distribution (solid blue line, rightmost histogram) is fairly close to uniformly distributed. Accordingly, its L function does not depart from $0$ quickly. However, by distances of $0.10$ or so, it has risen sufficiently above the envelope to signal a slight tendency to cluster. The continued rise across intermediate distances indicates the clustering is diffuse and widespread (not confined to some isolated peaks).

The initial large slope for the sample from the mixture distribution (middle histogram) reveals clustering at small distances (less than $0.15$). By dropping to negative levels, it signals separation at intermediate distances. Comparing this to the U-shaped distribution's L function is revealing: the slopes at $0$, the amounts by which these curves rise above $0$, and the rates at which they eventually descend back to $0$ all provide information about the nature of the clustering present in the data. Any of these characteristics could be chosen as a single measure of "evenness" to suit a particular application.

These examples show how an L-function can be examined to evaluate departures of the data from uniformity ("evenness") and how quantitative information about the scale and nature of the departures can be extracted from it.

(One can indeed plot the entire L function, extending to the full normalized distance of $1$, to assess large-scale departures from uniformity. Ordinarily, though, assessing the behavior of the data at smaller distances is of greater importance.)

Software

R code to generate this figure follows. It starts by defining functions to compute K and L. It creates a capability to simulate from a mixture distribution. Then it generates the simulated data and makes the plots.

Ripley.K <- function(x, scale) {
  # Arguments:
  # x is an array of data.
  # scale (not actually used) is an option to rescale the data.
  #
  # Return value:
  # A function that calculates Ripley's K for any value between 0 and 1 (or `scale`).
  #
  x.pairs <- outer(x, x, function(a,b) abs(a-b))  # All pairwise distances
  x.pairs <- x.pairs[lower.tri(x.pairs)]          # Distances between distinct pairs
  if(missing(scale)) scale <- diff(range(x.pairs))# Rescale distances to [0,1]
  x.pairs <- x.pairs / scale
  #
  # The built-in `ecdf` function returns the proportion of values in `x.pairs` that
  # are less than or equal to its argument.
  #
  return (ecdf(x.pairs))
}
#
# The one-dimensional L function.
# It merely subtracts 1 - (1-y)^2 from `Ripley.K(x)(y)`.  
# Its argument `x` is an array of data values.
#
Ripley.L <- function(x) {function(y) Ripley.K(x)(y) - 1 + (1-y)^2}
#-------------------------------------------------------------------------------#
set.seed(17)
#
# Create mixtures of random variables.
#
rmixture <- function(n, p=1, f=list(runif), factor=10) {
  q <- ceiling(factor * abs(p) * n / sum(abs(p)))
  x <- as.vector(unlist(mapply(function(y,f) f(y), q, f)))
  sample(x, n)
}
dmixture <- function(x, p=1, f=list(dunif)) {
  z <- matrix(unlist(sapply(f, function(g) g(x))), ncol=length(f))
  z %*% (abs(p) / sum(abs(p)))
}
p <- rep(1, 4)
fg <- lapply(p, function(q) {
  v <- runif(1,0,30)
  list(function(n) rnorm(n,v), function(x) dnorm(x,v), v)
  })
f <- lapply(fg, function(u) u[[1]]) # For random sampling
g <- lapply(fg, function(u) u[[2]]) # The distribution functions
v <- sapply(fg, function(u) u[[3]]) # The parameters (for reference)
#-------------------------------------------------------------------------------#
#
# Study the L function.
#
n <- 64                # Sample size
alpha <- beta <- 0.2   # Beta distribution parameters

layout(matrix(c(rep(1,3), 3, 4, 2), 2, 3, byrow=TRUE), heights=c(0.6, 0.4))
#
# Display the L functions over an envelope for the uniform distribution.
#
plot(c(0,1/3), c(-1/8,1/6), type="n", 
     xlab="Normalized Distance", ylab="Total Proportion",
     main="Ripley L Functions")
invisible(replicate(999, {
  plot(Ripley.L(x.unif <- runif(n)), col="#00000010", add=TRUE)
}))
abline(h=0, lwd=2, col="White")
#
# Each of these lines generates a random set of `n` data according to a specified
# distribution, calls `Ripley.L`, and plots its values.
#
plot(Ripley.L(x.norm <- rnorm(n)), col="Blue", lwd=2, add=TRUE)
plot(Ripley.L(x.beta <- rbeta(n, alpha, beta)), col="Red", lwd=2, lty=2, add=TRUE)
plot(Ripley.L(x.mixture <- rmixture(n, p, f)), col="Green", lwd=2, lty=3, add=TRUE)
#
# Display the histograms.
#
n.breaks <- 24
h <- hist(x.norm, main="Normal Sample", breaks=n.breaks, xlab="Value")
curve(dnorm(x)*n*mean(diff(h$breaks)), add=TRUE, lwd=2, col="Blue")
h <- hist(x.beta, main=paste0("Beta(", alpha, ",", beta, ") Sample"), 
          breaks=n.breaks, xlab="Value")
curve(dbeta(x, alpha, beta)*n*mean(diff(h$breaks)), add=TRUE, lwd=2, lty=2, col="Red")
h <- hist(x.mixture, main="Mixture Sample", breaks=n.breaks, xlab="Value")
curve(dmixture(x, p, g)*n*mean(diff(h$breaks)), add=TRUE, lwd=2, lty=3, col="Green")

Related Solutions

Solved – How does one measure the non-uniformity of a distribution

If you have not only the frequencies but the actual counts, you can use a $\chi^2$ goodness-of-fit test for each data series. In particular, you wish to use the test for a discrete uniform distribution. This gives you a good test, which allows you to find out which data series are likely not to have been generated by a uniform distribution, but does not provide a measure of uniformity.

There are other possible approaches, such as computing the entropy of each series - the uniform distribution maximizes the entropy, so if the entropy is suspiciously low you would conclude that you probably don't have a uniform distribution. That works as a measure of uniformity in some sense.

Another suggestion would be to use a measure like the Kullback-Leibler divergence, which measures the similarity of two distributions.

Solved – Standard measure of clumpiness

As an example, suppose you have an ordered set in which each position has an equal probability of being any of the lowercase letters in the alphabet. In this case I will make the ordered set contain $1000$ elements.

# generate a possible sequence of letters
s <- sample(x = letters, size = 1000, replace = TRUE)

It turns out that if each of the positions of the ordered set follows a uniform distribution over the lowercase letters of the alphabet, then the distance between two occurrences of the same letter follows a geometric distribution with parameter $p=1/26$. In light of this information, let's compute the distance between consecutive occurrences of the same letter.

# find the distance between occurences of the same letters
d <- vector(mode = 'list', length = length(unique(letters)))
for(i in 1:length(unique(letters))) {
    d[[i]] <- diff(which(s == letters[i]))
}
d.flat <- unlist(x = d)

Let's look at a histogram of the distances between occurrences of the same letter and compare it to the probability mass function associated with the geometric distribution mentioned above.

hist(x = d.flat, prob = TRUE, main = 'Histogram of Distances', xlab = 'Distance',
     ylab = 'Probability')
x <- range(d.flat)
x <- x[1]:x[2]
y <- dgeom(x = x - 1, prob = 1/26)
points(x = x, y = y, pch = '.', col = 'red', cex = 2)

The red dots represent the actual probability mass function of the distance we would expect if each of the positions of the ordered set followed a uniform distribution over the letters and the bars of the histogram represent the empirical probability mass function of the distance associated with the ordered set.

enter image description here

Hopefully the image above is convincing that the geometric distribution is appropriate.

Again, if each position of the ordered set follows a uniform distribution over the letters, we would expect the distance between occurrences of the same letter to follow a geometric distribution with parameter $p=1/26$. So how similar are the expected distribution of the distances and the empirical distribution of the differences? The Bhattacharyya Distance between two discrete distributions is $0$ when the distributions are exactly the same and tends to $\infty$ as the distributions become increasingly different.

How does d.flat from above compare to the expected geometric distribution in terms of Bhattacharyya Distance?

b.dist <- 0
for(i in x) {
    b.dist <- b.dist + sqrt((sum(d.flat == i) / length(d.flat)) * dgeom(x = i - 1,
              prob = 1/26))
}
b.dist <- -1 * log(x = b.dist)

The Bhattacharyya Distance between the expected geometric distribution and the emprirical distribution of the distances is about $0.026$, which is fairly close to $0$.

EDIT:

Rather than simply stating that the Bhattacharyya Distance observed above ($0.026$) is fairly close to $0$, I think this is a good example of when simulation comes in handy. The question now is the following: How does the Bhattacharyya Distance observed above compare to typical Bhattacharyya Distances observed if each position of the ordered set is uniform over the letters? Let's generate $10,000$ such ordered sets and compute each of their Bhattacharyya Distances from the expected geometric distribution.

gen.bhat <- function(set, size) {
    new.seq <- sample(x = set, size = size, replace = TRUE)
    d <- vector(mode = 'list', length = length(unique(set)))
    for(i in 1:length(unique(set))) {
        d[[i]] <- diff(which(new.seq == set[i]))
    }
    d.flat <- unlist(x = d)
    x <- range(d.flat)
    x <- x[1]:x[2]
    b.dist <- 0
    for(i in x) {
        b.dist <- b.dist + sqrt((sum(d.flat == i) / length(d.flat)) * dgeom(x = i -1,
                  prob = 1/length(unique(set))))
    }
    b.dist <- -1 * log(x = b.dist)
    return(b.dist)
}
dist.bhat <- replicate(n = 10000, expr = gen.bhat(set = letters, size = 1000))

Now we may compute the probability of observing the Bhattacharyya Distance observed above, or one more extreme, if the ordered set was generated in such a way that each of its positions follows a uniform distribution over the letters.

p <- ifelse(b.dist <= mean(dist.bhat), sum(dist.bhat <= b.dist) / length(dist.bhat),
            sum(dist.bhat > b.dist) / length(dist.bhat))

In this case, the probability turns out to be about $0.38$.

For completeness, the following image is a histogram of the simulated Bhattacharyya Distances. I think it's important to realize that you will never observe a Bhattacharyya Distance of $0$ because the ordered set has finite length. Above, the maximum distance between any two occurrences of a letter is at most $999$.