Solved – Standard measure of clumpiness

descriptive statistics

I have a lot of data and I want to do something which seems very simple. In this large set of data, I am interested in how much a specific element clumps together. Let's say my data is an ordered set like this: {A,C,B,D,A,Z,T,C…}. Let's say I want to know whether the A's tend to be found right next to each other, as opposed to being randomly (or more evenly) distributed throughout the set. This is the property I am calling "clumpiness".

Now, is there some simple measurement of data "clumpiness"? That is, some statistic that will tell me how far from randomly distributed the As are? And if there isn't a simple way to do this, what would the hard way be, roughly? Any pointers greatly appreciated!

Best Answer

As an example, suppose you have an ordered set in which each position has an equal probability of being any of the lowercase letters in the alphabet. In this case I will make the ordered set contain $1000$ elements.

# generate a possible sequence of letters
s <- sample(x = letters, size = 1000, replace = TRUE)

It turns out that if each of the positions of the ordered set follows a uniform distribution over the lowercase letters of the alphabet, then the distance between two occurrences of the same letter follows a geometric distribution with parameter $p=1/26$. In light of this information, let's compute the distance between consecutive occurrences of the same letter.

# find the distance between occurences of the same letters
d <- vector(mode = 'list', length = length(unique(letters)))
for(i in 1:length(unique(letters))) {
    d[[i]] <- diff(which(s == letters[i]))
}
d.flat <- unlist(x = d)

Let's look at a histogram of the distances between occurrences of the same letter and compare it to the probability mass function associated with the geometric distribution mentioned above.

hist(x = d.flat, prob = TRUE, main = 'Histogram of Distances', xlab = 'Distance',
     ylab = 'Probability')
x <- range(d.flat)
x <- x[1]:x[2]
y <- dgeom(x = x - 1, prob = 1/26)
points(x = x, y = y, pch = '.', col = 'red', cex = 2)

The red dots represent the actual probability mass function of the distance we would expect if each of the positions of the ordered set followed a uniform distribution over the letters and the bars of the histogram represent the empirical probability mass function of the distance associated with the ordered set.

enter image description here

Hopefully the image above is convincing that the geometric distribution is appropriate.

Again, if each position of the ordered set follows a uniform distribution over the letters, we would expect the distance between occurrences of the same letter to follow a geometric distribution with parameter $p=1/26$. So how similar are the expected distribution of the distances and the empirical distribution of the differences? The Bhattacharyya Distance between two discrete distributions is $0$ when the distributions are exactly the same and tends to $\infty$ as the distributions become increasingly different.

How does d.flat from above compare to the expected geometric distribution in terms of Bhattacharyya Distance?

b.dist <- 0
for(i in x) {
    b.dist <- b.dist + sqrt((sum(d.flat == i) / length(d.flat)) * dgeom(x = i - 1,
              prob = 1/26))
}
b.dist <- -1 * log(x = b.dist)

The Bhattacharyya Distance between the expected geometric distribution and the emprirical distribution of the distances is about $0.026$, which is fairly close to $0$.

EDIT:

Rather than simply stating that the Bhattacharyya Distance observed above ($0.026$) is fairly close to $0$, I think this is a good example of when simulation comes in handy. The question now is the following: How does the Bhattacharyya Distance observed above compare to typical Bhattacharyya Distances observed if each position of the ordered set is uniform over the letters? Let's generate $10,000$ such ordered sets and compute each of their Bhattacharyya Distances from the expected geometric distribution.

gen.bhat <- function(set, size) {
    new.seq <- sample(x = set, size = size, replace = TRUE)
    d <- vector(mode = 'list', length = length(unique(set)))
    for(i in 1:length(unique(set))) {
        d[[i]] <- diff(which(new.seq == set[i]))
    }
    d.flat <- unlist(x = d)
    x <- range(d.flat)
    x <- x[1]:x[2]
    b.dist <- 0
    for(i in x) {
        b.dist <- b.dist + sqrt((sum(d.flat == i) / length(d.flat)) * dgeom(x = i -1,
                  prob = 1/length(unique(set))))
    }
    b.dist <- -1 * log(x = b.dist)
    return(b.dist)
}
dist.bhat <- replicate(n = 10000, expr = gen.bhat(set = letters, size = 1000))

Now we may compute the probability of observing the Bhattacharyya Distance observed above, or one more extreme, if the ordered set was generated in such a way that each of its positions follows a uniform distribution over the letters.

p <- ifelse(b.dist <= mean(dist.bhat), sum(dist.bhat <= b.dist) / length(dist.bhat),
            sum(dist.bhat > b.dist) / length(dist.bhat))

In this case, the probability turns out to be about $0.38$.

For completeness, the following image is a histogram of the simulated Bhattacharyya Distances. I think it's important to realize that you will never observe a Bhattacharyya Distance of $0$ because the ordered set has finite length. Above, the maximum distance between any two occurrences of a letter is at most $999$.

enter image description here

Related Solutions

Solved – Tukey’s Hinges: Grouping Data

You are going to have problems dividing a set of data into four equal parts if the number of pieces of data is not a multiple of $4$.

One approach might be to duplicate some of the data. So if you have $4n$ data points, just divide into four sets of $n$ points by rank. If you have $4n-1$ points, duplicate the median, including it in both the second and third sets, so again you have four sets of $n$ points by rank. If you have $4n-2$ points, duplicate the first and third quartile points, including the first quartile in both the first and second sets and including the third quartile in the third and fourth sets, so again you have four sets of $n$ points by rank. And if you have $4n-3$ points, duplicate the median and the first and third quartile points, including them in the relevant sets and again you have four sets of $n$ points by rank. There are other approaches.

In your example with $18$ data points, that would give four equally sized subsets sets of

1st group: $1, 2, 2.5, 2.5, 2.5$
2nd group: $2.5, 3, 3, 4, 5$
3rd group: $5, 6, 7, 7.5, 7.5$
4th group: $7.5, 8, 9, 9, 10$

Quantiles (note the change from r to n) are difficult to define easily. Wikipedia gives 10 estimate types while Eric Langford gives 15 methods in the Journal of Statistics Education.

Solved – Quantify the difference between two samples

I feel you can fit a linear model and run ANOVA with Tukey's HSD procedure. In R:

# sample data
> t <- structure(list(test = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 
     2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("T1", "T2", "T3"), class = "factor"), 
     student = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 5L, 
     4L, 5L, 3L, 6L, 6L, 6L), .Label = c("S1", "S2", "S3", "S4", 
     "S5", "S6"), class = "factor"), score = c(8L, 6L, 9L, 8L, 
     3L, 5L, 5L, 9L, 1L, 9L, 3L, 1L, 9L, 5L, 3L)), .Names = c("test", 
     "student", "score"), class = "data.frame", row.names = c(NA, -15L))

# fit the model and run ANOVA
> m <- aov(score~test+student,t)
> summary(m)
            Df Sum Sq Mean Sq F value Pr(>F)  
test         2  41.20  20.600   6.341 0.0268 *
student      5  57.66  11.532   3.550 0.0644 .
Residuals    7  22.74   3.249                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is a significant difference among tests (p=0.02683), and almost significant among students (p=0.06444). Now, what tests are different, and how much? Run Tukey's HSD procedure

> hsd <- TukeyHSD(m, which="test", ordered=T) # can also compare students with which="student"
> hsd
   Tukey multiple comparisons of means
    95% family-wise confidence level
    factor levels have been ordered

Fit: aov(formula = score ~ test + student, data = t)

$test
      diff        lwr    upr     p adj
T2-T3  1.4 -1.9571998 4.7572 0.4752507
T1-T3  4.0  0.6428002 7.3572 0.0235300
T1-T2  2.6 -0.7571998 5.9572 0.1245249

The difference between T1 and T3 is statistically significant (p=0.0235300). The difference you got is 4.0, and a 95% confidence interval is [0.6428002, 7.3572]. Between the other tests there is no significant difference.

You can even plot pairwise test comparisons:

plot(hsd)

Best Answer

Related Solutions

Solved – Tukey’s Hinges: Grouping Data

Solved – Quantify the difference between two samples

Related Question