Measuring the Shannon entropy of an ordered sequence

biologycombinatoricsentropyinformation theorymusic-theory

I have 927 unique sequences of the numbers 1, 2 and 3, all of which sum to 12 and represent every possible one-octave scale on the piano, with the numbers representing the intervals between notes in half-steps (i.e., adjacent keys). For example, the Major Scale is { 2,2,1,2,2,2,1 }, while the hexatonic (6-note) Blues Scale is { 3, 2, 1, 1, 3, 2 }. It can help mentally to add a zero at the beginning to represent the root note.

I'm trying to decide how to categorize them, and believe Shannon entropy is a natural place to begin, with one problem: The default Shannon Equation returns the same figure agnostic of the order of the sequence (as we would expect), whereas this is a case where the order is supremely relevant. (Music theory and information theory have a lot in common, though this appears to be largely unexplored.)

Let's take a simple example: Both of these are valid scales, though neither has a name, to my knowledge:

{ 1,2,1,2,1,2,1,2 }
{ 2,2,1,2,1,1,1,2 }

Since there are equal numbers of 1s and 2s in each set, we don't need a calculator to see that the Shannon Entropy for each is 4:

$$-\sum_{i=1}^8 \frac{1}{2} \log_{2}\frac{1}{2}$$

However, I believe–I may be very wrong here–that there is significantly more information contained in the second sequence, given that the first can be communicated as "{1,2}, four times" while the second is considerably more arbitrary. (I believe this is debate in bioinformatics, where one can locate long spans of simple repeating sequences of nucleotides that some consider "junk DNA," though others dispute that adjective.)

How does one account for order when measuring surprisal? If I were to take, say, The Great Gatsby and sort the ~261,000 characters (as in letters, etc., not fictional people!) in ASCII order, I don't think it would contain the same amount of information as the original text.

One strategy I tried was a rolling measure of probability, where the algorithm is initially unaware of the total frequency of each number but accumulates the odds as it goes. (I suppose this is like measuring the accumulating surprisal of a die you cannot see and don't know how many sides it has.) This did not produce very intuitive or useful results.

Another thought I had was to divide each sequence into "words" that are repeating sub-sequences of at least two digits, then measure the entropy of each word and … I'm not sure what to do with them. A sum of $H[{1,2}]$ four times still adds to 4. I'm a bit out of my depth here both in reducing sequences to the most efficient subsequences and knowing what to do with them.

I've read Shannon's 1948 paper twice, including the discussion of the relative odds of letters in a given language, which seems relevant but is, in fact, not useful in this case, where every possible ordered combination is present and the intervals are added independently of those preceding them, with the edge case that they cannot exceed a sum of 12.

Which is all to say, how does one quantify the entropy of an ordered sequence, when the order is of extreme relevance but each item is independent of the others? Below I'll include how I generated the 927 sequences for anyone interested, but it's ancillary to the question, I think.

***

First, I wrote a simple recursive algorithm that begins with an empty set and calls itself three times after adding 1, 2 or 3, terminating when the sum is 12 and tossing any that go over. Happy to provide code or pseudo-code, but I think it's intuitive.

To check myself, I generated the all the combinations of [1,2,3] that add to 12–there are 19–using a Coin-Change algorithm, and then calculated the unique, distinct permutations of each of them (treating identical digits as indistinguishable). The sum of the permutations is also 927, which range from 1 for the sequence of twelve 1s (the Chromatic Scale) or four 3s (a sparse diminished scale) to 140 possible scales for the intervals { 1, 1, 1, 2, 2, 2, 3 }, which almost subsumes the Harmonic Minor Scale.

One more thought

Regarding the above example of four 1s and four 2s: If I was flipping an unbiased coin 12 times and coding Heads as "1" and Tails as "2", I would be more surprised, in a general sense of the word, to get alternating results than to get the second sequence with small clusters. So it's possible that "surprisal" is the wrong frame of mind here versus information.

Best Answer

You're starting to think of Kolmogorov complexity, which is a (almost uncomputable) measure of "how hard it is to describe" the sequence. It is completely dependent on "what is allowed to be used to describe" sequences (as computer programs, actually).

Shannon entropy fundamentally describes how much information per character there is when a stream of the specified probability distribution arrives. Serial correlations and so on are specifically not accounted for (you may of course consider characters to be multi-interval, and then your second sequence does have higher entropy).

Another point: surprisal in terms of "the intervals being in an unexpected mathematical sequence" is drastically different from surprisal per "this is a dissonant interval relative to the previous ones", but you're into atonality so that may not matter.

This is also heavy math, but in the direction of the lattice of rationals (abstract algebra), not information theory.