Why does the variance of the number of occurrences of a subsequence in a random sequence depend on the subsequence

combinatoricsdiscrete mathematicsprobability

Question

Why is the variance of the number of times we see the subsequences $[0, 0, 0, 1]$ and $[1, 0, 1, 0]$ in a random sequence of 1024 bits differ from one another and the binomial distribution $B(1021, 2^{-4})$?

How can I calculate the probability and variance of seeing a particular subsequence?

Background

When looking at different ways of evaluating the output of a random number generator, I was looking at checking if the number of occurrences of a particular subsequence in a string of bits was within a confidence interval.

As I played around with this method, I found some surprising results. I generated sequences of 1024 bits and counted the number of times a specific subsequence of length four occurred. I expected that out of the 1021 sliding window positions, the distribution of the number of matches would follow a binomial distribution of $B(1021,2^{-4})$, which would have a mean of 63.8 and a variance of 59.8.

However, the results I observed were quite different:

Simulation Chart Output

Other Thoughts

The differences in variance seem to be related to the fact that the subsequences are not independent, i.e., if you see the subsequence $[0, 0, 0, 1]$ then if you shift the sliding window over three places, you are guaranteed not to see that subsequence again no matter the next value. However, for the subsequence $[1, 0, 1, 0]$ you can shift over the window two times and have a chance of seeing the same subsequence again.

But this would lead me to believe that the subsequence $[1, 0, 1, 0]$ is more likely as you would have more chances of seeing it in a sequence. But this is not the case.

Code

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Generate 100,000 random 1024 bit sequences
random_array = np.random.choice(a=[False, True], size=(100000, 1024))

# Generate a 4 element sliding window view
sliding_window_view = np.lib.stride_tricks.sliding_window_view(
    random_array, 4, axis=1
)  # requires np >= 1.20

# Count the number of sliding windows that equal [1, 0, 1, 0]
number_of_1010 = np.all(sliding_window_view == [1, 0, 1, 0], axis=2).sum(axis=1)
# Count the number of sliding windows that equal [0, 0, 0, 1]
number_of_0001 = np.all(sliding_window_view == [0, 0, 0, 1], axis=2).sum(axis=1)
# Generate my expected distribution
expected = np.random.binomial(1021, 2**-4, size=100000)

# Show density plot
sns.kdeplot(
    number_of_1010,
    bw=0.5,
    label=(
        f"[1, 0, 1, 0]; mean {np.mean(number_of_1010):.2f}; var"
        f" {np.var(number_of_1010):.2f}"
    ),
)
sns.kdeplot(
    number_of_0001,
    bw=0.5,
    label=(
        f"[0, 0, 0, 1]; mean {np.mean(number_of_0001):.2f}; var"
        f" {np.var(number_of_0001):.2f}"
    ),
)
sns.kdeplot(
    expected,
    bw=0.5,
    label=(
        f"B(1021, 2^-4); mean {np.mean(expected):.2f}; var"
        f" {np.var(expected):.2f}"
    ),
)
plt.legend()
plt.show()

Best Answer

Short answer: The reason the variance for counting $0001$ is smaller is because the occurrences of $0001$ are not independent. For example, if the substring at indices 10 to 13 is is $0001$, then that makes it impossible for the substring at 11 to 14 to be $0001$, so the events of $0001$ occurring at those places are negatively correlated.

For the variance of the number of $1010$ occurrences, there are actually some positive correlations, since a $1010$ at spots 10-13 makes it much more likely that a $1010$ occurs at 12-15. There are still negative correlations as well, but it turns out the positives outweigh the negatives, so the variance here is greater than that of the binomial distribution.

Long answer: Letting $N_{0001}$ be the number of occurrences of $0001$, we can write $$ N_{0001}=\sum_{i=1}^{1021}X_i, $$ where $X_i$ is an indicator random variable, equal to one when the $i^{th}$ subsequence is $0001$, and zero otherwise. Then $$ \text{Var}(N_{0001})=\sum_{i=1}^{1021} \text{Var } X_i+2\sum_{i<j}\text{Cov}(X_i,X_j) $$ Note that the variance of an indicator random variable with probability $p$ is $p(1-p)$, so the contribution of the first sum is $2011\cdot 2^{-4}(1-2^{-4})$. This is exactly the variance of the corresponding binomial distribution.

For the sum of the covariances, you can show that each $\text{Cov}(X_i,X_j)\le 0$. Whenever $|i-j|\ge 4$, $X_i$ and $X_j$ are actually independent, since they depend on disjoint sets of coin flips, meaning $\text{Cov}(X_i,X_j)=0$ in those cases. But when $i$ and $j$ are close together, $X_i$ and $X_j$ are negatively correlated, since $X_i$ occurring implies $X_j$ cannot occur. This is because occurrences of $0001$ cannot overlap. To be precise, whenever $|i-j|\le 3$, $$ \text{Cov}(X_i,X_j)=E[X_iX_j]-E[X_i]E[X_j]=P(X_i\cap X_j)-P(X_i)P(X_j)=\color{red}0-(2^{-4})(2^{-4}) $$

Therefore, the second sum has a negative contribution, so the variance is less.

Now, let us apply this method to compute the variance of $N_{1010}$. It is still true that $\text{Cov}(X_i,X_j)$ is zero when $|i-j|\ge 4$. It is also still negative when $|i-j|=1$ or $|i-j|=3$, since $0101$ substrings cannot overlap like that. But when $|i-j|=2$, we have $$ \text{Cov}(X_i,X_{i+2})=P(X_i\cap X_j)-P(X_i)P(X_j)=\color{blue}{2^{-6}}-(2^{-4})(2^{-4})>0 $$ So in this case, the occurrences are positively correlated. So, the summation $\sum_{i<j} \text{Cov}(X_i,X_j)$ has some positive and some negative terms. Apparently, the net effect is that the positives outweigh the negatives, since your simulations show the variance of $N_{1010}$ is greater than for the binomial case. I would not have been able to predict this (and I got this wrong when I first wrote this answer), so I cannot really say what the intuition is here.

Related Question