Question
Why is the variance of the number of times we see the subsequences $[0, 0, 0, 1]$ and $[1, 0, 1, 0]$ in a random sequence of 1024 bits differ from one another and the binomial distribution $B(1021, 2^{-4})$?
How can I calculate the probability and variance of seeing a particular subsequence?
Background
When looking at different ways of evaluating the output of a random number generator, I was looking at checking if the number of occurrences of a particular subsequence in a string of bits was within a confidence interval.
As I played around with this method, I found some surprising results. I generated sequences of 1024 bits and counted the number of times a specific subsequence of length four occurred. I expected that out of the 1021 sliding window positions, the distribution of the number of matches would follow a binomial distribution of $B(1021,2^{-4})$, which would have a mean of 63.8 and a variance of 59.8.
However, the results I observed were quite different:
Other Thoughts
The differences in variance seem to be related to the fact that the subsequences are not independent, i.e., if you see the subsequence $[0, 0, 0, 1]$ then if you shift the sliding window over three places, you are guaranteed not to see that subsequence again no matter the next value. However, for the subsequence $[1, 0, 1, 0]$ you can shift over the window two times and have a chance of seeing the same subsequence again.
But this would lead me to believe that the subsequence $[1, 0, 1, 0]$ is more likely as you would have more chances of seeing it in a sequence. But this is not the case.
Code
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate 100,000 random 1024 bit sequences
random_array = np.random.choice(a=[False, True], size=(100000, 1024))
# Generate a 4 element sliding window view
sliding_window_view = np.lib.stride_tricks.sliding_window_view(
random_array, 4, axis=1
) # requires np >= 1.20
# Count the number of sliding windows that equal [1, 0, 1, 0]
number_of_1010 = np.all(sliding_window_view == [1, 0, 1, 0], axis=2).sum(axis=1)
# Count the number of sliding windows that equal [0, 0, 0, 1]
number_of_0001 = np.all(sliding_window_view == [0, 0, 0, 1], axis=2).sum(axis=1)
# Generate my expected distribution
expected = np.random.binomial(1021, 2**-4, size=100000)
# Show density plot
sns.kdeplot(
number_of_1010,
bw=0.5,
label=(
f"[1, 0, 1, 0]; mean {np.mean(number_of_1010):.2f}; var"
f" {np.var(number_of_1010):.2f}"
),
)
sns.kdeplot(
number_of_0001,
bw=0.5,
label=(
f"[0, 0, 0, 1]; mean {np.mean(number_of_0001):.2f}; var"
f" {np.var(number_of_0001):.2f}"
),
)
sns.kdeplot(
expected,
bw=0.5,
label=(
f"B(1021, 2^-4); mean {np.mean(expected):.2f}; var"
f" {np.var(expected):.2f}"
),
)
plt.legend()
plt.show()
Best Answer
Short answer: The reason the variance for counting $0001$ is smaller is because the occurrences of $0001$ are not independent. For example, if the substring at indices 10 to 13 is is $0001$, then that makes it impossible for the substring at 11 to 14 to be $0001$, so the events of $0001$ occurring at those places are negatively correlated.
For the variance of the number of $1010$ occurrences, there are actually some positive correlations, since a $1010$ at spots 10-13 makes it much more likely that a $1010$ occurs at 12-15. There are still negative correlations as well, but it turns out the positives outweigh the negatives, so the variance here is greater than that of the binomial distribution.
Long answer: Letting $N_{0001}$ be the number of occurrences of $0001$, we can write $$ N_{0001}=\sum_{i=1}^{1021}X_i, $$ where $X_i$ is an indicator random variable, equal to one when the $i^{th}$ subsequence is $0001$, and zero otherwise. Then $$ \text{Var}(N_{0001})=\sum_{i=1}^{1021} \text{Var } X_i+2\sum_{i<j}\text{Cov}(X_i,X_j) $$ Note that the variance of an indicator random variable with probability $p$ is $p(1-p)$, so the contribution of the first sum is $2011\cdot 2^{-4}(1-2^{-4})$. This is exactly the variance of the corresponding binomial distribution.
For the sum of the covariances, you can show that each $\text{Cov}(X_i,X_j)\le 0$. Whenever $|i-j|\ge 4$, $X_i$ and $X_j$ are actually independent, since they depend on disjoint sets of coin flips, meaning $\text{Cov}(X_i,X_j)=0$ in those cases. But when $i$ and $j$ are close together, $X_i$ and $X_j$ are negatively correlated, since $X_i$ occurring implies $X_j$ cannot occur. This is because occurrences of $0001$ cannot overlap. To be precise, whenever $|i-j|\le 3$, $$ \text{Cov}(X_i,X_j)=E[X_iX_j]-E[X_i]E[X_j]=P(X_i\cap X_j)-P(X_i)P(X_j)=\color{red}0-(2^{-4})(2^{-4}) $$
Therefore, the second sum has a negative contribution, so the variance is less.
Now, let us apply this method to compute the variance of $N_{1010}$. It is still true that $\text{Cov}(X_i,X_j)$ is zero when $|i-j|\ge 4$. It is also still negative when $|i-j|=1$ or $|i-j|=3$, since $0101$ substrings cannot overlap like that. But when $|i-j|=2$, we have $$ \text{Cov}(X_i,X_{i+2})=P(X_i\cap X_j)-P(X_i)P(X_j)=\color{blue}{2^{-6}}-(2^{-4})(2^{-4})>0 $$ So in this case, the occurrences are positively correlated. So, the summation $\sum_{i<j} \text{Cov}(X_i,X_j)$ has some positive and some negative terms. Apparently, the net effect is that the positives outweigh the negatives, since your simulations show the variance of $N_{1010}$ is greater than for the binomial case. I would not have been able to predict this (and I got this wrong when I first wrote this answer), so I cannot really say what the intuition is here.