Solved – Could any equation have predicted the results of this simulation

distributionshistogramprobabilityrsimulation

Suppose there is a coin that has a 5% chance of landing on HEADS and a 95% chance of landing on TAILS. Based on a computer simulation, I want to find out the following :

  • The average number of flips before observing HEADS, TAILS, HEADS (note: not the probability, but the number of flips)

Using the R programming language, I tried to make this simulation (this simulation keeps flipping a coin until HTH and counts the number of flips until this happens – it then repeats this same process 10,000 times):

results <- list()

for (j in 1:10000) {

  response_i <- ''
  i <- 1

  while (response_i != 'HTH') {
    response_i <- c("H","T")
    response_i <- sample(response_i, 3, replace=TRUE, 
                         prob=c(0.05, 0.95))
    response_i <- paste(response_i, collapse = '')

    iteration_i = i
    if (response_i == 'HTH') {
      run_i = data.frame(response_i, iteration_i)
      results[[j]] <- run_i
      }
    i <- i + 1
  }
}

data <- do.call('rbind', results)

We can now see a histogram of this data:

hist(data$iteration_i, breaks = 500, main = "Number of Flips Before HTH")

enter image description here

We can also see the summary of this data:

summary(data$iteration_i)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   119.0   288.0   413.7   573.0  3346.0

My Question:

  • Could any "mathematical equation" have predicted the results of this simulation in advance? Could any "formula" have shown that the average number of flips to get HTH would have been 413? Can Markov Chains be used to solve this problem?

  • Based on the "skewed" shape of this histogram, is the "arithmetical mean" (i.e. mean = sum(x_i)/n) a "faithful" representation of the "true mean"? Looking at the above histogram, we can clearly see that you are are more likely to see HTH before 437 iterations compared to seeing HTH after 437 iterations, e.g. (on 100,000 simulations, the new average is 418):

    nrow(data[which(data$iteration_i <418), ])

63184

nrow(data[which(data$iteration_i > 418), ])

36739

For such distributions, is there a better method to find out the "expectation" of this experiment?

Thanks!

Best Answer

At any given point in the game, you're $3$ or fewer "perfect flips" away from winning.

For example, suppose you've flipped the following sequence so far: $$ HTTHHHTTTTTTH $$

You haven't won yet, but you could win in two more flips if those two flips are $TH$. In other words, your last flip was $H$ so you have made "one flip" worth of progress toward your goal.

Since you mentioned Markov Chains, let's describe the "state" of the game by how much progress you have made toward the desired sequence $HTH$. At every point in the game, your progress is either $0$, $1$, or $2$--if it reaches $3$, then you have won. So we'll label the states $0$, $1$, $2$. (And if you want, you can say that there's an "absorbing state" called "state $3$".)

You start out in state $0$, of course.

You want to know the expected number of flips, from the starting point, state $0$. Let $E_i$ denote the expected number of flips, starting from state $i$.

At state $0$, what can happen? You can either flip $H$, and move to state $1$, or you flip $T$ and remain in state $0$. But either way, your "flip counter" goes up by $1$. So: $$ E_0 = p (1 + E_1) + (1-p)(1 + E_0), $$ where $p = P(H)$, or equivalently $$ E_0 = 1 + p E_1 + (1-p) E_0. $$ The "$1+$" comes from incrementing your "flip counter".

At state $1$, you want $T$, not $H$. But if you do get an $H$, at least you don't go back to the beginning--you still have an $H$ that you can build on next time. So: $$ E_1 = 1 + p E_1 + (1-p) E_2. $$

At state $2$, you either flip $H$ and win, or you flip $T$ and go all the way back to the beginning. $$ E_2 = 1 + (1-p) E_0. $$

Now solve the three linear equations for the three unknowns.

In particular you want $E_0$. I get $$ E_0 = \left( \frac{1}{p} \right) \left( \frac{1}{p} + \frac{1}{1-p} + 1 \right), $$ which for $p=1/20$ gives $E_0 = 441 + 1/19 \approx 441.0526$. (So the mean is not $413$. In my own simulations I do get results around $441$ on average, at least if I do around $10^5$ or $10^6$ trials.)

In case you are interested, our three linear equations come from the Law of Total Expectation.

This is really the same as the approach in Stephan Kolassa's answer, but it is a little more efficient because we don't need as many states. For example, there is no real difference between $TTT$ and $HTT$--either way, you're back at the beginning. So we can "collapse" those sequences together, instead of treating them as separate states.

Simulation code (two ways, sorry for using Python instead of R):

# Python 3
import random

def one_trial(p=0.05):
    # p = P(Heads)
    state = 0 # states are 0, 1, 2, representing the progress made so far
    flip_count = 0 # number of flips so far
    while True:
        flip_count += 1
        if state == 0: # empty state
            state = random.random() < p
            # 1 if H, 0 if T
        elif state == 1: # 'H'
            state += random.random() >= p
            # state 1 (H) if flip H, state 2 (HT) if flip T
        else: # state 2, 'HT'
            if random.random() < p: # HTH, game ends!
                return flip_count
            else: # HTT, back to empty state
                state = 0

def slow_trial(p=0.05):
    sequence = ''
    while sequence[-3:] != 'HTH':
        if random.random() < p:
            sequence += 'H'
        else:
            sequence += 'T'
    return len(sequence)

N = 10**5
print(sum(one_trial() for _ in range(N)) / N)
print(sum(slow_trial() for _ in range(N)) / N)