Problem about probability of observing a particular DNA sequence related to Beta binomial conjugacy

expected valueprobabilityrandomstatistics

I have some questions about the following problem from Introduction to Probability by Joseph K. Blitzstein and Jessica Hwang.

Problem statement

A DNA sequence can be represented as a sequence of letters, where the
“alphabet” has 4 letters: A,C,T,G. Suppose such a sequence is
generated randomly, where the letters are independent and the
probabilities of A,C,T,G are p1, p2, p3, p4 respectively.

Assume that the pj are unknown. Suppose we treat p2 as a Unif(0, 1)
r.v. before observing any data, and that then the first 3 letters
observed are “CAT”. Given this information, what is the probability
that the next letter is C?

Let C be the event that the first 3 letters observed are “CAT.” Then using a form of Bayes' rule,

$$f(p_2|C) = \frac{P(C|P_2 = p_2)f(p_2)}{P(C)} \\
f(p_2|C) = p_2\frac{(\frac{1-p_2}{3})^2(1)}{1/108}$$

where I substituted $p_2(\frac{1-p_2}{3})^2$ for $P(C|P_2 = p_2)$ since by symmetry all the remaining letters are equally likely, having probability $\frac{1-p_2}{3}$, if $p_2$ is known. I calculated P(C) as follows.

$$P(C) = \int_{0}^{1} P(C=c|p_2)f(p_2){\; dp_2} \\
P(C) = \int_{0}^{1} p_2(\frac{1-p_2}{3})^2(1){\; dp_2} = 1/108$$

Finally, I calculated the probability of the next letter being C given that the first three letters are 'CAT' as follows.

$$P(next \ letter = C) = \int_{0}^{1}P(next \ letter = C|first \ 3 \ are \ 'CAT')f(p_2|C)\\
P(next \ letter = C) = \int_{0}^{1}p_2f(p_2|C)dp_2 \\
P(next \ letter = C) = \int_{0}^{1} (p_2(\frac{1-p_2}{3})^2) (p_2(\frac{1-p_2}{3})^2)dp_2
\frac{108}{81}= 4/315$$

The correct answer is 2/5. Can someone explain what I am doing wrong?

Best Answer

Your computation runs into trouble from almost the very start, and I will illustrate with an example.

Suppose I have in a large bag four different colors of candy, say red, green, blue, and yellow. Without assuming anything about the frequencies of each color in the bag, I draw three candies at random. You are solely interested in estimating the proportion $p_r$ of red candies in the bag, based on the sample I draw.

Now, suppose I tell you that in the three candies I've drawn, there is exactly one red candy. How do you represent this information with respect to the parameter $p_r$?

Now, instead suppose I tell you that in the three candies I've drawn, there is one red, one yellow, and one blue candy. Is the likelihood function for $p_r$ any different?

Finally, suppose I tell you that the candies I drew, in order, are blue, red, and yellow, respectively. Is this any more informative about $p_r$ than the previous example?

In fact, all three situations are equivalent with respect to the likelihood of $p_r$, because you don't care that the other candies are blue or yellow--all that matters is that they are not red. The situation would be precisely the same if instead of blue, yellow, and green, there are only two types of candies: red and not red. Essentially, you would be colorblind to any other color besides red. This is not the case if you were interested in estimating all of the parameters of the underlying multinomial distribution, but here, you are specifically asked about one color--or in your question, one DNA base.

Therefore, we will reframe your question. The order of observed bases is irrelevant. All that matters is the number of bases that are "C" out of those observed. And this is a binomial random variable $X_2$ with parameters $n$ and $p_2$, where $p_2$ is the parameter to be estimated.

Consequently, you observed $X_2 = 1$ with $n = 3$, and $$f(p_2 \mid X_2 = 1) = \frac{\Pr[X_2 = 1 \mid p_2]f(p_2)}{\Pr[X_2 = 1]}.$$ The prior, $f(p_2)$, is uniform on $[0,1]$, thus $f(p_2) = 1$ for $0 \le p_2 \le 1$. The observed value conditional on $p_2$ is $$\Pr[X_2 = 1 \mid p_2] = \binom{3}{1} p_2^1 (1 - p_2)^{3-1} = 3 p_2 (1-p_2)^2.$$ The marginal distribution is $$\Pr[X_1 = 1] = \int_{p_2 = 0}^1 \Pr[X_2 = 1 \mid p_2] f(p_2) \, dp_2 = \int_{p_2 = 0}^1 3 p_2 (1-p_2)^2 \, dp_2 = \frac{1}{4}.$$ Consequently $$f(p_2 \mid X_2 = 1) = 12 p_2 (1-p_2)^2, \quad 0 \le p_2 \le 1.$$ This is the posterior distribution of $p_2$ based on the observed data and uniform prior. The posterior predictive distribution is clearly Bernoulli: it is a random variable, say $B$, whose value is $1$ if the next letter is "C", and $0$ otherwise. We compute $$\begin{align} \Pr[B = 1 \mid X_2 = 1] &= \int_{p_2 = 0}^1 \Pr[B = 1 \mid p_2]f(p_2 \mid X_2 = 1) \, dp_2 \\ &= \int_{p_2 = 0}^1 p_2 \cdot 12 p_2 (1-p_2)^2 \, dp_2 \\ &= \frac{2}{5} \end{align}$$ as claimed.

Note that you did obtain the correct posterior distribution, but through wrong reasoning: the assumption that the frequencies of the other letters are equal need not, and should not, be made. The main problem, however, is that your calculation of the posterior predictive is not correct.

Related Question