DNA code is composed of a sequence of four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T).
Consider we have a random DNA chain (not closed) of length $M$. What is the probability of finding a certain gene of length $m$ inside our random DNA chain? Suppose $m<M$.
We assume that the four nucleotides have the same probability to appear in every position of the chain independently.
My homework assignment is to solve the above general problem for $M=100$ and $m=2$, $m=3$
We can formulate the problem as follows
- our DNA chain is a M-tuple $S=(s_1, \ldots ,s_M)$ with $s_i$'s chosen from the alphabet $\{ \text{ A, G, T, C } \}$
- And the gene is a m-tuple $T=(t_i, \ldots ,t_m)$
My work
I know that there are $4^M$ different M-length DNA chains and for every one of them our gene could be located at the positions $1,2,\ldots, M-m-1$.
Moreover due to our assumption
\begin{align}P(\{ \text{T is located at position } i\})&= \frac{1}{4^m} \implies\\ P(\{ \text{T is NOT located at position } i\})&= 1- \frac{1}{4^m}\end{align}
But I have no idea how to proceed! Are the events $\{ \text{T is NOT located at position } i\}$ independent or not? Should I take into account the form of the gene?
Any help will be appreciated, thank you
Best Answer
This seems more like the question "What is the probability of getting 5 heads in a row at some point when tossing a fair coin 100 times?". There's probably some niftier generating function approach; but here's my take...
Suppose $m = 3$. As we proceed along the string of nucleotides, we are transitioning from one trigram (3 character string) to another; for example if the string of nucleotides is:
$$AGCTACAACG...$$
we are looking at a sequence of $M-2$ trigrams (note that the last two characters of each trigram in the sequence are the same as the first two characters of the succeeding trigram):
$$AGC, GCT, CTA, TAC, ACA, CAA, AAC, ACG, ...$$
Suppose the pattern we seek is $ACG$. Then it must have preceded by one of the trigrams: $AAC, CAC, GAC, TAC$; and if we encounter one of those 4 trigrams, then in each case there is a $\frac{1}{4}$ chance that the next trigram will be $ACG$. This makes me think that we can model this as a Markov process.
So I think of this as there being 4 states: $ACG, xAC, xxA,$ and $xxx$ (for the last state, this is any trigram that does NOT end with "A", "AC", or "ACG"). Then the sequence of trigrams in the example I gave above is now the sequence of states:
$$xxx, xxx, xxA, xAC, xxA, xxA, xAC, ACG,...$$
The state transitions probabilities $P(\text{from state} | \text{to state})$ are: $$P(xxx | xxx)=3/4, P(xxx | xxA)=1/4$$ $$P(xxA | xxx)=2/4, P(xxA | xxA)=1/4, P(xxA | xAC)=1/4$$ $$P(xAC | xxx)=2/4, P(xAC | xxA)=1/4, P(xAC | ACG)=1/4$$ $$P(ACG | ACG) = 1$$
with all omitted transitions above having $P=0$. $ACG$ is an absorbing state; and we will want to know the probability that our system enters state $ACG$ at some point during our stream of $M-2$ trigrams.
We can then define the transition matrix:
$$T = \begin{bmatrix} \frac{3}{4} & \frac{1}{4} & 0 & 0\\ \frac{1}{2} & \frac{1}{4} & \frac{1}{4} & 0\\ \frac{1}{2} & \frac{1}{4} & 0 & \frac{1}{4}\\ 0 & 0 & 0 & 1\\ \end{bmatrix} $$
and initial state probabilities for the starting trigram / state being $xxx, xxA, xAC,$ and $ACG$ :
$$P=\begin{bmatrix} \frac{43}{64} & \frac{16}{64} & \frac{4}{64} & \frac{1}{64}\\ \end{bmatrix} $$
So $P_4 = \frac{1}{64}$ is the probability for a sequence of length $3$ being in the state $ACG$; and in general if
$$P' = P \cdot T^{M-3}$$
Then $P'_4$ is the probability that a string of $M$ nucleotides contains $ACG$ at least once. In particular, for $M=100$, this equals $\approx 0.796888$, so roughly %80.
The above will hold for any trigram having distinct letters; but things are slightly different for patterns with duplicate letters.