[Math] the probability of finding a certain gene of length $m$ inside our random DNA chain

combinatoricscombinatorics-on-wordsprobability

DNA code is composed of a sequence of four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T).

Consider we have a random DNA chain (not closed) of length $M$. What is the probability of finding a certain gene of length $m$ inside our random DNA chain? Suppose $m<M$.

We assume that the four nucleotides have the same probability to appear in every position of the chain independently.

My homework assignment is to solve the above general problem for $M=100$ and $m=2$, $m=3$

We can formulate the problem as follows

  • our DNA chain is a M-tuple $S=(s_1, \ldots ,s_M)$ with $s_i$'s chosen from the alphabet $\{ \text{ A, G, T, C } \}$
  • And the gene is a m-tuple $T=(t_i, \ldots ,t_m)$

My work

I know that there are $4^M$ different M-length DNA chains and for every one of them our gene could be located at the positions $1,2,\ldots, M-m-1$.

Moreover due to our assumption

\begin{align}P(\{ \text{T is located at position } i\})&= \frac{1}{4^m} \implies\\ P(\{ \text{T is NOT located at position } i\})&= 1- \frac{1}{4^m}\end{align}

But I have no idea how to proceed! Are the events $\{ \text{T is NOT located at position } i\}$ independent or not? Should I take into account the form of the gene?

Any help will be appreciated, thank you

Best Answer

This seems more like the question "What is the probability of getting 5 heads in a row at some point when tossing a fair coin 100 times?". There's probably some niftier generating function approach; but here's my take...

Suppose $m = 3$. As we proceed along the string of nucleotides, we are transitioning from one trigram (3 character string) to another; for example if the string of nucleotides is:

$$AGCTACAACG...$$

we are looking at a sequence of $M-2$ trigrams (note that the last two characters of each trigram in the sequence are the same as the first two characters of the succeeding trigram):

$$AGC, GCT, CTA, TAC, ACA, CAA, AAC, ACG, ...$$

Suppose the pattern we seek is $ACG$. Then it must have preceded by one of the trigrams: $AAC, CAC, GAC, TAC$; and if we encounter one of those 4 trigrams, then in each case there is a $\frac{1}{4}$ chance that the next trigram will be $ACG$. This makes me think that we can model this as a Markov process.

So I think of this as there being 4 states: $ACG, xAC, xxA,$ and $xxx$ (for the last state, this is any trigram that does NOT end with "A", "AC", or "ACG"). Then the sequence of trigrams in the example I gave above is now the sequence of states:

$$xxx, xxx, xxA, xAC, xxA, xxA, xAC, ACG,...$$

The state transitions probabilities $P(\text{from state} | \text{to state})$ are: $$P(xxx | xxx)=3/4, P(xxx | xxA)=1/4$$ $$P(xxA | xxx)=2/4, P(xxA | xxA)=1/4, P(xxA | xAC)=1/4$$ $$P(xAC | xxx)=2/4, P(xAC | xxA)=1/4, P(xAC | ACG)=1/4$$ $$P(ACG | ACG) = 1$$

with all omitted transitions above having $P=0$. $ACG$ is an absorbing state; and we will want to know the probability that our system enters state $ACG$ at some point during our stream of $M-2$ trigrams.

We can then define the transition matrix:

$$T = \begin{bmatrix} \frac{3}{4} & \frac{1}{4} & 0 & 0\\ \frac{1}{2} & \frac{1}{4} & \frac{1}{4} & 0\\ \frac{1}{2} & \frac{1}{4} & 0 & \frac{1}{4}\\ 0 & 0 & 0 & 1\\ \end{bmatrix} $$

and initial state probabilities for the starting trigram / state being $xxx, xxA, xAC,$ and $ACG$ :

$$P=\begin{bmatrix} \frac{43}{64} & \frac{16}{64} & \frac{4}{64} & \frac{1}{64}\\ \end{bmatrix} $$

So $P_4 = \frac{1}{64}$ is the probability for a sequence of length $3$ being in the state $ACG$; and in general if

$$P' = P \cdot T^{M-3}$$

Then $P'_4$ is the probability that a string of $M$ nucleotides contains $ACG$ at least once. In particular, for $M=100$, this equals $\approx 0.796888$, so roughly %80.

The above will hold for any trigram having distinct letters; but things are slightly different for patterns with duplicate letters.