Probability of finding 5-nucleotide long sequence in random sequence

combinatoricsprobability

DNA code is composed of a sequence of four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T). We assume (for simplicity) that ๐‘(๐ด)=๐‘(๐‘‡)=๐‘(๐ถ)=๐‘(๐บ)=0.25

My question is: I have a random DNA chain of length M=14, where position 1 is always a C. The question is now, what is the probability of finding the sequence 'CTAGG' (m=5) within the chain?
The sequence can't overlap with itself.

Best Answer

As in the previous question, you get a very good approximation of the probability of occurrence from the expected number of occurrences. We have a special chance of $4^{-4}=2^{-8}$ of getting CTAGG in the first position because of the fixed C, and then we have probability $4^{-5}=2^{-10}$ to get an occurrence in each of the remaining $14-5+1-1=9$ slots, for a total expected number of $2^{-8}+9\cdot2^{-10}=13\cdot2^{-10}$.

To get the exact probability, we are again in the fortunate situation that at most two occurrences are possible, so we don't have to do a full inclusionโ€“exclusion calculation and can just subtract out the double occurrences.

If the first occurrence uses the initial C, then the probability for it is $4^{-4}$, the probability for the second one is $4^{-5}$, and there are $14-2\cdot5+1=5$ possible positions for it, so the probability for this is $5\cdot2^{-18}$.

If the first occurrence doesn't use the initial C, then the probability for each occurrence is $4^{-5}$, and we can choose their positions in $\binom{13-2\cdot5+2}{2}=\binom{5}2=10$ ways, so the probability for this is $10\cdot2^{-20}=5\cdot2^{-19}$.

Thus, the probability for at least one occurrence is

$$ \frac{13}{2^{10}}-\frac5{2^{18}}-\frac5{2^{19}}=\frac{13}{2^{10}}-\frac{15}{2^{19}}\approx 0.0127\;. $$

Since this differs from the answer provided by Jaroslaw Matlak, I wrote some code to check it.