Probability of having a specific k-mer in a randomised sequence of length 13

combinatoricsprobability

DNA code is composed of a sequence of four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T).

Consider we have a random DNA chain (not closed) of length 𝑀. What is the probability of finding a certain gene of length 𝑚 inside our random DNA chain? Suppose 𝑚<𝑀

We assume that the four nucleotides have the same probability to appear in every position of the chain independently.

My homework assignment is to solve the above general problem for 𝑀=13, 𝑚=6.
Where m has to consist of the bases in that sequence order: CCTAGG

There is 4^13 combinations of nucleotides in the random sequence M.

But I have no idea how to proceed?

Best Answer

The question cannot be answered in general because the probability depends on the gene. For example, the probability of finding CC in a chain of length $3$ is $\frac7{64}$ (the $7$ chains containing it are CCC, CCA, CCG, CCT, ACC, GCC and TCC) whereas the probability of finding AT in a chain of length $3$ is $\frac8{64}=\frac18$ (the $8$ chains containing it are AT* and *AT, where each * stands for any of the four bases).

In your concrete example of the gene CCTAGG in a chain of length $13$, you can get a very good approximation by calculating the expected number of occurrences of the gene in the chain. This is simply the number of slots in which it might occur, $M-m+1=13-6+1=8$, times the probability for it to occur in one of these slots, which is $4^{-m}=4^{-6}=2^{-12}$. Thus we expect $2^{-9}$ occurrences of the gene, and since it's very unlikely to occur more than once, this is almost exactly the probability of occurrence. It can't occur more than twice, so to calculate the exact probability of occurrence we just have to subtract out the cases where it occurs twice. This is easy in this case since this particular gene has no potential for self-overlap, so there are only three positions in which the two occurrences can be. One base remains free to be chosen, for a total of $3\cdot4=12$ chains with two occurrences. Thus the probability of occurrence of this particular gene (and of any other gene that can't overlap with itself) is

$$ \frac1{2^9}-\frac{12}{2^{26}}=\frac1{2^9}-\frac3{2^{24}}\approx0.001953\;. $$