Counting DNA sequences that contain all dinucleotides exactly once

biologycombinatoricspermutations

I'm trying to figure out how many unique DNA sequences minimally contain every possible dinucleotide sequence. I'm convinced that this is likely a solved combinatorics question, but I don't have the mathematical background to know how to ask it formally. Allow me to explain.

DNA has four nucleotide bases: A T G C.

A dinucleotide sequence is just any sequence to two nucleotides, so AT, GC, TT, etc.

There are of course 16 possible dinucleotide sequences. (Direction matters, so for instance GC and CG are different dinucleotides.) I'm searching for minimal sequences that contain every one of those 16 dinucleotide sequences exactly once. I know this one:

ACGTTAGCTGGATCCAA

It is 17 nucleotides long, and each of the 16 adjacent pairs of nucleotides is unique. This is what I'm looking for. I also know that I can make more such sequences by circular permutation of this sequence.

My question is, how would I count all such sequences? How would I construct a list of all such sequences?

Best Answer

The sequences you are constructing are more generally known as de Bruijn sequences. In general, if the alphabet has size $k$, then the number of cyclic sequences of length $k^n$ which contain each possible sequence of length $n$ as a possible subsequence is $$ \frac{(k!)^{k^{n-1}}}{k^n} $$ In the case $k=4$ and $n=2$, this gives $20{,}736$ cyclic sequences. This answers your first question. This question has some information about generating all sequences. After some debugging, I was able to use the code there to generate this text file with all of the sequences:

https://pastebin.com/FjLLnqre

Each this only lists each cyclic sequence once, so to get all linear sequeunces you would have to print all cyclic rotations of each sequence in the file.

Related Question