[Math] Calculating Shannon Entropy for DNA sequence

cryptographyentropyinformation theoryprobability

I'm following the formula on http://www.shannonentropy.netmark.pl/calculate to calculate the Shannon Entropy of a string of nucleotides [nt]. Since their are 4 nt, I assigned them each with equal probability P(nt) = 0.25. The equation I'm using is -sum([Pr(x)*log2(Pr(x)) for all x in X]) #X is the DNA sequence (e.g. ATCG).

So my question is this: In Shannon Entropy, MUST the probability be based solely on the sequence itself or can the probabilities be predetermined (i.e. nt_set = {A, T, C, G} and each P(nt) = 0.25)

If I used predetermined probabilities, would that still be entropy and if not, what would I be calculating?

Best Answer

In Shannon Entropy, MUST the probability be based solely on the sequence itself or can the probabilities be predetermined

Rather on the contrary (if I understand you right): the probabilities must be predetermined. More precisely: the Shannon entropy is defined in terms of a probabilistic model, it assumes that the probabilities are known. Hence, it does not make much sense to speak of the entropy of a particular sequence, but rather of the entropy of a source that emit that kind of sequence (in a probabilistic sense).

In your case, if you assume that you have 4 symbols, and that they are equiprobable and independent, then the entropy is 2 bits per symbol.

Related Question