How does the choice of alphabet impact the Shannon entropy of a sequence (if at all)

entropyprobability distributionsstatistics

MSA

The context for my problem is multiple sequence alignment, column entropy. So basically:

  • finding $H$ for sequences like MKR--KK-RR---RRM provided 1-letter code for amino acids and
  • the change in entropy when a new sequence is added to the alignemnt(the column sequence is extended by one symbol).

Question

I can't quite convince myself that $p(x)$ in the entropy [defintion below] should be interpreted as:

  1. as a frequency of a particular symbol [from the set of observed symbols] (M,R,K,-) as opposed to
  2. the probability of observing a symbol in any position [given the amino acid alphabet of 20 symbols].

In the context of MSA people generally use the former approach (1), but to me it seems like it doesn't account for the important aspect of the underlying "alphabet" as the definition calls it.

Example. So if have two sequences of same length: $S_{1}:=$K-K-P-RM and $S_{2}$:=KK-RR-MM where one features only 3 symbols out of 20 but the other features 4 symbols out of 20, it seems wrong to compare their "entropies" $H$ because the sets over which the events are distributed are not the same: $\mathcal{X}_{1}=\{K,P,R,M\} \neq \mathcal{X}_{2}=\{K,R,M\}$. Yet, that's what happens when we add "column entropies". Furthermore, i know that people also sometimes account for stereochemical properties of amino acids when calculating $H$ (some amino acids become more likely to occur than others in that case as opposed to being equiprobable).

I think i might be conflating some things, maybe someone can state the difference them for me. In particular:

  • how does the choice of alphabet affect the entropy of a sequence if not all symbols of the alphabet are observed in the sequence (as per def.)?
  • If it does not, what is the extended entropy defintion that would
    account for things like that?
  • Am i correct to say that the underlying distributions are different for seqs with different number of
    distinct symbols?

Shannon Entropy

For reference, wikipidea definiton of entropy $H$:

Given a discrete random variable $X$, which takes values $x$ in the alphabet $\mathcal{X}$ and is distributed according to ${\displaystyle p:{\mathcal {X}}\to [0,1]}$:

$${\displaystyle \mathrm {H} (X):=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x)=\mathbb {E} [-\log p(X)],}$$

where $\Sigma$ denotes the sum over the variable's possible values.

Related answers that didn't help:Answer 1, Answer 2

Best Answer

it seems wrong to compare their "entropies" H because the sets over which the events are distributed are not the same: $\mathcal{X}_1=\{K,P,R,M\} \neq \mathcal{X}_2=\{K,R,M\}$.

You've misunderstood the probability space over which $p$ is defined. The column entropy of a particular index in an MSA is calculated over the distribution of bases at that index in the MSA. In other words, let's represent your MSA by an $m \times n$ matrix $S$ where each row is a sequence $S_i$ in your MSA of length $n$. You can think of $S$ as defining $n$ probability distributions $p_j$ over the (single!) alphabet $\mathcal{X}$ of all 20 amino acid symbols, where

$$p_j(x) := \frac{|\{i \in [m] : S_{i,j} = x\}|}{m}$$

Now to compute the Shannon entropy of the $j$th column, $p_j$ would be the probability mass function $p$ that you use in the definition you provided and $\mathcal{X}$ would be the alphabet. Also, this is indeed what's going on in the github link you provided. Taken from a comment earlier on in the code:

H ranges from $0$ (only one base/residue in present at that position) to $4.322$ (all $20$ residues are equally represented in that position)