Entropy and how does it help distinguish difference between distributions or information

entropyinformation theory

I'm reading this book on information theory and it kind of goes between 2 different definitions of entropy. One being "the measure of uncertainty of a RV" or "average number of bit required to describe the RV." I'm coming out totally confused on what Entropy is or how it's used in this like how different distributions are.

The formal definition in the book is:
$H(X)=-\sum_{x\in X} p(x)log(p(x))$

Not really sure what this represents, but I want to learn more so I can understand how entropy is used in calculating things like KL divergence.

Best Answer

Both definitions are accurate although the first definition is more general because there are many ways entropy can be defined. Entropy is generally used as a measure of uncertainty we have about a particular event. Where an uncertain event is a event whereby there are different possible outcomes.

If you have a event (or random variable) with $M$ equiprobable outcomes then $M$ can reasonably be used as a measure of uncertainty of the event. If you observed the result of an uncertain event and you needed to store that result or transmit it to another party then the entropy measures how efficiently you can achieve that.

As a simple example, if an event has $M = 10$ equiprobable possible outcomes you can allocate each possible outcome a unique digit between $0$ and $9$. After observing the actual outcome you can then send the result to another other party by just sending the digit that corresponds to that outcome. You would only need to send $1$ digit for every outcome and so the entropy would be $1$ digit per outcome. Where a digit is a normal base-$10$ number.

If on the other hand you had $M = 20$ equiprobable possible outcomes then you need to send $2$ digits per outcome. If you use digits of base $b$ then your entropy can be shown to be $\log_b M$ per outcome (in the case of equiprobable outcomes). It is common to use a base of $2$ to represent the entropy. The entropy in this case is then in bits.

In the case of the observation of some binary random variable $X$ which follows a distribution $p(x)$, it can be shown that if $n$ observations are made (where $n$ is very large) then even though $2^n$ different sequences are possible, there is a very high probability that the observed outcome actually turns out to be one of $M = 2^{nH}$ equiprobable sequences. These sequences are called typical sequences. It then follows that $\log_2 2^{nH} = nH$ is entropy of the observed sequence. $H$ can be shown to evaluate to $-\sum p(x) \log_2 p(x)$.

Related Question