[Math] What does the $-\log[P(X)]$ term mean in the calculation of entropy

information theoryprobabilityprobability theory

The entropy (self information) of a discrete random variable X is calculated as:

$$
H(x)=E(-\log[P(X)])
$$

What does the $-\log[P(X)]$ mean? It seems to be something like ""the self information of each possible outcome of the random variable X".

And why do we use log function to calculate it?

ADD 1

Well, below is my reasoning:

The root motivation is to quantify/measure the uncertainty contained in a random variable.

Intuitively, people tend to agree that there's some connection between uncertainty and probability. And still intuitively, people shall agree that:

  • the more probability an outcome has, the less uncertainty it has.
  • thus, the less probability an outcome has, the more uncertainty it has.

So, I think if we want to measure the uncertainty for an outcome of a random variable, the measure function should satisfy:

  • the value of uncertainty measure should be positive (human instinct when counting)
  • the value of this measure for the uncertainty of an outcome should be monotonic decreasing function of the probability of that outcome.
  • for outcomes of independent experiments, the uncertainty should be additive. That is for P(A)*P(B), the total uncertainty should be the sum of A's and B's. (This is kind of instinctive, too.)

Then I come to the choice of -log[p(i)] as the measure of uncertainty of each possible outcome, or self-information of each outcome.

Then I treat the entropy as the weighted average of the self-information of all possible outcomes.

I just read the book <Information Theory, Inference and Learning Algorithms> by MacKay. The author indeed gives a similar explanation to mine. And he name it the information content of each outcome. It is not difficult to see that entropy better describes a random variable than the information content.

And it is coincidental that the formula we intuitively found to measure the average information content of a random variable has a similar form to the one of entropy in thermodynamics. Thus comes the name information entropy

BTW I want to quote some words from Einstein…

"It is not so important where one settles down. The best thing is to
follow your instincts without too much reflection."

–Einstein to Max Born, March 3, 1920. AEA 8-146

ADD 2

Following my above reasoning, I tried to derive the calculation of entropy for a continuous random variable Y in a similar way. But I was blocked. Details below.

Let Y's p.d.f be: $$f(y)$$

Then, if we strictly follow my previous reasoning, then we should pick up a small interval of I, and the probability of Y within interval I is given by: $$P(y\ within\ I)=\int_If(y)dy$$Then the measure of uncertainty for Y to fall in interval I should be: $$m(y\ within\ I) = -log\int_If(y)dy$$ Then, to get the entropy, we should get the expectation/average of this measure m, which is essentially: $$E[m(y\ within\ I)]$$ and it can be expanded as below:

$$
\int{P(y\ within\ I)*m(y\ within\ I)}dI
=\int{(\int_I{f(y)dy}*{(-log\int_If(y)dy)})dI}
$$

I found myself stuck here because the interval I is not strictly defined.

Then I find from here the authoritative definition of entropy of continuous random variable:

$$
H(Y)=-\int{f(y)log[f(y)]dy}
$$

The p.d.f. $f(y)$ can certainly be $> 1$, so the $H(Y)$ can be negative, while in discrete scenario, the $H(X)$ is always non-negative.

I cannot explain the why this in-consistence is happening. For now, I can only consider it as a philosophical difficulty regarding continuity and discreteness.

Some of my personal feeling (can be safely ignored):

In the discrete scenario, the concrete countable outcome provide the
foothold for us to carry out our calculation. But in the continuous
scenario, there's no such ready-made foothold (unless we can somehow
make one). Without such foothold, it feels like we just keep falling
into the endless hollowness of mind.

Anyone could shed some light?

ADD 3 – 4:23 PM 2/21/2022

We created mathematics to quantify the world. And here in probability we even try to quantify our mentality, while our mentality created mathematics in the first place. It's like an endless recursive fall. And it's really hard for one to settle down

Best Answer

Easy illustrative example:

Take a fair coin. $P({\rm each\ result})=1/2$. By independence, $P({\rm each\ result\ in\ n\ tosses})=1/2^n$. The surprise in each coin toss is the same. The surprise in $n$ tosses is $n\times$(surprise in one toss). The $\log$ makes the trick. And the entropy is the mean surprise.