Pattern Recognition and Machine Learning (Bishop) – Exercise 1.28

entropyinformation theorymachine learningpattern recognitionstatistical-inference

1.28
In Section 1.6, we introduced the idea of entropy $h(x)$ as the information gained on observing the value of a random variable $x$ having distribution $p(x)$.
We saw that, for independent variables $x$ and $y$ for which $p(x,y) = p(x) p(y)$, the entropy functions are additive, so that $h(x,y) = h(x) + h(y)$.
In this exercise, we derive the relation between $h$ and $p$ in the form of a function $h(p)$.
First show that $h(p^2) = 2 h(p)$, and hence by induction that $h(p^n) = n h(p)$ where $n$ is a positive integer.
Hence show that $h(p^{n/m}) = (n/m) h(p)$ where $m$ is also a positive integer.
This implies that $h(p^x) = x h(p)$ where $x$ is a positive rational number, and hence by continuity when it is a positive real number.
Finally, show that this implies $h(p)$ must take the form $h(p) \propto \ln p$.

(Original image here.)

I am having a few problems with this exercise, I will try to list them and perhaps you can help on some of them:

Essentially, I don't understand what kind of function is $h$, because the first two lines of the problem definition make me think that its input is a random variable $x$ and its ouput a real number $h(x)$. But next, the problem uses the notation $h(x,y)$, implying that now $h$ is a function of 2 random variables $x$ and $y$, yielding a real number $h(x,y)$ as output. So what would be the correct or formal way to define $h$?
In connection with the first item, now in the 5th line it reads $h(p)$… so if switching between 1 and 2 random variables as inputs wasn't enough, now the input is a probability density? $h$ must be a very robust function for sure… ok I guess that so far the question is the same as in item 1, i.e., what is $h$?
The first step of the problem definition is to show $h(p^2) = 2 \, h(p)$ – ignoring the issues I described previously – I can't see how this is achieved by using those two hypothesis about independent variables. A have a feeling that the idea is to do something like (warning: the following is nonsense): $h(p^2) = h(p,p) = h(p) + h(p) = 2\, h(p)$ the first equality being wrong because I magically split the $p^2$ into two input variables, the second one being also wrong because I apply a hypothesis given for independent random variables on probability densities of dependent random variables and the last one is ok (!!!). So, what would be a nicer way to fulfill this first step?
Finally, is it really the logarithm's family the only solutions that match these requirements? I wonder, for example, what would happen if the argument function was used (like arg$(z)$ in complex analysis) since it satisfies the same properties as the logarithm regarding the product and power of its inputs.

Help in any of these items will be appreciated!

In case you are interested, there are many online sources of the book, e.g. here, the topic is covered in page 48 of the book (68 of the pdf) and the problem definition – picture uploaded – is in page 64 of the book (84 of the pdf).

EDIT: I've found this article on Wikipedia, in which my first two items are answered, via a more clear notation for the functions and variables involved in the problem. Still looking for 3 and 4…

Best Answer

After some hours of research I've found a few sites which altogether answer these questions.

Regarding items 1 and 2, it looks like there is indeed a severe abuse of notation every time the author refers to function $h$. This function seems to be the so-called self-information and it is usually defined over probability events or random variables as well. I find this article very clarifying in this respect.

Regarding item 4, for what I have seen, it seems that under certain conditions that the self information functions must satisfy, the logarithm if the only possible choice. The selected answer in this post was particularly useful, and also the comments on the question. This topic is also discussed here, but I prefer the previous link.

Finally, I have not found an answer for item 3. Actually, I really think that this step is wrongly formulated due to the imprecision in the definition of function $h$. Nevertheless, the links I have provided as an answer to item 4 lead to the desired result.

Best Answer

Related Solutions

[Math] Math problem in Pattern Recognition and Machine Learning

Exercise 1.1 from Introduction of Pattern Recognition and Machine Learning by Christopher Bishop

Related Question