An event with unit density still has zero information, despite not being an event that is guaranteed to occur.

density functioninformation theorymachine learningprobability theoryrandom variables

My textbook says the following in a section on information theory:

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

We would like to quantify information in a way that formalizes this intuition.

• Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.

• Less likely events should have higher information content.

• Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once. To satisfy all three of these properties, we define the self-information of an event $\mathrm{x} = x$ to be

$$I(x) = -\log(P(x))$$

In this book, we always use $\log$ to mean the natural logarithm, with base $e$. Our definition of $I(x)$ is therefore written in units of nats. One nat is the amount of information gained by observing an event of probability $\dfrac{1}{e}$. Other texts use base-$2$ logarithms and units called bits or shannons; information measured in bits is just a rescaling of information measured in nats.

When $\mathrm{x}$ is continuous, we use the same definition of information by analogy, but some of the properties from the discrete case are lost. For example, an event with unit density still has zero information, despite not being an event that is guaranteed to occur.

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning (Page 71).

My question is about this last part:

When $\mathrm{x}$ is continuous, we use the same definition of information by analogy, but some of the properties from the discrete case are lost. For example, an event with unit density still has zero information, despite not being an event that is guaranteed to occur.

  1. What is meant by "unit density"?
  2. What is meant by "an event with unit density"?
  3. What is meant by "an event with unit density still has zero information, despite not being an event that is guaranteed to occur"?

I would greatly appreciate it if people could please take the time to clarify this.

Best Answer

Since @harinsa requested an update in the comments, I will methodically answer my question with the reasoning that I came up with to justify the textbook's claims:

  1. Unit density means $p(x) = 1$, where $p(\cdot)$ is a probability density function and $x$ is a continuous random variable.

  2. An event with unit density has zero information: $$\begin{align} I(x) &= -\log(p(x)) \\ &= -\log(1) \\ &= 0 \end{align}$$

  3. This is despite it not being an even that is guaranteed to occur. In fact, it is an even that has $0$ probability of occurring, since the cumulative density function of such a PDF is zero: $$\begin{align} &p(x) = 1 \\ \therefore \ &P(x) = 0 \ , \end{align}$$ where $P(x)$ here represents the cumulative density function, as opposed to in the textbook, where the author uses it to represent the probability mass function. This is because the probability of a continuous random variable being equal to any specific value (rather than a continuous range of real values) is zero.

Related Question