Solved – Motivating sigmoid output units in neural networks starting with unnormalized log probabilities linear in $z=w^Th+b$ and $\phi(z)$

Background: I'm studying chapter 6 of Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville. In section 6.2.2.2 (pages 182 of 183 which can be viewed here) the use of sigmoid to output $P(y=1|x)$ is motivated.

To summarize some of the material they let $$z = w^Th+b$$ be an output neuron before an activation is applied where $h$ is the output of the previous hidden layer, $w$ is a vector of weights and $b$ is a scalar bias. The input vector is denoted $x$ (which $h$ is a function of) and the output value is denoted $y=\phi(z)$ where $\phi$ is the sigmoid function. The book wishes to deﬁne a probability distribution over $y$ using the value $z$. From the second paragraph of page 183:

We omit the dependence on $x$ for the moment to discuss how to deﬁne a
probability distribution over $y$ using the value $z$. The sigmoid can be motivated by constructing an unnormalized probability distribution $\tilde P(y)$, which does not sum to 1. We can then divide by an appropriate constant to obtain a valid probability distribution. If we begin with the assumption that the unnormalized log probabilities are linear in $y$ and $z$, we can exponentiate to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation of z:
\begin{align}
\log\tilde P(y) &= yz \\
\tilde P(y) &= \exp(yz) \\
P(y) &= \frac{\exp(yz)}{\sum_{y'=0}^1 \exp(y'z) } \\
P(y) &= \phi((2y-1)z)
\end{align}

Questions: I'm confused about two things, particularly the first:

Where is the initial assumption coming from? Why is the unnormalized log probability linear in $y$ and $z$? Can someone give me some inituition on how the authors started with $\log\tilde P(y) = yz$?
How does the last line follow?

Best Answer

There are two possible outcomes for $y \in \{0, 1\}$. It's very important, because this property changes meaning of the multiplication. There are two possible cases:

\begin{align} \log\tilde P(y=1) &= z \\ \log\tilde P(y=0) &= 0 \\ \end{align}

In addition important to notice that unnormalized logarithmic probability for $y=0$ is constant. This property derives from the main assumption. Applying any deterministic function to the constant value will produce constant output. This property will simplify final formula when we will do normalization over all possible probabilities, because we just need to know only unnormalized probability for $y=1$ and for $y=0$ it's always constant. And since output from the network in unnormalized logarithmic probability we will require only one output, because another one assumed to be constant.

Next, we are applying exponentiation to the unnormalized logarithm probability in order to obtain unnormalized probability.

\begin{align} \tilde P(y=1) &= e ^ z \\ \tilde P(y=0) &= e ^ 0 = 1 \end{align}

Next we just normalize probabilities dividing each unnormalized probability by the sum of all possible unnormalized probabilities.

\begin{align} P(y=1) = \frac{e ^ z}{1 + e ^ z} \\ P(y=0) = \frac{1}{1 + e ^ z} \end{align}

We are interested only in $P(y=1)$, because that's what probability from the sigmoid function means. The obtained function doesn't look like sigmoid on the first look, but they are equal and it's easy to show.

\begin{align} P(y=1) = \frac{e ^ x}{1 + e ^ x} = \frac{1}{\frac{e ^ x + 1}{e ^ x}} = \frac{1}{1 + \frac{1}{e ^ x}} = \frac{1}{1 + e ^ {-x}} \end{align}

The last statement can be confusing at first, but it just a way to show that that final probability function is a sigmoid. The $(2y−1)$ value converts $0$ to $-1$ and $1$ to $1$ (or we can say that it would be without change).

$$ P(y) = \sigma((2y - 1)z) = \begin{cases} \sigma(z) = \frac{1}{1 + e ^ {-z}} = \frac{e ^ z}{1 + e ^ z} & \text{when } y = 1 \\ \sigma(-z) = \frac{1}{1 + e ^ {-(-z)}} = \frac{1}{1 + e ^ z} & \text{when } y = 0 \\ \end{cases} $$

As we can see, it just the way to show the relation between $\sigma$ and $P(y)$

Best Answer

Related Solutions

Solved – Binary classification vs. continuous output with neural networks

Solved – starting off with same weights in neural networks

Related Question