Note that $P(x)=\int_{-\infty}^xp(t)\,\mathrm{d}t$ is the probability that the variable will be less than $x$. Thus, if $g$ is a monotonically increasing function, $P(g(x))$ is the probability that the variable will be less than $g(x)$.
By the chain rule, the probability density of $P(g(x))$ is $P'(g(x))g'(x)=p(g(x))\left|g'(x)\right|$.
If $g$ is monotonically decreasing, then $P(g(x))$ is the probability that the variable will be greater than $g(x)$.
By the chain rule, the probability density of $P(g(x))$ is $-P'(g(x))g'(x)=p(g(x))\left|g'(x)\right|$.
The situation is more complicated if $g$ is not monotonic; we need to sum the expressions above for all the points where $g(x)$ is a certain value.
Let me try with the following three-step reasoning process.
To measure probability value difference
Intuitively, what is best way to measure difference between two probability values?
The probability of a person's death is related to car accident is about $\frac{1}{77}$, and the odds of one stricken by lightening is about $\frac{1}{700,000}$. Their numerical difference (in terms of L2) is around 1%. Do you consider the two events similarly likely? Most people in this case might consider the two events are very different: the first type of events is rare but significant and worthy of attention, while most would not worry about the second type of events in their normal days.
Overall, the sun shines 72% of the time in San Jose, and about 66% of the time on the sunny side (bay side) of San Francisco. The two sun shine probabilities differ numerically by about 6%. Do you consider the difference significant? For some, it might be; but or me, both places get plenty of sun shine, and there is little material difference.
The take away is that we need to measure individual probability value difference not by subtraction, but by some sort of quantities related to their ratio $\frac{p_k}{q_k}$.
But there are problems with using ratio as the measurement quantity. One problem is that it could vary a lot, especially for rare events. It is not uncommon for one to assess a certain probability to be 1% the first day, and declare it to be 2% the second day. Taking a simple ratio of the probability values to probability value of another event would lead to the measurements to change by 100% between the two days. For this reason, the log of ratio $\ log(\frac{p_k}{q_k})$ is used for measuring difference between individual pair of probability values.
To measure probability distribution difference
The goal of your question is to measure the distance between two probability distributions, not two individual probability value points. For a probability distribution, we are talking about multiple probability value points. To most people, it should makes sense to first compute the difference at each probability value point, and then to take their average (weighted by their probability values, i.e. $p_k log(\frac{p_k}{q_k})$) as the distance between two probability distributions.
This leads to our first formula for measuring distribution differences.
$$ D_{KL}(p \Vert q) = \sum_{k=1}^n p_k log\left( \frac{p_k}{q_k} \right). $$
This distance measure, called KL-divergence, (not a metric) is usually much better than L1/L2 distances, especially in the realm of Machine Learning. I hope, by now, you would agree that KL-divergence is a natural measure for probability distribution differences.
Finally the cross-entropy measure
There are two technical facts one needs to be aware.
First, KL-divergence and cross entropy is related by the following formula.
$$ D_{KL}(p \Vert q) = H(p, q) - H(p). $$
Second, in ML practice, we often pass the ground truth label as the $p$ parameter and the model inference outputs as the $q$ parameter. And in majority of the cases, our training algorithms are based on gradient descent. If both of our assumptions are true (most likely), the term $H(p)$ term is a constant that does not affect our training results, and hence can be discarded to save computational resources. In this case, $H(p,q)$, the cross-entropy, can be used in place of $D_{KL}(p \Vert q)$.
If the assumptions are violated, you need to abandon the cross-entropy formula and revert back to the KL-divergence.
I think I can now end my wordy explanation. I hope it helps.
Best Answer
The question was posted a long time ago but it could be useful for anyone else who is working through Bishop's book to note that both forms of the softmax function are equivalent since \begin{equation}1+\sum_{j=1}^{M-1}\exp{\{\eta_{j}\}}={\sum_{j=1}^M\exp{\{\eta_{j}\}}}\end{equation}