It seems to be closely related to the concept of Kullback–Leibler divergence (see Kullback and Leibler, 1951). In their article Kullback and Leibler discuss the mean information for discriminating between two hypotheses (defined as $I_{1:2}(E)$ in eqs. $2.2-2.4$) and cite pp. 18-19 of Shannon and Weaver's The Mathematical Theory of Communication (1949) and p. 76 of Wiener's Cybernetics (1948).
EDIT:
Additional aliases include the Kullback-Leibler information measure, the relative information measure, cross-entropy, I-divergence and Kerridge inaccuracy.
To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").
So in optimal encoding the average length of encoded message is
$$
\sum_i p_i \log_2(\tfrac{1}{p_i}),
$$
that is, Shannon entropy of the original probability distribution.
However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is
$$
\sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}),
$$
is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.
As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.
Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)
But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).
Best Answer
Minimizing the cross entropy is often used as a learning objective in generative models where p is the true distribution and q is the learned distribution.
The cross entropy of p and q is equal to the entropy of p plus the KL divergence between p and q.
$H(p, q) = H(p) + D_{KL}(p||q)$
You can think of $H(p)$ as a constant because $p$ comes directly from the training data and is not learned by the model. So, only the KL divergence term is important. The motivation for KL divergence as a distance between probability distributions is that it tells you how many bits of information are gained by using the distribution p instead of the approximation q.
Note that KL divergence isn't a proper distance metric. For one thing, it is not symmetric in p and q. If you need a distance metric for probability distributions you will have to use something else. But, if you are using the word "distance" informally then you can use KL divergence.