It seems to be closely related to the concept of Kullback–Leibler divergence (see Kullback and Leibler, 1951). In their article Kullback and Leibler discuss the mean information for discriminating between two hypotheses (defined as $I_{1:2}(E)$ in eqs. $2.2-2.4$) and cite pp. 18-19 of Shannon and Weaver's The Mathematical Theory of Communication (1949) and p. 76 of Wiener's Cybernetics (1948).
EDIT:
Additional aliases include the Kullback-Leibler information measure, the relative information measure, cross-entropy, I-divergence and Kerridge inaccuracy.
To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").
So in optimal encoding the average length of encoded message is
$$
\sum_i p_i \log_2(\tfrac{1}{p_i}),
$$
that is, Shannon entropy of the original probability distribution.
However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is
$$
\sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}),
$$
is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.
As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.
Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)
But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).
Best Answer
If the data is $x^n = x_1 \ldots x_n$, that is, an $n$-sequence from a sample space $\mathcal{X}$, the empirical point probabilities are $$\hat{p}(x) = \frac{1}{n}|\{ i \mid x_i = x\}| = \frac{1}{n} \sum_{i=1}^n \delta_x(x_i)$$ for $x \in \mathcal{X}$. Here $\delta_x(x_i)$ is one if $x_i = x$ and zero otherwise. That is, $\hat{p}(x)$ is the relative frequency of $x$ in the observed sequence. The entropy of the probability distribution given by the empirical point probabilities is $$H(\hat{p}) = - \sum_{x \in \mathcal{X}} \hat{p}(x) \log \hat{p}(x) = - \sum_{x \in \mathcal{X}} \frac{1}{n} \sum_{i=1}^n \delta_x(x_i) \log \hat{p}(x) = -\frac{1}{n} \sum_{i=1}^n \log\hat{p}(x_i).$$ The latter identity follows by interchanging the two sums and noting that $$\sum_{x \in \mathcal{X}} \delta_x(x_i) \log\hat{p}(x) = \log\hat{p}(x_i).$$ From this we see that $$H(\hat{p}) = - \frac{1}{n} \log \hat{p}(x^n)$$ with $\hat{p}(x^n) = \prod_{i=1}^n \hat{p}(x_i)$ and using the terminology from the question this is the empirical entropy of the empirical probability distribution. As pointed out by @cardinal in a comment, $- \frac{1}{n} \log p(x^n)$ is the empirical entropy of a given probability distribution with point probabilities $p$.