To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").
So in optimal encoding the average length of encoded message is
$$
\sum_i p_i \log_2(\tfrac{1}{p_i}),
$$
that is, Shannon entropy of the original probability distribution.
However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is
$$
\sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}),
$$
is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.
As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.
Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)
But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).
Because of Limiting density of discrete points, the interpretation of
$$S = -\sum_x p(x)\ln p(x)$$
cannot be generalized to
$$S= -\int dx (p(x)\ln p(x))$$
Because the direct generalization leads to
$$S= -\int dx p(x)\ln (p(x)dx) = -\int dx p(x)\ln (p(x)) -\int dx p(x)\ln (dx) $$
Clearly, $\ln dx$ explodes.
Intuitively, since $p(x)dx = 0$, so the reasoning of using fewer bits for encoding something that is more likely to happen does not hold. So, we need to find another way to interpret $S= -\int dx p(x)\ln (p(x)dx)$, and the choice is $KL$ divergence.
Say we have a uniform distribution $q(x)$ in the same state space, then we have $$KL(p(x)\Vert q(x)) = \int dx p(x) \ln (\frac{p(x)dx}{q(x)dx})$$
Since $q(x)$ is just a constant, so we effectively keep the form of $S= -\int dx (p(x)\ln (p(x)dx))$, and at the same time construct a well-defined quantity for the continuous distribution $p(x)$.
So from $KL$ divergence, the entropy of a continuous distribution $p(x)$ can be interpreted as:
If we use a uniform distribution for encoding $p(x)$, then how many bits that is unnecessary on average.
Best Answer
There is no interpretation of differential entropy which would be as meaningful or useful as that of entropy. The problem with continuous random variables is that their values typically have 0 probability, and therefore would require an infinite number of bits to encode.
If you look at the limit of discrete entropy by measuring the probability of intervals $[n\varepsilon, (n + 1)\varepsilon[$, you end up with
$$-\int p(x) \log_2 p(x) \, dx - \log_2 \varepsilon$$
and not the differential entropy. This quantity is in a sense more meaningful, but will diverge to infinity as we take smaller and smaller intervals. It makes sense, since we'll need more and more bits to encode in which of the many intervals the value of our random value falls.
A more useful quantity to look at for continuous distributions is the relative entropy (also Kullback-Leibler divergence). For discrete distributions:
$$D_\text{KL}[P || Q] = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)}.$$
It measures the number of extra bits used when the true distribution is $P$, but we use $-\log Q_2(x)$ bits to encode $x$. We can take the limit of relative entropy and arrive at
$$D_\text{KL}[p \mid\mid q] = \int p(x) \log_2 \frac{p(x)}{q(x)} \, dx,$$
because $\log_2 \varepsilon$ will cancel. For continuous distributions this corresponds to the number of extra bits used in the limit of infinitesimally small bins. For both continuous and discrete distributions, this is always non-negative.
Now, we could think of differential entropy as the negative relative entropy between $p(x)$ and an unnormalized density $\lambda(x) = 1$,
$$-\int p(x) \log_2 p(x) \, dx = -D_\text{KL}[p \mid\mid \lambda].$$
Its interpretation would be the difference in the number of bits required by using $-\log_2 \int_{n\varepsilon}^{(n + 1)\varepsilon} p(x) \, dx$ bits to encode the $n$-th interval instead of $-\log \varepsilon$ bits. Even though the former would be optimal, this difference can now be negative, because $\lambda$ is cheating (by not integrating to 1) and therefore might assign fewer bits on average than theoretically possible.
See Sergio Verdu's talk for a great introduction to relative entropy.