The Kullback-Leibler Divergence is not a metric proper, since it is not symmetric and also, it does not satisfy the triangle inequality. So the "roles" played by the two distributions are different, and it is important to distribute these roles according to the real-world phenomenon under study.
When we write (the OP has calculated the expression using base-2 logarithms)
$$\mathbb K\left(P||Q\right) = \sum_{i}\log_2 (p_i/q_i)p_i $$
we consider the $P$ distribution to be the "target distribution" (usually considered to be the true distribution), which we approximate by using the $Q$ distribution.
Now,
$$\sum_{i}\log_2 (p_i/q_i)p_i = \sum_{i}\log_2 (p_i)p_i-\sum_{i}\log_2 (q_i)p_i = -H(P) - E_P(\ln(Q))$$
where $H(P)$ is the Shannon entropy of distribution $P$ and $-E_P(\ln(Q))$ is called the "cross-entropy of $P$ and $Q$" -also non-symmetric.
Writing
$$\mathbb K\left(P||Q\right) = H(P,Q) - H(P)$$
(here too, the order in which we write the distributions in the expression of the cross-entropy matters, since it too is not symmetric), permits us to see that KL-Divergence reflects an increase in entropy over the unavoidable entropy of distribution $P$.
So, no, KL-divergence is better not to be interpreted as a "distance measure" between distributions, but rather as a measure of entropy increase due to the use of an approximation to the true distribution rather than the true distribution itself.
So we are in Information Theory land. To hear it from the masters (Cover & Thomas) "
...if we knew the true distribution $P$ of the random variable, we
could construct a code with average description length $H(P)$. If,
instead, we used the code for a distribution $Q$, we would need $H(P)
+ \mathbb K (P||Q)$ bits on the average to describe the random variable.
The same wise people say
...it is not a true distance between distributions since it is not
symmetric and does not satisfy the triangle inequality. Nonetheless,
it is often useful to think of relative entropy as a “distance”
between distributions.
But this latter approach is useful mainly when one attempts to minimize KL-divergence in order to optimize some estimation procedure. For the interpretation of its numerical value per se, it is not useful, and one should prefer the "entropy increase" approach.
For the specific distributions of the question (always using base-2 logarithms)
$$ \mathbb K\left(P||Q\right) = 0.49282,\;\;\;\; H(P) = 1.9486$$
In other words, you need 25% more bits to describe the situation if you are going to use $Q$ while the true distribution is $P$. This means longer code lines, more time to write them, more memory, more time to read them, higher probability of mistakes etc... it is no accident that Cover & Thomas say that KL-Divergence (or "relative entropy") "measures the inefficiency caused by the approximation."
Best Answer
Intuitive understanding is somewhat subjective, but I can at least offer my perspective:
Kullback-Leibler divergence is a concept from Information Theory. It tells you how much longer --- how many bits --- on average are your messages going to be if you use a suboptimal coding scheme.
For every probability distribution, there is a lower bound on the average message length, and that is the entropy of the distribution. For the distribution $P$ from your Wikipedia example, it is
$$ - \sum_x P(x) \cdot \log_2 P(x) \approx 1.462 $$
That is, if you were to record realisations of random variables from that probability distribution, e.g. in a computer file, or transmit them over a limited-bandwidth channel, you'd need, on average, at least $1.462$ bits per realisation, no matter how sophisticated your coding is. Since in that distribution the case $x = 2$ is three times as probable as $x = 3$, it makes sense to use a shorter code for encoding the event $x=2$ than for encoding $x=3$. You could, for example, use the following encoding:
The average message length with this code is $1.68$ bits, which is (of course!) more than the theoretical lower bound, but still better than an equal-length code, e.g.:
which would need $2$ bits per event. You can construct more complex codes to encode sequences of events, but no matter what you do, you won't be able to beat the information-theoretical lower bound.
Now, for a different distribution, say $Q$, there are other encodings that approximate the best possible coding. The entropy of $Q$ from your example is $\approx 1.583$ bits. As approximations, both above codes are equally good, requiring on average $2$ bits per event, but more complex codes might be better.
However, what is better for encoding $Q$ is not necessarily better for encoding $P$. Kullback-Leibler divergence tells you how many bits does it costs you to use a coding optimised for transmitting/storing information on $Q$ if your true probability distribution is $P$. This measure cannot be negative. If it were, it would mean that you could beat the optimal coding for $P$ by using the coding optimised for $Q$ instead.
Indeed, the KL-divergence $D_{KL}(P||P) = 0$ (easy to show, because $\log(p(x)/p(x)) = \log(1) = 0$) tells you that encoding the probability distribution $P$ with a code optimised for that distribution incurs zero costs.