The paradox can be stated in a simpler form:
We know that $I(X;Y)=h(X)-h(X|Y)\ge 0$ holds, also for continuous variables. Take the particular case $Y=X$; then second term vanishes ($h(X|X)=0$) and we get
$$ I(X;X)= h(X)-h(X|X)=h(X) \ge 0$$
But this is not right. The differential entropy can be negative. So what?
when dealing with continuous rv's where the one is a function of the
other, their conditional differential entropy may be non-zero (it
doesn't matter whether it is positive or negative), which is not
intuitive at all.
Your problem (and the problem with the above paradox) is to implicitly assume that the concept "zero entropy means no uncertainty" applies also to differential entropy. That's false. The differential entropy is not some type of entropy. It's false that $h(g(X)|X)=0$, it's false that $h(X|X)=0$; and it's false that $h(X)=0$ implies zero uncertainty. The fact that (by a mere change of scale) a differential entropy can be made negative, suggests by itself that here zero differential entropy (conditional or not) has no special meaning. In particular, a uniform variable in $[0,1]$ has $h(X)=0$.
We know that $H(X)$ quantifies the amount of information that each observation of $X$ provides, or, equivalently, the minimal amount of bits that we need to encode $X$ ($L_X \to H(X)$, where $L_X$ is the optima average codelength - first Shannon theorem)
The mutual information
$$I(X;Y)=H(X) - H(X \mid Y)$$
measures the reduction in uncertainity (or the "information gained") for $X$ when $Y$ is known.
It can be written as $$I(X;Y)=D(p_{X,Y}\mid \mid p_X \,p_Y)=D(p_{X\mid Y} \,p_Y \mid \mid p_X \,p_Y)$$
wher $D(\cdot)$ is the KullbackβLeibler divergence or distance, or relative entropy... or information gain (this later term is not so much used in information theory, in my experience).
So, they are the same thing. Granted, $D(\cdot)$ is not symmetric in its arguments, but don't let confuse you. We are not computing $D(p_X \mid \mid p_Y)$, but $D(p_X \,p_Y\mid \mid p_{X,Y})$, and this is symmetric in $X,Y$.
A slightly different situation (to connect with this) arises when one is interested in the effect of knowing a particular value of $Y=y$ . In this case,
because we are not averaging on $y$, the amount of bits gained [*] would be $ D(p_{X\mid Y} \mid \mid p_X )$... which depends in $y$.
[*] To be precise, that's actually the amount of bits we waste when coding the conditioned source $X\mid Y=y$ as if we didn't knew $Y$ (using the unconditioned distribution of $X$)
Best Answer
If one event determines another, the quantity $π»(π|π)+π»(π|π)$ is not necessarily $0$. The value will only be $0$ if both $X$ and $Y$ determine each other.
For instance if $Y$ is a function of $X$ (i.e. $Y = f(X)$) then $H(Y|X) = 0$ but $H(X | Y)$ can take a value greater than $0$. So just because $X$ determines $Y$ does not mean $Y$ determines $X$.
For instance if $X$ ~ $\mathcal{U}(-N , N)$, $f(x) = |x|$ and $Y = f(X)$ then it's clear that $X$ determines $Y$ but because $Y$ does not determine $X$ the quantity $π»(π|π)+π»(π|π)$ will be non-zero.
\begin{align} H(X) &= \log_2 N + 1 \\ H(Y) &= \log_2 N \\ H(Y|X) &= 0 \\ H(X|Y) &= 1 \\ \end{align}