I'm assuming you meant to say $\mathcal{X} = \mathcal{Y} = \{0,1\}$. If $Y=X$, i.e. the two random variables are equal, $I(X;Y)$ is fixed and $P$ is meaningless.
Your expansion doesn't really help because both $H(Y)$ and $H(Y|X)$ depend on $q$. Instead, let's look at $I(X;Y) = H(X) - H(X|Y)$ by the symmetry of mutual information. $H(X)$ is fixed for a fixed $p$, so we only need to maximize $H(X|Y)$ through $q$ to minimize $I(X;Y)$.
If $q = 0.5$, observing $Y$ doesn't give any information about $X$, so our best guess is equivalent to flipping a $p$-biased coin, using the outcome as our guess and just ignoring $Y$. So, for $q=0.5$, $H(X|Y) = H(X)$, which is the maximum because conditioning can't increase entropy. You can plug-in the joint distribution of $X,Y$ to verify this intuitive explanation.
Given a channel output $Y_i=y_i$, we compute $P_{X_i|Y_i}(x_i|y_i)$ for $x_i=0$ and $x_i=1$ via Bayes' rule, i.e.,
$$
P_{X_i|Y_i}(x_i|y_i)= \frac{P_{Y_i|X_i}(y_i|x_i)P_{X_i}(x_i)}{P_{Y_i}(y_i)}.
$$
Since $P_{X_i}(0)=P_{X_i}(1)=\frac{1}{2}$ (the common assumption in this context), we have that
$$P_{X_i|Y_i}(x_i|y_i)=c\cdot P_{Y_i|X_i}(y_i|x_i)$$
for $x_i=0, 1$, where $c$ is some constant.
Since $P_{X_i|Y_i}(0|y_i)+P_{X_i|Y_i}(1|y_i)=1$ and the likelihood $P_{Y_i|X_i}(y_i|x_i)$ is specified by the channel (you can see more details in the last part of my reply), we can find $c$ without computing $P_{Y_i}(y_i)$.
However, a more smarter choice of $c$ is
$$c=(P_{Y_i|X_i}(y_i|0)+P_{Y_i|X_i}(y_i|1))^{-1}.$$
From here, you should be able to sense why the posterior probability $P_{X_i|Y_i}(x_i|y_i)$ is called the normalized likelihood. To end the first part, I would like to point out this fact:
$$
\frac{P_{X_i|Y_i}(0|y_i)}{P_{X_i|Y_i}(1|y_i)}=\frac{P_{Y_i|X_i}(y_i|0)}{P_{Y_i|X_i}(y_i|1)}.
$$
Regarding your second question:
The author also writes: "from the point of view of decoding, all that matters is >the likelihood ratio". While this seems intuitively plausible to me, is there a >simple way to see that this is true in general from the formula of Bayes' theorem?
Looking at the decoding problem, our goal is to find a codeword $\mathbf{x}$ in your codebook $\mathcal{C}$ such that $P_{\mathbf{X}|\mathbf{Y}}(\mathbf{x}|\mathbf{y})$ is maximized. For simplicity, I consider memoryless binary-input channels. As codeword are equally likely (by assumption), the decoding task for any given channel output $\mathbf{y}$ is
$$
\arg\max_{\mathbf{x}\in\mathcal{C}}P_{\mathbf{X}|\mathbf{Y}}(\mathbf{x}|\mathbf{y})
=\arg\max_{\mathbf{x}\in\mathcal{C}}P_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})=\arg\max_{\mathbf{x}\in\mathcal{C}}\prod_{i=1}^n P_{Y_i|X_i}(y_i|x_i).
$$
To solve the above optimization problem, we may compare $P_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})$ for all pairs of codewords $\mathbf{x}_1$ and $\mathbf{x}_2$: we know $\mathbf{x}_1$ is more likely to be transmitted if
$$
\prod_{i=1}^n\frac{P_{Y_i|X_{1,i}}(y_i|x_{1,i})}{P_{Y_i|X_{2,i}}(y_i|x_{2,i})}\ge 1.
$$
From this comparison, you should realize why the decoding only relies on the likelihood ratios.
Your last question should be no longer a question. Indeed, this equality $$P_{Y_i|X_i}(y_i|0)+P_{Y_i|X_i}(y_i|1)=1$$ is in general not true, and we really don't mean this (I know the name "normalized likelihood" might confuse you). So, forget about this equality.
Next, how do we compute the likelihood $P_{Y_i|X_i}(y_i|x_i)$? Well, it is given to you once the channel model is determined since $P_{Y_i|X_i}(y_i|x_i)$ is exactly the channel transition probability.
For a given channel output $Y_i=y_i$ of an AWGNC where $Y_i=X_i+N_i$ and $N_i\sim\mathcal{N}(0, \sigma^2)$, we have that $P_{Y_i|X_i}(y_i|x_i)=\mathcal{N}(x_i, \sigma^2)(y_i)$ for $X_i=x_i$. The other cases need to be more specific so that we can discuss further. I hope the above helps. :)
Best Answer
According to R.C. Gallager, "Information theory and reliable communication", Wiley (1968):
DMC = discrete memoryless channel.