For simplicity, in what follows, assume that we are talking about finite discrete random variables.
Intuitively speaking, Fano's inequality is about guessing a random variable $X$, whose distribution is not known to us, by observing another random variable $Y$ which has a joint distribution with $X$. If we denote our random guess as $\hat X$, then mathematical translation of this last sentence is $\mathbb{P}(\hat X = \hat x | Y=y, X=x) = \mathbb{P}(\hat X = \hat x | Y = y)$. Mathematically speaking, if $X \to Y \to \hat X$ is a markov chain, Fano's inequality gives a lower bound for the error event as follows
$$\mathbb{P}(X \ne \hat X) \ge \frac{H(X|Y) – 1}{\log_2(|\mathcal{X}|-1)},$$
where $\mathcal{X} = \mathrm{range} X$.
Considering the intuition behind Fano's inequality as a motivation, the following question seems natural. Suppose that we have the joint distribution $\mathbb{P}(X = x, Y = y)$. Assuming that we know the event $Y = y$ has happened, what would be the "best" strategy for guessing $X$? Here, "best" can be translated to having the least $\mathbb{P}(\hat X \ne X)$.
Given $Y = y$, my intuition is to deterministically define $\hat X = \arg \underset{x \in \mathcal{X}}{\max} \mathbb{P}(X = x | Y = y)$. Consequently, $\hat X = f(Y)$. Now, if $\hat Z = g(Y)$ is any other random variable, I could show that $\mathbb{P}(\hat Z \ne X) \ge \mathbb{P}(\hat X \ne X)$ as below
\begin{align*}
\mathbb{P}(\hat X \ne X) &= \sum_{y\in\mathcal{Y}} \mathbb{P}(\hat X \ne X, Y = y) \\
&= \sum_{y\in\mathcal{Y}} \mathbb{P}(X \ne f(y), Y = y)
= \sum_{y\in\mathcal{Y}} \mathbb{P}(X \ne f(y)| Y = y) \mathbb{P}(Y=y) \\
&= \sum_{y\in\mathcal{Y}} (1 – \mathbb{P}(X = f(y)| Y = y)) \mathbb{P}(Y=y) \\
&\leq \sum_{y\in\mathcal{Y}} (1 – \mathbb{P}(X = g(y)| Y = y)) \mathbb{P}(Y=y) \\
&= \sum_{y\in\mathcal{Y}} \mathbb{P}(X \ne g(y)| Y = y) \mathbb{P}(Y=y)
= \sum_{y\in\mathcal{Y}}\mathbb{P}(X \ne g(y), Y = y) \\
&= \sum_{y\in\mathcal{Y}}\mathbb{P}(\hat Z \ne X, Y = y) \\
&= \mathbb{P}(\hat Z \ne X),
\end{align*}
This shows that the choice of $\hat X$ is optimal in this sense. Here is the main question.
Question
I think that the above argument could be extended for any other $\hat Z$ which is not deterministic and has a joint distribution with $Y$. How should I go about this?
Best Answer
According to the hint given by @stochasticboy321, I managed to answer the question. Here it is.
\begin{align*} \mathbb{P}(X \ne \hat Z) &= \sum_{z \in \mathcal{Z}} \mathbb{P}(X \ne z, \hat Z = z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}( X \ne z, Y = y, \hat Z = z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}( Y = y, \hat Z = z, X \ne z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y)\mathbb{P}(X \ne z | Y = y, \hat Z = z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) (1 - \mathbb{P}(X = z | Y = y, \hat Z = z)) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) (1 - \mathbb{P}(X = z | Y = y)) \tag{Markov}\\ &\ge \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) (1 - \mathbb{P}(X = f(y) | Y = y)) \tag{$\hat X$} \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y) | Y = y) \sum_{z \in \mathcal{Z}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y) | Y = y) \sum_{z \in \mathcal{Z}} \mathbb{P}(\hat Z = z, Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y) | Y = y) \mathbb{P}(Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y), Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne \hat X, Y = y) \\ &= \mathbb{P}(X \ne \hat X). \end{align*}