Probability – Fano’s Inequality and the Best Estimator Explained

conditional probabilityentropyinformation theoryprobabilityrandom variables

For simplicity, in what follows, assume that we are talking about finite discrete random variables.

Intuitively speaking, Fano's inequality is about guessing a random variable $X$, whose distribution is not known to us, by observing another random variable $Y$ which has a joint distribution with $X$. If we denote our random guess as $\hat X$, then mathematical translation of this last sentence is $\mathbb{P}(\hat X = \hat x | Y=y, X=x) = \mathbb{P}(\hat X = \hat x | Y = y)$. Mathematically speaking, if $X \to Y \to \hat X$ is a markov chain, Fano's inequality gives a lower bound for the error event as follows

$$\mathbb{P}(X \ne \hat X) \ge \frac{H(X|Y) – 1}{\log_2(|\mathcal{X}|-1)},$$

where $\mathcal{X} = \mathrm{range} X$.

Considering the intuition behind Fano's inequality as a motivation, the following question seems natural. Suppose that we have the joint distribution $\mathbb{P}(X = x, Y = y)$. Assuming that we know the event $Y = y$ has happened, what would be the "best" strategy for guessing $X$? Here, "best" can be translated to having the least $\mathbb{P}(\hat X \ne X)$.

Given $Y = y$, my intuition is to deterministically define $\hat X = \arg \underset{x \in \mathcal{X}}{\max} \mathbb{P}(X = x | Y = y)$. Consequently, $\hat X = f(Y)$. Now, if $\hat Z = g(Y)$ is any other random variable, I could show that $\mathbb{P}(\hat Z \ne X) \ge \mathbb{P}(\hat X \ne X)$ as below

\begin{align*}
\mathbb{P}(\hat X \ne X) &= \sum_{y\in\mathcal{Y}} \mathbb{P}(\hat X \ne X, Y = y) \\
&= \sum_{y\in\mathcal{Y}} \mathbb{P}(X \ne f(y), Y = y)
= \sum_{y\in\mathcal{Y}} \mathbb{P}(X \ne f(y)| Y = y) \mathbb{P}(Y=y) \\
&= \sum_{y\in\mathcal{Y}} (1 – \mathbb{P}(X = f(y)| Y = y)) \mathbb{P}(Y=y) \\
&\leq \sum_{y\in\mathcal{Y}} (1 – \mathbb{P}(X = g(y)| Y = y)) \mathbb{P}(Y=y) \\
&= \sum_{y\in\mathcal{Y}} \mathbb{P}(X \ne g(y)| Y = y) \mathbb{P}(Y=y)
= \sum_{y\in\mathcal{Y}}\mathbb{P}(X \ne g(y), Y = y) \\
&= \sum_{y\in\mathcal{Y}}\mathbb{P}(\hat Z \ne X, Y = y) \\
&= \mathbb{P}(\hat Z \ne X),
\end{align*}

This shows that the choice of $\hat X$ is optimal in this sense. Here is the main question.

Question

I think that the above argument could be extended for any other $\hat Z$ which is not deterministic and has a joint distribution with $Y$. How should I go about this?

Best Answer

According to the hint given by @stochasticboy321, I managed to answer the question. Here it is.

\begin{align*} \mathbb{P}(X \ne \hat Z) &= \sum_{z \in \mathcal{Z}} \mathbb{P}(X \ne z, \hat Z = z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}( X \ne z, Y = y, \hat Z = z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}( Y = y, \hat Z = z, X \ne z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y)\mathbb{P}(X \ne z | Y = y, \hat Z = z) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) (1 - \mathbb{P}(X = z | Y = y, \hat Z = z)) \\ &= \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) (1 - \mathbb{P}(X = z | Y = y)) \tag{Markov}\\ &\ge \sum_{z \in \mathcal{Z}}\sum_{y \in \mathcal{Y}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) (1 - \mathbb{P}(X = f(y) | Y = y)) \tag{$\hat X$} \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y) | Y = y) \sum_{z \in \mathcal{Z}} \mathbb{P}(Y = y) \mathbb{P}(\hat Z = z | Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y) | Y = y) \sum_{z \in \mathcal{Z}} \mathbb{P}(\hat Z = z, Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y) | Y = y) \mathbb{P}(Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne f(y), Y = y) \\ &= \sum_{y \in \mathcal{Y}} \mathbb{P}(X \ne \hat X, Y = y) \\ &= \mathbb{P}(X \ne \hat X). \end{align*}

Related Question