Solved – MLE and Cross Entropy for Conditional Probabilities

cross entropymaximum likelihoodneural networks

I'm trying to understand the relationship between maximum likelihood estimation for a function of the type $p(y^{(i)}|x^{(i)};\theta)$ and the related cross entropy minimization.

For a single variable this is straight forward. I am using the notation from "Deep Learning" by Goodfellow et al.

$\hat{\theta}_{ML} = \operatorname{argmax}_{\theta} \frac{1}{N}\sum_{i=1}^{N} \log p_{model}(x^{(i)};\theta)$. Writing the empirical distribution of the data as $\hat{p}_{data}$ we can re-write the function inside the argmax as $E_{\hat{p}_{data}}[\log p_{model}(x;\theta)]$. Maximizing this function is equivalent to minimizing the cross entropy $H(\hat{p}_{data},p_{model})$. Ok, easy enough.

I'm confused how this generalizes to the case where we are maximizing the likelihood of a model of the type $p_{model}(y^{(i)}|x^{(i)};\theta)$ as we might do in a supervised learning framework.

Trying to follow the same line of reasoning we should be able to derive a statement that maximizing the likelihood $\hat{\theta}_{ML} = \operatorname{argmax} \frac{1}{N} \sum_{i=1}^{N} \log p_{model}(y^{(i)}|x^{(i)};\theta)$ is equivalent to minimizing the cross entropy between $\hat{p}_{data}(Y|X)$ and $p_{model}(Y|X)$. However $H(\hat{p}_{data}(Y|X),p_{model}(Y|X))$ is a random variable w.r.t. $X$ so the analogy doesn't hold.

My best guess is that maximizing the likelihood in this scenario is equivalent to minimizing the expected cross entropy between $\hat{p}_{data}(Y|X)$ and $p_{model}(Y|X)$ taken over the empirical distribution of $X$. Written out this would be $E_{\hat{p}_{data}(X)}[H(\hat{p}_{data}(Y|X),p_{model}(Y|X))]$.

Any insight would be greatly appreciated.

Best Answer

Actually, I realized that defining the conditional cross-entropy in my previous answer is not even needed, we can directly draw the equivalence between conditional MLE maximization and classical cross entropy minimization: \begin{eqnarray} \hat{\theta}_{ML} &=& \arg\max_{\theta} \frac{1}{N} \sum_{i=1}^{N} \log p_{model}(y^{(i)}|x^{(i)};\theta) \nonumber \\ &=& \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} - \log p_{model}(y^{(i)}|x^{(i)};\theta) \nonumber \\ &=& \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} - \log p_{model}(y^{(i)}|x^{(i)};\theta) - \log p(x^{(i)}) \nonumber \\ &=& \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} - \log p_{model}(y^{(i)}|x^{(i)};\theta) - \log p_{model}(x^{(i)}|\theta) \nonumber \\ &=& \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} - \log \left( p_{model}(y^{(i)}|x^{(i)};\theta)p_{model}(x^{(i)}|\theta) \right) \nonumber \\ &=& \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} - \log p_{model}(y^{(i)}, x^{(i)}|\theta) \nonumber \\ &\approx& \arg\min_{\theta} \left(\mathbb{E}_{p_{data}(x,y)}\left[ -\log p_{model}(y,x|\theta) \right] \right) \nonumber \\ &=& \arg\min_{\theta} \left(\mathbb{E}_{p_{data}(x,y)}\left[ -\log p_{model(\theta)}(y,x) \right] \right) \nonumber \end{eqnarray} In the third step we subtracted logarithm of true and unknown probability of drawing the sample $x^{(i)}$, which does not change the argmax, as it is independent of $\theta$. In the fourth step, we defined $p_{model}(x^{(i)}|\theta) \equiv p(x^{(i)})$. We can do it, as our model is actually not modeling probability of the sample $x^{(i)}$, so artificially defining $p_{model}(x^{(i)}|\theta) \equiv p(x^{(i)})$ does not have impact on the model, and is interpreted as "we are modeling the probability of drawing $x^{(i)}$ as the true probability of this event (which is actually unknown, but we don't care)". The last step is just a reformulation for those that would irritated by conditioning on $\theta$, and the resulting term is just simple cross-entropy minimization.