Machine Learning – Negative Log Likelihood vs Cross-Entropy

cross entropylikelihoodmachine learning

Im developing some machine learning code, and I'm using the softmax function in the output layer.
My loss function is trying to minimize the Negative Log Likelihood (NLL) of the network's output.

However I'm trying to understand why NLL is the way it is, but I seem to be missing a piece of the puzzle.

From what I've googled, the NNL is equivalent to the Cross-Entropy, the only difference is in how people interpret both.
The former comes from the need to maximize some likelihood (maximum likelihood estimation – MLE), and the latter from information theory

However when I go on wikipedia on the Cross-Entropy page, what I find is:

Wikipedia - Relation between cross-entropy and NLL

Question 1: Why are they raising the estimated outcome to the power of the (N * training outcome).

Question 2: Why are they dividing the whole formula by N? Is is just just for convinience like adding the log to the likelihood?

This is what I've got so far:

Softmax function


Calculations

Thank you for your time, and excuse me if the question is too easy, but I just can't wrap my mind around it.
Math is not my forte, but I'm working on it 🙂

Best Answer

The Wikipedia page has left out a few steps. When they say "the likelihood of the training set", they just mean "the probability of the training set given some parameter values":

$$ L(\theta | x_1, ..., x_n) = p(x_1, ..., x_n | \theta) $$

The data $x_k$ in the training set is assumed conditionally independent, so that

$$ p(x_1, ..., x_n | \theta) = \prod_{k=1}^n p(x_k | \theta) $$

If the training set was $[2,2,1]$ then the likelihood would be $p(2|\theta) p(2|\theta) p(1|\theta) = p(2|\theta)^2 p(1|\theta)$. This suggests a simplification:

$$ \prod_{k=1}^n p(x_k | \theta) = \prod_i p(i | \theta)^{n_i} $$

where $n_i$ is the number of times $i$ occurs in the training data. The Wikipedia page defines $q_i = p(i | \theta)$ and $p_i = n_i/N$, so the likelihood is

$$ \prod_{k=1}^n q_{x_k} = \prod_i q_i^{n_i} = \prod_i q_i^{N p_i} $$

At the end, they divide the logarithm by $N$ to get the cross-entropy.