Without loss of generality, let's assume binary classification. For simplicity and illustration, let's assume that there is only one feature and it takes only one value (that is, it's a constant). Since effectively there are no covariates, there is only one parameter to estimate here, the probability $p$ of the positive class. Given data, which effectively consists of only $y$ in this case, learning or training becomes identical to the problem of parameter estimation for binomial distribution, for which any standard statistics textbook would contain some derivation like this:
Likehood $\displaystyle L(p) = {n \choose k} p^k (1-p)^{n-k}$, take the log of it and set the partial derivative to zero, $\displaystyle
\frac{\partial \log L(p)}{\partial p}=0$. Solving it gives $\hat{p} = \frac{k}{n}$.
Now, allow $n \rightarrow \infty$, and let the true but unknown probability of the positive class be $\pi$. The likelihood becomes $\displaystyle L(p) = {n \choose n\pi} p^{n\pi} (1-p)^{n(1-\pi)}$. Repeating the same steps as above, which is legitimate despite $n \rightarrow \infty$, gives $\hat{p} = \pi$. Perfect calibration, achieved through likelihood maximization.
Allowing for covariates means one has to model $p(y=1|x)$ (say, using $1/\left(1+\exp{(-(\beta_0+\beta^T x))}\right)$ as in logistic regression), which can be imperfect and hence likelihood is only maximized over a particular functional family (say, the one used in logistic regression above; this is aka "parametric restriction" in some contexts) but not over all possible families, hence giving potentially miscalibrated probabilities. If we allow for all possible functional families to model $p(y=1|x)$, the likelihood would be truly maximized and perfect calibration achieved, in the same way as the toy example shows above. But that would understandably require infinite data, since it amounts to a parametric model with infinite parameters.
I think your intuition missed the fact that the likelihood depends on the true probabilities in the exponentiated form above, hence maximizing it would bring the estimated probabilities close to the true ones, as oppose to close to 1.
Best Answer
TL;DR: to compare different neural network configurations that could have generated the observed training dataset
Before talking about neural networks (NNs) let's interpret what the likelihood tells us.
For any dataset $\mathcal D=\{x_n\}_n$ of observed variables with an associated probability distribution $p$, i.e., $x_n\sim p({\boldsymbol \theta})$ (where $\boldsymbol\theta$ are the parameters that govern our distribution $p$), we can think of the likelihood of $p$ as a value that tells how how true it is, in terms of the observed data, that a given vector $\boldsymbol{\hat\theta}$ generated the observed variables.
Mathematically, we define the likelihood function as the joint probability of the observed variables given the parameters of the distribution, and we usually assume that the observed variables are independent and identically distributed (i.i.d), i.e.,
$$ p(x_1, \ldots, x_N\vert{\boldsymbol\theta}) = \prod_{n=1}^Np(x_n\vert\boldsymbol\theta) $$
Before moving on with NNs, let us consider an example to illustrate this point.
Consider the following dataset with Bernoulli-distributed random variables:
D = [1, 1, 1, 1, 0, 1, 1, 1]
. Since the probability mass function (pmf) of a bernoulli-distributed random variable is given by$$ p(x\vert\mu) = \mu^x (1 - \mu)^{1-x}, $$
we can see that our likelihood will take the form
$$ \mathcal L = \prod_{n=1}^8 p(x_n\vert\mu) = \mu^{\sum_{n=1}^8x_n} (1 - \mu)^{N - \sum_{n=1}^8x_n} $$
Coming back to your analogy of what the likelihood tells us, we would like to find the parameter $\mu$ that is more likely to have generated the dataset
D
. The bigger $\mathcal L$ is, the more likely is that that a given $\mu$ generatedD
.Since we know that $\mu$ can only take values from $0$ to $1$, we can expect $\mathcal L$ to be bigger for values around 1, than for values around 0, and indeed this is the case as we can see from the following graph:
Since the likelihood is the product of many probabilities. In many cases, these probabilities tend quickly to zero, which makes them computationally impossible to work with. To counter this, we can introduce the log-likelihood so that we sum over log-probabilities. Since the log-transformation of a function does not change where its maximum value is in its domain, the maximum log-likelihood will not yield a different value of where the maximum value of $\mu$ is.
If we wish to find where the maximum value of $\mu$ is, we can simply take the derivative of the log-likelihood with respect to $\mu$, set it to zero and solve (hint: the answer will be the sample mean). Furthermore, since
$$ \hat x = \arg\max_x f(x) = \arg\min_x -f(x) $$
we can see that finding where the maximum log-likelihood happens is the same as finding where the minimum negative log-likelihood happens (the more negative, the better it is)
Introducing neural networks
What does a NN has to do with any of this? Consider a database $\mathcal{D}=\{(t_n, {\bf x}_n)\vert t_n\in\mathbb{R}, \mathcal{x}_n\in\mathbb{R}^M\}_{n=1}^N$ and a neural network architecture of the form $f: \mathbb{R}^M\to\mathbb{R}$ (it can be a feed-forward NN, convolutional NN, RNN, or any neural network architecture that returns a single real number). We assume that
$$ t_n\vert{\bf x}_n \sim \mathcal{N}(t_n \vert f({\bf x}_n, W), \sigma^2) $$
where $W$ are the weights of the neural network.
If we assume that we have trained $K$ configurations of NNs, we can look for the best (in terms of the training dataset) NN by comparing their likelihoods or, equivalently, their negative log-likelihoods, as they will tell us how likely is each configuration to have generated the observed training dataset.