Solved – Negative loss while training Gaussian Mixture Density Networks

density functiongaussian mixture distributionmachine learningmaximum likelihoodneural networks

In classification problems, the usual negative log-likelihood loss function

$L(\theta)=\sum_{i=1}^N -\log(p(y_i|x_i,\theta))$

is always non-negative, since the $y_i$'s are discrete random variables and, as such, $p(y_i|x_i,\theta) \leq 1$ for all $\theta$. Therefore, the existence of a minimizer $\theta^*$ for $L(\theta)$ is guaranteed.

However, if the $y_i$'s are continuous random variables (as is the case for Gaussian Mixtures), $p(y|x,\theta)$ is a pdf, so it may assume values greater than 1 (as long as it integrates to 1 w.r.t. $y$). Therefore, if we make no assumptions on the family of pdf's $p(y|x,\theta)$, it is not obvious to me that $L(\theta)$ is necessarily lower bounded. Is it?

Moreover, let us suppose that $p(y|x,\theta)$ is a Gaussian Mixture whose parameters are provided by a Mixture Density Network, that is:

$p(y|x,\theta)=\sum_{i=1}^K \alpha_i(x)\mathcal{N}(y|\mu_i(x),\sigma^2_i(x))$,

where $K$ is the number of components in the mixture and $\alpha_i(x)$, $\mu_i(x)$ and $\sigma^2_i(x)$ are the outputs of a neural network that takes $x$ as input. In this case, $\theta$ represents all the parameters of the neural network to be optimized.

(for more details on Mixture Density Networks please check Bishop, 1994, available at https://publications.aston.ac.uk/373/1/NCRG_94_004.pdf)

Is $L(\theta)$ lower bounded for this particular family $p(y|x,\theta)$? How to prove (or at least have a strong intuition on) that?

These questions came to my mind while I was trying to reproduce the toy example in section 5 of the paper I mention above. Actually, during training, my loss is always decreasing, assuming negative values after about 1000 iterations. Since I do not have a lower bound for the loss, I don't have any idea if the final loss is reasonable or not. Moreover, I am not being able to reproduce the results in the paper (for instance, for $x=0.5$, I get a unimodal distribution instead of a trimodal one).

Thank you in advance.

Best Answer

Actually, the loss is not lower bounded and the problem is actually ill-posed, since one of the mixture components may collapse in a data point, making the loss decrease to arbitrarily small values (i.e. going to $-\infty$).

Related Question