Solved – Training with a max-margin ranking loss converges to useless solution

loss-functionsmachine learningneural networksranking

I want to train a neural network to predict real-valued scores $s(x)$ for items $x$.

I use a dataset of examples $(x_p,X_n)$ where $x_p$ is a "positive" item that should be assigned a higher score than all "negative" items in $X_n$. I train the network with stochastic gradient descent minimizing a max-margin loss:

$L(x_p,X_n) = max(0, 1 + \max_{x \in X_n}{s(x)} – s(x_p))$

The function $s$ is a multi-layered perceptron with parameters $\theta$.

The problem is that this approach often converges to a very useless solution: Starting with a loss higher than 1, gradient descent just updates the network by decreasing $\theta$ further and further. Thereby, the scores $s(x)$ of both the positive and negative instances become smaller, as well as their difference and thus the loss. The loss then coverges to 1 and the model predicts almost the same score for every element.

Instead, I would want the network to learn to separate positive and negative items by predicting scores such that $s(x_p) > 1 + score(x_n)$.

Playing around with my data, I made the following observations:

  • When training on a very small subset, it works as expected.
  • The more data I add, the more often it fails to learn and converges to the useless solution.
  • Whether that happens or not heavily depends on how the weights were initialized (i.e. on the random seed).

Is there anything I'm missing here that would make the training more stable? Some sort of regularization? Or is there an issue with my setup?

Best Answer

In here a year too late, but try adding a softmax layer:

$$ softmax(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n}{e^{x_j}}} $$

This constrains the output scores so that they sum to 1, regularizing the model to "choose" a ranking, rather than forcing them all to zero.

Related Question