Solved – How to combine a particular loss function into a DNN with another loss/objective

deep learningloss-functionsmachine learningneural networks

I'm training a fully-connected layer with a custom loss $L_1$ to perform dimensionality reduction. This loss is in function of the weights, which pushes the network to a solution which has some properties. In the current setting, I know how to train it through SGD as a stand-alone layer, because given this loss, I know how to update the weights such that it minimizes objective $L_1$. However, in the end I want to use this layer to make classifications, thus I would like to combine $L_1$ with a cross-entropy loss, in a end-to-end fashion. Is it possible to do that?

While a simple fully connected layer will reduce dimensionality, it won't necessarily find a solution whose weights have the same properties driven by this loss. Then, I'd like to combine both: a fully-connected layer, but that also finds a solution with said properties obtained by minimizing this loss.

An alternative interpretation, as loss $L_1$ wants to learn a set of weights for a feed-forward layer and cross-entropy $L_2$ measures the classification performance, maybe $L_1$ should be a regularization term then (as it is only in function of the weights), applied only to the weights in the layers of interest. But, if that's the case and I include it as a regularization term, how do I handle the specific update rule for the weights in this layer (obtained by deriving the loss with respect to the weights)?

Questions:

  1. Is it possible to combine these two losses in a end-to-end fashion such that the gradients affect each other?
  2. Ideally, this dimensionality reduction layer would precede a prediction layer, thus making some sort of skip-connection. How would that affect my update rule when using such "regularization"? Should I backprop the gradients regarding $L_1$ through the prediction layer as well?

Example

Consider the PCA loss, which may be defined as:

$$\max_W{\mathbb{E}_\mathbf{x}[\text{tr}{(W^T\mathbf{xx}^TW)}]}.$$

This objective yields the following update rule for the weights $W$:

$$W^{(t+1)} = W^{(t)} + \eta \mathbf{x}\mathbf{x}^T W^{(t)}.$$

As a stand-alone layer, update rule based on gradients $\frac{\partial L_1}{\partial W}$ are used immediately in the fully-connected layer, as shown in this horrible diagram I tried to make below:

Stand-alone linear layer with the PCA objective

Ok, but now I'd like to combine to insert it before the prediction layer, for instance:

Stacked linear layer and the PCA objective. Is it possible to fit them together?

How do I update weights $W$ now? And should I backpropagate the gradients $\frac{\partial L_1}{\partial W}$ to the regular fully-connected layer that precedes the layer I inserted (depicted as the first rectangle in the diagram)? Errata: Now it occurred to me that the second diagram is wrong; the gradients w.r.t. loss $L_1$ should be used to update the second layer itself (which already receives the gradient from the last fully-connected layer). Thus, I guess it should backpropagate to the first layer the regular gradient, isn't it?

Maybe I'm looking at it at the wrong angle, and this layer should only be the last layer. But, then, how would I combine it with cross-entropy?

Best Answer

I think your network architecture is better understood as:

enter image description here

Gradients should flow back to the dimension reduction layer from both the PCA loss and the cross-entropy loss on the predictions. If you're using a graph-based system like Tensorflow, Theano, Caffe, etc., this should happen automatically. Note that your update for the PCA loss should actually be $$W_{t+1}=W_t + \eta \tilde{\Sigma}W_{t}$$ where $$\tilde{\Sigma}=E[{xx}^T]\approx\frac{1}{b}\sum^{b-1}_{i=0}{x_ix_i}^T$$which is just gradient ascent with the expectation taken over a mini-batch. Alternatively, you could precompute a covariance matrix using all (or a large portion of the very redundant) available data and use $\tilde{\Sigma}$ as a constant. So there is no need for a custom update.

However, let's assume you do want a custom update as represented by a custom gradient (for example, a reduced variance estimate of the gradient). Then you would only need to overwrite the gradient flowing back from the PCA loss into the reduction layer and most graph based programs make this fairly easy to do. If your custom update cannot be understood as an alternative gradient (although most can), then you could use ADMM where you alternate between optimizing with respect to the PCA loss using the custom update and applying gradient descent with respect to the cross-entropy loss. However, since you PCA loss is unbounded from above (unless there are some other constraints on the magnitude of $W$), I'd recommend against this.