Neural Networks – What Does It Mean ‘Loss Function Should Be Decoupled’ in Backprop for Deep Learning?: Detailed Insight

loss-functionsneural networks

While revising my notes that i took in a deep learning related course (in calculating gradients step), I see "we assume our loss function is a decoupled function". Can someone explain what this sentence means? I do not remember at all.

Edit 1 – more explanation: I also noted that in order to be able to apply backpropagation, our loss function should be decoupled in the sense that only one value (average loss) is used to backpropagate the error. We do not use a vector of loss that contains the loss of each data point in a mini-batch for example. Instead, I noted that "we need a decoupled loss function to be able to apply backpropagation". Why?

Best Answer

I think "decoupled function" is meant in the sense of this preprint: https://arxiv.org/pdf/1805.08479.pdf where they discuss multivariate functions being "decoupled" into the sum of univariate functions.

If you have $N$ data points, your loss function should be written as the sum (or average) of $N$ univariate functions, which are functions only of the predicted value at that point. For example, sum of squared errors is decoupled, since the total loss is just the sum of the loss at each point.

This would be violated if you have some loss function which considers explicit dependence between the observations. The condition is sort of like assuming your data are IID - it's the IID assumption which allows us to work with decoupled loss functions.

For example, if you tried to build a model with this loss function: $$\sum_i\sum_j (\hat{y_i} - \hat{y_j})^2$$, which minimizes the difference between all predictions, and says nothing else about what they should be. In order to calculate the gradient of $\hat{y_i}$ as a function of the parameters in the model, you'd also need to know the value of every $\hat{y_j}$. So backprop isn't possible.

Best Answer

Related Solutions

Solved – Convergence Criteria for Stochastic Gradient Descent

Solved – MSE as a proxy to Pearson’s Correlation in Regression Problems

Related Question