Gradient of Cost Functions with Respect to Activations of Last Layer of Network (aka the predictions $\hat{y}$)

gradient descentneural networkspartial derivative

Problem Statement:

I am coding the backpropagation algorithm (Rumelhart et al., 1986 and Ch. 2: How the Backpropagation Algorithm Works) from scratch. I need the partial derivative of the mean squared error (MSE) and binary cross entropy (BCE) cost functions with respect to the $\hat{y}_{j}$ for a single sample ${x \in \mathbb{R}^{n_x}}$. I believe I made a mistake with the BCE derivative, but I am not sure where, please help! Note the following about the target $y$:

(1) $y \in \mathbb{R}^{n_y}$ for regression problems (i.e., ${y=\langle y_{1}, y_{2}, …, y_{n_{y}}\rangle}$). Therefore the structure whose elements are the partial derivatives of the MSE cost function with respect to each element of $\hat{y}$ is known as the gradient of the MSE with respect to the $j^{th}$ element of $y$ and denoted ${\nabla_{\hat{y}} C_{\text{MSE}} = \langle
\frac{\partial C_{\text{MSE}}} {\partial \hat{y_{1}}},
\frac{\partial C_{\text{MSE}}} {\partial \hat{y_{2}}},
…,
\frac{\partial C_{\text{MSE}}} {\partial \hat{y_{n_y}}}
\rangle}$
.

(2) $y \in \{0, 1\}$ for a binary classification problem (i.e., $y = 0$ or $y=1$),

(3) $\hat{y} \in [0, 1]$ for a binary classification problem, (i.e., the closed interval resulting from the sigmoid activation function ${\sigma(z) = \frac{1}{1 + e^{-z}}}$)

\begin{aligned}
C_{\text{MSE Sample}} & = \frac{1}{n_{y}} \sum_{j=1}^{n_{y}}{ (\hat{y}_j – y_j)^{2} } \\
C_{\text{BCE Sample}} & = -1(y \ln(\hat{y}) + (1 – y)\ln(1 – \hat{y})) \\
% C_{\text{BCE Batch}} & = \frac{1}{m} \sum_{j=1}^{m} C_{\text{BCE Sample}}
\end{aligned}

What I Have Done:

\begin{aligned}
\frac{\partial C_{\text{MSE Sample}}} {\partial \hat{y_{j}}} & = \frac{1} {n_{y}}
(\frac{\partial} {\partial \hat{y}_{j}} (\hat{y}_{j} – y_{j})^{2}) \\
& = \frac{2}{n_{y}} (\hat{y}_{j} – y_{j})
\end{aligned}

and

\begin{aligned}
\frac{\partial C_{\text{BCE Sample}}}{\hat{y}} & = -1(y \ln(1 – \hat{y}) + (1 – y)\ln(\hat{y})) \\
& = -1 (\frac{y}{\hat{y}} – \frac{1-y}{1-\hat{y}})
\end{aligned}

I think I must be making some mistake in the BCE derivation or implementation because it does not match up with the TensorFlow reference implementation shown below.

Validating My Implementation:

My MSE implementation gives the correct results; however, my BCE implementation does not give the correct results. The reference implementation uses TensorFlow. I have included my implementation below with examples.

Note:

My implementation of the derivative of the binary cross entropy cost function does not take optimization considerations that the TensorFlow implementation makes. However, I have not investigated these differences enough to provide a comprehensive answer based on these differences.

# Imports
from typing import Optional, Union

import numpy as np

from tensorflow.keras.losses import binary_crossentropy, mean_squared_error
import tensorflow as tf


class MeanSquaredError:
    """Mean squared error cost (loss) function.

    The predictions are the activations of the network. The order of
    arguments in the `derivative` was based on
    `Four fundamental equations behind backpropagation` from
    Nielsen (Ch.2, 2015). Similarly, the gradient calculation in BP1a of 
    is described in the same resource.
    """

    def gradient(
            self, inputs: tuple[np.ndarray, np.ndarray]) -> np.ndarray:
        """Computes the gradient with respect to all activations (preds).

        This is a vectorized function and is called on each element of 
        an activation vector in order to compute the partial derivative
        of the cost with respect to the j^{th} activation for the 
        l^{th} layer.

        MSE = (1/dims) * (pred - true)^{2}
        dMSE/dPred =  (2/dim) * (pred - true)

        Args:
            inputs: Targets, predictions vectors.

        Returns:
            Vector (gradient) of values.
        """

        targets, predictions = inputs
        return (2 / targets.shape[-1]) * (predictions - targets)

    def __call__(
            self,
            inputs: tuple[np.ndarray, np.ndarray],
            axis: Optional[int] = None) -> np.float64:
        """Compute cost given inputs.

        Args:
            inputs: Targets and predictions vectors.

        Return:
            Scalar cost.
        """

        targets, predictions = inputs
        return np.mean(np.square(targets - predictions), axis=axis)

class BinaryCrossEntropy:
    """Binary cross entropy loss (cost) function."""

    def __init__(self, from_logits: bool = False):
        """Initializes sigmoid function for binary cross entropy.

        Args:
         from_logits: True for logits, false for normalized log 
                probabilities (i.e., used sigmoid activation function).
                Assumes not from logits.
        """

        self.sigmoid = lambda t: 1 / (1 + np.exp(-t))
        self.from_logits = from_logits

    def gradient(self, inputs: tuple[np.ndarray, np.ndarray]) -> np.ndarray:
        """Derivative with respect to a single activation (same as derivative).

        Should there be a from logits check here??

        Args:
            inputs: Targets, predictions vectors. Presumably, the inputs 
            here also have to be normalized log probabilities.

        Returns:
            Vector (gradient) of values.
        """
        targets, predictions = inputs

        if self.from_logits:
            predictions = self.sigmoid(predictions)

        return -1 * ((targets/predictions) - ((1-targets) / (1-predictions)))

    def __call__(self,
                 inputs: tuple[np.ndarray, np.ndarray],
                 axis: Optional[int] = None) -> np.ndarray:
        """Compute cost given inputs.

        Args:
            inputs: Targets and predictions vectors. 
                Assumes predictions are not from logits.

        Return:
            Scalar cost.
        """

        targets, predictions = inputs

        if self.from_logits:
            predictions = self.sigmoid(predictions)

        return -1 * np.mean(targets * np.log(predictions) + (1 - targets) * np.log(1 - predictions), axis=axis)

# MSE gradient example

# Instantiate cost function objects
mse = MeanSquaredError()
bce = BinaryCrossEntropy()
sigmoid = lambda t: 1 / (1 + np.exp(-t))

# Validate MSE grad
a_L_np = np.array([0.12, 0.35, 0.61])
y_true_np = np.array([0.11, 0.01, 0.59])
a_L_tf = tf.Variable(a_L_np)
y_true_tf = tf.constant(y_true_np)

# tf gradient context
with tf.GradientTape() as tape:
    C = mean_squared_error(y_true=y_true_tf, y_pred=a_L_tf)

dC_daL = tape.gradient(C, a_L_tf)
print('-- MSE -- ')
print('tf gradient tape:', dC_daL.numpy())

# My implementation
dC_daL_np = mse.gradient((y_true_np, a_L_np))
print('mse.gradient:', dC_daL_np)
print()

#### BCE ####
y_true = tf.constant(np.array([0., 1., 0., 0.]))
y_pred_logits = np.array([-18.6, 0.51, 2.94, -12.8])
y_pred_proba = tf.Variable(sigmoid(y_pred_logits))

with tf.GradientTape() as tape:
    C = binary_crossentropy(y_true, y_pred_proba)

print('-- BCE --')
dC_dProbaActivation = tape.gradient(C, y_pred_proba)
print('tf gradient tape:', dC_dProbaActivation.numpy())
dC_dProbaActivationMine = bce.gradient((y_true, y_pred_proba))
print('bce.gradient:', dC_dProbaActivationMine.numpy())

#### Outputs ####
# -- MSE -- 
# tf gradient tape: [0.00666667 0.22666667 0.01333333]
# mse.gradient: [0.00666667 0.22666667 0.01333333]

# -- BCE --
# tf gradient tape: [ 0.         -0.40012383  4.97895166  0.25000067]
# bce.gradient: [ 1.00000001 -1.60049558 19.91584631  1.00000276]
```

Best Answer

The loss function (binary cross-entropy) for one example should be (for one example) $$ \phi = - [t \log(\hat{y})+ (1-t) \log(1-\hat{y})] $$ where $t=0$ or $t=1$. The gradient reads $$ \frac{\partial \phi}{\partial \hat{y}} = - \left[ \frac{t}{\hat{y}} - \frac{1-t}{1-\hat{y}} \right] \tag{1} $$ I suspect the difference comes from the fact that in tensorflow there is a mean operation and not in yours... so the tf gradient is 1/N the quantity in (1). Nota: I also suspect a copy error for the first term. In Matlab 0.250000002089597 -0.400123894703066 4.978961578063771 0.250000690193143`

Related Question