Problem Statement:
I am coding the backpropagation algorithm (Rumelhart et al., 1986 and Ch. 2: How the Backpropagation Algorithm Works) from scratch. I need the partial derivative of the mean squared error (MSE) and binary cross entropy (BCE) cost functions with respect to the $\hat{y}_{j}$ for a single sample ${x \in \mathbb{R}^{n_x}}$. I believe I made a mistake with the BCE derivative, but I am not sure where, please help! Note the following about the target $y$:
(1) $y \in \mathbb{R}^{n_y}$ for regression problems (i.e., ${y=\langle y_{1}, y_{2}, …, y_{n_{y}}\rangle}$). Therefore the structure whose elements are the partial derivatives of the MSE cost function with respect to each element of $\hat{y}$ is known as the gradient of the MSE with respect to the $j^{th}$ element of $y$ and denoted ${\nabla_{\hat{y}} C_{\text{MSE}} = \langle
\frac{\partial C_{\text{MSE}}} {\partial \hat{y_{1}}},
\frac{\partial C_{\text{MSE}}} {\partial \hat{y_{2}}},
…,
\frac{\partial C_{\text{MSE}}} {\partial \hat{y_{n_y}}}
\rangle}$.
(2) $y \in \{0, 1\}$ for a binary classification problem (i.e., $y = 0$ or $y=1$),
(3) $\hat{y} \in [0, 1]$ for a binary classification problem, (i.e., the closed interval resulting from the sigmoid activation function ${\sigma(z) = \frac{1}{1 + e^{-z}}}$)
\begin{aligned}
C_{\text{MSE Sample}} & = \frac{1}{n_{y}} \sum_{j=1}^{n_{y}}{ (\hat{y}_j – y_j)^{2} } \\
C_{\text{BCE Sample}} & = -1(y \ln(\hat{y}) + (1 – y)\ln(1 – \hat{y})) \\
% C_{\text{BCE Batch}} & = \frac{1}{m} \sum_{j=1}^{m} C_{\text{BCE Sample}}
\end{aligned}
What I Have Done:
\begin{aligned}
\frac{\partial C_{\text{MSE Sample}}} {\partial \hat{y_{j}}} & = \frac{1} {n_{y}}
(\frac{\partial} {\partial \hat{y}_{j}} (\hat{y}_{j} – y_{j})^{2}) \\
& = \frac{2}{n_{y}} (\hat{y}_{j} – y_{j})
\end{aligned}
and
\begin{aligned}
\frac{\partial C_{\text{BCE Sample}}}{\hat{y}} & = -1(y \ln(1 – \hat{y}) + (1 – y)\ln(\hat{y})) \\
& = -1 (\frac{y}{\hat{y}} – \frac{1-y}{1-\hat{y}})
\end{aligned}
I think I must be making some mistake in the BCE derivation or implementation because it does not match up with the TensorFlow reference implementation shown below.
Validating My Implementation:
My MSE implementation gives the correct results; however, my BCE implementation does not give the correct results. The reference implementation uses TensorFlow. I have included my implementation below with examples.
Note:
My implementation of the derivative of the binary cross entropy cost function does not take optimization considerations that the TensorFlow implementation makes. However, I have not investigated these differences enough to provide a comprehensive answer based on these differences.
# Imports
from typing import Optional, Union
import numpy as np
from tensorflow.keras.losses import binary_crossentropy, mean_squared_error
import tensorflow as tf
class MeanSquaredError:
"""Mean squared error cost (loss) function.
The predictions are the activations of the network. The order of
arguments in the `derivative` was based on
`Four fundamental equations behind backpropagation` from
Nielsen (Ch.2, 2015). Similarly, the gradient calculation in BP1a of
is described in the same resource.
"""
def gradient(
self, inputs: tuple[np.ndarray, np.ndarray]) -> np.ndarray:
"""Computes the gradient with respect to all activations (preds).
This is a vectorized function and is called on each element of
an activation vector in order to compute the partial derivative
of the cost with respect to the j^{th} activation for the
l^{th} layer.
MSE = (1/dims) * (pred - true)^{2}
dMSE/dPred = (2/dim) * (pred - true)
Args:
inputs: Targets, predictions vectors.
Returns:
Vector (gradient) of values.
"""
targets, predictions = inputs
return (2 / targets.shape[-1]) * (predictions - targets)
def __call__(
self,
inputs: tuple[np.ndarray, np.ndarray],
axis: Optional[int] = None) -> np.float64:
"""Compute cost given inputs.
Args:
inputs: Targets and predictions vectors.
Return:
Scalar cost.
"""
targets, predictions = inputs
return np.mean(np.square(targets - predictions), axis=axis)
class BinaryCrossEntropy:
"""Binary cross entropy loss (cost) function."""
def __init__(self, from_logits: bool = False):
"""Initializes sigmoid function for binary cross entropy.
Args:
from_logits: True for logits, false for normalized log
probabilities (i.e., used sigmoid activation function).
Assumes not from logits.
"""
self.sigmoid = lambda t: 1 / (1 + np.exp(-t))
self.from_logits = from_logits
def gradient(self, inputs: tuple[np.ndarray, np.ndarray]) -> np.ndarray:
"""Derivative with respect to a single activation (same as derivative).
Should there be a from logits check here??
Args:
inputs: Targets, predictions vectors. Presumably, the inputs
here also have to be normalized log probabilities.
Returns:
Vector (gradient) of values.
"""
targets, predictions = inputs
if self.from_logits:
predictions = self.sigmoid(predictions)
return -1 * ((targets/predictions) - ((1-targets) / (1-predictions)))
def __call__(self,
inputs: tuple[np.ndarray, np.ndarray],
axis: Optional[int] = None) -> np.ndarray:
"""Compute cost given inputs.
Args:
inputs: Targets and predictions vectors.
Assumes predictions are not from logits.
Return:
Scalar cost.
"""
targets, predictions = inputs
if self.from_logits:
predictions = self.sigmoid(predictions)
return -1 * np.mean(targets * np.log(predictions) + (1 - targets) * np.log(1 - predictions), axis=axis)
# MSE gradient example
# Instantiate cost function objects
mse = MeanSquaredError()
bce = BinaryCrossEntropy()
sigmoid = lambda t: 1 / (1 + np.exp(-t))
# Validate MSE grad
a_L_np = np.array([0.12, 0.35, 0.61])
y_true_np = np.array([0.11, 0.01, 0.59])
a_L_tf = tf.Variable(a_L_np)
y_true_tf = tf.constant(y_true_np)
# tf gradient context
with tf.GradientTape() as tape:
C = mean_squared_error(y_true=y_true_tf, y_pred=a_L_tf)
dC_daL = tape.gradient(C, a_L_tf)
print('-- MSE -- ')
print('tf gradient tape:', dC_daL.numpy())
# My implementation
dC_daL_np = mse.gradient((y_true_np, a_L_np))
print('mse.gradient:', dC_daL_np)
print()
#### BCE ####
y_true = tf.constant(np.array([0., 1., 0., 0.]))
y_pred_logits = np.array([-18.6, 0.51, 2.94, -12.8])
y_pred_proba = tf.Variable(sigmoid(y_pred_logits))
with tf.GradientTape() as tape:
C = binary_crossentropy(y_true, y_pred_proba)
print('-- BCE --')
dC_dProbaActivation = tape.gradient(C, y_pred_proba)
print('tf gradient tape:', dC_dProbaActivation.numpy())
dC_dProbaActivationMine = bce.gradient((y_true, y_pred_proba))
print('bce.gradient:', dC_dProbaActivationMine.numpy())
#### Outputs ####
# -- MSE --
# tf gradient tape: [0.00666667 0.22666667 0.01333333]
# mse.gradient: [0.00666667 0.22666667 0.01333333]
# -- BCE --
# tf gradient tape: [ 0. -0.40012383 4.97895166 0.25000067]
# bce.gradient: [ 1.00000001 -1.60049558 19.91584631 1.00000276]
```
Best Answer
The loss function (binary cross-entropy) for one example should be (for one example) $$ \phi = - [t \log(\hat{y})+ (1-t) \log(1-\hat{y})] $$ where $t=0$ or $t=1$. The gradient reads $$ \frac{\partial \phi}{\partial \hat{y}} = - \left[ \frac{t}{\hat{y}} - \frac{1-t}{1-\hat{y}} \right] \tag{1} $$ I suspect the difference comes from the fact that in tensorflow there is a mean operation and not in yours... so the tf gradient is 1/N the quantity in (1). Nota: I also suspect a copy error for the first term. In Matlab
0.250000002089597 -0.400123894703066 4.978961578063771 0.250000690193143`