Solved – Gradient descent with Binary Cross-Entropy for single layer perceptron

I'm implementing a Single Layer Perceptron for binary classification in python. I'm using binary Cross-Entropy loss function and gradient descent.

The gradient descent is not converging, may be I'm doing it wrong. Here what I did:

We have a vector of weights $W = \begin{pmatrix} w_1, \dots, w_M \end{pmatrix}$, a matrix of $N$ samples $X = \begin{bmatrix}
x_{11} & \dots & x_{N,1} \\
\vdots & \ddots & \vdots\\
x_{1M} & \dots & x_{NM}
\end{bmatrix}$, with each column representing a sample, a sigmoid $\sigma(x) = \frac{1}{1 + e^{-x}}$ as activation function and vector of $N$ targets $Y = \begin{pmatrix} y_1, \dots, y_N \end{pmatrix}$.

On forward step we have $V = WX$ and the output is $Z = \sigma(V)$.

On backpropagation we update the weights as $W(n+1) = W(n) – \eta \frac{\partial L}{\partial W}$, where $L$ is the binary Cross-Entropy loss function: $L(Y, Z)-\frac{1}{N}\sum_{k = 1}^{N} y_k \log(z_k) + (1 – y_k) \log(1 – z_k)$. In matrix notation this function can be rewritten as $L(Y,Z) = -\frac{1}{N} \left (Y (\log(Z))^T + (1_N – Y) (\log(1_N – Z))^T \right )$.

I think that all is correct so far and may be I messed up on the derivates.

Applying chain rule: $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \frac{\partial Z}{\partial V} \frac{\partial V}{\partial W}$

$\frac{\partial V}{\partial W}$ is straightforward: $\frac{\partial V}{\partial W} = X$.

$\frac{\partial Z}{\partial V} = \sigma '(V)$

$\frac{\partial L}{\partial Z} = -\frac{1}{N} \left (Y\begin{pmatrix}\frac{1}{z_1}, \dots, \frac{1}{z_N}\end{pmatrix}^T + (1_N – Y)\begin{pmatrix}\frac{1}{z_1 – 1}, \dots, \frac{1}{z_N – 1}\end{pmatrix}^T \right)$

So finally, $W(n+1) = W(n) – \eta \left( -\frac{1}{N} \left (Y\begin{pmatrix}\frac{1}{z_1}, \dots, \frac{1}{z_N}\end{pmatrix}^T + (1_N – Y)\begin{pmatrix}\frac{1}{z_1 – 1}, \dots, \frac{1}{z_N – 1}\end{pmatrix} ^T \right) \sigma '(V) X \right)$

Are those derivates correct?

Here's my code:

# Activation function
def sigmoid(self, x):
    return 1 / (1 + np.exp(-x))

# Activation function derivative
def d_sigmoid(self, x): # derivate of sigmoid
    return self.sigmoid(x)*(1 - self.sigmoid(x))

# Binary Cross-Entropy loss function
def loss(self, Z, Y):
    Z = Z.flatten() # reshape matrix to vector
    return np.asscalar((-1/len(Z)) * (np.dot(Y, np.log(Z + (1.e-10))) + np.dot((1 - Y), np.log(1 - Z + (1.e-10))))) 

# Gradient of loss function
def gradient_loss(self, Z, Y):
    Z = Z.flatten() # reshape matrix to vector
    dL = np.dot(Y, 1/Z) + np.dot((1 - Y), -1/(1 - Z))
    return np.asscalar((-1/len(Z)) * dL)

# Foward step
def forward(self, X):
    V = self.W @ X
    Z = self.sigmoid(V)
    print("Z {}:".format(Z))
    return V,Z

# Train neural network
def train(self, X, Y):
    n = 0
    print("Epoch {}:".format(n))
    V, Z = self.forward(X)
    print("loss: {}".format(self.loss(Z,Y)))
    # Perform a gradient descent algorithm
    while self.loss(Z, Y) > 0.1 and n < 10000:
        n = n + 1
        W_new = self.W - self.rate * self.gradient_loss(Z, Y) * self.d_sigmoid(V) @ X.transpose()
        self.W = W_new
        print(self.W.shape)

        print("Epoch {}:".format(n))
        V, Z = self.forward(X)
        print("loss: {}".format(self.loss(Z,Y)))

Best Answer

You've some error in cross-entropy differentiation, thinking with $N=1$ for simplicity it should be: $$\frac{\partial L}{\partial z}=-y\frac{1}{z}-(1-y)\frac{1}{z-1}$$ But, in your formulation there is $1$ (i.e. $1_N$) before $\frac{1}{z-1}$, which doesn't belong there. Also, just be careful about sigmoid derivative, you didn't write it explicitly: $\sigma'(v)=\sigma(v)(1-\sigma(v))$.

Best Answer

Related Solutions

Solved – From the Perceptron rule to Gradient Descent: How are Perceptrons with a sigmoid activation function different from Logistic Regression

Solved – Vectorization of Cross Entropy Loss

Related Question