Solved – How to apply Cross Entropy on Rectified Linear Units

deep learningentropymachine learningneural networks

I am currently getting started with Machine Learning. However, I have some problem to derive formula and not able understand how to applied the Cross Entropy (CE) on Rectified Linear Units (ReLU).

I also try to search from website, but most of them either brief only few sentences or take sigmoid as examples. (maybe this is too obvious for them?)

The best that I can find is from this website which teach how to applied the CE error function on the Sigmoid Units. Therefore, I try learning from there and derive my own. Here is how I start,

Given

cross entropy
$$C = – \frac{1}{n} \sum_x [y\ln{a} + (1-y)\ln{(1-a)}]$$

activation function
$$a=\sigma(z)$$

weight sum input
$$z = \sum_j w_ix_i + b$$

where $n$ is the total number of items of training data, the sum is over all training inputs, $x$, and $y$ is the corresponding desired output.

In this case, my activation function will be
$$\sigma(z) = max(0,z)$$

Compute CE w.r.t weight
$$\frac{\partial C}{\partial w_j} = -\frac{1}{n} \sum_x\bigg(\frac{y}{\sigma(z)} – \frac{(1-y)}{1-\sigma(z)}\bigg)\sigma'(z)x_j$$

Ok, the derivative of ReLU is (note that it is not differentiable on 0, put 0 instead)
$$\sigma'(z) = \left\{
\begin{array}{lr}
1 & : z > 0\\
0 & : z \leq 0
\end{array}
\right.
$$

When $z > 0$,
$$\frac{\partial C}{\partial w_j} = \frac{y-z}{z(1-z)}$$

$z = 1$ is undefined

when $z \leq 0$
$$\frac{\partial C}{\partial w_j} = \frac{y-0}{0(1-0)}$$

which is undefined as well.

Can someone shed some light?

Best Answer

I think the short answer here is that it's not a good idea to use ReLU activations on the output layer in combination with a cross-entropy loss. Read on for details!

The cross-entropy is a "cost" function that attempts to compute the difference between two probability distribution functions. If your neural network's output does not fit the criteria for representing a probability distribution function, then the cross-entropy is going to work erratically.

What are these criteria? Traditionally, you want each of the categories in your distribution to be represented using a probability value, such that

  • each probability value is between 0 and 1
  • the sum of all probability values equals 1.

Most often when using a cross-entropy loss in a neural network context, the output layer of the network is activated using a softmax (or the the logistic sigmoid, which is a special case of the softmax for just two classes) $$ s(\vec{z}) = \frac{\exp(\vec{z})}{\sum_i\exp(z_i)} $$ which forces the output of the network to satisfy these two representation criteria. In particular the softmax ensures that each of the outputs of the network are restricted to the open interval (0, 1), which in turn ensures that you don't get these undefined mathematical quantities like taking $\log(0)$ or computing $\frac{1}{1-z}$ for $z=1$.

Using a ReLU output activation function with a cross-entropy loss is problematic because the ReLU activation does not generate values that can, in general, be interpreted as probabilities, whereas the cross-entropy requires its inputs to be interpreted as probabilities.