Solved – Does ReLU has advantange over sigmoid activator with cross-entropy as error function

deep learningneural networks

I'm learning CNN and I see that http://neuralnetworksanddeeplearning.com/chap3.html#what_does_the_cross-entropy_mean_where_does_it_come_from
introduced a cross-entropy function $C=-\frac{1}{n}\sum_x{[y\ln a + (1-y)\ln (1-a)]}$, it can be used as error function if we adapt sigmoid function as activator. It is beneficial as the gradient is no longer proportional to the derivative of sigmoid, and this can help prevent vanishing gradient problem.

But I see that in image classification ReLU is used as activator to prevent vanishing gradient problem, as pointed in:
What are the advantages of ReLU over sigmoid function in deep neural network?
and
http://neuralnetworksanddeeplearning.com/chap3.html#other_models_of_artificial_neuron

So is there anything wrong with my observation. What is the advantage of ReLU, given that the gradient in the Sigmoid+cross-entropy function model is not proportional to derivative of activator?

Thanks in advance!

Best Answer

Cross-entropy + ReLu does not seem to be a good idea. Check that answer.

The cross-entropy is a "cost" function that attempts to compute the difference between two probability distribution functions.

(...)

Traditionally, you want each of the categories in your distribution to be represented using a probability value, such that

  • each probability value is between 0 and 1
  • the sum of all probability values equals 1

Which is not the case with ReLu.