Solved – Neuron saturation occurs only in last layer or all layers

neural networks

In Chapter 3 of the Neural Networks and Deep Learning book, the text repeatedly states that neuron saturation depends only on the activation function of the output layer and the cost function, such as:

"When should we use the cross-entropy instead of the quadratic cost?
In fact, the cross-entropy is nearly always the better choice,
provided the output neurons are sigmoid neurons."

and,

"This shows that if the output neurons are linear neurons then the
quadratic cost will not give rise to any problems with a learning
slowdown. In this case the quadratic cost is, in fact, an appropriate
cost function to use."

However, it's unclear to me why saturation is only a problem for the output layer. If there are previous hidden layers with sigmoid activations and a quadratic cost function, wouldn't the gradient for those previous layers also have a problem with saturation?

Best Answer

It seems to me the author didn't mean that it is the only reason for learning slowdown.

Surely sigmoid activation functions in hidden layers are likely to cause vanishing gradients, but for sigmoid in the output layer, it can be avoided by using the cross-entropy loss.

I think the discussions about output layer and saturation in that chapter is aimed at answering the question when should we use the cross-entropy instead of the quadratic cost? The answer of which is, sigmoid output goes well with cross-entropy loss, and linear output goes well with quadratic loss.

Related Question