Solved – Help understanding Vanishing and Exploding Gradients

distributionsgradient descentmachine learningneural networks

I am following deeplearning.ai's videos on Coursera. I have a couple of questions regarding vanishing and exploding gradients. The following is Prof Andrew Ngs lecture slides:

Slide 1

From what Prof Ng says in the lectures, having large weights leads to large gradients and having having small weights leads to small gradients and both of these are troublesome for training. My questions are:

1.) Assuming my activation function is Sigmoid (say g(x)), having very large weights (x) will lead to small gradients and slow learning through Gradient Descent.
If, x is close to 0, then my gradients will be very large. Can this lead to overstepping in Gradient Descent (akin to a large Learning Rate)? I am guessing this is not the case as the gradient is due to the nature of the cost function while the learning rate is a parameter we set.

2.) If my activation function is ReLU, the slope will be constant for any x>0 and 0 for x<0. So, where is the notion of exploding or vanishing gradients coming from for ReLU? I thought ReLUs solve the vanishing and exploding gradient problem.

Thanks in advance.

Best Answer

1.) Assuming my activation function is Sigmoid (say g(x)), having very large weights (x) will lead to small gradients and slow learning through Gradient Descent. If, x is close to 0, then my gradients will be very large. Can this lead to overstepping in Gradient Descent (akin to a large Learning Rate)? I am guessing this is not the case as the gradient is due to the nature of the cost function while the learning rate is a parameter we set.

"If, x is close to 0, then my gradients will be very large" --> "very" is very relative here. It will be 0.25 if x=0.

2) If my activation function is ReLU, the slope will be constant for any x>0 and 0 for x<0. So, where is the notion of exploding or vanishing gradients coming from for ReLU? I thought ReLUs solve the vanishing and exploding gradient problem.

You can still lose the signal with ReLUs because the gradient will be 0 for negative inputs.

EDIT:

Alright. I understand how the vanishing gradients occur. But how is it ever possible to get "exploding" gradients if my gradients are never >1?

Consider the following neural network sketch:

enter image description here

For example, the partial derivative of the loss of the weight $w_{1,1}^{1}$, doesn't just rely on the derivative of the ReLU function. I.e., the partial derivative would be

$$\begin{aligned} \frac{\partial l}{\partial w_{1,1}^{(1)}}=\frac{\partial l}{\partial o} \cdot \frac{\partial o}{\partial a_{1}^{(2)}} & \cdot \frac{\partial a_{1}^{(2)}}{\partial a_{1}^{(1)}} \cdot \frac{\partial a_{1}^{(1)}}{\partial w_{1,1}^{(1)}} \\ &+\frac{\partial l}{\partial o} \cdot \frac{\partial o}{\partial a_{2}^{(2)}} \cdot \frac{\partial a_{2}^{(2)}}{\partial a_{1}^{(1)}} \cdot \frac{\partial a_{1}^{(1)}}{\partial w_{1,1}^{(1)}} \end{aligned}$$