Solved – How do CNN’s avoid the vanishing gradient problem

deep learninggradient descentmachine learningoptimization

I have been reading a lot about convoloutional neural networks and was wondering how they avoid the vanishing gradient problem. I know deep belief networks stack single level auto-encoders or other pre-trained shallow networks and can thus avoid this problem but I don't know how it is avoided in CNNs.

According to Wikipedia:

"despite the above-mentioned "vanishing gradient problem," the
superior processing power of GPUs makes plain back-propagation
feasible for deep feedforward neural networks with many layers."

I don't understand why GPU processing would remove this problem?

Best Answer

The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge. This is a problem if you have a slow computer which takes a long time for each step. If you have a fast GPU which can perform many more steps in a day, this is less of a problem.

There are several ways to tackle the vanishing gradient problem. I would guess that the largest effect for CNNs came from switching from sigmoid nonlinear units to rectified linear units. If you consider a simple neural network whose error $E$ depends on weight $w_{ij}$ only through $y_j$, where

$$y_j = f\left( \sum_iw_{ij}x_i \right),$$

its gradient is

\begin{align} \frac{\partial}{\partial w_{ij}} E &= \frac{\partial E}{\partial y_j} \cdot \frac{\partial y_j}{\partial w_{ij}} \\ &= \frac{\partial E}{\partial y_j} \cdot f'\left(\sum_i w_{ij} x_i\right) x_i. \end{align}

If $f$ is the logistic sigmoid function, $f'$ will be close to zero for large inputs as well as small inputs. If $f$ is a rectified linear unit,

\begin{align} f(u) = \max\left(0, u\right), \end{align} the derivative is zero only for negative inputs and 1 for positive inputs. Another important contribution comes from properly initializing the weights. This paper looks like a good source for understanding the challenges in more details (although I haven't read it yet):

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

Related Question