First let's restate the problem of vanishing gradients. Suppose you have a normal multilayer perceptron with sigmoidal hidden units. This is trained by back-propagation. When there are many hidden layers the error gradient weakens as it moves from the back of the network to the front, because the derivative the sigmoid weakens towards the poles. The updates as you move to the front of the network will contain less information.
RNNs amplify this problem because they are trained by back-propagation through time (BPTT). Effectively the number of layers that is traversed by back-propagation grows dramatically.
The long short term memory (LSTM) architecture to avoids the problem of vanishing gradients by introducing error gating. This allows it to learn long term (100+ step) dependencies between data points through "error carousels."
A more recent trend in training neural networks is to use rectified linear units, which are more robust towards the vanishing gradient problem. RNNs with sparsity penalization and rectified linear unit apparently work well.
See Advances In Optimizing Recurrent Networks.
Historically neural networks performance greatly depended on many optimization tricks and the selection of many hyperparameters. In the case of RNN you'd be wise to also implement rmsprop and Nesterov’s accelerated gradient. Thankfully, the recent developments in dropout training have made neural networks more robust towards overfitting. Apparently there is some work towards making dropout work with RNNs.
See On Fast Dropout and its Applicability to Recurrent
Networks
The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge. This is a problem if you have a slow computer which takes a long time for each step. If you have a fast GPU which can perform many more steps in a day, this is less of a problem.
There are several ways to tackle the vanishing gradient problem. I would guess that the largest effect for CNNs came from switching from sigmoid nonlinear units to rectified linear units. If you consider a simple neural network whose error $E$ depends on weight $w_{ij}$ only through $y_j$, where
$$y_j = f\left( \sum_iw_{ij}x_i \right),$$
its gradient is
\begin{align}
\frac{\partial}{\partial w_{ij}} E
&= \frac{\partial E}{\partial y_j} \cdot \frac{\partial y_j}{\partial w_{ij}} \\
&= \frac{\partial E}{\partial y_j} \cdot f'\left(\sum_i w_{ij} x_i\right) x_i.
\end{align}
If $f$ is the logistic sigmoid function, $f'$ will be close to zero for large inputs as well as small inputs. If $f$ is a rectified linear unit,
\begin{align}
f(u) = \max\left(0, u\right),
\end{align}
the derivative is zero only for negative inputs and 1 for positive inputs. Another important contribution comes from properly initializing the weights. This paper looks like a good source for understanding the challenges in more details (although I haven't read it yet):
http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
Best Answer
To my understanding, during backprop, skip connection's path will pass gradient update as well. Conceptually this update acts similar to synthetic gradient's purpose.
Instead of waiting for gradient to propagate back one layer at a time, skip connection's path allow gradient to reach those beginning nodes with greater magnitude by skipping some layers in between.
I personally do not find any improvement nor greater risk of encountering exploding gradient with skip connection.