Potentially a newer paper which attempts to address more of it from Langford and Shapire team: Learning Deep ResNet Blocks Sequentially using Boosting Theory
Parts of interest are (See section 3):
The key difference is that boosting is an ensemble of estimated
hypothesis whereas ResNet is an ensemble of estimated feature
representations $\sum_{t=0}^T f_t(g_t(x))$. To solve this problem, we
introduce an auxiliary linear classifier $\mathbf{w}_t$ on top of each
residual block to construct a hypothesis module. Formally a
hypothesis module is defined as $$o_t(x) := \mathbf{w}_t^T g_t(x) \in \mathbb{R}$$
...
(where) $o_t(x) = \sum_{{t'} = 0}^{t-1} \mathbf{w}_t^T f_{t'}(g_{t'}(x))$
The paper goes into much more detail around the construction of the weak module classifier $h_t(x)$ and how that integrates with their BoostResNet algorithm.
Adding a bit more detail to this answer, all boosting algorithms can be written in some form of [1](p 5, 180, 185...):
$$F_T(x) := \sum_{t=0}^T \alpha_t h_t(x)$$
Where $h_t$ is the $t^{th}$ weak hypothesis, for some choice of $\alpha_t$. Note that different boosting algorithms will yield $\alpha_t$ and $h_t$ in different ways.
For example AdaBoost [1](p 5.) uses $h_t$ to minimize the weighted error $\epsilon_t$ with $\alpha_t = \frac{1}{2} \log \frac{1- \epsilon_t}{\epsilon_t}$
On the other hand, in gradient boosting setting [1](p 190.), $h_t$ is selected that maximizes $\nabla\mathcal{L}(F_{t-1}(x)) \cdot h_t$, and $\alpha_t > 0$ is chosen (as learning rate etc)
Where as in [2] under Lemma 3.2, it is shown that the output of depth-$T$ ResNet is $F(x)$ which is equivalent to
$$F(x) \propto \sum_{t=0}^T h_t(x)$$
this completes the relationship between boosting and resnet. The paper [2] proposes adding auxiliary linear layer to get it into the form $F_T(x) := \sum_{t=0}^T \alpha_t h_t(x)$, which leads to their BoostResNet algorithm and some discussion around that
[1] Robert E. Schapire and Yoav Freund. 2012. Boosting: Foundations and Algorithms. The MIT Press. p 5, 180, 189
[2] Furong Huang, Jordan Ash, John Langford, Robert Schapire: Learning Deep ResNet Blocks Sequentially using Boosting Theory, ICML 2018
Best Answer
In short (from my cellphone), it works because the gradient gets to every layer, with only a small number of layers in between it needs to differentiate through.
If you pick a layer from the bottom of your stack of layers, it has a connection with the output layer which only goes through a couple of other layers. This means the gradient will be more pure.
It is a way to solve the vanishing gradient problem. And therefore models could be built even deeper.