Machine Learning – Are Residual Networks Related to Gradient Boosting?

deep learninggradient descentmachine learningneural networksresidual-networks

Recently, we saw the emergence of the Residual Neural Net, wherein,
each layer consists of a computational module $c_i$ and a shortcut connection that preserves the input to the layer such as the output of the ith layer exhibits:
$$
y_{i+1} = c_i + y_i
$$
The network allows to extract residual features and allows for deeper depth while being more robust to the vanishing gradient problem, achieving state of the art performance.

Having delved into gradient boosting, a very powerful ensembling technique in the machine learning world, which also seems to perform a form of gradient optimization on the residual of the loss, Its hard not to see some form of similarity.

I know that they are similar but not the same – one major difference I noticed is that gradient boosting performs optimization on the additive term while the residual net, optimizes the entire network.

I didn't see He et al note this as part of their motivation in their original paper. So I was wondering what are your insights on this topic and ask that you share interesting resources that you have.

Thank you.

Best Answer

Potentially a newer paper which attempts to address more of it from Langford and Shapire team: Learning Deep ResNet Blocks Sequentially using Boosting Theory

Parts of interest are (See section 3):

The key difference is that boosting is an ensemble of estimated hypothesis whereas ResNet is an ensemble of estimated feature representations $\sum_{t=0}^T f_t(g_t(x))$. To solve this problem, we introduce an auxiliary linear classifier $\mathbf{w}_t$ on top of each residual block to construct a hypothesis module. Formally a hypothesis module is defined as $$o_t(x) := \mathbf{w}_t^T g_t(x) \in \mathbb{R}$$

...

(where) $o_t(x) = \sum_{{t'} = 0}^{t-1} \mathbf{w}_t^T f_{t'}(g_{t'}(x))$

The paper goes into much more detail around the construction of the weak module classifier $h_t(x)$ and how that integrates with their BoostResNet algorithm.


Adding a bit more detail to this answer, all boosting algorithms can be written in some form of [1](p 5, 180, 185...):

$$F_T(x) := \sum_{t=0}^T \alpha_t h_t(x)$$

Where $h_t$ is the $t^{th}$ weak hypothesis, for some choice of $\alpha_t$. Note that different boosting algorithms will yield $\alpha_t$ and $h_t$ in different ways.

For example AdaBoost [1](p 5.) uses $h_t$ to minimize the weighted error $\epsilon_t$ with $\alpha_t = \frac{1}{2} \log \frac{1- \epsilon_t}{\epsilon_t}$

On the other hand, in gradient boosting setting [1](p 190.), $h_t$ is selected that maximizes $\nabla\mathcal{L}(F_{t-1}(x)) \cdot h_t$, and $\alpha_t > 0$ is chosen (as learning rate etc)

Where as in [2] under Lemma 3.2, it is shown that the output of depth-$T$ ResNet is $F(x)$ which is equivalent to

$$F(x) \propto \sum_{t=0}^T h_t(x)$$

this completes the relationship between boosting and resnet. The paper [2] proposes adding auxiliary linear layer to get it into the form $F_T(x) := \sum_{t=0}^T \alpha_t h_t(x)$, which leads to their BoostResNet algorithm and some discussion around that

[1] Robert E. Schapire and Yoav Freund. 2012. Boosting: Foundations and Algorithms. The MIT Press. p 5, 180, 189
[2] Furong Huang, Jordan Ash, John Langford, Robert Schapire: Learning Deep ResNet Blocks Sequentially using Boosting Theory, ICML 2018