Solved – Why is resnet faster than vgg

computer visiondeep learning

In Kaiming He's resnet presentation on slide 40 he says, "lower time complexity than VGG-16/19." Why is this the case, when resnet is much deeper?

Best Answer

Updated in order to address @mrgloom's comment

In my original answer, I stated that VGG-16 has roughly 138 million parameters and ResNet has 25.5 million parameters and because of this it's faster, which is not true. Number of parameters reduces amount of space required to store the network, but it doesn't mean that it's faster. Resnet is faster than VGG, but for a different reason.

Also, as @mrgloom pointed out that computational speed may depend heavily on the implementation. Below I'll discuss simple computational case. Also, I'll avoid counting FLOPs for activation functions and pooling layers, since they have relatively low cost.

First of all, speed of the convolution will depend on the size of the input. Let's say you have gray colored (1 channel) 100x100 image and you apply one 3x3 convolution filter with stride 1 and 0 padding. This operation will require you ~163k FLOPs. It's easy to calculate. If you know how convolution works you should know that from the 100x100 image you will get 98x98 (using the setup described above). And in order to compute each value from the 98x98 output image you need to do 9 multiplications and 8 additions, which in total corresponds to 17 operations per value. If you combine everything you get 98 * 98 * 17 and it equals to 163,268. Now, imagine you apply the same filter on the larger image, let's say 200x200. Image has 4 times bigger area and therefor you'll get roughly 4 times more FLOPs.

Now, I'll start with the comparison between VGG19 and Resnet 34, since that's the image that they use in the original paper.

In the figure 3, they break down architecture into the set of blocks marked with different colors. At the end each block reduces height and width by a factor of two. In the first two layers Resnet manages to reduces hight and width of the image by a factor of 4.

vgg-vs-resnet-block-1

From the VGG19 you can see that first two layers apply convolution on top of the full 224x224 image which is quite expensive. If you apply similar calculations as I did above you will find that first layer does ~170M FLOPs, but it produces 64x224x224 output from the 3x224x224 image. Since layer applies the same conv filter you should see that number of FLOPs should be close to 170M * (64 / 3). In fact, it's almost 3.7B FLOPs. This layer alone has roughly as many FLOPs as whole Resnet-34. In order to avoid this computational problem in the Resnet they address this issue in the first layer. It reduces number of row and columns by a factor of 2 and it uses only 240M FLOPs and next max pooling operation applies another reduction by factor of 2.

In contrast, these four convolutional layers in the VGG19 make around 10B FLOPs.

Next, convolutional filters in the Resnet build up slowly. You can see that they use less kernels compare to the VGG, but they have more of them stacked alternating between convolutional operation and non-linear activation functions. That's another thing that was pointed out by @mrgloom. They exploited idea of using thinner, but deeper networks.

vgg-vs-resnet-block-2

Next Resnet layers follow the same strategy, trying to make it thinner and deeper.

In addition, from Table 1 in the paper you can notice that convolutional blocks for Renet 50, Resnet 101 and Resnet 152 look a bit different. That's how it looks visually.

It was used in order to reduce number of operations even more, while using networks with larger number of filters in the convolutional layers. The idea of 1x1 convolutional layer allow to reduce channel depth before applying 3x3 convolution and upscale it back afterwords.