Solved – Convolutional Neural Network Scale Sensitivity

computer visionneural networks

For the sake of example, lets suppose we're building an age estimator, based on the picture of a person. Below we have two people in suits, but the first one is clearly younger than the second one.


(source: tinytux.com)

There are plenty of features that imply this, for example the face structure. However the most telling feature is the ratio of head size to body size:


(source: wikimedia.org)

So suppose we've trained a CNN regression to predict the age of the person. In a lot of the age predictors that I've tried, the above image of the kid seems to fool the predictions into thinking he's older, because of the suit and likely because they rely primarily the face:

I'm wondering just how well can a vanilla CNN architecture infer the ratio of head to torso?

Compared to a regional RCNN, which is able to get bounding boxes on the body and head, will the vanilla CNN always perform worse?

Just before global flattening in the vanilla CNN (i.e. just after all convolutions), each output has a corresponding receptive field, which should have a sense of scale. I know that faster RCNN exploits this by making bounding box proposals exactly at this stage, so that all prior convolutional filters automatically train to all scales.

So, I would think that the vanilla CNN should be able to infer the ratio of head to torso size? Is this right? If so, is the only benefit of using a faster RCNN framework to exploit the fact that may have been pre-trained on detecting people?

Best Answer

CNNs are too large a class of models, to answer this question. LeNet, AlexNet, ZFNet and VGG16 will behave very differently than GoogLeNet, which was built specifically to do most of what R-CNN do, with a CNN architecture (you may know GoogLeNet with the name of Inception, even though strictly speaking Inception is just the basic unit (subnetwork) upon which GoogLeNet is built). Finally, ResNets will behave differently. And all these architectures were not built to classify age classes, but the 1000 ImageNet classes, which don't contain age classes for humans. One could use transfer learning (if you have enough training images) to train one of the widely available trained models above, and see how they perform. In general, however, especially the older architectures (let's say up to VGG16) have an hard time learning "global features" which require to learn about "head" (already a complex feature), "torso" (another complex feature) and their ratio (which also requires that the two features are in a certain spatial relationship). This kind of stuff is what Capsule Networks should have been able to do.

Convnets were born to do exactly the opposite: be sensitive to local features, and relatively insensitive to their relative position/scale. A good Convnet should recognize "white cat" whether the picture is a close-up or an American shot. Combining convolutional layers (which are sensitive to local features) with pooling layers (which remove part of the sensitivity to variation in scale or translation of the image) gives you an architecture which in its most basic form is not great at learning the kind of spatial relationships among objects which you're looking for. There was an example somewhere (but I can't find it anymore) where, after splitting a cat image in various rectangular nonoverlapping tiles and putting them together in a random order, the CNN would keep identifying the image as cat. This indicates that CNNs are more sensitive to local features (textures or something like that) than to the spatial relationship among high level features. See also the Capsule networks paper for some discussion of this. Hinton also showed an example of this in a video about the limits of convnets.

My wild guess is that one of the recent architectures would be perfectly capable (given enough data) of discerning men from children, but not because of a "threshold" on a metric relationship among high level features such as "head" and "torso". It would learn some statistical regularity, maybe completely unnoticeable to humans, which separates adult images from child images in the training set.