Computer Vision – Understanding Translation Invariance in Convolutional Neural Networks

computer visionconv-neural-networkconvolutionmachine learning

I don't have computer vision background, yet when I read some image processing and convolutional neural networks related articles and papers, I constantly face the term, translation invariance, or translation invariant.
Or I read alot that the convolution operation provides translation invariance?!! what does this mean?
I myself always translated it to myself as if it means if we change an image in any shape, the actual concept of the image doesn't change.
For example if I rotate an image of a lets say tree, it's again a tree no matter what I do to that picture.
And I myself consider all operations that can happen to an image and transform it in a way (crop it, resize it, gray-scale it,color it etc…) to be this way. I have no idea if this is true so I would be grateful if anyone could explain this to me .

Best Answer

You're on the right track.

Invariance means that you can recognize an object as an object, even when its appearance varies in some way. This is generally a good thing, because it preserves the object's identity, category, (etc) across changes in the specifics of the visual input, like relative positions of the viewer/camera and the object.

The image below contains many views of the same statue. You (and well-trained neural networks) can recognize that the same object appears in every picture, even though the actual pixel values are quite different.

Various kinds of invariance, demonstrated

Note that translation here has a specific meaning in vision, borrowed from geometry. It does not refer to any type of conversion, unlike say, a translation from French to English or between file formats. Instead, it means that each point/pixel in the image has been moved the same amount in the same direction. Alternately, you can think of the origin as having been shifted an equal amount in the opposite direction. For example, we can generate the 2nd and 3rd images in the first row from the first by moving each pixel 50 or 100 pixels to the right.


One can show that the convolution operator commutes with respect to translation. If you convolve $f$ with $g$, it doesn't matter if you translate the convolved output $f*g$, or if you translate $f$ or $g$ first, then convolve them. Wikipedia has a bit more.

One approach to translation-invariant object recognition is to take a "template" of the object and convolve it with every possible location of the object in the image. If you get a large response at a location, it suggests that an object resembling the template is located at that location. This approach is often called template-matching.


Invariance vs. Equivariance

Santanu_Pattanayak's answer (here) points out that there is a difference between translation invariance and translation equivariance. Translation invariance means that the system produces exactly the same response, regardless of how its input is shifted. For example, a face-detector might report "FACE FOUND" for all three images in the top row. Equivariance means that the system works equally well across positions, but its response shifts with the position of the target. For example, a heat map of "face-iness" would have similar bumps at the left, center, and right when it processes the first row of images.

This is is sometimes an important distinction, but many people call both phenomena "invariance", especially since it is usually trivial to convert an equivariant response into an invariant one--just disregard all the position information).

Related Question