Solved – Pretext Task in Computer Vision

computer visionmachine learning

I am new to Computer Vision. I am reading many papers and i see the term "pretext task". Can anyone explain what exactly it means.
Thanks in Advance.

Best Answer

A pretext task is used in self-supervised learning to generate useful feature representations, where "useful" is defined nicely in this paper:

By “useful” we mean a representation that should be easily adaptable for other tasks, unknown during training time.

This paper gives a very clear explanation of the relationship of pretext and downstream tasks:

Pretext Task: Pretext tasks are pre-designed tasks for networks to solve, and visual features are learned by learning objective functions of pretext tasks.

Downstream Task: Downstream tasks are computer vision applications that are used to evaluate the quality of features learned by self-supervised learning. These applications can greatly benefit from the pretrained models when training data are scarce.

A popular pretext task is minimizing reconstruction error in autoencoders to create lower-dimensional feature representations. Those representations are then used for whatever task you like, with the idea that if the decoder was able to come close to reconstructing the original input, all the essential information exists in the bottleneck layer of the autoencoder, and you can use that lower-dimensional representation as a proxy for the full input.

Another pretext task in vision is image inpainting in context encoders where the network tries to fill in blanked out regions of an image based on surrounding pixels. Yet another one is grayscale colorization that, as the name suggests, tries to colorize a grayscale image, with the idea that in order to do that the network must represent the spatial layout of the image as well as some semantic knowledge. For example, coloring a grayscale school bus as yellow captures a common regularity about school buses as opposed to a city bus which might be any color. So, if your task were, say, classifying vehicles by type, you might perform better on this task predicting from this learned representation because it has encoded spatial and color information that correlates well with our semantic labeling of our environment.

Note that pretext tasks are not unique to computer vision, but since vision dominates a lot of active machine learning research these days, there are many good examples of pretext tasks that have been demonstrated to help in vision-related tasks. An interesting multimodal example is this paper where they train a network to predict whether or not the input audio and video streams are temporally aligned. Using those features, they are able to perform cool tasks like sound-source localization, action recognition, and on/off-screen prediction (i.e. separating out the audio associated with what is visible on screen and what is background audio coming from outside the visual frame).

Related Question