Solved – Deep learning: representation learning or classification

classificationdeep learningfeature selectiongraphical-modelneural networks

For classification, I have often heard about deep learning / deep neural networks as a form of representation learning. I am confused as to what "representation learning" means in this context. Which of the following is the case?

1) The output layer of the network gives a feature vector, with one output node per vector element. This feature vector is then passed into a classifier. As such, the output layer is learning a better "representation" of the data than the original input layer, which means it is more suitable when put through a classifier.

2) The output layer of the network gives a classification score for each class, with one output node per class. The score for each class is then the value of the respective output node.

Number 2 is the way I have seen artificial neural networks being used in the past, but number 1 is learning "representations". Which is the one used in deep learning?

Best Answer

In my opinion: it's both. It's referenced many times in the highly cited article on convolutional neural networks Gradient-Based Learning Applied to Document Recognition by Yann LeCun, Yoshua Bengio, Leon Bottou and Patrick Haffner.

The idea is that it is quite hard to hand-design a rich and complex feature hierarchy. For low level features, we see that conv-nets learn edges or color blobs. This makes intuitive sense and from early computer vision methods, we have some good quality hand-crafted edge feature detectors. But how to compose these features to form richer and more complex features is not a simple task to do by hand. And now imagine trying to design a 10-level feature hierarchy.

Instead what you can do is tie the representation learning and classification tasks together, as is done in deep networks. Now we allow the data to drive the feature learning mechanism.

Deep architectures are designed to learn a hierarchy of features from the data as opposed to ad-hoc hand-crafted features designed by humans. Most importantly, the features will be learned with the explicit objective of learning a hierarchical feature representation which obtains low error on a given loss function which measures the performance of our deep net. A priori, given some hand-crafted features, one does not know how good these features are for the task at hand. In this manner, desired high performance on the task at hand will drive the quality of the learned features and they become inextricably linked together.

This end-to-end training/classification pipeline has been a big idea when it comes to designing computer vision architectures.