Solved – Object detection : is deep learning the only way to go

computer visionconv-neural-networkmachine learningneural networksobject detection

It seems that deep learning based approaches are currently more superior to the more "traditional" methods in the domain of object detection. Methods like YOLO, for example, seem to be doing something magical that can't be replicated by the other methods. However, these deep neural networks are trained on hundreds of classes and require a huge amount of computational power. So I have been wondering if we limit the object detection algorithm to a single class of objects (cats or a bicycle for example) can methods based on feature detectors(SIFT, SURF…) and machine learning algorithms (SVM, ANN…) perform similarly to deep learning algorithms?

Best Answer

Object detection is completely dominated by deep learning. Essentially it is now down to one-stage detectors (e.g., YOLO) when you want speed and two-stage detectors (e.g., faster-RCNN) when you want accuracy. Check out any recent survey on object detection, like this one or this one.

However, these deep neural networks are trained on hundreds of classes and require a huge amount of computational power.

This is often true, but not strictly so. Many networks focus exclusively on humans (detecting pedestrians, or estimating human pose) for example. And modern hardware is quite capable of running high speed detectors like YOLO9000 in at least near real-time, even on cell phones (especially network weight quantizations).

If we limit the object detection algorithm to a single class of objects (cats or a bicycle for example) can methods based on feature detectors(SIFT, SURF...) and machine learning algorithms (SVM, ANN...) perform similarly to deep learning algorithms?

It's possible, sure. One case where you might do well is when there is very little training data available. Another is for objects that you can describe very well at multiple scales with a hand-crafted feature. However, I suspect that even with a relatively small amount of labeled training data, a deep CNN can do better, especially if it is via transfer learning or using low-shot/few-shot learning techniques.

Anecdotally, however, I know of a recent commercial/industrial object detection task in a production setting that was only designed to target a very small number of categories for a single class of objects (i.e., "fine-grained recognition and detection"). Their starting point was RetinaNet, I believe.

As to your comment

Do you think then, that a combination of CNN and images descriptors can be worth pursuing

I am not sure that it is worth it to look into image descriptors (i.e., hand-crafted feature extraction), since CNNs already do this very well. What you should look into are ways to combine classical techniques with deep learning methods. Much useful knowledge from traditional computer vision has been forgotten and is waiting for rediscovery, I suspect. For example, take the recent "attention" work from CVPR by Ramachandran et al, which obtains state-of-the-art performance with fewer FLOPs and no convolutions at all, by taking advantage of so-called self-attention. Yet, this approach is simply a form of learned non-local means and/or parameterized bilateral filtering - both old computer vision techniques. In other words, while feature extraction may be done very well by deep models, there is still plenty of room for the use of classical methods in other areas I think.

Related Question