There's a lot of information about deep learning but what features should an algorithm have for it to be classified as deep or shallow? Is shallow only related to neural networks?
Solved – the difference between deep and shallow machine learning
algorithmsclassificationdeep learningmachine learning
Related Solutions
In my opinion: it's both. It's referenced many times in the highly cited article on convolutional neural networks Gradient-Based Learning Applied to Document Recognition by Yann LeCun, Yoshua Bengio, Leon Bottou and Patrick Haffner.
The idea is that it is quite hard to hand-design a rich and complex feature hierarchy. For low level features, we see that conv-nets learn edges or color blobs. This makes intuitive sense and from early computer vision methods, we have some good quality hand-crafted edge feature detectors. But how to compose these features to form richer and more complex features is not a simple task to do by hand. And now imagine trying to design a 10-level feature hierarchy.
Instead what you can do is tie the representation learning and classification tasks together, as is done in deep networks. Now we allow the data to drive the feature learning mechanism.
Deep architectures are designed to learn a hierarchy of features from the data as opposed to ad-hoc hand-crafted features designed by humans. Most importantly, the features will be learned with the explicit objective of learning a hierarchical feature representation which obtains low error on a given loss function which measures the performance of our deep net. A priori, given some hand-crafted features, one does not know how good these features are for the task at hand. In this manner, desired high performance on the task at hand will drive the quality of the learned features and they become inextricably linked together.
This end-to-end training/classification pipeline has been a big idea when it comes to designing computer vision architectures.
This is a great question and there's actually been some research tackling the capacity/depth issues you mentioned.
There's been a lot of evidence that depth in convolutional neural networks has led to learning richer and more diverse feature hierarchies. Empirically we see the best performing nets tend to be "deep": the Oxford VGG-Net had 19 layers, the Google Inception architecture is deep, the Microsoft Deep Residual Network has a reported 152 layers, and these all are obtaining very impressive ImageNet benchmark results.
On the surface, it's a fact that higher capacity models have a tendency to overfit unless you use some sort of regularizer. One way very deep networks overfitting can hurt performance is that they will rapidly approach very low training error in a small number of training epochs, i.e. we cannot train the network for a large number of passes through the dataset. A technique like Dropout, a stochastic regularization technique, allows us to train very deep nets for longer periods of time. This in effect allows us to learn better features and improve our classification accuracy because we get more passes through the training data.
With regards to your first question:
Why can you not just reduce the number of layers / nodes per layer in a deep neural network, and make it work with a smaller amount of data?
If we reduce the training set size, how does that affect the generalization performance? If we use a smaller training set size, this may result in learning a smaller distributed feature representation, and this may hurt our generalization ability. Ultimately, we want to be able to generalize well. Having a larger training set allows us to learn a more diverse distributed feature hierarchy.
With regards to your second question:
Is there a fundamental "minimum number of parameters" that a neural network requires until it "kicks in"? Below a certain number of layers, neural networks do not seem to perform as well as hand-coded features.
Now let's add some nuance to the above discussion about the depth issue. It appears, given where we are at right now with current state of the art, to train a high performance conv net from scratch, some sort of deep architecture is used.
But there's been a string of results that are focused on model compression. So this isn't a direct answer to your question, but it's related. Model compression is interested in the following question: Given a high performance model (in our case let's say a deep conv net), can we compress the model, reducing it's depth or even parameter count, and retain the same performance?
We can view the high performance, high capacity conv net as the teacher. Can we use the teacher to train a more compact student model?
Surprisingly the answer is: yes. There's been a series of results, a good article for the conv net perspective is an article by Rich Caruana and Jimmy Ba Do Deep Nets Really Need to be Deep?. They are able to train a shallow model to mimic the deeper model, with very little loss in performance. There's been some more work as well on this topic, for example:
among other works. I'm sure I'm missing some other good articles.
To me these sorts of results question how much capacity these shallow models really have. In the Caruana, Ba article, they state the following possibility:
"The results suggest that the strength of deep learning may arise in part from a good match between deep architectures and current training procedures, and that it may be possible to devise better learning algorithms to train more accurate shallow feed-forward nets. For a given number of parameters, depth may make learning easier, but may not always be essential"
It's important to be clear: in the Caruana, Ba article, they are not training a shallow model from scratch, i.e. training from just the class labels, to obtain state of the art performance. Rather, they train a high performance deep model, and from this model they extract log probabilities for each datapoint. We then train a shallow model to predict these log probabilities. So we do not train the shallow model on the class labels, but rather using these log probabilities.
Nonetheless, it's still quite an interesting result. While this doesn't provide a direct answer to your question, there are some interesting ideas here that are very relevant.
Fundamentally: it's always important to remember that there is a difference between the theoretical "capacity" of a model and finding a good configuration of your model. The latter depends on your optimization methods.
Best Answer
Term "deep" is coming from Neural Networks domain. This term is "soft", and doesn't have exactly unified definition.
First NN networks similar to present models started in 1986 when work of (Rumelhart, Hinton, Williams) popularized backpropagation method and shown that NN can handle aprox. of not-linear function. In 90s NN was used but frequently outperformed by other models, like SVM. Researches recognized that NN can't handle very complex function aprox. Adding additional layers into NN leads to problem of learning, known as "vanishing gradient problem" (Hochreitera, 1991). Lack of computation power and better results from strong math-based models (probabilistic, SVM etc.) put out to pasture NN. However classic models quality improvement was very slow.
In 2006 Hinton showed NN with higher number of layers, learnt in different way. Adding more layers made NN to be deeper. More powerful CPUs and then new GPUs allowed to show that deep NN outperform classic models. Additionally DNN solution was often simpler and more flexible, i.e. CNN (first proposed in 1998) vs classic image processing. This encourage researches to focus on DNN, and provide additional solution for vanishing gradient, like using ReLU activation function. The quality improvement based on DNN was so big that often seen as revolution, DL term born.
In original term "deep" means that NN contains more than 1 hidden layer. However often is identified with any present complex high quality solution.
Effects of learning CNN was illustrated by generating pictures of layers output and interpreted that higher level features corresponds to higher level features. Like identify edges on pictures (low level) and identify edges form shape of face (high level). The main point is that CNN achieve this "characterization" of features in unsupervised way. This is the reason many researches expect that term Deep Learning is corresponding to models with ability to achieve unsupervised hierarchical feature extraction. However usually deep learning is identified with DNNs, which is wrong in my opinion. Nature of NN is flexibility which allowed to easily find deep models, however it is not a good reason to identify DL only with DNN.