First, mind that deep learning is a buzz term. There is not even a consensus of a formal definition in the research community. A discussion of the term does not lead anywhere, really. It's just a word.
That being said, convolutional nets are deep because they rely on multiple layers of feature extraction, as you said. They extract features from the input to predict an outcome.
What you refer to is a "generative" approach, i.e. the features are used to create the observation (a picture, not a class label). That is what made deep learning popular, but it is in no way limited to that.
This is a great question and there's actually been some research tackling the capacity/depth issues you mentioned.
There's been a lot of evidence that depth in convolutional neural networks has led to learning richer and more diverse feature hierarchies. Empirically we see the best performing nets tend to be "deep": the Oxford VGG-Net had 19 layers, the Google Inception architecture is deep, the Microsoft Deep Residual Network has a reported 152 layers, and these all are obtaining very impressive ImageNet benchmark results.
On the surface, it's a fact that higher capacity models have a tendency to overfit unless you use some sort of regularizer. One way very deep networks overfitting can hurt performance is that they will rapidly approach very low training error in a small number of training epochs, i.e. we cannot train the network for a large number of passes through the dataset. A technique like Dropout, a stochastic regularization technique, allows us to train very deep nets for longer periods of time. This in effect allows us to learn better features and improve our classification accuracy because we get more passes through the training data.
With regards to your first question:
Why can you not just reduce the number of layers / nodes per layer in a deep neural network, and make it work with a smaller amount of data?
If we reduce the training set size, how does that affect the generalization performance? If we use a smaller training set size, this may result in learning a smaller distributed feature representation, and this may hurt our generalization ability. Ultimately, we want to be able to generalize well. Having a larger training set allows us to learn a more diverse distributed feature hierarchy.
With regards to your second question:
Is there a fundamental "minimum number of parameters" that a neural network requires until it "kicks in"? Below a certain number of layers, neural networks do not seem to perform as well as hand-coded features.
Now let's add some nuance to the above discussion about the depth issue. It appears, given where we are at right now with current state of the art, to train a high performance conv net from scratch, some sort of deep architecture is used.
But there's been a string of results that are focused on model compression. So this isn't a direct answer to your question, but it's related. Model compression is interested in the following question: Given a high performance model (in our case let's say a deep conv net), can we compress the model, reducing it's depth or even parameter count, and retain the same performance?
We can view the high performance, high capacity conv net as the teacher. Can we use the teacher to train a more compact student model?
Surprisingly the answer is: yes. There's been a series of results, a good article for the conv net perspective is an article by Rich Caruana and Jimmy Ba
Do Deep Nets Really Need to be Deep?. They are able to train a shallow model to mimic the deeper model, with very little loss in performance. There's been some more work as well on this topic, for example:
among other works. I'm sure I'm missing some other good articles.
To me these sorts of results question how much capacity these shallow models really have. In the Caruana, Ba article, they state the following possibility:
"The results suggest that the strength of deep learning may arise in part from a good match between deep architectures
and current training procedures, and that it may be possible to devise better learning algorithms to train more accurate shallow feed-forward nets. For a given number of parameters, depth may make learning easier, but may not always be essential"
It's important to be clear: in the Caruana, Ba article, they are not training a shallow model from scratch, i.e. training from just the class labels, to obtain state of the art performance. Rather, they train a high performance deep model, and from this model they extract log probabilities for each datapoint. We then train a shallow model to predict these log probabilities. So we do not train the shallow model on the class labels, but rather using these log probabilities.
Nonetheless, it's still quite an interesting result. While this doesn't provide a direct answer to your question, there are some interesting ideas here that are very relevant.
Fundamentally: it's always important to remember that there is a difference between the theoretical "capacity" of a model and finding a good configuration of your model. The latter depends on your optimization methods.
Best Answer
In my opinion: it's both. It's referenced many times in the highly cited article on convolutional neural networks Gradient-Based Learning Applied to Document Recognition by Yann LeCun, Yoshua Bengio, Leon Bottou and Patrick Haffner.
The idea is that it is quite hard to hand-design a rich and complex feature hierarchy. For low level features, we see that conv-nets learn edges or color blobs. This makes intuitive sense and from early computer vision methods, we have some good quality hand-crafted edge feature detectors. But how to compose these features to form richer and more complex features is not a simple task to do by hand. And now imagine trying to design a 10-level feature hierarchy.
Instead what you can do is tie the representation learning and classification tasks together, as is done in deep networks. Now we allow the data to drive the feature learning mechanism.
Deep architectures are designed to learn a hierarchy of features from the data as opposed to ad-hoc hand-crafted features designed by humans. Most importantly, the features will be learned with the explicit objective of learning a hierarchical feature representation which obtains low error on a given loss function which measures the performance of our deep net. A priori, given some hand-crafted features, one does not know how good these features are for the task at hand. In this manner, desired high performance on the task at hand will drive the quality of the learned features and they become inextricably linked together.
This end-to-end training/classification pipeline has been a big idea when it comes to designing computer vision architectures.