Solved – Why don’t people use deeper RBFs or RBF in combination with MLP

machine learningneural networksrbf-network

So when looking at Radial Basis Function Neural Networks, I've noticed that people only ever recommend the usage of 1 hidden layer, whereas with multilayer perceptron neural networks more layers is considered better.

Given that RBF networks can be trained with version of back propagation is there any reasons why deeper RBF networks wouldn't work, or that an RBF layer couldn't be used as the penultimate or first layer in a deep MLP network? (I was thinking the penultimate layer so it could essentially be trained on the features learned by the previous MLP layers)

Best Answer

The fundamental problem is that RBFs are a) too nonlinear, b) do not do dimension reduction.

because of a) RBFs were always trained by k-means rather than gradient descent.

I would claim that the main success in Deep NNs is conv nets, where one of the key parts is dimension reduction: although working with say 128x128x3=50,000 inputs, each neuron has a restricted receptive field, and there are much fewer neurons in each layer.In a given layer in an MLP- each neuron represents a feature/dimension) so you are constantly reducing dimensionality (in going from layer to layer).

Although one could make the RBF covariance matrix adaptive and so do dimension reduction, this makes it even harder to train.

Related Solutions

Solved – How to design neural networks for pattern recognition in biometry

Neural networks are too often used as a black box without correct model specification. Indeed, you could take your 256 values per individual and model it as a Feed-forward NN with 256 inputs, maybe 5 to 10 perceptrons on the hidden layer, and hope to get a nice output.

However to choose a correct model it would be a good idea to do some preliminary analysis. Maybe you already did, but at least I have now the following questions:

How are the values distributed? Normal, symmetric, skewed? (often pre-transformations are very useful!)
Do they correlate with each other? How many principal components are needed to explain most of the variation? (If only few, you need not a lot of perceptrons)
Do the 256 values measure the same thing? Do you have per subject a time series of 256 values ranging from second 0-2 until 510-512??

After getting a better idea of the data, the model specification is feasible.

Deep Learning – Difference Between Neural Network and Deep Neural Network

Let's start with a triviliaty: Deep neural network is simply a feedforward network with many hidden layers.

This is more or less all there is to say about the definition. Neural networks can be recurrent or feedforward; feedforward ones do not have any loops in their graph and can be organized in layers. If there are "many" layers, then we say that the network is deep.

How many layers does a network have to have in order to qualify as deep? There is no definite answer to this (it's a bit like asking how many grains make a heap), but usually having two or more hidden layers counts as deep. In contrast, a network with only a single hidden layer is conventionally called "shallow". I suspect that there will be some inflation going on here, and in ten years people might think that anything with less than, say, ten layers is shallow and suitable only for kindergarten exercises. Informally, "deep" suggests that the network is tough to handle.

Here is an illustration, adapted from here:

But the real question you are asking is, of course, Why would having many layers be beneficial?

I think that the somewhat astonishing answer is that nobody really knows. There are some common explanations that I will briefly review below, but none of them has been convincingly demonstrated to be true, and one cannot even be sure that having many layers is really beneficial.

I say that this is astonishing, because deep learning is massively popular, is breaking all the records (from image recognition, to playing Go, to automatic translation, etc.) every year, is getting used by the industry, etc. etc. And we are still not quite sure why it works so well.

I base my discussion on the Deep Learning book by Goodfellow, Bengio, and Courville which went out in 2017 and is widely considered to be the book on deep learning. (It's freely available online.) The relevant section is 6.4.1 Universal Approximation Properties and Depth.

You wrote that

10 years ago in class I learned that having several layers or one layer (not counting the input and output layers) was equivalent in terms of the functions a neural network is able to represent [...]

You must be referring to the so called Universal approximation theorem, proved by Cybenko in 1989 and generalized by various people in the 1990s. It basically says that a shallow neural network (with 1 hidden layer) can approximate any function, i.e. can in principle learn anything. This is true for various nonlinear activation functions, including rectified linear units that most neural networks are using today (the textbook references Leshno et al. 1993 for this result).

If so, then why is everybody using deep nets?

Well, a naive answer is that because they work better. Here is a figure from the Deep Learning book showing that it helps to have more layers in one particular task, but the same phenomenon is often observed across various tasks and domains:

We know that a shallow network could perform as good as the deeper ones. But it does not; and they usually do not. The question is --- why? Possible answers:

Maybe a shallow network would need more neurons then the deep one?
Maybe a shallow network is more difficult to train with our current algorithms (e.g. it has more nasty local minima, or the convergence rate is slower, or whatever)?
Maybe a shallow architecture does not fit to the kind of problems we are usually trying to solve (e.g. object recognition is a quintessential "deep", hierarchical process)?
Something else?

The Deep Learning book argues for bullet points #1 and #3. First, it argues that the number of units in a shallow network grows exponentially with task complexity. So in order to be useful a shallow network might need to be very big; possibly much bigger than a deep network. This is based on a number of papers proving that shallow networks would in some cases need exponentially many neurons; but whether e.g. MNIST classification or Go playing are such cases is not really clear. Second, the book says this:

Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.

I think the current "consensus" is that it's a combination of bullet points #1 and #3: for real-world tasks deep architecture are often beneficial and shallow architecture would be inefficient and require a lot more neurons for the same performance.

But it's far from proven. Consider e.g. Zagoruyko and Komodakis, 2016, Wide Residual Networks. Residual networks with 150+ layers appeared in 2015 and won various image recognition contests. This was a big success and looked like a compelling argument in favour of deepness; here is one figure from a presentation by the first author on the residual network paper (note that the time confusingly goes to the left here):

But the paper linked above shows that a "wide" residual network with "only" 16 layers can outperform "deep" ones with 150+ layers. If this is true, then the whole point of the above figure breaks down.

Or consider Ba and Caruana, 2014, Do Deep Nets Really Need to be Deep?:

In this paper we provide empirical evidence that shallow nets are capable of learning the same function as deep nets, and in some cases with the same number of parameters as the deep nets. We do this by first training a state-of-the-art deep model, and then training a shallow model to mimic the deep model. The mimic model is trained using the model compression scheme described in the next section. Remarkably, with model compression we are able to train shallow nets to be as accurate as some deep models, even though we are not able to train these shallow nets to be as accurate as the deep nets when the shallow nets are trained directly on the original labeled training data. If a shallow net with the same number of parameters as a deep net can learn to mimic a deep net with high fidelity, then it is clear that the function learned by that deep net does not really have to be deep.

If true, this would mean that the correct explanation is rather my bullet #2, and not #1 or #3.

As I said --- nobody really knows for sure yet.

Concluding remarks

The amount of progress achieved in the deep learning over the last ~10 years is truly amazing, but most of this progress was achieved by trial and error, and we still lack very basic understanding about what exactly makes deep nets to work so well. Even the list of things that people consider to be crucial for setting up an effective deep network seems to change every couple of years.

The deep learning renaissance started in 2006 when Geoffrey Hinton (who had been working on neural networks for 20+ years without much interest from anybody) published a couple of breakthrough papers offering an effective way to train deep networks (Science paper, Neural computation paper). The trick was to use unsupervised pre-training before starting the gradient descent. These papers revolutionized the field, and for a couple of years people thought that unsupervised pre-training was the key.

Then in 2010 Martens showed that deep neural networks can be trained with second-order methods (so called Hessian-free methods) and can outperform networks trained with pre-training: Deep learning via Hessian-free optimization. Then in 2013 Sutskever et al. showed that stochastic gradient descent with some very clever tricks can outperform Hessian-free methods: On the importance of initialization and momentum in deep learning. Also, around 2010 people realized that using rectified linear units instead of sigmoid units makes a huge difference for gradient descent. Dropout appeared in 2014. Residual networks appeared in 2015. People keep coming up with more and more effective ways to train deep networks and what seemed like a key insight 10 years ago is often considered a nuisance today. All of that is largely driven by trial and error and there is little understanding of what makes some things work so well and some other things not. Training deep networks is like a big bag of tricks. Successful tricks are usually rationalized post factum.

We don't even know why deep networks reach a performance plateau; just 10 years people used to blame local minima, but the current thinking is that this is not the point (when the perfomance plateaus, the gradients tend to stay large). This is such a basic question about deep networks, and we don't even know this.

Update: This is more or less the subject of Ali Rahimi's NIPS 2017 talk on machine learning as alchemy: https://www.youtube.com/watch?v=Qi1Yry33TQE.

[This answer was entirely re-written in April 2017, so some of the comments below do not apply anymore.]

Best Answer

Related Solutions

Solved – How to design neural networks for pattern recognition in biometry

Deep Learning – Difference Between Neural Network and Deep Neural Network

Related Question