Solved – Can deep neural network approximate multiplication function without normalization

deep learningmachine learningneural networksregression

Let say we want to do regression for simple f = x * y using standart deep neural network.

I remember that there are reseraches that tells that NN with one hiden layer can apoximate any function, but I have tried and without normalization NN was unable approximate even this simple multiplication. Only log-normalization of data helped m = x*y => ln(m) = ln(x) + ln(y).
But that looks like a cheat. Can NN do this without log-normalization? The unswer is obviously(as for me) – yes, so the question is more what should be type/configuration/layout of such NN?

Best Answer

A big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have a zero gradient (because of neural network implementation details and limitations). We can use two approaches:

Divide by a constant. We are just dividing everything before the learning and multiply after.
Use log-normalization. It makes multiplication into addition:

\begin{align} m &= x \cdot y\\ &\Rightarrow \\ \ln(m) &= \ln(x) + \ln(y) \end{align}

Related Solutions

Solved – the difference between a neural network and a deep neural network, and why do the deep ones work better

Let's start with a triviliaty: Deep neural network is simply a feedforward network with many hidden layers.

This is more or less all there is to say about the definition. Neural networks can be recurrent or feedforward; feedforward ones do not have any loops in their graph and can be organized in layers. If there are "many" layers, then we say that the network is deep.

How many layers does a network have to have in order to qualify as deep? There is no definite answer to this (it's a bit like asking how many grains make a heap), but usually having two or more hidden layers counts as deep. In contrast, a network with only a single hidden layer is conventionally called "shallow". I suspect that there will be some inflation going on here, and in ten years people might think that anything with less than, say, ten layers is shallow and suitable only for kindergarten exercises. Informally, "deep" suggests that the network is tough to handle.

Here is an illustration, adapted from here:

But the real question you are asking is, of course, Why would having many layers be beneficial?

I think that the somewhat astonishing answer is that nobody really knows. There are some common explanations that I will briefly review below, but none of them has been convincingly demonstrated to be true, and one cannot even be sure that having many layers is really beneficial.

I say that this is astonishing, because deep learning is massively popular, is breaking all the records (from image recognition, to playing Go, to automatic translation, etc.) every year, is getting used by the industry, etc. etc. And we are still not quite sure why it works so well.

I base my discussion on the Deep Learning book by Goodfellow, Bengio, and Courville which went out in 2017 and is widely considered to be the book on deep learning. (It's freely available online.) The relevant section is 6.4.1 Universal Approximation Properties and Depth.

You wrote that

10 years ago in class I learned that having several layers or one layer (not counting the input and output layers) was equivalent in terms of the functions a neural network is able to represent [...]

You must be referring to the so called Universal approximation theorem, proved by Cybenko in 1989 and generalized by various people in the 1990s. It basically says that a shallow neural network (with 1 hidden layer) can approximate any function, i.e. can in principle learn anything. This is true for various nonlinear activation functions, including rectified linear units that most neural networks are using today (the textbook references Leshno et al. 1993 for this result).

If so, then why is everybody using deep nets?

Well, a naive answer is that because they work better. Here is a figure from the Deep Learning book showing that it helps to have more layers in one particular task, but the same phenomenon is often observed across various tasks and domains:

We know that a shallow network could perform as good as the deeper ones. But it does not; and they usually do not. The question is --- why? Possible answers:

Maybe a shallow network would need more neurons then the deep one?
Maybe a shallow network is more difficult to train with our current algorithms (e.g. it has more nasty local minima, or the convergence rate is slower, or whatever)?
Maybe a shallow architecture does not fit to the kind of problems we are usually trying to solve (e.g. object recognition is a quintessential "deep", hierarchical process)?
Something else?

The Deep Learning book argues for bullet points #1 and #3. First, it argues that the number of units in a shallow network grows exponentially with task complexity. So in order to be useful a shallow network might need to be very big; possibly much bigger than a deep network. This is based on a number of papers proving that shallow networks would in some cases need exponentially many neurons; but whether e.g. MNIST classification or Go playing are such cases is not really clear. Second, the book says this:

Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.

I think the current "consensus" is that it's a combination of bullet points #1 and #3: for real-world tasks deep architecture are often beneficial and shallow architecture would be inefficient and require a lot more neurons for the same performance.

But it's far from proven. Consider e.g. Zagoruyko and Komodakis, 2016, Wide Residual Networks. Residual networks with 150+ layers appeared in 2015 and won various image recognition contests. This was a big success and looked like a compelling argument in favour of deepness; here is one figure from a presentation by the first author on the residual network paper (note that the time confusingly goes to the left here):

But the paper linked above shows that a "wide" residual network with "only" 16 layers can outperform "deep" ones with 150+ layers. If this is true, then the whole point of the above figure breaks down.

Or consider Ba and Caruana, 2014, Do Deep Nets Really Need to be Deep?:

In this paper we provide empirical evidence that shallow nets are capable of learning the same function as deep nets, and in some cases with the same number of parameters as the deep nets. We do this by first training a state-of-the-art deep model, and then training a shallow model to mimic the deep model. The mimic model is trained using the model compression scheme described in the next section. Remarkably, with model compression we are able to train shallow nets to be as accurate as some deep models, even though we are not able to train these shallow nets to be as accurate as the deep nets when the shallow nets are trained directly on the original labeled training data. If a shallow net with the same number of parameters as a deep net can learn to mimic a deep net with high fidelity, then it is clear that the function learned by that deep net does not really have to be deep.

If true, this would mean that the correct explanation is rather my bullet #2, and not #1 or #3.

As I said --- nobody really knows for sure yet.

Concluding remarks

The amount of progress achieved in the deep learning over the last ~10 years is truly amazing, but most of this progress was achieved by trial and error, and we still lack very basic understanding about what exactly makes deep nets to work so well. Even the list of things that people consider to be crucial for setting up an effective deep network seems to change every couple of years.

The deep learning renaissance started in 2006 when Geoffrey Hinton (who had been working on neural networks for 20+ years without much interest from anybody) published a couple of breakthrough papers offering an effective way to train deep networks (Science paper, Neural computation paper). The trick was to use unsupervised pre-training before starting the gradient descent. These papers revolutionized the field, and for a couple of years people thought that unsupervised pre-training was the key.

Then in 2010 Martens showed that deep neural networks can be trained with second-order methods (so called Hessian-free methods) and can outperform networks trained with pre-training: Deep learning via Hessian-free optimization. Then in 2013 Sutskever et al. showed that stochastic gradient descent with some very clever tricks can outperform Hessian-free methods: On the importance of initialization and momentum in deep learning. Also, around 2010 people realized that using rectified linear units instead of sigmoid units makes a huge difference for gradient descent. Dropout appeared in 2014. Residual networks appeared in 2015. People keep coming up with more and more effective ways to train deep networks and what seemed like a key insight 10 years ago is often considered a nuisance today. All of that is largely driven by trial and error and there is little understanding of what makes some things work so well and some other things not. Training deep networks is like a big bag of tricks. Successful tricks are usually rationalized post factum.

We don't even know why deep networks reach a performance plateau; just 10 years people used to blame local minima, but the current thinking is that this is not the point (when the perfomance plateaus, the gradients tend to stay large). This is such a basic question about deep networks, and we don't even know this.

Update: This is more or less the subject of Ali Rahimi's NIPS 2017 talk on machine learning as alchemy: https://www.youtube.com/watch?v=Qi1Yry33TQE.

[This answer was entirely re-written in April 2017, so some of the comments below do not apply anymore.]

Solved – How to train and fine-tune fully unsupervised deep neural networks

After some tests I've came up with a method that seems to give very good results, and as you'd expect from a 'fine-tuning' it improves the performance of all the layers:

Just like normally, during the greedy layer-wise learning phase, each new autoencoder tries to reconstruct the activations of the previous autoencoder's hidden layer. However, the last autoencoder (that will be the last layer of our multi-layer autoencoder during fine-tuning) is different, this one will use the activations of the previous layer and tries to reconstruct the 'global' input (ie the original input that was fed to the first layer).

This way when I connect all the layers and train them together, the multi-layer autoencoder actually reconstructs the original image in the final output. I found a huge improvement in the features learned, even without a supervised step.

I don't know if this is supposed to somehow correspond with standard implementations but I haven't found this trick anywhere before.

Best Answer

Related Solutions

Solved – the difference between a neural network and a deep neural network, and why do the deep ones work better

Solved – How to train and fine-tune fully unsupervised deep neural networks

Related Question