Solved – If each neuron in a neural network is basically a logistic regression function, why multi layer is better

logisticneural networks

I'm going thru the Cousera's DeepAI course (Week3 video 1 "Neural Networks Overview") and Andrew Ng is explaining how each layer in a neural network is just another logistic regression, but he doesn't explain how it makes thing more accurate.

So in a 2 layer network, how does calculating logistic multiple times make it more accurate?

Best Answer

When using logistic activation functions, it's true that the function relating the inputs of each unit to its output is the same as for logistic regression. But, this isn't really the same as each unit performing logistic regression. The difference is that, in logistic regression, the weights and bias are chosen such that the output best matches given target values (using the log/cross-entropy loss). In contrast, hidden units in a neural net send their outputs to downstream units. There is no target output to match for individual hidden units. Rather, the weights and biases are chosen to minimize some objective function that depends on the final output of the network.

Rather than performing logistic regression, it might make more sense to think of each hidden unit as computing a coordinate in some feature space. From this perspective, the purpose of a hidden layer is to transform its input--the input vector is mapped to a vector of hidden layer activations. You can think of this as mapping the input into a feature space with a dimension corresponding to each hidden unit.

The output layer can often be thought of as a standard learning algorithm that operates in this feature space. For example, in a classification task, using a logistic output unit with cross entropy loss is equivalent to performing logistic regression in feature space (or multinomial logistic regression if using softmax outputs). In a regression task, using a linear output with squared error is equivalent to performing least squares linear regression in feature space.

Training the network amounts to learning the feature space mapping and classification/regression function (in feature space) that, together, give the best performance. Assuming nonlinear hidden units, increasing the width of the hidden layer or stacking multiple hidden layers permits more complex feature space mappings, thereby allowing more complex functions to be fit.

Related Solutions

Solved – Problem with getting neural network learned to calculate XOR

There are a lot of technicians to optimize your learning like Momentum, RPROP and other. More infromation you can check at paper Yann LeCunn - Efficient BackProp. Also there are a lot of heuristic method for learning, for example you can use random weights from the "standard normal" distribution or make you learning rate different for layers or even for every synapse.

Also, you can find some technics which can train your network weights without learning iterations (but it's not a GD). For this issues you need computtion of inverse matrix or pseudo-inverse matrix. XOR is not linear separateble problem so you can't do it in classic Linear Algebra way, you must change you data set space in some way, for example you can use Radial Basis Functions (RBF).

Deep Learning – Difference Between Neural Network and Deep Neural Network

Let's start with a triviliaty: Deep neural network is simply a feedforward network with many hidden layers.

This is more or less all there is to say about the definition. Neural networks can be recurrent or feedforward; feedforward ones do not have any loops in their graph and can be organized in layers. If there are "many" layers, then we say that the network is deep.

How many layers does a network have to have in order to qualify as deep? There is no definite answer to this (it's a bit like asking how many grains make a heap), but usually having two or more hidden layers counts as deep. In contrast, a network with only a single hidden layer is conventionally called "shallow". I suspect that there will be some inflation going on here, and in ten years people might think that anything with less than, say, ten layers is shallow and suitable only for kindergarten exercises. Informally, "deep" suggests that the network is tough to handle.

Here is an illustration, adapted from here:

But the real question you are asking is, of course, Why would having many layers be beneficial?

I think that the somewhat astonishing answer is that nobody really knows. There are some common explanations that I will briefly review below, but none of them has been convincingly demonstrated to be true, and one cannot even be sure that having many layers is really beneficial.

I say that this is astonishing, because deep learning is massively popular, is breaking all the records (from image recognition, to playing Go, to automatic translation, etc.) every year, is getting used by the industry, etc. etc. And we are still not quite sure why it works so well.

I base my discussion on the Deep Learning book by Goodfellow, Bengio, and Courville which went out in 2017 and is widely considered to be the book on deep learning. (It's freely available online.) The relevant section is 6.4.1 Universal Approximation Properties and Depth.

You wrote that

10 years ago in class I learned that having several layers or one layer (not counting the input and output layers) was equivalent in terms of the functions a neural network is able to represent [...]

You must be referring to the so called Universal approximation theorem, proved by Cybenko in 1989 and generalized by various people in the 1990s. It basically says that a shallow neural network (with 1 hidden layer) can approximate any function, i.e. can in principle learn anything. This is true for various nonlinear activation functions, including rectified linear units that most neural networks are using today (the textbook references Leshno et al. 1993 for this result).

If so, then why is everybody using deep nets?

Well, a naive answer is that because they work better. Here is a figure from the Deep Learning book showing that it helps to have more layers in one particular task, but the same phenomenon is often observed across various tasks and domains:

We know that a shallow network could perform as good as the deeper ones. But it does not; and they usually do not. The question is --- why? Possible answers:

Maybe a shallow network would need more neurons then the deep one?
Maybe a shallow network is more difficult to train with our current algorithms (e.g. it has more nasty local minima, or the convergence rate is slower, or whatever)?
Maybe a shallow architecture does not fit to the kind of problems we are usually trying to solve (e.g. object recognition is a quintessential "deep", hierarchical process)?
Something else?

The Deep Learning book argues for bullet points #1 and #3. First, it argues that the number of units in a shallow network grows exponentially with task complexity. So in order to be useful a shallow network might need to be very big; possibly much bigger than a deep network. This is based on a number of papers proving that shallow networks would in some cases need exponentially many neurons; but whether e.g. MNIST classification or Go playing are such cases is not really clear. Second, the book says this:

Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.

I think the current "consensus" is that it's a combination of bullet points #1 and #3: for real-world tasks deep architecture are often beneficial and shallow architecture would be inefficient and require a lot more neurons for the same performance.

But it's far from proven. Consider e.g. Zagoruyko and Komodakis, 2016, Wide Residual Networks. Residual networks with 150+ layers appeared in 2015 and won various image recognition contests. This was a big success and looked like a compelling argument in favour of deepness; here is one figure from a presentation by the first author on the residual network paper (note that the time confusingly goes to the left here):

But the paper linked above shows that a "wide" residual network with "only" 16 layers can outperform "deep" ones with 150+ layers. If this is true, then the whole point of the above figure breaks down.

Or consider Ba and Caruana, 2014, Do Deep Nets Really Need to be Deep?:

In this paper we provide empirical evidence that shallow nets are capable of learning the same function as deep nets, and in some cases with the same number of parameters as the deep nets. We do this by first training a state-of-the-art deep model, and then training a shallow model to mimic the deep model. The mimic model is trained using the model compression scheme described in the next section. Remarkably, with model compression we are able to train shallow nets to be as accurate as some deep models, even though we are not able to train these shallow nets to be as accurate as the deep nets when the shallow nets are trained directly on the original labeled training data. If a shallow net with the same number of parameters as a deep net can learn to mimic a deep net with high fidelity, then it is clear that the function learned by that deep net does not really have to be deep.

If true, this would mean that the correct explanation is rather my bullet #2, and not #1 or #3.

As I said --- nobody really knows for sure yet.

Concluding remarks

The amount of progress achieved in the deep learning over the last ~10 years is truly amazing, but most of this progress was achieved by trial and error, and we still lack very basic understanding about what exactly makes deep nets to work so well. Even the list of things that people consider to be crucial for setting up an effective deep network seems to change every couple of years.

The deep learning renaissance started in 2006 when Geoffrey Hinton (who had been working on neural networks for 20+ years without much interest from anybody) published a couple of breakthrough papers offering an effective way to train deep networks (Science paper, Neural computation paper). The trick was to use unsupervised pre-training before starting the gradient descent. These papers revolutionized the field, and for a couple of years people thought that unsupervised pre-training was the key.

Then in 2010 Martens showed that deep neural networks can be trained with second-order methods (so called Hessian-free methods) and can outperform networks trained with pre-training: Deep learning via Hessian-free optimization. Then in 2013 Sutskever et al. showed that stochastic gradient descent with some very clever tricks can outperform Hessian-free methods: On the importance of initialization and momentum in deep learning. Also, around 2010 people realized that using rectified linear units instead of sigmoid units makes a huge difference for gradient descent. Dropout appeared in 2014. Residual networks appeared in 2015. People keep coming up with more and more effective ways to train deep networks and what seemed like a key insight 10 years ago is often considered a nuisance today. All of that is largely driven by trial and error and there is little understanding of what makes some things work so well and some other things not. Training deep networks is like a big bag of tricks. Successful tricks are usually rationalized post factum.

We don't even know why deep networks reach a performance plateau; just 10 years people used to blame local minima, but the current thinking is that this is not the point (when the perfomance plateaus, the gradients tend to stay large). This is such a basic question about deep networks, and we don't even know this.

Update: This is more or less the subject of Ali Rahimi's NIPS 2017 talk on machine learning as alchemy: https://www.youtube.com/watch?v=Qi1Yry33TQE.

[This answer was entirely re-written in April 2017, so some of the comments below do not apply anymore.]

Best Answer

Related Solutions

Solved – Problem with getting neural network learned to calculate XOR

Deep Learning – Difference Between Neural Network and Deep Neural Network

Related Question