Solved – If each neuron in a neural network is basically a logistic regression function, why multi layer is better

logisticneural networks

I'm going thru the Cousera's DeepAI course (Week3 video 1 "Neural Networks Overview") and Andrew Ng is explaining how each layer in a neural network is just another logistic regression, but he doesn't explain how it makes thing more accurate.

So in a 2 layer network, how does calculating logistic multiple times make it more accurate?

Best Answer

When using logistic activation functions, it's true that the function relating the inputs of each unit to its output is the same as for logistic regression. But, this isn't really the same as each unit performing logistic regression. The difference is that, in logistic regression, the weights and bias are chosen such that the output best matches given target values (using the log/cross-entropy loss). In contrast, hidden units in a neural net send their outputs to downstream units. There is no target output to match for individual hidden units. Rather, the weights and biases are chosen to minimize some objective function that depends on the final output of the network.

Rather than performing logistic regression, it might make more sense to think of each hidden unit as computing a coordinate in some feature space. From this perspective, the purpose of a hidden layer is to transform its input--the input vector is mapped to a vector of hidden layer activations. You can think of this as mapping the input into a feature space with a dimension corresponding to each hidden unit.

The output layer can often be thought of as a standard learning algorithm that operates in this feature space. For example, in a classification task, using a logistic output unit with cross entropy loss is equivalent to performing logistic regression in feature space (or multinomial logistic regression if using softmax outputs). In a regression task, using a linear output with squared error is equivalent to performing least squares linear regression in feature space.

Training the network amounts to learning the feature space mapping and classification/regression function (in feature space) that, together, give the best performance. Assuming nonlinear hidden units, increasing the width of the hidden layer or stacking multiple hidden layers permits more complex feature space mappings, thereby allowing more complex functions to be fit.