[Math] Neural Network – Why use Derivative

derivativesneural networks

Good Day

I am trying to get an understanding of Neural Network. Have gone through few web sites. Came to know the following:

1) One of main objective of neural network is to “predict” based on data.
2) To predict
a. Train the network with known data
b. Calculate weights by finding difference between “Target Output” and “Calculated Output”.
c. To do that we use derivative, partial derivative(chain rule etc..)

I can understand the overall concept of neural network
a) I can also understand “Derivative” is nothing but Rate of change of one quantity over another(at a given point).
b) Partial derivative is Rate of change of one quantity over another, irrespective of another quantity , if more than two factors are in equation.

The point that I canNOT relate or understand clearly is,
a) why should we use derivative in neural network, how exactly does it help
b) Why should we activation function, in most cases its Sigmoid function.
c) I could not get a complete picture of how derivatives helps neural network.

Can you guys please help me understand the complete picture, iff possible try not to use mathematical terms, so that it will be easy for me to grasp.

Thanks,
Satheesh

Best Answer

As you said: "Partial derivative is Rate of change of one quantity over another, irrespective of another quantity , if more than two factors are in equation."

It means that we can measure the rate of change of the output error w.r.t. network weights. If we know how the error changes w.r.t. weights, we can change those weights in a direction that decreases the error. But as @user1952009 said, it is just gradient descent. Neural networks combine it with the chain rule to update non-output layers.

Regarding sigmoid activations, it has 2 uses: 1) to bound the neuron output; 2) to introduce nonlinearities into the network. This last item is essential to make the neural network solve problems not solvable by simple linear/logistic regression. If neurons hadn't nonlinear activation functions, you could rewrite your entire network as a single layer, which is not as useful. For instance, suppose a 2-layer neural network. Its output would be $y = W_o(W_i\mathbf{x})$ ($W_i$ = input weights, $W_o$ = output weights, $\mathbf{x}$ = input), which can be rewritten as $y = (W_oW_i)\mathbf{x}$. Let $W = W_oW_i$, it leaves us with a single layer neural network $y = W\mathbf{x}$.

Related Question