Stuck in understanding change in output W.R.T weights and bias in neural nets

calculusmachine learningneural networks

I'm reading this online book on neural networks, in which it's mentioned that sigmoid activation function is used instead of perceptron, because in sigmoid function we can change the output if we slightly change the input weights and bias.

This property of sigmoid is useful when we want to get desired output by approximating weights and bias (to learn feature).

I completely understand this concept, but mathematically the change in output is given by following equation, which I don't understand:

$${\Delta}output \approx \sum_j \frac{\delta{output}}{\delta{w_j}}\Delta{w_j} + \frac{\delta{output}}{\delta{b}}\Delta{b}$$

Is there any general rule for computing approx change? The equation for actual output is:

$$output = w{\cdot}x + b$$

Best Answer

Firstly, as a small aside, the output of a sigmoid is given by $$\frac{1}{1 + \exp(-\sum_j w_j x_j - b)},$$ where the $x_j$ are inputs, $w_j$ are weights, and $b$ is a bias (roughly defining the threshold). It may be helpful to know that the mathematical world refers to such a function as a logistic curve.

This isn't important for this application. The important thing here is that the sigmoid function is a differentiable function in terms of $\vec{x}, \vec{w}$, and $b$. There is a very general way to approximate differentiable functions.

Given a continuously differentiable function $f(x)$, one can approximate $f(x_1)$ from a value $f(x_0)$ and the derivative, in terms of $$ f(x_1) \approx f(x_0) + (x_1 - x_0) f'(x_0). \tag{1}$$ Approximations of this form are ubiqitous, and they have many names. One might call this a linear approximation, or a one-dimensional Taylor approximation (in terms of Taylor polynomials). Or one might say that this follows immediately from the definition of a derivative, as $$ \lim_{x_1 \to x_0}\frac{f(x_1) - f(x_0)}{x_1 - x_0} = f'(x_0),$$ and thus for $x_1 \approx x_0$, the approximation in $(1)$ is a good approximation. The point here is that one can use linear approximations to approximate differentiable functions (and indeed, to be "differentiable" is essentially to have good linear approximations at every point).

Let's now apply this idea in more variables, and let's look at the function $f(x,w,b)$, where each of $x, w$, and $b$ are real numbers. This is a one-input+one-weight example in terms of neural networks.

In terms of varying $w$ and $b$ (which reflects the dependence of the function on the weights and bias, given the same inputs), the linear approximation to $f$ is given by $$ f(x, w_1, b_1) \approx f(x, w_0, b_0) + \frac{\partial}{\partial w} f(x, w_0, b_0) [w_1 - w_0] + \frac{\partial}{\partial b} f(x, w_0, b_0) [b_1 - b_0].$$ If this a new idea, you can also think of this as a naive linear approximation, or a two-dimensional Taylor approximation, or essentially the definition of a derivative. Rearranging, we get that $$ f(x, w_1, b_1) - f(x, w_0, b_0) \approx \frac{\partial}{\partial w} f(x, w_0, b_0) [w_1 - w_0] + \frac{\partial}{\partial b} f(x, w_0, b_0) [b_1 - b_0],$$ which is precisely $$ \Delta f \approx \frac{\partial f}{\partial w} \Delta w + \frac{\partial f}{\partial b} \Delta b.$$

Increasing this to higher dimensions is a small task, as it's essentially the same. The overarching idea is to use a linear approximation, and to recognize that derivatives are precisely linear approximations. For a slightly different phrasing, you might look at the definition of the total derivative on Wikipedia, as this is a first-principles linear approximation.

Good luck in your studies!

Related Question