Solved – How to change a weight/bias with gradient

backpropagationgradient descentinterceptneural networksweights

After watching 3Blue1Brown's tutorial series, and an array of others, I'm attempting to make my own neural network from scratch.

So far, I'm able to calculate the gradient for each of the weights and biases.

Now that I have the gradient, how am I supposed to correct my weight/bias?

Should I:

Add the gradient and the original value?
Multiply the gradient and the original value?
Something else? (Most likely answer)

In addition to this, I've been hearing the term learning rate being tossed around, and how it is used to define the magnitude of the 'step' to descend to minimum cost. I figured this may also play a major role in reducing the cost.

Best Answer

After you've found the gradients, which we'll call $dW,db$ as shorthand for the gradients of the cost function with respect to the weights and biases respectively, you can do a variety of things to your weights and biases $W,b$. It depends on the specific optimization algorithm you are using. The most simple optimization algorithm is vanilla gradient descent. In this case, you apply the transformations: $$W\rightarrow W-\alpha dW$$ $$b\rightarrow b-\alpha dW$$ Where $\alpha$ is your learning rate. The learning rate governs how quickly and well your model learns. A learning weight which is too small may make your model have to take many steps to converge; however, a learning rate which is too big might make your model overshoot the optimum and never converge at all. Finding the correct learning rate is an iterative process.

There are other optimization algorithms out there including with momentum, with Nesterov's momentum, rmsprop, Adam, Adamax, Adagrad, Adadelta, etc. Which optimization algorithm to use is also a question for the cross validation step of learning. Generally speaking though, I find Adam to work quite well for a variety of situations. You can look here for a more in depth description of the optimization algorithms.

Related Solutions

Solved – How to incorporate the biases in the feed-forward neural network

Anyone new to NN may feel confused when first read NN tutorials with different notations. Some tutorials use 'biases', while others use 'bias units'. The ideas about the role of bias are just the same, which is well illustrated in this question, but the two notations are based on a slight implementation difference I think. The following two are for the same network with the same input layer and first hidden layer.

Implementation for 'biases':
The input layer with $m$ units is represented by a $1\times m$ matrix, $v$ here; the hidden layer with $n$ units is represented by a $1\times n$ matrix, $h$; the weights from the input to the hidden layer is represented by a $m\times n$ weight matrix, $w$; the bias to the hidden layer is represented by an another $1\times n$ matrix, $b$. A forward pass is carried out by $h = v * w + b$ and then apply activation function to $h$.

Implementation for 'bias units':
The input layer with $1\times (m+1)$ units is represented by a $1\times (m+1)$ matrix $v$, and the first unit is a bias unit with constant value $1$; the weight matrix from the input to the hidden layer is of size $(m+1) \times n$, and the first row's values are weights corresponding to the bias; the hidden layer has $n+1$ units in which the first unit is a bias unit with constant value $1$ not affected by forward passes. A forward pass is carried out by $h=v*w$ and then apply activation function to $h$.

The following image quoted from holehouse.org is an illustration of the second implementation.

Both of the two implementations are common, so deal with the question based on the notation. According to the given conditions, your question follows the first implementation. Suppose your v is a one unit vector [2.8], the following is an R implementation of the forward pass.

logistic <- function(vec){
  size = length(vec);
  for(i in 1:size){
    vec[i] = 1 / (1 + exp(-vec[i]));
  }
  return (vec);
}

v = c(2.8)
w = c(0.12,0.86,0.20,0.5)
b = c(7.12,-6.20,0.90,-3.6)
result = logistic(v%*%t(w) + b)
result
      [,1]       [,2]      [,3]       [,4]
[1,] 0.9994224 0.02205315 0.8115327 0.09975049

Besides, if it is the second implementation, the input layer becomes [1, 2.8], the biases are merged to the weight matrix, which becomes [7.12,−6.20,0.90,−3.6; 0.12,0.86,0.20,0.5], and the hidden layer has a bias unit.

v = c(1,2.8)
w = matrix (nrow = 2, ncol = 4)
w[1, ] = c(7.12,-6.20,0.90,-3.6);
w[2, ] = c(0.12,0.86,0.20,0.5);
result = logistic(v%*%w)
result
      [,1]       [,2]      [,3]       [,4]
[1,] 0.9994224 0.02205315 0.8115327 0.09975049
h = c(1, result);
h
[1] 1.00000000 0.99942237 0.02205315 0.81153267 0.09975049

Neural Networks – Why Use Gradient Descent with Neural Networks?

Because we can't. The optimization surface $S(\mathbf{w})$ as a function of the weights $\mathbf{w}$ is nonlinear and no closed form solution exists for $\frac{d S(\mathbf{w})}{d\mathbf{w}}=0$.
Gradient descent, by definition, descends. If you reach a stationary point after descending, it has to be a (local) minimum or a saddle point, but never a local maximum.

Best Answer

Related Solutions

Solved – How to incorporate the biases in the feed-forward neural network

Neural Networks – Why Use Gradient Descent with Neural Networks?

Related Question