Solved – Weight Decay in Neural Neural Networks Weight Update and Convergence

convergencemachine learningneural networks

I have a neural network (That I created using java) for a class assignment that is working when I do not use any weight decay value, but when I use a value greater than or equal to .001, my accuracy drops greatly. The data is normalized. I am not sure if it is how I am calculating the convergence condition, or if my weight updates with weight decay is incorrect. I am using a sigmoid activation function. My classifier is binary 0 or 1, and when classifying if my output is > .5, the example is 1, and <= .5, the example is 0.

In my test I am using 5 hidden neurons + 1 bias, and 11 input neruons + 1 bias, and 1 output neuron. When running with 0 weight decay i am getting 99% accuracy, however when i use a value of .001 I am getting 56% accuracy. The accuracy I am using is TP + TN / (TP + TN + FP + FN)

My weight update right now is

Weight = Weight – LearningRate * WeightChange – Weight * WeightDecay

My convergence test is if the absolute difference in the sum of the current weights and the sum of the previous weights is < 0.00001 I say that the network has converged. Is this correct in thinking so?

Let me know if there is any more information needed.

Best Answer

It is not surprising that weight decay will hurt performance of your neural network at some point. Let the prediction loss of your net be $\mathcal{L}$ and the weight decay loss $\mathcal{R}$. Given a coefficient $\lambda$ that establishes a tradeoff between the two, one optimises $$ \mathcal{L} + \lambda \mathcal{R}. $$ At the optimium of this loss, the gradients of both terms will have to sum up to zero: $$ \triangledown \mathcal{L} = -\lambda \triangledown \mathcal{R}. $$ This makes clear that we will not be at an optimium of the training loss. Even more so, the higher $\lambda$ the steeper the gradient of $\mathcal{L}$, which in the case of convex loss functions implies a higher distance from the optimum.

Related Solutions

Solved – Weight Size in Neural Networks

I realise this is an old post but perhaps this answer will be useful for others.

Firstly, reinforcement learning is based on the idea of searching for the best long term reward. That is why, in a Q learning algorithm, we update the Q values (or 'goodness' values') for each state-action pair to be equal to the reward received plus some fraction (the rate of decay/gamma) of the predicted future reward. In this way, your algorithm could be converging on good Q values that are considering both expected immediate reward plus potential future rwards.

That being said, if your neural network is indeed diverging, then there are a number of things that you can do to help your algorithm converge. My immediate advice would be to use Double Deep Q learning, whereby you introduce a second neural network and copy the weights from your current neural network every so often (less often than the current network is updated) and use this to provide value predictions for the future state.

So for a neural network that takes a state (your input values) and outputs a range of values inside a list (the indices of which correspond to different actions). This is how you would get each new input, target pair to train your model on:

action = action_the_agent_did_in_this_memory_from_state_to_new_state

target = model.predict(state) #this gives a list of values for each action in the current state
future_target_one = model.predict(new_state) #this gives a list of values for each action in the next state as predicted by your current model
future_target_two = target_model.predict(next_state) #this gives a list of values for each action in the next state as predicted by your target model
best_future_action_index = np.argmax(future_target_one) #this gives the index of the maximum value action in the future state using the current model
best_future_action_value = future_target_two[best_future_action_index] #this sets best_future_action_value equal to the value from the target model (using the index from the current model)

#if this is the last move before the game ends, then there is no future reward
if done:
    targets[action] = reward #the value of targets at the index equal to the action (such as action 0 perhaps) is set equal to the reward
else: #otherwise one must consider the future rewards too
    targets[action] = reward_t + GAMMA * best_future_action_value

This idea is used to decouple the index and value from each other in the value predictions to help prevent problems with overestimation. I hope this helps and you weren't put off by my super-long variable name.

Solved – A list of cost functions used in neural networks, alongside applications

Here are those I understand so far. Most of these work best when given values between 0 and 1.

Quadratic cost

Also known as mean squared error, this is defined as:

$$C_{MST}(W, B, S^r, E^r) = 0.5\sum\limits_j (a^L_j - E^r_j)^2$$

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C_{MST} = (a^L - E^r)$$

Cross-entropy cost

Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy

$$C_{CE}(W, B, S^r, E^r) = -\sum\limits_j [E^r_j \text{ ln } a^L_j + (1 - E^r_j) \text{ ln }(1-a^L_j)]$$

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C_{CE} = \frac{(a^L - E^r)}{(1-a^L)(a^L)}$$

Exponentional cost

This requires choosing some parameter $\tau$ that you think will give you the behavior you want. Typically you'll just need to play with this until things work good.

$$C_{EXP}(W, B, S^r, E^r) = \tau\text{ }\exp(\frac{1}{\tau} \sum\limits_j (a^L_j - E^r_j)^2)$$

where $\text{exp}(x)$ is simply shorthand for $e^x$.

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C = \frac{2}{\tau}(a^L- E^r)C_{EXP}(W, B, S^r, E^r)$$

I could rewrite out $C_{EXP}$, but that seems redundant. Point is the gradient computes a vector and then multiplies it by $C_{EXP}$.

Hellinger distance

$$C_{HD}(W, B, S^r, E^r) = \frac{1}{\sqrt{2}}\sum\limits_j(\sqrt{a^L_j}-\sqrt{E^r_j})^2$$

You can find more about this here. This needs to have positive values, and ideally values between $0$ and $1$. The same is true for the following divergences.

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C = \frac{\sqrt{a^L}-\sqrt{E^r}}{\sqrt{2}\sqrt{a^L}}$$

Kullback–Leibler divergence

Also known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).

Kullback–Leibler divergence is typically denoted $$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}$$,

where $D_{\mathrm{KL}}(P\|Q)$ is a measure of the information lost when $Q$ is used to approximate $P$. Thus we want to set $P=E^i$ and $Q=a^L$, because we want to measure how much information is lost when we use $a^i_j$ to approximate $E^i_j$. This gives us

$$C_{KL}(W, B, S^r, E^r)=\sum\limits_jE^r_j \log \frac{E^r_j}{a^L_j}$$

The other divergences here use this same idea of setting $P=E^i$ and $Q=a^L$.

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C = -\frac{E^r}{a^L}$$

Generalized Kullback–Leibler divergence

From here.

$$C_{GKL}(W, B, S^r, E^r)=\sum\limits_j E^r_j \log \frac{E^r_j}{a^L_j} -\sum\limits_j(E^r_j) + \sum\limits_j(a^L_j)$$

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C = \frac{a^L-E^r}{a^L}$$

Itakura–Saito distance

Also from here.

$$C_{GKL}(W, B, S^r, E^r)= \sum_j \left(\frac {E^r_j}{a^L_j} - \log \frac{E^r_j}{a^L_j} - 1 \right)$$

The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:

$$\nabla_a C = \frac{a^L-E^r}{\left(a^L\right)^2}$$

Where $\left(\left(a^L\right)^2\right)_j = a^L_j \cdot a^L_j$. In other words, $\left( a^L\right) ^2$ is simply equal to squaring each element of $a^L$.

Best Answer

Related Solutions

Solved – Weight Size in Neural Networks

Solved – A list of cost functions used in neural networks, alongside applications

Quadratic cost

Cross-entropy cost

Exponentional cost

Hellinger distance

Kullback–Leibler divergence

Generalized Kullback–Leibler divergence

Itakura–Saito distance

Related Question