I realise this is an old post but perhaps this answer will be useful for others.
Firstly, reinforcement learning is based on the idea of searching for the best long term reward. That is why, in a Q learning algorithm, we update the Q values (or 'goodness' values') for each state-action pair to be equal to the reward received plus some fraction (the rate of decay/gamma) of the predicted future reward. In this way, your algorithm could be converging on good Q values that are considering both expected immediate reward plus potential future rwards.
That being said, if your neural network is indeed diverging, then there are a number of things that you can do to help your algorithm converge. My immediate advice would be to use Double Deep Q learning, whereby you introduce a second neural network and copy the weights from your current neural network every so often (less often than the current network is updated) and use this to provide value predictions for the future state.
So for a neural network that takes a state (your input values) and outputs a range of values inside a list (the indices of which correspond to different actions). This is how you would get each new input, target pair to train your model on:
action = action_the_agent_did_in_this_memory_from_state_to_new_state
target = model.predict(state) #this gives a list of values for each action in the current state
future_target_one = model.predict(new_state) #this gives a list of values for each action in the next state as predicted by your current model
future_target_two = target_model.predict(next_state) #this gives a list of values for each action in the next state as predicted by your target model
best_future_action_index = np.argmax(future_target_one) #this gives the index of the maximum value action in the future state using the current model
best_future_action_value = future_target_two[best_future_action_index] #this sets best_future_action_value equal to the value from the target model (using the index from the current model)
#if this is the last move before the game ends, then there is no future reward
if done:
targets[action] = reward #the value of targets at the index equal to the action (such as action 0 perhaps) is set equal to the reward
else: #otherwise one must consider the future rewards too
targets[action] = reward_t + GAMMA * best_future_action_value
This idea is used to decouple the index and value from each other in the value predictions to help prevent problems with overestimation. I hope this helps and you weren't put off by my super-long variable name.
Here are those I understand so far. Most of these work best when given values between 0 and 1.
Quadratic cost
Also known as mean squared error, this is defined as:
$$C_{MST}(W, B, S^r, E^r) = 0.5\sum\limits_j (a^L_j - E^r_j)^2$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{MST} = (a^L - E^r)$$
Cross-entropy cost
Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy
$$C_{CE}(W, B, S^r, E^r) = -\sum\limits_j [E^r_j \text{ ln } a^L_j + (1 - E^r_j) \text{ ln }(1-a^L_j)]$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{CE} = \frac{(a^L - E^r)}{(1-a^L)(a^L)}$$
Exponentional cost
This requires choosing some parameter $\tau$ that you think will give you the behavior you want. Typically you'll just need to play with this until things work good.
$$C_{EXP}(W, B, S^r, E^r) = \tau\text{ }\exp(\frac{1}{\tau} \sum\limits_j (a^L_j - E^r_j)^2)$$
where $\text{exp}(x)$ is simply shorthand for $e^x$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{2}{\tau}(a^L- E^r)C_{EXP}(W, B, S^r, E^r)$$
I could rewrite out $C_{EXP}$, but that seems redundant. Point is the gradient computes a vector and then multiplies it by $C_{EXP}$.
Hellinger distance
$$C_{HD}(W, B, S^r, E^r) = \frac{1}{\sqrt{2}}\sum\limits_j(\sqrt{a^L_j}-\sqrt{E^r_j})^2$$
You can find more about this here. This needs to have positive values, and ideally values between $0$ and $1$. The same is true for the following divergences.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{\sqrt{a^L}-\sqrt{E^r}}{\sqrt{2}\sqrt{a^L}}$$
Kullback–Leibler divergence
Also known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).
Kullback–Leibler divergence is typically denoted $$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}$$,
where $D_{\mathrm{KL}}(P\|Q)$ is a measure of the information lost when $Q$ is used to approximate $P$. Thus we want to set $P=E^i$ and $Q=a^L$, because we want to measure how much information is lost when we use $a^i_j$ to approximate $E^i_j$. This gives us
$$C_{KL}(W, B, S^r, E^r)=\sum\limits_jE^r_j \log \frac{E^r_j}{a^L_j}$$
The other divergences here use this same idea of setting $P=E^i$ and $Q=a^L$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = -\frac{E^r}{a^L}$$
Generalized Kullback–Leibler divergence
From here.
$$C_{GKL}(W, B, S^r, E^r)=\sum\limits_j E^r_j \log \frac{E^r_j}{a^L_j} -\sum\limits_j(E^r_j) + \sum\limits_j(a^L_j)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{a^L}$$
Itakura–Saito distance
Also from here.
$$C_{GKL}(W, B, S^r, E^r)= \sum_j \left(\frac {E^r_j}{a^L_j} - \log \frac{E^r_j}{a^L_j} - 1 \right)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{\left(a^L\right)^2}$$
Where $\left(\left(a^L\right)^2\right)_j = a^L_j \cdot a^L_j$. In other words, $\left( a^L\right) ^2$ is simply equal to squaring each element of $a^L$.
Best Answer
It is not surprising that weight decay will hurt performance of your neural network at some point. Let the prediction loss of your net be $\mathcal{L}$ and the weight decay loss $\mathcal{R}$. Given a coefficient $\lambda$ that establishes a tradeoff between the two, one optimises $$ \mathcal{L} + \lambda \mathcal{R}. $$ At the optimium of this loss, the gradients of both terms will have to sum up to zero: $$ \triangledown \mathcal{L} = -\lambda \triangledown \mathcal{R}. $$ This makes clear that we will not be at an optimium of the training loss. Even more so, the higher $\lambda$ the steeper the gradient of $\mathcal{L}$, which in the case of convex loss functions implies a higher distance from the optimum.