Solved – Weight Size in Neural Networks

I am using a Neuronal Network in combination with Reinforcement Learning. The network should learn the values of three actions in given states. The reward from the environment is scaled to [-0.9,0.9]. The network consists of one input layer with 30 nodes of which not more than three are activated at the same time. The hidden layer consists of 30 nodes as well and tanh is used as activation function. The output layer consists of said three nodes, which also are activated using tanh.

When training my network, it often finds solutions for the given task, but often enough the values returned by the network are getting closer and closer to -1 and 1, although the reward is normalized to [-0.9,0.9]. Through debugging I found out that the weight from the hidden to the output layer are getting bigger and bigger. Since the weight size is used for the backpropagation, a huge error is passed through the network, causing even bigger weights and so on.

Is there a way to prevent the network from getting this vicious circle?

action = action_the_agent_did_in_this_memory_from_state_to_new_state target = model.predict(state) #this gives a list of values for each action in the current state future_target_one = model.predict(new_state) #this gives a list of values for each action in the next state as predicted by your current model future_target_two = target_model.predict(next_state) #this gives a list of values for each action in the next state as predicted by your target model best_future_action_index = np.argmax(future_target_one) #this gives the index of the maximum value action in the future state using the current model best_future_action_value = future_target_two[best_future_action_index] #this sets best_future_action_value equal to the value from the target model (using the index from the current model) #if this is the last move before the game ends, then there is no future reward if done: targets[action] = reward #the value of targets at the index equal to the action (such as action 0 perhaps) is set equal to the reward else: #otherwise one must consider the future rewards too targets[action] = reward_t + GAMMA * best_future_action_value

Best Answer

I realise this is an old post but perhaps this answer will be useful for others.

Firstly, reinforcement learning is based on the idea of searching for the best long term reward. That is why, in a Q learning algorithm, we update the Q values (or 'goodness' values') for each state-action pair to be equal to the reward received plus some fraction (the rate of decay/gamma) of the predicted future reward. In this way, your algorithm could be converging on good Q values that are considering both expected immediate reward plus potential future rwards.

That being said, if your neural network is indeed diverging, then there are a number of things that you can do to help your algorithm converge. My immediate advice would be to use Double Deep Q learning, whereby you introduce a second neural network and copy the weights from your current neural network every so often (less often than the current network is updated) and use this to provide value predictions for the future state.

So for a neural network that takes a state (your input values) and outputs a range of values inside a list (the indices of which correspond to different actions). This is how you would get each new input, target pair to train your model on:

This idea is used to decouple the index and value from each other in the value predictions to help prevent problems with overestimation. I hope this helps and you weren't put off by my super-long variable name.

Best Answer

Related Solutions

Related Question