Solved – Reward function in Reinforcement Learning

machine learningreinforcement learning

I am currently looking to apply reinforcement learning in my recurrent neural network.

I am having trouble to define the reward function for my following application.
In my current setting, I am running a recurrent neural network, and each action is equivalent to generating a vector [1×5], and then the network will terminate when it finish generating a 10 of them. Lastly, I will append them together, forming a 10 x 5 matrix.

For this matrix, I will be running another algorithm to evaluate its reward, giving a final reward of scalar numerical value.

[ Matrix A] —> run algorithm —-> reward = 5.2

[Matrix B] —-> run algirthm ——> reward = 8.2


However, I am not sure how should I define the reward function for my network. Since there is really no reward for the intermediate step, should I give them a zero???

Best Answer

It is not 100% clear whether you are running something that would benefit and work from being framed as a full reinforcement learning problem.

Your RL agent should be doing one or both of:

  • Predicting total accumulated reward (called "return" or "utility") when completing the task, given a current state or state/action pair (or sometimes in deterministic environments, the next state). These predictions will be used to help decide the next action.

  • Generating probability space of actions (e.g. a list of 10 means and standard deviations for generating the eventual vectors) from a current state. This Probability Density Function describes a policy that can then be refined based on eventual reward.

You need at least one of the above to be true in order to generate suitable gradients to train your RNN from sparse reward data.

Assuming you do have your problem set up in a reinforcement learning framework as above:

However, I am not sure how should I define the reward function for my network. Since there is really no reward for the intermediate step, should I give them a zero???

In reinforcement learning, rewards are only granted when they impact directly against goals of a task. There is no need for interim rewards, and adding heuristics, unless done very carefully, may hinder the agents learning. So, yes, everything other than the end reward should be zero for your problem. Bear in mind though, that your RNN should not be predicting reward but one or both of state value (long term reward) and/or probability distributions for the actions.

It is OK to just have a single reward granted at the end of an episode, as you have set up with a scoring system for the matrix, and the rest zero. For example, this is a fairly common setup for games where the agent either wins, draws or loses.

Related Question