Reinforcement Learning – How to Create an Effective Reward Function in Reinforcement Learning

machine learningreinforcement learning

While studying Reinforcement Learning, I have come across many forms of the reward function: $R(s,a)$, $R(s,a,s')$, and even a reward function that only depends on the current state. Having said that, I realized it is not very easy to 'make' or 'define' a reward function.

Here are my questions:

  1. Are there rules on how to make reward functions?
  2. Are there other forms of the reward function? For example, a polynomial form perhaps that depends on the state?

Best Answer

Reward functions describe how the agent "ought" to behave. In other words, they have "normative" content, stipulating what you want the agent to accomplish. For example, some rewarding state $s$ might represent the taste of food. Or perhaps, $(s,a)$ might represent the act of tasting the food. So, to the extent that the reward function determines what the agent's motivations are, yes, you have to make it up!

There are no absolute restrictions, but if your reward function is "better behaved", the the agent will learn better. Practically, this means speed of convergence, and not getting stuck in local minima. But further specifications will depend strongly on the species of reinforcement learning you are using. For example, is the state/action space continuous or discrete? Is the world or the action selection stochastic? Is reward continuously harvested, or only at the end?

One way to view the problem is that the reward function determines the hardness of the problem. For example, traditionally, we might specify a single state to be rewarded: $$ R(s_1)=1 $$ $$ R(s_{2..n})=0 $$ In this case, the problem to be solved is quite a hard one, compared to, say, $R(s_i)=1/i^2$, where there is a reward gradient over states. For hard problems, specifying more detail, e.g. $R(s,a)$ or $R(s,a,s^\prime)$ can help some algorithms by providing extra clues, but potentially at the expense of requiring more exploration. You might well need to include costs as negative terms in $R$ (e.g. energetic costs), to make the problem well-specified.

For the case of a continuous state space, if you want an agent to learn easily, the reward function should be continuous and differentiable. So polynomials can work well for many algorithms. Further, try to remove localised minima. There are a number of examples of how NOT to make a reward function -- like the Rastrigin function. Having said this, several RL algorithms (e.g. Boltzmann machines) are somewhat robust to these.

If you are using RL to solve a real-world problem, you will probably find that although finding the reward function is the hardest part of the problem, it is intimately tied up with how you specify the state space. For example, in a time-dependent problem, the distance to the goal often makes a poor reward function (e.g. in the mountain car problem). Such situations can be solved by using higher dimensional state spaces (hidden states or memory traces), or by hierarchical RL.

At an abstract level, unsupervised learning was supposed to obviate stipulating "right and wrong" performance. But we can see now that RL simply shifts the responsibility from the teacher/critic to the reward function. There is a less circular way to solve the problem: that is, to infer the best reward function. One method is called inverse RL or "apprenticeship learning", which generates a reward function that would reproduce observed behaviours. Finding the best reward function to reproduce a set of observations can also be implemented by MLE, Bayesian, or information theoretic methods - if you google for "inverse reinforcement learning".