Reinforcement Learning – Is Reward Function Needed to Be Continuous

reinforcement learning

Is reward function needed to be continuous in deep reinforcement learning? It should be noted that the reward is used for gradient computation

The algorithm I used is Proximal Policy Gradient(Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv:1707.06347, 2017.)

The objective function is as follows:

The advantage is denoted as A^hat above, and it is derived from:

And you can see that A is from state-action value and state value, which are derived from immediate reward.

Best Answer

Although you can define the reward as a "reward function", and you may have computer code that calculates reward from a function call with current state and action as inputs, typically reward is not considered a mathematical function. It is a variable that you can observe.

So to answer this, assuming you mean "reward" where you say "reward function":

Is reward function needed to be continuous in deep reinforcement learning? It should be noted that the reward is used for gradient computation

No, there is no requirement for reward to be drawn from any continuous function. That is because the value of $R_t$ is produced by the environment, independently of the parameters $\theta$ that the policy gradient is with respect to. Changing any part of $\theta$ would not change the value observed in the same context (although it may change whether you ever observe the same value again). In fact this is used when deriving your first equation in the Policy Gradient Theorem (see appendix 1 of this paper), the gradient of $r$ is assumed to be zero when expanding terms.

Intuitively, the reward is data that your algorithm learns from. It does not make sense to ask about the gradient of reward w.r.t. learnable parameters, any more than it makes sense to ask about gradient of input data from supervised learning w.r.t. learnable parameters*.

* In some contexts - e.g. style transfer for images - we do take gradients of input data in order to modify it - technically gradients of a loss function w.r.t. input, not input w.r.t. learnable parameters (that would still be zero). There are RL contexts where you fit reward structure to observed behaviour where this could be useful (e.g. inverse reinforcement learning), but that is not what you do when training an agent to optimise total reward in an environment.

Best Answer

Related Solutions

Solved – Why do we clip the surrogate objective in PPO

Solved – Actor-critic loss function in reinforcement learning

Related Question