Solved – What does the Reward function depend on in a Markov Decision Processes (MDPs), in the context of Reinforcement Learning

reinforcement learning

I was trying to understand MDP's in the context of reinforcement learning, specifically I was trying to understand what the reward function explicitly depends on.

I have seen a formulation of the reward function as defined by Andrew Ng in his lecture notes as:

$$R: S \times A \mapsto \mathbb{R}$$

Which means that the the reward function depends on the current state and the action take at that state and maps to some real number (the reward).

To get a different perspective, I read the interpretation wikipedia had:

The process responds at the next time step by randomly moving into a new state s', and giving the decision maker a corresponding reward $R_a(s,s')$. Which seems to be a different interpretation in my opinion since this would make the reward function more of a function of the form:

$$R: S \times A \times S\mapsto \mathbb{R}$$

Which in my opinion, seems to be a completely different thing. I was trying to understand if the two formulations were actually the same (and if it was possible to prove their equivalence) in the context of MDP's applied to reinforcement learning.

Best Answer

The two definitions are not the same, but it essentially boils down to a modelling choice: for some problems, the reward function might be easier to define on the (state,action) pairs, while for others, the tuple (state,action,state) might be more appropriate. There's even a third option that only defines the reward on the current state (this can also be found in some references).

I do think the definition of the reward function R(s,a) on the (state, action) pair is the most common, however. But the core learning algorithms remain the same whatever your exact design choice for the reward function.