Solved – Why (and when) does one have to learn the reward function from samples in reinforcement learning

machine learningreinforcement learning

In reinforcement learning we have a reward function that informs the agent how well its current actions and states are doing. In a some what general setting the reward function is a function of three variables:

  1. Current state $S$
  2. Current action at current state $\pi(s) = a$
  3. Next state $S'$

So it looks something like:

$$R(S, a, S')$$

What my question is (which is probably my misunderstanding), normally the person using reinforcement learning decides what is the reward. For example, it assigns 1000 points for reaching the goal, or assings -1000 points for crashing the autonomous robot. In these scenarios, its not clear to me why we would need samples to learn R. R is a priori specified and then we use our agent. Right? However, I know I am wrong because in Andrew Ng's notes he says:

enter image description here

Where he says that we don't know the reward function explicitly. That seems bizarre to me. I know I am wrong and I'd love if someone could clarify to me in what scenarios do we actually have to learn R from samples?

(obviously, the transition probabilities have to be learned because one does not know how the environment will make our agent move a priori).

Best Answer

In his notes, when you must "estimate them from data", he does not mean the reward function. You rarely estimate the reward function. You typically learn the value function, which estimates the immediate reward plus the temporally-discounted future reward (if the temporal discount is zero, then you are estimating the rewards). Or, you can learn Q values, which are values associated with state-action pairs.

In summary, the reward function and the true transition function is defined by the environment. The agent learns things like the transition function, Q values, and the value function.