Solved – Forms of the Reward function in Reinforcement Learning: A vector, a matrix, a linear combination

machine learningreinforcement learning

Personally, one of the most intuitive forms of the reward function in reinforcement learning is the form $R(s,a)$ in a matrix. In this case, $s$ is a state and $a$ is an action. This way if an agent is in state $s$ and plans to do an action $a$, then it is easy to see in the matrix $R(s,a)$ the corresponding value of the reward function.

My questions are:

(a) if the reward function is a vector of dimension $n$, where $n$ is the number of states, what is the interpretation of this?

(b) Also, if the in case of infinite state space, $R$ is most likely a linear combination of features (in features-based representation reinforcement learning), $R = \sum \alpha_i \phi_i$, where $\phi_i$ are the features and $\alpha_i$ are the weights. How would one interpret this while the agent is learning? The features seem to change as time goes by, doesn't it? (versus the action and the state in the case of $R(s,a)$ when everything is constant.)

Best Answer

For case;

a) Vector reward of size $n$ simply means the reward function is a function of ONLY states, not actions (remember the general case is that the reward function is defined as $r:\mathcal{S} \times \mathcal{A} \longrightarrow \mathbb{R}$. The interpretation is simply that, $r(s)$ is the reward you get for being at state $s$.

b) For infinite state spaces, you are right that function approximation is used. One common assumption not usually mentioned in these cases is that the environment is assumed to be static, there features are constant w.r.t time. Of course one can define some form of dynamic features but this just explodes the dimensions. So if you take the assumption that features are constant wrt, then $r(s, a)$ is equivalent to $r = \sum_i \alpha_i \phi_i$ is you define indicator features for every state and action. Remember the features are also functions defined as such,

$$\phi_i:\mathcal{S} \times \mathcal{A} \longrightarrow \mathbb{R}$$

in the most general case, but can also be defined only with respect to state. A concrete example in Gridworld could be a feature indicating if a state if terminal, and another feature for if a state is blocked or not.