Solved – State space for Markov Decision Processes

markov-process

I'm currently trying to formulate a MDP for a Reinforcement Learning (RL) task. Having read a variety of papers where RL has been applied I've been left somewhat confused as to what can be a considered part of the state space. I was always under the impression that the state formed the agent's observable world and actions taken by the agent would cause a state transition. However many authors add in state variables that are useful to enable predictions but will not and can not change as the result of an action being selected.
For example I've seen authors reference current time as a state variable to allow the agent to associate time varying conditions such as peak times of day when controlling traffic lights. In addition some authors have included things like real time electricity pricing in the state space for planning demand response activities in the home. No action the user takes can possibly cause a change in the price of the electrical unit, but obviously the decision the agent makes is dependent upon it.
In short, can I have the following state space {price, currentTime} in my MDP or does it need to be modelling differently.

Best Answer

I was always under the impression that the state formed the agent's observable world and actions taken by the agent would cause a state transition.

You are right that the agent's action $a$ causes the state transition, as defined by the transition function: $$P(s'|s,a)$$

The state should contain enough (but no too much) information for the problem to be solved. Furthermore this information should be fully observable (i.e. you should always be able to deduce the current state), otherwise you need a Partially Observable Markov Decision Process (POMDP).

When you have defined the MDP, i.e. the states, actions, transition function, and rewards, then you can find the policy $\pi(s)$ (through Value Iteration or Policy Iteration) to maximize the expected reward. The policy is a function:

$$\pi(s): s \rightarrow a$$

Thus after having learnt the policy, you can apply the policy by passing the current state $s$ as argument, and it gives the best action $a$ to do according to the policy.


In the case of controlling traffic lights you could use the following state variables:

  • $S_\text{time}$: the time, could be hours {0..23} or {morning, afternoon, night} for example.
  • $S_\text{color}$: red, orange, green.
  • $S_\text{traffic}$: light, dense.

The total set of state would then be the Cartesian product: $$S = S_\text{time} \times S_\text{color} \times S_\text{traffic}$$

You could now define different probabilities based on different situations:

$$P(S'=\text{{morning, red, dense}} | S=\text{{morning, red, light}}, a) = ..$$ $$P(S'=\text{{morning, green, dense}} | S=\text{{morning, red, light}}, a) = ..$$ etc..


In addition some authors have included things like real time electricity pricing in the state space for planning demand response activities in the home. No action the user takes can possibly cause a change in the price of the electrical unit, but obviously the decision the agent makes is dependent upon it.

Here you could include pricing in the state, so when you execute the policy, the state variable $S_\text{price}$ will be set by the current electricity price. Note that you have to discretize these values, or look into Continuous state MDPs. To be able to learn a good policy you however also need to define the probabilities of going from one state to another: the transition functions. So you need to calculate beforehand the probabilities of going from one price state to another.


In short, can I have the following state space {price, currentTime} in my MDP or does it need to be modelling differently.

So the answer is yes, but you have to make sure that you do not have too many states, since the complexity of finding a policy grows exponentially with the number of states. Furthermore, to have a good policy as result, you need good approximations of the state transitions probabilities ($P(s'|s,a)$).

Related Question