Solved – Why is the optimal policy non-stationary in the case finite-horizon problems, whereas it is stationary in the case of infinite-horizon problems

markov-processreinforcement learningstationarity

I have difficulty understanding the meaning of stationary policy in the RL (MDP) setting.

Specifically, let's assume stationary dynamics $$P(s_{t+1}=j|s_t=i,a) = P (s_{k+1}=j|s_k=i,a) \ \forall t,k,i,j,a$$ In other words, given a fixed policy, the probability of transitioning from state $i$ to state $j$ under some action $a$ does not change over time.

We know that a stationary policy will always choose the same action in the same state, independent of time, while a non-stationary policy can choose different action in the same state, depending on time.

I do not understand why in the case of finite-horizon problems the optimal policy is non-stationary while in the case of infinite horizon problems the optimal policy is stationary.

If this is truly the case, why most RL algorithms use stationary policies in episodic settings (i.e. finite-horizon)?

Furthermore, in reality, a lot of environments are non-stationary and it makes more sense to use a non-stationary policy instead of a stationary one. Again, why most RL algorithms use stationary policies in these cases, too?

Best Answer

We know that a stationary policy will always choose the same action in the same state, independent of time, while a non-stationary policy can choose different action in the same state, depending on time.

This is not entirely correct. A stationary policy can still be nondeterministic (e.g. have a 50% chance of selecting action 1, and a 50% chance of selecting action 2, regardless of the current time).

I do not understand why in the case of finite-horizon problems the optimal policy is non-stationary while in the case of infinite horizon problems the optimal policy is stationary.

The example you already gave in comments yourself explains why non-stationary policies may be required in a finite-horizon setting. I don't think the same case can be made for the infinite-horizon setting though, since the discount factor $\gamma$ is not state-dependent, and there is no time limit that suddenly changes whether or not a highly-rewarding state is still reachable.

If this is truly the case, why most RL algorithms use stationary policies in episodic settings (i.e. finite-horizon)? Furthermore, in reality, a lot of environments are non-stationary and it makes more sense to use a non-stationary policy instead of a stationary one. Again, why most RL algorithms use stationary policies in these cases, too?

It's difficult to tell why choices were made across the board without any specific cases; there may have been different reasons in different cases. In many cases, I expect it is done because it is already difficult enough in practice to learn a single stationary policy for complex problems; learning many different policies for different remaining time horizons would take even more time.

Note that, in cases where the remaining time-to-go is observable to an agent, and also considered to be really important, you could include that time-to-go in your state representation. Then, a "stationary" policy that doesn't change over time (but implicitly still takes it into account by observing the remaining time in the state) can still be optimal.