Solved – Can reinforcement learning be “stateless”

deep learningmachine learningreinforcement learningterminology

After learning a little bit about RL and Q-Learning, I am a bit puzzled at why it seems that for many forms of learning scenarios, there does not seem to be a "state".

I understand that if we wish for a robot to navigate around a maze, each position in the maze is a state and the robot has a selection of actions {up, down, left, right}. We want to find a sequence of actions that leads the robot to the exit of the maze or to the highest reward.

However, it seems that in other situations, no state is needed. For example, if two players are playing rock-paper-scissors, at each round of the game, each player throws a hand sign and receives some reward. The goal is to throw a sequence of hand signs to maximize the reward over time. It is not clear to me what the state is in this situation. Is it every unit of time?

Similarly, suppose a single player has a selection of buttons he can press, and every time he presses a button, some reward is provided. Again, there does not seem to be a state involved. There is just action and reward, that's it.

Does anyone know if reinforcement learning can be formulated without states? If so, what distinguishes these types of RL with the RL that requires statese?

Best Answer

Reinforcement learning is formulated as a problem with states, actions, and rewards, with transitions between states affected by the current state, chosen action and the environment. That is part of its definition (formulated as a Markov Decision Process), so generally you won't find stateless variants of it that are still called reinforcement learning.

However, there are related stateless problems. There are multi-armed bandits, which just have actions and rewards. Solutions for these allow for learning of reward based on actions, and can be optimised by selecting the best action (being sure you have the best one, and what total reward you can accumulate whilst still testing for which is the best action are the main optimisation problems). Your button pushing example looks a lot like a multi-armed bandit problem. Another common example might be advert selection online for anonymous visitors to a site - although there is often plenty of data available, there's also a huge amount that is hidden, and a practical approach is to treat the probability of a click through as only depending on the choice of content, which is then the site's action.

There is a "stateful" variant of multi-armed bandits called contextual bandits - when there is some kind of signal that can be associated with the correct action, but actions taken have no effect on what the next state will be. Contextual bandits have states, actions and rewards, but no transition rules, they can be treated as a set of entirely separate events.

A contextual bandit with added transition rules between states, but no influence from the selected action, is essentially a sub-class of the reinforcement learning problem, and you can use most of the same analysis to predict long term reward and learn optimal behaviour.

And just for completeness, there are Markov Reward Processes without agent interaction that have states, and rewards, but no actions. It is possible to use reinforcement learning algorithms on these to predict long term reward and/or the expected long term value of being in a specific state.

The rock-paper-scissors game does not fit neatly into any of the above problems because there are two agents (although you could analyse it as a multi-armed bandit if the opponent's policy was fixed to always play with specific unchanging probabilities, or as a contextual bandit if there multiple such opponents or a really easy "tell" to their style of play that never changed). Typically a game of rock-paper-scissors would be analysed using game theory - it has an interesting feature that the Nash equilibrium is achieved by using a stochastic policy with equal $\frac{1}{3}$ probability of each choice.

If you wrote a rock-paper-scissors agent to play against a human opponent, you might actually formulate it as a reinforcement learning problem, taking the last N plays as the state, because that could learn to take advantage of human players' poor judgement of randomness.

Related Question