Solved – Is value iteration considered a reinforcement learning algorithm or planning algorithm

machine learningreinforcement learningterminology

Recall value iteration:

$
\text{Initialize $V_0(s) = 0 , \forall s \in S$} \\
\text{Repeat until convergence},\{\\
\quad \text{Given value function $V_i(s), s \in S$ for iteration $i$ do:} \\
\quad V_{i+1}(s) := max_{a \in A} \sum_{s'} T(s' | s, a)[ R(s,a,s') + \gamma V_i(s') ]\\
\}
$

It seems that the algorithm assume that the reward and the transition probabilities are known. Hence, the environment is known and there really isn't any interaction with the environment (no samples from the environment), i.e. we get the model of the environment and compute the value function and the compute the optimal policy (if we want) as follows:

$$ \pi(s) = argmax_a \sum_{s'} T(s' | s, a)[ R(s,a,s') + \gamma V_i(s') ] $$

Hence, it seems to me that the only "learning" done is of the policy (and value function). Hence, isn't this type of scheme better considered as planning rather than learning?

Best Answer

According to chapter 9 "Planning and Learning" in the (overall recommended) book Reinforcement Learning: An Introduction by Sutton and Barto, both learning and planning try to estimate a value function and use it to improve the overall policy. The difference is, that planning uses simulated experience or knowledge from an environment model meanwhile learning uses actual (trial-and-error) experience to learn the value function.

Since Value Iteration as part of Dynamic Programming requires full knowledge of the environment, it is indeed a planning algorithm. Value Iteration is presented in the context of Reinforcement Learning as theoretical pre-stage, where an environment model is available, before switching to heuristics (Monte-Carlo, Temporal-Difference-Learning) where it is not.

However, as Sutton points out, it is not necessarily helpful to make such a distinction in practice. For example, one can learn the state-value-function $Q(s,a)$ and use that one during action selection to plan (e.g. by using Heuristic Search).

Best Answer

Related Solutions

Solved – Reinforcement Learning Value Iteration Explained

Solved – Reinforcement learning with Neural Fitted Q-iteration

Related Question