Solved – Why are the value and policy iteration dynamic programming algorithms

dynamic programmingpolicy-iterationreinforcement learningvalue-iteration

Algorithms like policy iteration and value iteration are often classified as dynamic programming methods that try to solve the Bellman optimality equations.

My current understanding of dynamic programming is this:

It is a method applied to optimization problems.
DP problems exhibit optimal substructure, i.e., the optimal solution to a problem contains the optimal solution to a subproblem.
These subproblems are not independent of each other, but are overlapping.
There are two approaches – one is a bottom-up, the other is top-down.

I have the following questions:

Is this understanding of DP comprehensive? Does every DP algorithm have an optimal substructure with overlapping subproblems?
How does policy iteration and value iteration fit into this scheme? Can we call it bottom-up or top-down?

Best Answer

Have you seen Silver's lecture? Did you know Bellman coined the dynamic programming term, his first book was called "Dynamic programming" in 1957, see Wikipedia?

DP is an algorithm technique to problems that have an optimal substructure and overlapping subproblems. In contrast, if problems have the non-overlapping subproblems property, you only need to solve it once.
In the top-down DP approach (see below) we find a solution based on previously stored results. In policy iteration (policy evaluation + iteration) and value iteration we update the value of each state, from older estimates of the optimal policy and state values.

I think there is also a difference between "classical" DP, because in policy and value iteration we apply update steps iteratively until convergence. In DP updates can be done in one pass.

Best Answer

Related Solutions

Solved – Why does value iteration converge to the optimal value function in **finite** amount of steps

Reinforcement Learning – Understanding the Policy Improvement Theorem

Related Question

Solved – Why does value iteration converge to the optimal value function in finite amount of steps