Solved – Q-learning vs. Value Iteration

machine learningq-learningreinforcement learning

One of the primary differences I've seen stated between Q-learning, as a model-free algorithm, and Value Iteration, as a model-based algorithm is that Value-Iteration requires a "model of the environment" in that it incorporates knowledge of how to transition from one state to another (which I've understood as a policy over actions), and knowledge of the rewards received at each state. I've also seen it stated that

"Value iteration works fine, but it has two weaknesses: first, it can take a long time to converge in some situations, even when the underlying policy is not changing"

Which has left me quite confused.

While I can see that it needs a given set of transition probabilities, I don't see anything in its update equation as to why it requires knowledge of rewards before hand and why it cannot be trained in an online way in the same way that Q-learning can? The next thing is that I'm not sure why the policy would ever change in value iteration? I thought the whole idea was to "predict" how this particular policy would perform using value iteration to find the optimal value function for this policy, and then "control" this behaviour using policy iteration – "one would adapt the previous policy slightly, perform value iteration, find a new value function, compare it to the previous value function and check if it had converged.

Is this an incorrect interpretation?

The equation given for value iteration is displayed here:

$ U(s) = R(s) + γmax_a\sum{T(s, a, s')U(s')} $

$ Q(s, a) = R(s, a) + γmax_a Q(s', a')$

Best Answer

I don't see anything in its update equation as to why it requires knowledge of rewards before hand and why it cannot be trained in an online way in the same way that Q-learning can?

The usual Value Iteration update rule is:

$$v(s) \leftarrow \text{max}_a[\sum_{r,s'} p(r, s'|s,a)(r + \gamma v(s')]$$

or it might be written as some variation of

$$v(s) \leftarrow \text{max}_a[R_s^a + \gamma \sum_{s'} (p(s'|s,a)v(s')]$$

In the first equation, $r$ is a known reward value, where you must already know its distribution. In the second equation $R_s^a$ is the expected reward from a particular state, action pair. It is also possible to use $R_{ss'}^a$ - the expected reward from a particular transition. None of these are about observed reward instances.

You cannot use observed reward instances directly in Value Iteration and trust that this will work in expectation in the long term, as you need to take the max over possible actions, and it is not usually possible to observe the output of taking all possible actions.

You could in theory maintain a separate table of estimated expected rewards, and that would then be a complete RL algorithm, not much different from other single-step TD algorithms. It would be a kind of half-way house to Q-learning as you would still need to know transition probabilities in order to use it.

The equation you give for policy iteration uses a slightly different view of reward:

$$U(s) = R(s) + \gamma \text{max}_a\sum_{s'}{T(s, a, s')U(s')}$$

Here $R(s)$ appears to stand for "fixed reward from arriving in this state". If you have a full model to go with the state transitions, then it is provided. But could you learn it from observations? Yes, with one important caveat: The formula assumes that each state has one and only one associated reward value.

If you can safely assume that your MDP has fixed rewards associated with landing in specific states (and often you can if you have constructed the MDP as part of a game or virtual environment), then yes you should be able to run value iteration much like Q-learning in an online fashion using that update rule. The equation will work just fine without you needing to store or estimate anything about immediate rewards - the expected and immediate values would be the same. You should note this is not a completely generic solution that applies to all MDPs.

The next thing is that I'm not sure why the policy would ever change in value iteration?

The policy changes implicitly. The deterministic policy from value iteration is the one that takes the action with best expected return:

$$\pi(s) \leftarrow \text{argmax}_a[\sum_{r,s'} p(r, s'|s,a)(r + \gamma v(s')]$$

This clearly will change as your values converge. All optimal control algorithms need to deal with this, which is why many use a rolling average (a learning rate parameter) to estimate values, and not a true mean.

Related Question