Solved – Batch reinforcement learning: Algorithm example

algorithmsmachine learningreinforcement learning

I have been reading the Reinforcement learning: An Introduction by Sutton and Barto (2012) and I have come across the batch learning method. Unfortunately, this method is not very well described in the book and scientific articles regarding batch learning are yet too advanced for me.

Could anyone please elaborate on the method and provide exemplary pseudo-algorithm for this reinforcement learning method?

Batch updating is mentioned in Sutton and Barto on the page 137 in chapter "Optimality of TD(0)", here is the relevant paragraph:

Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer [……] We call this batch updating because updates are made only after processing each complete batch of training data.

After searching for some articles of google, I tried to understand this one, but was not very succesfull.

Best Answer

The reference to batch updating is not regarding any new or undescribed reinforcement learning method, but just a subtle re-ordering of how the experience and updates interact. You can use batch updates where experience is in short supply (as opposed to computation time).

Essentially you can just use the standard TD update mechanism, but instead of taking each piece of experience once as it is observed, you store and re-play the trajectories that you have seen so far (i.e. without selecting new actions). You can keep repeating them, almost like policy evaluation in Dynamic Programming (except only using your sampled trajectories), until the value estimates only change by a small amount - or maybe keep repeating for e.g. 10 seconds, until the next pieces of experience arrive.

Obviously this approach has limitations. It can only evaluate actions that have been taken so far, it cannot explore further. It will probably suffer from sampling bias in the value estimates. However, it will represent in some ways the best estimates of state values seen so far, and provided the stochastic parts of the MDP or policy are not too wild, it may be a good choice where gaining experience is costly or time-consuming.

In terms of implementation, you could do something like this (for SARSA(0)):

Initialize $Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)$, arbitrarily, and $Q(\text{terminal-state}, \cdot) = 0$

Repeat:

$\qquad$ Initialize $S$

$\qquad$ Choose $A$ from $S$ using policy derived from $Q$ (e.g., ε-greedy)

$\qquad$ Repeat (for each step of episode):

$\qquad\qquad$ Take action $A$, observe $R, S'$

$\qquad\qquad$ Choose $A'$ from $S'$ using policy derived from $Q$

$\qquad\qquad$ Store five sampled variables $S, A, R, S', A'$ in experience table

$\qquad\qquad$ If experience table is large enough, repeat:

$\qquad\qquad\qquad$ For each $S, A, R, S', A'$ in experience table:

$\qquad\qquad\qquad\qquad Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma Q(S', A') − Q(S, A)]$

$\qquad\qquad$ Until computation time runs out, or max changes in Q are small

$\qquad\qquad$ If experience table was used, then clear it (as policy will have changed)

$\qquad\qquad S \leftarrow S'; A \leftarrow A'$

$\qquad\qquad$until $S$ is terminal

There are obviously a few variations of this - you can interleave the experience replay/batch in different places that might work better for episodic or continuous problems. You could keep the online learning updates too, possibly with a different learning rate. If you have an episodic problem, then your experience table really should contain at least one episode end, and definitely so if $\gamma =1$. If you are performing policy evaluation of a fixed policy (as opposed to control where you adjust the policy) then you can keep the older experience, you don't need to cull it.

Furthermore, this is very similar to experience replay approaches that are useful when using estimators such as neural networks. In that case your experience table can be sampled to create mini-batch updates for supervised learning for the estimator. In an off-policy scenario like Q learning you can drop the storage of $A'$ and substitute the maximising $A'$ at the time of replay, also allowing you to keep more history.