[Math] Implementing temporal difference learning for a random walk in Python

I'm trying to reproduce an example from a book by Richard Sutton on Reinforcement Learning (in Chapter 6 of this PDF). Here is the problem statement:

The example discusses the difference between Monte Carlo (MC) and Temporal Difference (TD) learning, but I'd just like to implement TD learning so that it converges.

Following my understanding of the book and http://www.scholarpedia.org/article/Temporal_difference_learning, I tried to reproduce this example with the following Python code:

import numpy as np
np.set_printoptions(precision=3, suppress=True)

class Chain:
    gamma = 1       # Discount factor for future rewards (=1 because the game is finite and deterministic)

    def __init__(self, alpha = 0.1, values = np.ones(7)*0.5):
        self.nodes = ["Left", "A", "B", "C", "D", "E", "Right"]
        self.position = 3       # Initial position is "C"
        self.node = self.nodes[self.position]
        self.values = values
        self.terminated = False
        self.alpha = alpha

    def move(self):
        if not self.terminated:
            direction = np.random.choice(["left", "right"])

            if direction == "left":
                print "Moving to the left."
                new_position = self.position - 1
            elif direction == "right":
                print "Moving to the right."
                new_position = self.position + 1

            reward = self.get_reward(new_position)

            # self.update_value_table_immediate_reward(self.position, new_position)
            self.update_value_table_future_reward(self.position, new_position)

            self.position = new_position
            self.node = self.nodes[self.position]

            if (self.node == "Left") or (self.node == "Right"):
                print "The random walk has terminated."
                self.terminated = True

        else:
            print "Moving is not possible as the random walk has already terminated."

    def get_reward(self, new_position):
        return 1.0 if self.nodes[new_position] == "Right" else 0.0

    def update_value_table_immediate_reward(self, old_position, new_position):
        reward = self.get_reward(new_position)
        self.values[old_position] += self.alpha * (reward - self.values[old_position])

    def update_value_table_future_reward(self, old_position, new_position):
        reward = self.get_reward(new_position) + self.gamma * self.values[new_position]
        self.values[old_position] += self.alpha * (reward - self.values[old_position])


value_estimates = np.ones(7) * 0.5      # Initialize estimates at all 0.5

N_episodes = 1000

for episode in range(N_episodes):
    c = Chain(values = value_estimates)         # Use value estimates from previous iteration
    while not c.terminated:
        c.move()
    value_estimates = c.values                  # Update value estimates

print(value_estimates[1:-1])

If I run the code as it is, I would expect it to at the end print the following array:

[ 0.1667,  0.3333,  0.5   ,  0.6667,  0.8333]

with some slight fluctuations about the expected values. However, instead I get this:

[ 0.691  0.858  1.044  1.214  1.345]

In short, the output of values is not converging to the expected future reward in each state.

If I replace the line with update_value_table_future_reward with update_value_table_immediate_reward in the Chain class, I get the following output:

[ 0.     0.     0.     0.     0.559]

This is more-or-less what I would expect: the expected immediate reward in state "E" is 0.5, and zero in the other states.

What I thus suspect is that there is something wrong with the update_value_table_future_reward function, but I can't see what. Following the Scholarpedia article, I'm trying to implement

$p_t := p_t + \alpha\left[y_{t+1} + \gamma P_t\left(x_{t+1}\right) – p_t \right],$

where the $:=$ denotes the updating at each time step, $p$ is the look-up table values, alpha is the learning rate (set to 0.1 as in the book), gamma is the discount factor for future rewards (set to 1 in this deterministic and finite example), $y_{t+1}$ is the immediate reward received after the move, and $P_t\left(x_{t+1}\right)$ is the expected reward in the next state (the state moved to) using the current value table.

Any ideas on how to implement TD learning for this example?

[Math] Implementing temporal difference learning for a random walk in Python

Best Answer

Related Question

Best Answer

Related Solutions

Machine Learning – Understanding Softmax Policy Parametrization with Binary State Features

Related Question