Solved – Gradient Bandit Algorithm baseline

gradient descentmultiarmed-banditreinforcement learningstochastic gradient descent

I am reading Sutton's latest draft of "Reinforcement learning, an introduction" and I came to the Gradient Bandit Algorithm (page 29). I am having a bit of trouble understanding how the baseline should be computed. The book says:

"$\bar{R_t} ∈ \mathbb{R}$ is the average of all the rewards up through and including time $t$, which can be computed incrementally as described in Section 2.4 "

Section 2.4 explains briefly how to compute sample averages incrementally by using this formula: $Q_{n+1} = Q_{n+1} + (R_n – Q_n)/n$. So I am using it to compute the average reward, but I am not getting the results displayed in the book.

These are the books results:

enter image description here

These are mine:

enter image description here

This is my code [it is concise, I promise :)]:

def GradientBandit(bandits, iterations, step_size, no_baseline=False):
    k = len(bandits)
    H = np.zeros(k)
    P = np.ones(k) / k
    N = np.zeros(k)

    is_optimal = np.zeros(iterations)
    avg_reward = 0
    for t in range(1, iterations + 1):
        A = np.argmax(P)

        R = bandits.reward(A)
        avg_reward = avg_reward + (1.0 / t) * (R - avg_reward)
        baseline = 0.0 if no_baseline else avg_reward

        H[A] = H[A] + step_size * (R - baseline) * (1 - P[A])
        for a in range(k):
            if a != A:
                H[a] = H[a] - step_size * (R - baseline) * P[a]


        aux_exp = np.exp(H)
        P = aux_exp / np.sum(aux_exp)
        is_optimal[t - 1] = int(bandits.is_optimal(A)) * 1.0


    return P, is_optimal

I am not sure if I misunderstood the way the baseline is supposed to be computed… At first I thought it may be a baseline per action, but then in the formula proof they state that the baseline must not depend on the selected action in order to derive the gradient correctly. Am I wrong?.

My results are pretty bad in two ways: first, the no base line algorithm is not learning anything. Second, it barely reaches like 47% in the optimal action, whereas the book results get over 80%.

The bandits testbed is a 10-bandit problem where the true expected reward is shifted 4 units as suggested in chapter 2 (this is: set the mean of each bandit = gaussian(mean=4)). The results are averages of 2000 runs with 1000 steps. The is_optimal check returns true if the action is equal to the maximum mean of all bandits.

Best Answer

Your implementation is very close to being correct. The issue is in the definition of $\pi$ (which you have as $P$ in your code). From the notes:

where here we have also introduced a useful new notation, $\pi_t(a)$, for the probability of taking action $a$ at time $t$.

But in your code, you take an argmax:

    A = np.argmax(P)

This should instead be a random integer with mass function defined by $P$.