Solved – How to understand k-armed bandit example from Sutton’s RL book chapter 2

multiarmed-banditreinforcement learning

I have a trivial problem but I do not understand fully the k-armed bandit theorem from chapter 2. My question is based on Sutton's "Reinforcement Learning: An introduction, second edition". The exercise is as such:

Consider a k-armed bandit problem with k = 4 actions, denoted
1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using ε-greedy action selection,
sample-average action-value estimates, and initial estimates of Q1(a) = 0, for all a. Suppose the initial
sequence of actions and rewards is A1 = 1, R1 = 1, A2 = 2, R2 = 1, A3 = 2, R3 = 2, A4 = 2, R4 = 2,
A5 = 3, R5 = 0. On some of these time steps the ε case may have occurred, causing an action to be
selected at random. On which time steps did this definitely occur? On which time steps could this
possibly have occurred?

My assumption was as such: A1 may have been chosen both randomly or deliberately (no info about that provided), A2 may have been random (as exploration), A3 and A4 were greedy and A5 was explored. So my answer to the question:

On which time steps did this definitely occur?

would be that actions A5 was definitelly random (not sure about A1 and A2).

How to answer this correctly and the second question as well? What I do not quite get here is why the immediate rewards (R3 and R4) are equal to 2 when they should be chosen among 0 (lose) and 1 (win). I would understand that the action-values Q3(a) and Q4(a) may be equal to 2 but this R3, R4 confused me. Could someone provide me the proper way of solving this problem step by step? This would help me to understand the essence of the k-armed bandit theorem.

Best Answer

From what you posted, I don't see where the rewards are constrained to 0,1. Generally rewards can be real-valued.

To understand the algorithm, let's write down the estimates after each iteration. I'll use the format Iteration: estimate of Q(1),Q(2),Q(3),Q(4)

0: 0, 0, 0, 0

1: 1, 0, 0, 0

2: 1, 1, 0, 0

3: 1, 1.5, 0, 0

4: 1, 1.67, 0, 0

5: 1, 1.67, 0, 0

Now consider how the algorithm works.

At iteration 1, it could either pick action 1 by default or pick it randomly - we really can't say.

At iteration 2, it should pick action 1 since that has the largest estimate. Since it picks action 2, we know this is an exploration action.

At iteration 3, it should pick 2 so this is greedy.

You should be able to use this reasoning to deduce that A4 is an exploitation action and A5 was an exploration action.

Related Question