Multi-Armed Bandit – How Does the Gambler Choose the Best Strategy in Multi-Armed Bandit Problems?

multiarmed-banditreinforcement learning

In the multi-armed bandit problem, I would like to clarify exactly what happens from time step $t=1$ in the context of the epsilon greedy strategy for $\epsilon=0$ and $0<\epsilon \leq 1$. By what I understood. For example, I am not sure exactly what the gambler does to determine the best strategy. Does he just go through the $k$ levers and sees what's the best one, so at time step $k+1$ he "starts" the greedy behaviour where he picks the best strategy until the $k+1$'th step with probability $1-\epsilon$ and then chooses randomly from all possible levers at each timstep with probability $\epsilon$?

Best Answer

You mention two options, $\epsilon=0$ and $\epsilon>0$.

For $\epsilon=0$ there is no exploration, it's all exploitation, so your agent will always select the arm with the maximal reward. If you star from the beginning and all the rewards are random or 0 then this is a very poor choice (you may still get some exploration if you break ties at random).

For $\epsilon>0$ at each time step you will with probability $1-\epsilon$ select the current best action (best lever), and with probability $\epsilon$ you will select an action (lever) at random. This does not change over time. This is done for exploration. In the end if you wanted to figure out what are the best actions you would look at the best actions only.

For $\epsilon=1$ all actions in all time steps are random, no exploitation.