Solved – UCB1 for Multi Armed Bandit is stochastic or deterministic

machine learningmathematical-statisticsmultiarmed-banditreinforcement learning

I would like to know if UCB1 for multi armed bandit problems is deterministic or stochastic.

I understand that the arm chosen depends on the expected reward and the "width" of the upper bound, which depends on the number of times I have pulled that arm. So my guess is it's deterministic, but I'm not sure.
Also I am supposing my MAB is not dealing with adversarial rewards – is this the key to say it's deterministic?

– edit – I have since found that it should be stochastic, as opposed adversarial. How can I understand this though? Does this mean than adversarial is deterministic? I wouldn't have said so, the only difference is who/what is giving me my rewards, but I have the same knowledge in the two cases, I guess? Could you help clarify?

Thanks!

Best Answer

UCB1 is a deterministic algorithm, as at each decision-making point (play of the bandit) it selects the best alternative by a deterministic rule:

$$ \text{action} = \arg \min_j \bar{x}_j + \sqrt{2\log(t)/n_j}$$

where $t$ is the iteration count, $n_j$ is the number of times we've already chosen action $j$, and $\bar{x}_j$ is the average reward we've seen from choosing action $j$. There is no randomness in the selection of $\text{action}$ given our state of knowledge $(t, n_j, \bar{x}_j)$ at each decision point, hence the designation of "deterministic".

The rewards are random, but that does not make the algorithm itself a stochastic algorithm, although the sample path of actions chosen is random because the results of the bandit play are random.

Adversarial or non-adversarial has nothing to do with it. Decision-making algorithms in adversarial games can be either deterministic or stochastic.

Related Question