Solved – Best bandit algorithm

algorithmsmachine learningmathematical-statisticsmultiarmed-banditreinforcement learning

The most well-known bandit algorithm is upper confidence bound (UCB) which popularized this class of algorithms. Since then I presume there are now better algorithms. What is the current best algorithm (in terms of either empirical performance or theoretical bounds)? Is this algorithm optimal in some sense?

Best Answer

A paper from NIPS 2011 ("An empirical evaluation of Thompson Sampling") shows, in experiments, that Thompson Sampling beats UCB. UCB is based on choosing the lever that promises the highest reward under optimistic assumptions (i.e. the variance of your estimate of the expected reward is high, therefore you pull levers that you don't know that well). Instead, Thompson Sampling is fully Bayesian: it generates a bandit configuration (i.e. a vector of expected rewards) from a posterior distribution, and then acts as if this was the true configuration (i.e. it pulls the lever with the highest expected reward).

The Bayesian Control Rule ("A Minimum Relative Entropy Principle for Learning and Acting", JAIR), a generalization of Thompson Sampling, derives Thompson Sampling from information-theoretic principles and causality. In particular, it is shown that the Bayesian Control Rule is the optimum strategy when you want to minimize the KL between your strategy and the (unknown) optimum strategy and if you take into account causal constraints. The reason why this is important is because this can be viewed as an extension of Bayesian inference to actions: Bayesian inference can be shown to be the optimal prediction strategy when your performance criterion is the KL between your estimator and the (unknown) true distribution.

Related Question