The most well-known bandit algorithm is upper confidence bound (UCB) which popularized this class of algorithms. Since then I presume there are now better algorithms. What is the current best algorithm (in terms of either empirical performance or theoretical bounds)? Is this algorithm optimal in some sense?
Solved – Best bandit algorithm
algorithmsmachine learningmathematical-statisticsmultiarmed-banditreinforcement learning
Related Solutions
I am going to try to give an explanation without any mathematics. Part of this answer is repeated from some points I made in an answer to another question on MAB problems.
The strategic trade-off in multi-arm bandit problems: In multi-arm bandit problems the gambler plays one "bandit" each round and attempts to maximise his total expected return over a given number of rounds. The expected return of each of the bandits is described by some unknown parameters in the problem, and so as we observe more outcomes in each round, we get more information about these unknown parameters, and hence, about the expected return of each of the bandits. In each round of play (except the last), the MAB problem involves a strategic trade-off by the gambler between two objectives:
Immediate rewards: In each round he would like to choose a distribution that gives him a high expected reward on this round, which entails a preference for distributions he (presently) infers to have a high mean reward;
Future rewards (affected by information gain): On the other hand, he wants to refine his knowledge of the true expected rewards by gaining more information on the distributions (especially those that he has not played as much as others), so that he can improve his choices in future rounds.
The relative importance of these two things will determine the trade-off, and this relative importance is affected by a number of factors. For example, if there is only a small number of remaining rounds in the problem then inference for future trials is relatively less valuable, whereas if there is a large number of remaining rounds then inference for future rewards is relatively more valuable. So the gambler needs to consider how much he wants to focus on maximising the immediate rewards in the current round, and how much he wants to deviate from this, to learn more about the unknown parameters that determine the expected reward of each of the bandits.
Thompson sampling: The basic idea of Thompson sampling is that in each round, we take our existing knowledge of the machines, which is in the form of a posterior belief about the unknown parameters, and we "sample" the parameters from this posterior distribution. This sampled parameter yields a set of expected rewards for each machine, and now we bet on the one with the highest expected return, under that sampled parameter.
Prima facie, the Thompson sampling scheme seems to involve an attempt to maximise the immediate expected return in each round (since it involves this maximisation step after sampling the parameter). However, because it involves random sampling of the parameter from the posterior, the scheme involves an implicit variation of maximising the present reward, versus searching for more information. Most of the time we will get a parameter "sample" that is somewhere in the main part of the posterior, and the choice of machine will roughly approximate maximisation of the immediate reward. However, sometimes we will randomly sample a parameter value that is far in the tails of the posterior distribution, and in that case we will end up choosing a machine that does not maximise the immediate reward - i.e., this will constitute more of a "search" to assist with future rewards.
The Thompson scheme also has the nice property that we tend to decrease our "search" as we get more information, and this mimics the desirable strategic trade-off in the problem, where we want to focus less on searches as we obtain more information. As we get play more and more rounds and get more and more data, the posterior converges closer to the true parameter values and so the random "sampling" in the Thompson scheme becomes more tightly packed around the parameter values that will lead to maximisation of the immediate reward. Hence, there is an implicit tendency of this scheme to be more "search-oriented" early on with little information, and less "search-oriented" later on when there is a lot of data.
Now, having said this, one clear drawback of the Thompson sampling scheme is that it does not take into account the number of rounds remaining in the MAB problem. This scheme is sometimes formulated on the basis of a game with infinite rounds, and in this case that is not an issue. However, in MAB problems with finite rounds, it is preferable to take account of the number of remaining rounds in order to decrease the "search" as the number of future rounds decreases. (And in particular, the optimal play in the last round is to ignore searches completely and just bet on the bandit with the highest posterior expected return.) The Thompson scheme does not do this, so it will play finite-round games in a way that is clearly sub-optimal in certain cases.
What you have there is commonly called the exploration term. The upper confidence bound is the empirical mean plus this exploration term.
Let's consider each term separately:
$c$ is a constant which lets the user set the exploration/exploitation trade-off. For theoretical results it is often optimized for the problem at hand (e.g. k-armed bandits with Gaussian priors).
$\sqrt{1/n_i}$ is proportional to the posterior standard deviation after $n_i$ samples of action $i$. Essentially this says that as you pull an arm more often, there is less unknown about the arm.
$\sqrt{ln(N_i)}$ ensures that you don't stop exploring too early. As $N_i$ becomes very large, the sample variances become small enough that we need to compensate to ensure that we never completely stop exploring. Most of the technical math is to show that $\sqrt{ln(N_i)}$ is just enough (but not too much) compensation.
For a more technical description, the paper by Auer et al. is a good starting point.
Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47.2-3 (2002): 235-256.
Best Answer
A paper from NIPS 2011 ("An empirical evaluation of Thompson Sampling") shows, in experiments, that Thompson Sampling beats UCB. UCB is based on choosing the lever that promises the highest reward under optimistic assumptions (i.e. the variance of your estimate of the expected reward is high, therefore you pull levers that you don't know that well). Instead, Thompson Sampling is fully Bayesian: it generates a bandit configuration (i.e. a vector of expected rewards) from a posterior distribution, and then acts as if this was the true configuration (i.e. it pulls the lever with the highest expected reward).
The Bayesian Control Rule ("A Minimum Relative Entropy Principle for Learning and Acting", JAIR), a generalization of Thompson Sampling, derives Thompson Sampling from information-theoretic principles and causality. In particular, it is shown that the Bayesian Control Rule is the optimum strategy when you want to minimize the KL between your strategy and the (unknown) optimum strategy and if you take into account causal constraints. The reason why this is important is because this can be viewed as an extension of Bayesian inference to actions: Bayesian inference can be shown to be the optimal prediction strategy when your performance criterion is the KL between your estimator and the (unknown) true distribution.