Solved – Thomson/Bayesian Bandit Algorithm

algorithmsbayesianmultiarmed-bandit

I am looking to use the Bayesian Bandits Strategy to find the best arm of a Multi armed bandit.

As outlined in the link, the Bayesian algorithm is

  1. Sample a random variable $X_b$ from the prior of bandit $b$, for all $b$.
  2. Select the bandit with the largest sample, i.e. select bandit $B$ = argmax $X_b$.
  3. Observe the result of pulling bandit $B$, and update the prior of bandit $B$.
  4. Return to 1.

My questions are:

  • Does each bandit have a different prior? i.e. does each bandit have an individual beta distribution, with different initial $\alpha, \beta$?
  • If each bandit does have a different prior, how do we incorporate prior knowledge about the bandits? i.e. if I suspect that bandit $1$ has a higher chance of probability than bandit $2$, how do I include this?

Best Answer

The algorithm is actually called "Thompson Sampling for Bernoulli bandits". For bandit $i$, let $X_i = 1$ if the playing is successful, and $X_i = 0$ means that playing is failure. Then, we can assume $X_i \sim Ber(p_i)$, where $p_i$ is the successful probability.

Since the conjugate prior of Bernoulli sample is Beta distribution, we assume the prior distribution $p_i\sim Beta(\alpha_i,\beta_i)$ for bandit $i$. We commonly choose the non-informative prior $\alpha_i = \beta_i = 1$ for all bandits $i=1,\cdots,K$.

Now, we can discuss about your two questions.

  • In the beginning, all bandits have the same prior distribution $Beta(1,1)$. However, after playing the game, assume the bandit $i$ receives success times $S_i$ and failure time $F_i$. Then, each bandit will have its own posterior $Beta(S_i + 1, F_i +1)$.
  • Actually, you can choose different $(\alpha_i, \beta_i)$ for each bandit $i$. Note that $\mathbb{E}(p_i) = \frac{\alpha_i}{\alpha_i + \beta_i}$ for $p_i\sim Beta(\alpha_i,\beta_i)$. If you think bandit $i$ has a higher successful probability in the very beginning, you can simply choose large $\alpha_i$ and small $beta_i$ so that the expectation of successful probability $p_i$ for bandit $i$ is very close to $1$.
Related Question