I am looking to use the Bayesian Bandits Strategy to find the best arm of a Multi armed bandit.
As outlined in the link, the Bayesian algorithm is
- Sample a random variable $X_b$ from the prior of bandit $b$, for all $b$.
- Select the bandit with the largest sample, i.e. select bandit $B$ = argmax $X_b$.
- Observe the result of pulling bandit $B$, and update the prior of bandit $B$.
- Return to 1.
My questions are:
- Does each bandit have a different prior? i.e. does each bandit have an individual beta distribution, with different initial $\alpha, \beta$?
- If each bandit does have a different prior, how do we incorporate prior knowledge about the bandits? i.e. if I suspect that bandit $1$ has a higher chance of probability than bandit $2$, how do I include this?
Best Answer
The algorithm is actually called "Thompson Sampling for Bernoulli bandits". For bandit $i$, let $X_i = 1$ if the playing is successful, and $X_i = 0$ means that playing is failure. Then, we can assume $X_i \sim Ber(p_i)$, where $p_i$ is the successful probability.
Since the conjugate prior of Bernoulli sample is Beta distribution, we assume the prior distribution $p_i\sim Beta(\alpha_i,\beta_i)$ for bandit $i$. We commonly choose the non-informative prior $\alpha_i = \beta_i = 1$ for all bandits $i=1,\cdots,K$.
Now, we can discuss about your two questions.