Solved – Multi armed bandit for general reward distribution

multiarmed-banditreferences

I'm working on a multi-armed bandit problem where we do not have any information about the reward distribution.

I have found many papers which guarantee regret bounds for a distribution with known bound, and for general distributions with support in [0,1].

I would like to find out if there is a way to perform well in an environment where the reward distribution does not have any guarantees about its support.
I'm trying to compute a nonparametric tolerance limit and using that number to scale the reward distribution so I can use the algorithm 2 specified on this paper (http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf). Does anyone think this approach will work?

If not, can anyone point me to the right spot?

Thanks a bunch!

Best Answer

The research into MAB algorithms is closely tied to theoretical performance guarantees. Indeed, the resurgence of interest into these algorithms (recall Thompson sampling was proposed in the 30s) only really happened since Auer's 2002 paper proving $\mathcal{O}(\log(T))$ regret bounds for the various UCB and $\epsilon$-greedy algorithms. As such, there is little interest in problems where the reward distribution has no known bound since there is almost nothing that can be said theoretically.

Even the simple Thompson sampling algorithm you mention requires Bernoulli distributed rewards, and even that took 80 years to prove a logarithmic regret bound!

In practice, however, In cases where you do not know the reward distribution for certain, you may simply scale it to $[0,1]$ by dividing by large number $S$, and if you observe a reward above $S$ just double the value, $S:=2S$. There are no regret guarantees using this approach though, but it typically works quite well.

Also, the Thompson sampling algorithm you mention needs Bernoulli trials, so you can't use arbitrary continuous rewards. You could fit a Gaussian posterior distribution instead of a Beta, but this is a bit sensitive to your choice of prior, so you may want to set it to be very flat. If you're not looking to prove anything about your implementation this will probably work quite well.

Best Answer

Related Solutions

Solved – Why is this regret a good choice for a multi-armed bandit

Multiarmed Bandit – Multi Armed Bandit for Continuous Rewards: Extended Question and In-depth Analysis

Related Question