Multi-Armed Bandit – Learning Payoffs from Variable Number of Armed Bandits

multiarmed-banditthompson-sampling

Does there exist a technique, such that while computing the returns of multi-armed bandits, we have the possibility of introducing an extra bandit? If the number of bandits was fixed, we could directly apply the Thompson Sampling (TS_ algorithm) to it.

Maybe a variant of TS exists that could take in as input a variable number of bandits? For instance, while we are applying TS to 3 slot machines, we come to know about the existence of a fourth slot machine. Therefore, we'd like to take that machine into account within our algorithm.

While asking around, I was told this problem could be modeled using a probability distribution for non-stationary number of variables. However, I am not sure how I could do that.

Best Answer

The setting you are referring to is named sleeping bandits (see here). Sleeping bandits aim at solving problems in which "the decision space (set of arms A) changes over time".

Hence in this setting arms may appear or disappear at every time step.

For what concerns the extension of TS to this setting, have a look at this paper.

Related Question