Solved – In a multi-arm bandit problem, how does one calculate the cumulative regret in real life

contextual-banditmachine learningmultiarmed-bandit

I was recently looking at some information on Multi-arm Bandits, and they have the notion of cumulative regret. It is defined as the following:

Consider $K$ arms, each having a normal distribution with mean $\mu_k$ taken from:

$$
\mu_k \sim N(0,1)
$$

Then, the reward function $R_t(\mu_k)$ at time $t$ has distribution:

$$
R_t(\mu_k) \sim N(\mu_k, 1)
$$

Then, the mean of the best arm is taken to be $\mu^* = \max_k \mu_k$.

From this, assume we have $T$ total pulls of the bandit. Then, the cumulative regret is defined to be:

$$
Regret = T\mu^* – \sum_{t=1}^T R_t
$$

I was wondering how this definition of regret differs than ones with an expectation in front of the sum. Does anyone have any specific ideas here?

Best Answer

In comparison to formalised theories of regret, this is a non-standard way to define "regret" (see e.g., Loomes and Sugden 1982). Within formalised regret theory, regret is defined as the foregone utility from the actual action choice in comparison to the case where the random states of the world are treated as known (hindsight knowledge). This is usually measured fully as a random variable, without any expectation, but obviously you can also measure expected regret. Within regret theory the optimisation problem for the decision-maker is usually to minimise regret, which is different to maximising utility.

Suppose we let $R_{a,t}$ denote the return at time $t$ for action $a = 1, ..., K$ and let $\boldsymbol{a} = (a_1, ..., a_T)$ denote the vector of actions that are chosen in the problem. Following your own notation, we will also let $R_t = R_{a_t, t}$ denote the return of the chosen action at time $t$. Then the usual form for the regret would be the random variable:

$$\text{Regret}(\boldsymbol{a}) = \sum_{t=1}^T ( \max_a R_{a,t} - R_t).$$

The regret function is non-negative, since it compares the actual return to the optimal return under hindsight knowledge. Within regret theory it is impossible to have negative regret (the optimum is zero regret). This also means that the expected regret is non-negative.

In the analysis you are showing, the "regret" function is taking the expected value of the first part but not the second. So the "regret" you are using is the difference between the expected gain under hindsight knowledge, compared to the actual (random) gain. One strange aspect of this measure of "regret" is that it can actually be negative (i.e., you can gain more than the expected gain under hindsight knowledge). Hence, this measure of "regret" does not correspond to either the usual definition in regret theory, nor to the expected regret in regret theory. It is a hybrid of these, and it does not obey the property of non-negativity.

Loomes, G. and Sugden, R, (1982) Regret theory: an alternative theory of rational choice under uncertainty. The Economic Journal 92(368), pp. 805–824.

Best Answer

Related Solutions

Solved – Why is this regret a good choice for a multi-armed bandit

Multiarmed Bandit – Multi Armed Bandit for Continuous Rewards: Extended Question and In-depth Analysis

Related Question