Online Algorithms Regret – Understanding and Minimizing Regret in Online Learning

online-algorithms

In online learning/online convex optimization, it's often the case that you compare your algorithm against the best action in hindsight (i.e., from https://people.cs.umass.edu/~akshay/courses/cs690m/files/lec15.pdf)

$$
\operatorname{Regret}(T)=\sum_t{f_t(w_t)}-\min_u \sum_t{f_t(u)}
$$

For sequence $t=1, \ldots, T$, $u$ in this case is the best action over the entire sequence. In practice, a popular approach is to use a rolling window approach (i.e., regress over past 10 periods of data). Does the $u$ mean that it's basically using a growing window? If $f_t$ changes over time, wouldn't $u$ do terribly since it's basically taking the average over all past observations?

Best Answer

The regret measure of $$\operatorname{Regret}(T)=\sum_{t=1}^T{f_t(w_t)}-\min_u \sum_{t=1}^T{f_t(u)}$$ is the difference between the cumulative loss of the player and that of the best fixed decision in hindsight. So as you have already observed, the measure implicitly assumes that there is a reasonably good decision over all iterations. The measure is thus not suitable in non-stationary environments and is sometimes called static regret.

There are two mainstream measures for tacking non-stationary environments:

adaptive regret: ensuring static regret in any small interval (this is exactly the measure mentioned by @Sebastiaan);
dynamic regret: ensuring a small regret with a changing comparator sequence (I will focus on this measure).

The definition of dynamic regret is as follows, $$\mbox{D-Regret}_T(u_1,\ldots,u_T) = \sum_{t=1}^{T} f_t(w_t) - \sum_{t=1}^{T} f_t(u_t).$$ So you can see the comparators $\{u_1,\ldots,u_T\}$ here are allowed to change over iterations instead of a fixed one in the static regret.

The dynamic regret bounds will typically involve a term measuring the non-stationarity of environments, for example, $P_T = \sum_{t=2}^{T} \| u_t - u_{t-1}\|$.

The notion of dynamic regret is also called tracking regret/ shifting regret in the early development of prediction with expert advice. For online convex optimization (OCO), the pioneering work [1] first introduces this measure, [2] establishes the lower bound and designs an optimal algorithm for OCO, [3] extends the idea to bandit convex optimization scenarios. See more introduction and related works in [2] and [3].

Reference

[1] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928–936, 2003.

[2] Lijun Zhang, ShiyinLu, and Zhi-Hua Zhou. Adaptive Online Learning in Dynamic Environments. In Advances in Neural Information Processing Systems 31 (NeurIPS), 2018.

[3] Peng Zhao, Guanghui Wang, Lijun Zhang, Zhi-Hua Zhou. Bandit Convex Optimization in Non-stationary Environments. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.

Best Answer

Related Solutions

Machine Learning – Online vs Offline Learning

Solved – Why regret is used in online machine learning and is there any intuitive explanation about it

Related Question