1. Introduction

Learning Augmented Online Learning Algorithms - The Adversarial Bandit with Knapsacks framework

Davide Drago

davide.drago@studbocconi.it 0

Andrea Celli

andrea.celli2@unibocconi.it 0

Marek Eliáš

marek.elias@unibocconi.it 0 0 Bocconi University , Via Guglielmo Röntgen 1, Milan, 20136 , Italy

We delve into the Bandit with Knapsacks framework with the aim of creating a learning-augmented online algorithm with better competitive guarantees than the state-of-the-art classical worst-case algorithms. In particular, we obtain better competitive ratios when the input predictions are accurate, while also upholding worst-case scenario guarantees for imprecise predictions. Two unique algorithms are introduced - the first working in a full feedback environment and the other tailored for a bandit setting. Both algorithms integrate a static prediction in a worst-case -competitive algorithm. This results in an optimized competitive ratio of 1/[ + 1 (1 − )] in scenarios where the prediction is perfect, and a marginally compromised constant competitive ratio of / (1 − ) when the prediction is highly imprecise, with ∈ (0, 1) parameter chosen by the decision-makers.

1. Introduction

1.1. Related Work Bandit with Knapsacks. In the case of adversarial bandits with knapsacks, Immorlica et al. [ 3 ] provide a competitive ratio of ( log ). This was improved by Kesselheim and Singla [ 4 ] to (log log ). Subsequently, Castiglioni et al. [ 5 ] provided the first constant-factor competitive ratio for the case in which = Ω( ). Such competitive ratio is 1/ = /. Learning Augmented online algorithms. The framework of Learning Augmented online algorithms was formally established by Lykouris and Vassilvtiskii [ 6 ]. Applications of this framework are wide-ranging and include scheduling [ 7, 8 ], caching or paging algorithms [ 9, 10 ]. In addition, recently a general framework for integrating predictions into online primal-dual algorithms was introduced in [ 11 ].

2. Setting

The decision maker makes a sequence of decisions, drawing actions from a finite set . A randomized strategy is defined as ∈ ∆( ). We denote by is the best predicted mixed strategy, while * is the best-fixed distribution. The decision maker has available resources and a budget for each of them. A sequence of items is selected by an adversary. In our setting, will be composed of a reward ∈ [ 0, 1 ] and a cost vector ∈ [ 0, 1 ]× . We focus on the case in which = Ω( ). We denote as = / the ratio of budget to time horizon. Benchmark. The benchmark used in the paper is the Fixed Distribution benchmark, defined in [ 3 ] and denoted as OPT . Such a quantity is defined as the expected total reward of the distribution over actions * , maximizing E[REW].

Regret. To evaluate the algorithm we use the notion of pseudo-regret, expressed ac

E[REWALG] ≥ OPT + reg where 1/ is the competitive ratio, OPT is the profit of the fixed distribution benchmark, and reg a sublinear regret term.

3. Algorithms

Full-feedback. In the full-feedback algorithm, at each iteration, with probability the prediction is played, with probability the iteration is skipped, and with the remaining probability the worst-case algorithm is played. Both the prediction and the worst-case algorithm are assigned the full budget and are stopped when the budget assigned would be depleted, had they been played for the full sequence. The worst-case algorithm is updated at each iteration. Bandit. The diference in the bandit feedback algorithm lies in the update rules. The worst-case algorithm is updated in the iterations in which it is played, otherwise, we set the feedback at (0, 0). Moreover, since the calculation of the expected stopping times is not possible with the bandit feedback, the budget must be divided preemptively between the two algorithms, proportionally to the probability of being played.

Results. Both algorithms have the same competitive ratio guarantees in their respective settings. Theorem 3.1. The algorithms with = * and = 2√2 log(1/ ) , for a sequence of inputs 1/2 achieve w.h.p. a competitive ratio of 1/[ + (1 − )]. When ∑︀ ( ) = 0, the competitive ratio degrades to 1/[ (1 − )].

4. Conclusions

Our findings, although encouraging, have some limitations. Specifically, our algorithms do not ensure sublinear regret under stochastic inputs, and are not designed to adjust the probability parameter in response to real-time performance. Future research could focus on adapting our framework to stochastic environments and creating algorithms capable of dynamically modifying the parameter as system dynamics change. Moreover, enhancing our model to provide best-of-both-worlds guarantees may be useful in diverse applications.

[1]

R. D.

Benedictis ,

Castiglioni ,

Ferraioli ,

Malvone ,

Maratea ,

Scala ,

Serafini ,

Serina ,

Tosello ,

Umbrico ,

Vallati , Preface to the Italian Workshop on Planning and Scheduling , RCRA Workshop on Experimental evaluation of algorithms for solving problems with combinatorial explosion, and SPIRIT Workshop on Strategies, Prediction, Interaction, and Reasoning in Italy (IPS-RCRA-SPIRIT 2023 ), in: Proceedings of the Italian Workshop on Planning and Scheduling , RCRA Workshop on Experimental evaluation of algorithms for solving problems with combinatorial explosion, and SPIRIT Workshop on Strategies, Prediction, Interaction, and Reasoning in Italy (IPS-RCRA-SPIRIT 2023) co-located with 22th International Conference of the Italian Association for Artificial Intelligence (AI* IA 2023 ), 2023 .

[2]

Badanidiyuru ,

Kleinberg ,

Slivkins , Bandits with knapsacks , in: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science , 2013 , pp. 207 - 216 . doi: 10 .1109/ FOCS. 2013 . 30 .

[3]

Immorlica ,

Sankararaman ,

Schapire ,

Slivkins , Adversarial bandits with knapsacks , Journal of the ACM 69 ( 2022 ). doi: 10 .1145/3557045.

[4]

Kesselheim ,

Singla , Online learning with vector costs and bandits with knapsacks , in: J. Abernethy , S. Agarwal (Eds.), Proceedings of Thirty Third Conference on Learning Theory , volume 125 of Proceedings of Machine Learning Research, PMLR , 2020 , pp. 2286 - 2305 .

[5]

Castiglioni ,

Celli ,

Kroer , Online learning with knapsacks: the best of both worlds , in: K. Chaudhuri,

Jegelka ,

Song ,

Szepesvari , G. Niu, S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning , volume 162 of Proceedings of Machine Learning Research, PMLR , 2022 , pp. 2767 - 2783 .

[6]

Lykouris ,

Vassilvtiskii , Competitive caching with machine learned advice , in: J. Dy , A . Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research, PMLR , 2018 , pp. 3296 - 3305 .

[7]

Lattanzi ,

Lavastida ,

Moseley ,

Vassilvitskii , Online Scheduling via Learned Weights, 2020 , pp. 1859 - 1877 . doi: 10 .1137/1.9781611975994.114.

[8]

Mitzenmacher , Scheduling with predictions and the price of misprediction , 2019 . arXiv: 1902 .00732.

[9]

Rohatgi , Near-optimal bounds for online caching with machine learned advice , 2019 . arXiv: 1910 .12172.

[10]

Antoniadis ,

Coester ,

Elias ,

Polak ,

Simon , Online metric algorithms with untrusted predictions , in: H. D. III , A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research, PMLR , 2020 , pp. 345 - 355 .

[11]

Étienne

Bamas ,

Maggiori , O. Svensson, The primal-dual method for learning augmented algorithms , 2020 . arXiv: 2010 .11632.