Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials Hossein Aboutalebi∗1 , Doina Precup1 , and Tibor Schuster2 1 Department of Computer Science, McGill University. Mila Quebec AI Institute 2 Department of Family Medicine, McGill University Abstract in the multi-armed bandit setting is to find the arm ? which has the maximum expected reward µ? (or equivalently, mini- The stochastic multi-armed bandit problem is a mum expected regret). The expected regret after T steps RT well-known model for studying the exploration- is defined as the sum of the expected difference between the exploitation trade-off. It has significant possible ap- mean reward under {at } and the reward expected under the plications in adaptive clinical trials, which allow for optimal arm ?: dynamic changes in the treatment allocation prob- " T # abilities of patients. However, most bandit learning X algorithms are designed with the goal of minimiz- RT = E (µ? − µat ) t=1 ing the expected regret. While this approach is use- ful in many areas, in clinical trials, it can be sensi- While this objective is very popular, there are practi- tive to outlier data, especially when the sample size cal applications, for example in medical research and AI is small. In this paper, we define and study a new ro- safety [Garcıa and Fernández, 2015] where maximizing ex- bustness criterion for bandit problems. Specifically, pected value is not sufficient, and it would be better to have we consider optimizing a function of the distribu- an algorithm sensitive also to the variability of the outcomes tion of returns as a regret measure. This provides of a given arm. For example, consider multi-arm clinical tri- practitioners more flexibility to define an appropri- als where the objective is to find the most promising treatment ate regret measure. The learning algorithm we pro- among a pool of available treatments. Due to heterogeneity in pose to solve this type of problem is a modifica- patients’ treatment responses, considering only the expected tion of the BESA algorithm [Baransi et al., 2014], mean may not be of interest [Austin, 2011]. Specifically, as which considers a more general version of regret. the mean is usually sensitive to outliers and does not provide We present a regret bound for our approach and information about the dispersion of individual responses, the evaluate it empirically both on synthetic problems expected reward has only limited value in achieving a clini- as well as on a dataset from the clinical trial liter- cal trial’s objective. Due to these problems, previous contri- ature. Our approach compares favorably to a suite butions like [Sani et al., 2012a] try to include the variance of standard bandit algorithms. Finally, we provide of rewards in the regret definition and develop algorithms to a web application where users can create their de- solve this slightly enhanced problem. While these modified sired synthetic bandit environment and compare the approaches try to consider variablity in the response of arms, performance of different bandit algorithms online. they induce new problems due to the fact that the variance is not necessarily a good measure of variablity for a distribu- tion. This is because the variance equally penalizes responses Introduction that are above or below the mean response. Other articles like The multi-armed bandit is a standard model for researchers [Galichet et al., 2013] try to use the conditional value at risk to investigate the exploration-exploitation trade-off, see to define a better regret definition. Though the conditional e.g [Baransi et al., 2014; Auer et al., 2002; Sani et al., 2012a; value at risk may address the problem we faced with includ- Chapelle and Li, 2011; Sutton and Barto, 1998]. One of the ing variance, it may not reflect the amount of variablity we main advantage of multi-armed bandit problems is its sim- could observe for a distribution over its entire domain. All in plicity that allows for a higher level of theoretical studies. all, the consistency of treatments among patients is essential, The multi-armed bandit problem consists of a set of arms, with the ideal treatment usually defined as the one which has each of which generates a stochastic reward from a fixed but a high positive response rate while showing low variability in unknown distribution associated to it. Consider a series of response among patients. Thus, the idea of consistency and mulitple arm pulls (or steps) t = 1, ..., T and selecting a spe- saftey seems to some extent subjective and problem depen- cific arm a ∈ A at each step i.e. a(t) = at . The standard goal dant. As a result, it might be necessary to develop an algo- rithm which can work with an arbitrary definition of consis- ∗ hossein.aboutalebi@mail.mcgill.ca tency for a distribution. This kind of system design which allows the separation of samples from Xa,t , where sub-sample I ⊂ {1, 2, . . . , Na,t }. different parts of a system (here regret function and learning The multi-armed bandit was first presented in the sem- algorithm) has already been explored in modular program- inal work of Robbins [Robbins, 1985]. It has been shown ming. In modular programming, we emphasize on splitting that under certain conditions [Burnetas and Katehakis, 1996; the entire system into independant modules which at the end, Lai and Robbins, 1985], a policy can have logarithmic cumu- the composite of these modules builds our system. This de- lative regret: sign trick is necessary when we are dealing with the change Rt X µ? − µa of customer demands and we require our system to adapt lim inf > with the new demands. Here, we follow the same paradigm t→∞ log(t) a:µ <µ Kinf (ra ; r? ) a ? by making regret definition independent of the learning al- gorithm. As a result, we allow more flexibility in defining where Kinf (ra ; r? ) is the Kullback-Leibler divergence be- the regret function which is capable of incorporating problem tween the reward distributions of the respective arms. Policies specific demands. for which this bound holds are called admissible. Finally, we achieve the aforementioned goals by extend- Several algorithms have been shown to produce admissi- ing one of the recent algorithms in the bandit literature called ble policies, including UCB1 [Auer et al., 2002], Thomp- BESA (Best Empirical Sampled Average) [Baransi et al., son sampling [Chapelle and Li, 2011; Agrawal and Goyal, 2014]. One of the main advantage of BESA compared to 2013] and BESA [Baransi et al., 2014]. However, theoreti- other existing bandit algorithms is that it does not involve cal bounds are not always matched by empirical results. For many hyper-parameters. This is especially useful when one example, it has been shown in [Kuleshov and Precup, 2014] does not have any prior knowledge or has insufficient prior that two algorithms which do not produce admissible poli- knowledge about the different arms in the beginning. Also, cies, ε-greedy and Boltzmann exploration [Sutton and Barto, this feature makes it easier to introduce modular design by 1998], behave better than UCB1 on certain problems. Both using McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009]. BESA and Thompson sampling were shown to have compa- Key contributions: We provide a modular definition of re- rable performance with Softmax and ε-greedy. gret called safety-aware regret which allows higher flexibility While the expected regret is a natural and popular measure in defining the risk for multi-armed bandit problems. We pro- of performance which allows the development of theoretical pose a new algorithm called BESA+ which solves this cate- results, recently, some papers have explored other definitions gory of problems. We show the upper-bounds of its safety- for regret. For example, [Sani et al., 2012b] consider a linear aware regret for two-armed and multi-armed bandits. For the combination of variance and mean as the definition of regret experiment parts, we compare our model with some of the for a learning algorithm A: notable earlier research works and show that BESA+ has a M d bt2 (A) − ρb V t (A) = σ µt (A) (1) satisfying performance. For the last experiment, we depict the performance of our algorithm on a real clinical dataset where µ bt is the estimate of the average of observed rewards and illustrate that it is capable of solving the problem with up to time step t and σbt is a biased estimate of the variance of user-defined safety-aware regret. Finally, for the first time as rewards up to time step t. The regret is then defined as: far as we know, we provide a web application which allows users to create their own custom environment and compare Rt (A) = M d V t (A) − M d V ?,t (A), our algorithm with other works. where ? is the optimal arm. According to [Maillard, 2013], however, this definition is going to penalize the algorithm if it Background and Notation switches between optimal arms. Instead, in [Maillard, 2013], the authors devise a new definition of regret which controls We consider the standard bandit setting with action (arm) the lower tail of the reward distribution. However, the algo- set A, where each action a ∈ A is characterized by a reward rithm to solve the corresponding objective function seems distribution ϕa . The distribution for action a has mean µa and time-consuming, and the optimization to be performed may variance σa2 . Let Xa,i ∼ ϕa denote the i-th reward sampled be intricate. Finally, in [Galichet et al., 2013], the authors use from the distribution of action a. All actions and samples are the notion of conditional value at risk in order to define the independent. The bandit problem is described as an iterative regret. game where, on each step (round) t, the player (an algorithm) selects action (arm) at and observes sample Xa,Na,t , where Pt Measure of regret Na,t = s=1 I{as = a} denotes the number of samples observed for action a up to time t (inclusively). A policy is Unlike previous works, we now give a formal definition of a distribution over A. In general, stochastic distributions are class of functions which can be used as a separate module necessary during the learning stage, in order to identify the inside our learning algorithm module to measure the regret. best arm. We discuss the exact notion of “best” below. We call these class of functions ”safety value functions”. We define IS (m, j) as the set obtained by sub-sampling In the following section, we try to formally define these without replacement j elements form the set S of size m. functions. Assume we have k arms (|A| = k) with reward Let Xa,t denote the history of observations (records) obtained distributions ϕ1 , ϕ2 , . . . , ϕk . from action (arm) a up to time t (inclusively), such that Definition 0.1. safety value function: Let D denotes the set |Xa,t | = Na,t . The notation Xa,t (I) indicates the set of sub- of all possible reward distributions for a given interval. The safety value function v : D → R provides a score for a given (without hyperparameter) approach for finding the optimal distribution. arm according to the expected mean regret criterion. Consider The optimal arm ? under this value function is defined as a two-armed bandit with actions a and ? ,where µ? > µa , and assume that Na,t < N?,t at time step t. In order to select the ? ∈ arg max(v(ϕa )) (2) next arm for time step t + 1, BESA first sub-samples s? = a∈A I? (N?,t , Na,t ) from the observation history (records) of the The regret corresponding to the safety value function up to arm ? and similarly sub-sample sa = Ia (Na,t , Na,t ) = Xa,t time T is defined as: from the records of arm a. If µbsa > µbs? , BESA chooses arm " T # a, otherwise it chooses arm ?. The main reason behind the sub-sampling is that it gives X RT,v = E (v(ϕ? ) − v(ϕat )) (3) t=1 a similar opportunity to both arms. Consequently, the effect of having a small sample size, which may cause bias in the We call (3), safety-aware regret. estimates diminishes. When there are more than two arms, When the context is clear, we usually drop the subscript v BESA runs a tournament algorithm on the arms [Baransi et and use only RT for the ease of notation. al., 2014]. Finally, it is worth mentioning that the proof of the regret Definition 0.2. Well-behaved safety value function: Given a bound of BESA uses a non-trivial lemma for which authors reward distribution ϕa over the interval [0, 1], a safety value did not provide any formal proof. In this paper, we will avoid function v for this distribution is called well-behaved if there using this lemma to prove the soundness of our proposed al- exists an unbiased estimator vb of v such that for any set of gorithm for a more general regret family. Also, we extend the observation {x1 , x2 , . . . , xn } sampled from ϕa , and for some proof for the multi-armed case which was not provided in the constant γ we have: [Baransi et al., 2014]. γ We are now ready to outline our proposed approach, which sup |b v (x1 , . . . , xi , . . . , xn ) − vb(x1 , . . . , xbi , . . . , xn )| < xbi n we call BESA+. As in [Baransi et al., 2014], we focus on the (4) two-arm bandit. For more than two arms, a tournament can be set up in our case as well. If (4) holds for any reward distribution ϕ over the interval [0, 1], we call the safety value function v, a well-behaved Algorithm BESA+ two action case safety value function. Input: Safety aware value function v and its estimate vb Example 1: For a given arm a which has reward distribu- Parameters: current time step t, actions a and b. Initially tion limited to interval [0, 1], consider the safety value func- Na,0 = 0, Nb,0 = 0 tion µa − ρσa2 which measures the balance between the mean 1: if Na,t−1 = 0 ∨ Na,t−1 < log(t) then and the variance of the reward distribution of arm a. ρ is a 2: at = a hyper-parameter constant for adjusting the balance between 3: else if Nb,t−1 = 0 ∨ Nb,t−1 < log(t) then variance and the mean. This is a well-behaved safety func- 4: at = b tion if we use the following estimator for computing empiri- 5: else cal mean and variance: 6: nt−1 = min{Na,t−1 , Nb,t−1 } 7: Ia,t−1 ← Ia (Na,t−1 , nt−1 ) Na,t 8: Ib,t−1 ← Ib (Nb,t−1 , nt−1 ) 1 X µ ba,t = ra,i (5) 9: Calculate ṽa,t = vb(Xa,t−1 (Ia,t−1 )) and ṽb,t = Na,t i=1 vb(Xb,t−1 (Ib,t−1 )) Na,t 10: at = arg maxi∈{a,b} ṽi,t (break ties by choosing arm 2 1 X with fewer tries) σ ba,t = ba,t )2 (ra,i − µ (6) Na,t − 1 i=1 11: end if 12: return at where ra,i is the ith reward obtained from pulling arm a. 2 It should be clear that the unbiased estimator µ ba,t − ρb σa,t satisfies (4).  If there is a strong belief that one arm should be better Other types of well-behaved safety function can be defined than the other then instead of using factor log(t) in Algorithm as a function of standard deviation or conditional value at risk BESA+, one can use α log(t) factor (where 0 < α < 1 and is similar to the previous example. In the next section, we are constant) to reduce the final regret. going to develop an algorithm which can optimize the safety- The first major difference between BESA+ and BESA is the aware regret. use of the safety-aware value function instead of the simple regret. A second important change is that BESA+ selects the arm which has been tried less up to time step t if the arm has Proposed Algorithm been chosen less than log(t) times up to t. Essentially, this In order to optimize the safety-aware regret, we build on the change in the algorithm is negligible in terms of establish- BESA algorithm, which we will now briefly review. As dis- ing the total expected regret, as we cannot achieve any better cussed in [Baransi et al., 2014], BESA is a non-parametric bound than log(T ) which is shown in Robbins’ lemma [Lai and Robbins, 1985]. This tweak also turns out to be vital in these dlog ke games, we know that at that round we will see proving that the expected regret of the BESA+ algorithm is a regret. This regret should be less than or equal to ∆max . bounded by log(T ) (a result which we present shortly). In the following,We use notation 1 −a? ,i to denote the in- To better understand why this modification is necessary, dicator for the event of a? losing the ith match (1 6 i 6 consider a two arms scenario. The first arm gives a deter- dlog ke). ministic reward of r ∈ [0, 0.5) and the second arm has a uniform distribution in the interval [0,1] with the expected T X X k reward of 0.5. If we are only interested in the expected re- RT = ∆ai E[11at =ai ] ward (µ), the algorithm should ultimately favor the second t=1 i=1 arm. On the other hand, there exists a probability of r that the T dlog X Xke BESA algorithm is going to constantly choose the first arm 6 ∆max E[11−a? ,i ] if the second arm gives a value less than r on its first pull. t=1 i=1 In contrast, BESA+ evades this problem by letting the second T dlog arm be selected enough times such that it eventually becomes X Xke 6 ∆max max{E[11−a? ,i0 ]} distinguishable from the first arm. 0 i t=1 i=1 We are now ready to state the main theoretical result of our dlog ke T proposed algorithm. X X 6 ∆max max{E[11−a? ,i0 ]} Theorem 0.1. Let v be a well-behaved safety value function. 0 i i=1 t=1 Assume A = {a, ?} be a two-armed bandit with bounded T rewards ∈ [0, 1], and the value gap ∆ = v? − va . Given the ∆max dlog ke X value γ, the expected safety-aware regret of the Algorithm 6 ∆ba E[11−a? ,ba ] + k∆max n ∆ba t=n BESA+ up to time T is upper bounded as follows: ∆max dlog ke RT 6 ζ∆,γ log(T ) + θ∆,γ (7) 6 [ζ∆ab ,γ log(T ) + θ∆ab ,γ ] + k∆max n ∆ba where in (7), ζ∆,γ , θ∆,γ are constants which are dependent (9) on the value of γ, ∆. Proof. Due to the page limit, we could not include all the proof. Here, we just provide a short overview of the proof. The proof mainly consists of two parts. The first part of our Empirical results proof is similar to [Baransi et al., 2014] but instead we have Empirical comparison of BESA and BESA+ used McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009] [Tolstikhin, 2017]. For the second part of the proof, unlike As discussed in the previous section, BESA+ has some ad- [Baransi et al., 2014], we have avoided using the unproven vantages over BESA. We illustrate the example we discussed lemma in their work and instead tried to compute the upper in the previous section through the results in Figures 1-3, bound directly by exploiting the log trick in our algorithm for r ∈ {0.2, 0.3, 0.4}. Each experiment has been repeated (this trick has been further elaborated in the first experiment). 200 times. Note that while BESA has an almost a linear re- Interested reader can visit here to see the full proof. gret behavior, BESA+ can learn the optimal arm within the given time horizon and its expected accumulated regret is up- Theorem 0.2. Let v be a well-behaved safety value function. per bounded by a log function. It is also easy to notice that Assume A = {a1 , . . . , ak−1 , ?} be a k-armed bandit with BESA+ has a faster convergence rate compared with BESA. bounded rewards ∈ [0, 1]. Without loss of generality, con- As r gets closer to 0.5, the problem becomes harder. This sider the optimal arm is ? and the value gap for arm a, ? phenomenon is a direct illustration of our theoretical result. is ∆a = v? − va . Also consider ∆max = maxa∈A ∆a . Given the value γ, the expected safety-aware regret of the Algorithm 800 BESA BESA+ up to time T is upper bounded as follows: BESA+ 700 ∆max dlog ke 600 RT 6 [ζ∆ab ,γ log(T ) + θ∆ab ,γ ] + k∆max n ∆ba 500 (8) regret 400 300 where in (8), ζ, θ are constants which are dependent on the value of γ, ∆. Moreover, b a is defined: 200 100 a = arg max ζ∆a ,γ log(T ) + θ∆a ,γ b 0 a∈A 0 2000 4000 6000 8000 10000 steps for T > n. Figure 1: Result of accumulated expected regret for r = 0.4 Proof. We Know that the arm ? has to play at most dlog ke matches (games) in order to win the round. If it losses any of 30000 1000 BESA MARAB Algorithm BESA+ BESA+ 25000 800 20000 expected regret 600 15000 regret 400 10000 200 5000 0 0 0 2000 4000 6000 8000 10000 0 200 400 600 800 1000 steps steps Figure 2: Result of accumulated expected regret for r = 0.3 Figure 4: Accumulated regret figure. The safety value function here is conditional value at risk. BESA 80 MARAB Algorithm BESA+ BESA+ 1000 70 60 800 50 Percentage 600 regret 40 400 30 20 200 10 0 0 0 2000 4000 6000 8000 10000 0 200 400 600 800 1000 steps steps Figure 3: Result of accumulated expected regret for r = 0.2 Figure 5: Percentage of optimal arm play figure. The safety value function here is conditional value at risk. Conditional value at risk safety value function As discussed in [Galichet et al., 2013], in some situations, we Mean-variance safety value function need to limit the exploration of risky arms. Examples include Next, we evaluated the performance of BESA+ with the regret financial investment where inverters may tend to choose risk- definition provided by [Sani et al., 2012a]. Here, we used the averse kind of strategy. Using conditional value at risk as a same 20 arms Gaussian mixture environment described in the risk measure is one of the approaches to achieve this goal. previous section. We evaluated the experiments with ρ = 1 Informally, conditional value at risk level α is defined as the which is the trade off factor between variance and the mean. expected values of the quantiles of reward distribution where The results of this experiment is depicted in figures 6, 7. The the probability of the occurrence of values inside this quantile hyper-parameters used here for algorithms MV-LCB and Ex- is less than or equal to α. More formally: pExp are based on what [Sani et al., 2012a] suggests using. Again, we can see that BESA+ has a relatively small variance CV aRα = E[X|X < vα ] (10) over 10 experiments. where in (10), vα = arg maxβ {P(X < β) 6 α}. To estimate (10), we have used the estimation measure intro- Real Clinical Trial Dataset duced by [Chen, 2007]. This estimation is also employed in Finally, we examined the performance of BESA+ against [Galichet et al., 2013] work to derive their MARAB algo- other methods (BESA, UCB1 , Thompson sampling, MV- rithm. Here, we have used this estimation for the Conditional LCB, and ExpExp) based on a real clinical dataset. This value at risk safety value function which is the regret mea- dataset includes the survival times of patients who were suf- sure for this problem. Our environment consists of 20 arms fering from lung cancer [Ripley et al., 2013]. Two different where each arm reward distribution is the truncated Gaussian kinds of treatments (standard treatment and test treatment) mixture consisting of four Gaussian distribution with equal were applied to them and the results are based on the number probability. The reward of arms are restricted to the interval of days the patient survived after receiving one of the treat- [0, 1]. To make the environment more complex, the mean and ments. For the purpose of illustration and simplicity, we as- standard deviation of arms are sampled uniformly from the sumed non-informative censoring and equal follow-up times interval [0, 1] and [0.5, 1] respectively. The experiments are in both treatment groups. As the experiment has already been carried out for α = 10%. For MARAB algorithm, we have conducted, to apply bandit algorithms, each time a treatment used grid search and set the value C = 1. The figures 4, 5 de- is selected by a bandit algorithm, we sampled uniformly from pict the results of the run for ten experiments. It is noticeable the recorded results of the patients whom received that se- that in both figures BESA+ has a lower variance in experi- lected treatment and used the survival time as the reward sig- ments. nal. Figure 8 shows the distribution of treatment 1 and 2. We 0.7 treatment 1 treatment 2 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1 2 3 4 5 6 7 8 9 10 Figure 6: Accumulated regret figure. The safety value function here Figure 8: Distribution graph is mean-variance. 12 UCB1 Thompson Sampling MV_LCB Algorithm 10 ExpExp Algorithm BESA 8 BESA+ regret 6 4 2 0 0 200 400 600 800 1000 steps Figure 7: Percentage of optimal arm play figure. The safety value Figure 9: Accumulated consistency-aware regret function here is mean-variance. extension. The link is here categorized the survival time into ten categories (category 1 showing the minimum survival time). It is interesting to no- Conclusion and future work tice that while treatment 2 has a higher mean than treatment In this paper, we developed a modular safety-aware regret 1 due to the effect of outliers, it has a higher level of variance definition which can be used to define the function of interest compared to treatment 1. From figure 8 it is easy to deduce as a safety measure. We also modified the BESA algorithm that treatment 1 has a more consistent behavior than treat- and equipped it with new features to solve modular safety- ment 2 and a higher number of patients who received treat- aware regret bandit problems. We then computed the asymp- ment 2 died early. That is why treatment 1 may be preferred totic regret of BESA+ and showed that it can perform like an over treatment 2 if we use the safety value function described admissible policy if the safety value function satisfies a mild in Example 1. In this regard, by setting ρ = 1, treatment assumption. Finally, we depicted the performance of BESA+ 1 has less expected mean-variance regret than treatment 2, on the regret definition of previous works and showed that it and it should be ultimately favored by the learning algorithm. can have better performance in most cases. Figure 9 illustrates the performance of different bandit algo- It is still interesting to investigate whether we can find bet- rithms. It is easy to notice that BESA+ has relatively better ter bounds for BESA+ algorithm with modular safety-aware performance than all the other ones. regret definition. Another interesting path would be to re- search if we can define similar safety-aware regret definition Web Application Simulator for broader reinforcement learning problems including MDP As discussed earlier, for this project, we have developed a environments. web application simulator for bandit problem where users can create their customized environment and run experiments Acknowledgment online. Usually, research works provide limited experiments We would like to thank Audrey Durand for her comments and to testify their method. We tried to overcome this problem insight on this project. We also thank department of family by developing this web application where the user can select medicine of McGill University and CIHR for their generous number of arms and change their reward distribution. Then support during this project. the web application will send the input to the web-server and show the results to the user by providing regret figures and additional figures describing the way algorithms have chosen References arms over time. This software can be used as a benchmark [Agrawal and Goyal, 2013] Shipra Agrawal and Navin for future bandit research and it is open sourced for future Goyal. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics, pages [Sani et al., 2012b] Amir Sani, Alessandro Lazaric, and 99–107, 2013. Rémi Munos. Risk-aversion in multi-armed bandits. In [Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and NIPS, pages 3275–3283, 2012. Paul Fischer. Finite-time analysis of the multiarmed bandit [Sutton and Barto, 1998] Richard S Sutton and Andrew G problem. Machine learning, 47(2-3):235–256, 2002. Barto. Reinforcement learning: An introduction, volume 1. [Austin, 2011] Peter C Austin. An introduction to propensity MIT press Cambridge, 1998. score methods for reducing the effects of confounding in [Tolstikhin, 2017] IO Tolstikhin. Concentration inequalities observational studies. Multivariate behavioral research, for samples without replacement. Theory of Probability & 46(3):399–424, 2011. Its Applications, 61(3):462–481, 2017. [Baransi et al., 2014] Akram Baransi, Odalric-Ambrym Maillard, and Shie Mannor. Sub-sampling for multi- armed bandits. In ECML-KDD, pages 115–131, 2014. [Burnetas and Katehakis, 1996] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for se- quential allocation problems. Advances in Applied Math- ematics, 17(2):122–142, 1996. [Chapelle and Li, 2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249– 2257, 2011. [Chen, 2007] Song Xi Chen. Nonparametric estimation of expected shortfall. Journal of financial econometrics, 6(1):87–107, 2007. [El-Yaniv and Pechyony, 2009] Ran El-Yaniv and Dmitry Pechyony. Transductive rademacher complexity and its applications. Journal of Artificial Intelligence Research, 35(1):193, 2009. [Galichet et al., 2013] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013. [Garcıa and Fernández, 2015] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforce- ment learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015. [Kuleshov and Precup, 2014] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014. [Lai and Robbins, 1985] Tze Leung Lai and Herbert Rob- bins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985. [Maillard, 2013] Odalric-Ambrym Maillard. Robust risk- averse stochastic multi-armed bandits. In ICML, pages 218–233, 2013. [Ripley et al., 2013] Brian Ripley, Bill Venables, Douglas M Bates, Kurt Hornik, Albrecht Gebhardt, David Firth, and Maintainer Brian Ripley. Package ‘mass’. Cran R, 2013. [Robbins, 1985] Herbert Robbins. Some aspects of the se- quential design of experiments. In Herbert Robbins Se- lected Papers, pages 169–177. Springer, 1985. [Sani et al., 2012a] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.