=Paper= {{Paper |id=Vol-2419/paper11 |storemode=property |title=Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials |pdfUrl=https://ceur-ws.org/Vol-2419/paper_11.pdf |volume=Vol-2419 |authors=Hossein Aboutalebi,Doina Precup,Tibor Schuster |dblpUrl=https://dblp.org/rec/conf/ijcai/AboutalebiPS19 }} ==Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials== https://ceur-ws.org/Vol-2419/paper_11.pdf

Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive
Clinical Trials

Hossein Aboutalebi∗1 , Doina Precup1 , and Tibor Schuster2
1
Department of Computer Science, McGill University. Mila Quebec AI Institute
2
Department of Family Medicine, McGill University

Abstract in the multi-armed bandit setting is to find the arm ? which
has the maximum expected reward µ? (or equivalently, mini-
The stochastic multi-armed bandit problem is a mum expected regret). The expected regret after T steps RT
well-known model for studying the exploration- is defined as the sum of the expected difference between the
exploitation trade-off. It has significant possible ap- mean reward under {at } and the reward expected under the
plications in adaptive clinical trials, which allow for optimal arm ?:
dynamic changes in the treatment allocation prob- " T #
abilities of patients. However, most bandit learning X
algorithms are designed with the goal of minimiz- RT = E (µ? − µat )
t=1
ing the expected regret. While this approach is use-
ful in many areas, in clinical trials, it can be sensi- While this objective is very popular, there are practi-
tive to outlier data, especially when the sample size cal applications, for example in medical research and AI
is small. In this paper, we define and study a new ro- safety [Garcıa and Fernández, 2015] where maximizing ex-
bustness criterion for bandit problems. Specifically, pected value is not sufficient, and it would be better to have
we consider optimizing a function of the distribu- an algorithm sensitive also to the variability of the outcomes
tion of returns as a regret measure. This provides of a given arm. For example, consider multi-arm clinical tri-
practitioners more flexibility to define an appropri- als where the objective is to find the most promising treatment
ate regret measure. The learning algorithm we pro- among a pool of available treatments. Due to heterogeneity in
pose to solve this type of problem is a modifica- patients’ treatment responses, considering only the expected
tion of the BESA algorithm [Baransi et al., 2014], mean may not be of interest [Austin, 2011]. Specifically, as
which considers a more general version of regret. the mean is usually sensitive to outliers and does not provide
We present a regret bound for our approach and information about the dispersion of individual responses, the
evaluate it empirically both on synthetic problems expected reward has only limited value in achieving a clini-
as well as on a dataset from the clinical trial liter- cal trial’s objective. Due to these problems, previous contri-
ature. Our approach compares favorably to a suite butions like [Sani et al., 2012a] try to include the variance
of standard bandit algorithms. Finally, we provide of rewards in the regret definition and develop algorithms to
a web application where users can create their de- solve this slightly enhanced problem. While these modified
sired synthetic bandit environment and compare the approaches try to consider variablity in the response of arms,
performance of different bandit algorithms online. they induce new problems due to the fact that the variance
is not necessarily a good measure of variablity for a distribu-
tion. This is because the variance equally penalizes responses
Introduction that are above or below the mean response. Other articles like
The multi-armed bandit is a standard model for researchers [Galichet et al., 2013] try to use the conditional value at risk
to investigate the exploration-exploitation trade-off, see to define a better regret definition. Though the conditional
e.g [Baransi et al., 2014; Auer et al., 2002; Sani et al., 2012a; value at risk may address the problem we faced with includ-
Chapelle and Li, 2011; Sutton and Barto, 1998]. One of the ing variance, it may not reflect the amount of variablity we
main advantage of multi-armed bandit problems is its sim- could observe for a distribution over its entire domain. All in
plicity that allows for a higher level of theoretical studies. all, the consistency of treatments among patients is essential,
The multi-armed bandit problem consists of a set of arms, with the ideal treatment usually defined as the one which has
each of which generates a stochastic reward from a fixed but a high positive response rate while showing low variability in
unknown distribution associated to it. Consider a series of response among patients. Thus, the idea of consistency and
mulitple arm pulls (or steps) t = 1, ..., T and selecting a spe- saftey seems to some extent subjective and problem depen-
cific arm a ∈ A at each step i.e. a(t) = at . The standard goal dant. As a result, it might be necessary to develop an algo-
rithm which can work with an arbitrary definition of consis-
∗
hossein.aboutalebi@mail.mcgill.ca tency for a distribution.
This kind of system design which allows the separation of samples from Xa,t , where sub-sample I ⊂ {1, 2, . . . , Na,t }.
different parts of a system (here regret function and learning The multi-armed bandit was first presented in the sem-
algorithm) has already been explored in modular program- inal work of Robbins [Robbins, 1985]. It has been shown
ming. In modular programming, we emphasize on splitting that under certain conditions [Burnetas and Katehakis, 1996;
the entire system into independant modules which at the end, Lai and Robbins, 1985], a policy can have logarithmic cumu-
the composite of these modules builds our system. This de- lative regret:
sign trick is necessary when we are dealing with the change
Rt X µ? − µa
of customer demands and we require our system to adapt lim inf >
with the new demands. Here, we follow the same paradigm t→∞ log(t) a:µ <µ Kinf (ra ; r? )
a ?
by making regret definition independent of the learning al-
gorithm. As a result, we allow more flexibility in defining where Kinf (ra ; r? ) is the Kullback-Leibler divergence be-
the regret function which is capable of incorporating problem tween the reward distributions of the respective arms. Policies
specific demands. for which this bound holds are called admissible.
Finally, we achieve the aforementioned goals by extend- Several algorithms have been shown to produce admissi-
ing one of the recent algorithms in the bandit literature called ble policies, including UCB1 [Auer et al., 2002], Thomp-
BESA (Best Empirical Sampled Average) [Baransi et al., son sampling [Chapelle and Li, 2011; Agrawal and Goyal,
2014]. One of the main advantage of BESA compared to 2013] and BESA [Baransi et al., 2014]. However, theoreti-
other existing bandit algorithms is that it does not involve cal bounds are not always matched by empirical results. For
many hyper-parameters. This is especially useful when one example, it has been shown in [Kuleshov and Precup, 2014]
does not have any prior knowledge or has insufficient prior that two algorithms which do not produce admissible poli-
knowledge about the different arms in the beginning. Also, cies, ε-greedy and Boltzmann exploration [Sutton and Barto,
this feature makes it easier to introduce modular design by 1998], behave better than UCB1 on certain problems. Both
using McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009]. BESA and Thompson sampling were shown to have compa-
Key contributions: We provide a modular definition of re- rable performance with Softmax and ε-greedy.
gret called safety-aware regret which allows higher flexibility While the expected regret is a natural and popular measure
in defining the risk for multi-armed bandit problems. We pro- of performance which allows the development of theoretical
pose a new algorithm called BESA+ which solves this cate- results, recently, some papers have explored other definitions
gory of problems. We show the upper-bounds of its safety- for regret. For example, [Sani et al., 2012b] consider a linear
aware regret for two-armed and multi-armed bandits. For the combination of variance and mean as the definition of regret
experiment parts, we compare our model with some of the for a learning algorithm A:
notable earlier research works and show that BESA+ has a M
d bt2 (A) − ρb
V t (A) = σ µt (A) (1)
satisfying performance. For the last experiment, we depict
the performance of our algorithm on a real clinical dataset where µ bt is the estimate of the average of observed rewards
and illustrate that it is capable of solving the problem with up to time step t and σbt is a biased estimate of the variance of
user-defined safety-aware regret. Finally, for the first time as rewards up to time step t. The regret is then defined as:
far as we know, we provide a web application which allows
users to create their own custom environment and compare Rt (A) = M
d V t (A) − M
d V ?,t (A),
our algorithm with other works. where ? is the optimal arm. According to [Maillard, 2013],
however, this definition is going to penalize the algorithm if it
Background and Notation switches between optimal arms. Instead, in [Maillard, 2013],
the authors devise a new definition of regret which controls
We consider the standard bandit setting with action (arm)
the lower tail of the reward distribution. However, the algo-
set A, where each action a ∈ A is characterized by a reward
rithm to solve the corresponding objective function seems
distribution ϕa . The distribution for action a has mean µa and
time-consuming, and the optimization to be performed may
variance σa2 . Let Xa,i ∼ ϕa denote the i-th reward sampled
be intricate. Finally, in [Galichet et al., 2013], the authors use
from the distribution of action a. All actions and samples are
the notion of conditional value at risk in order to define the
independent. The bandit problem is described as an iterative
regret.
game where, on each step (round) t, the player (an algorithm)
selects action (arm) at and observes sample Xa,Na,t , where
Pt Measure of regret
Na,t = s=1 I{as = a} denotes the number of samples
observed for action a up to time t (inclusively). A policy is Unlike previous works, we now give a formal definition of
a distribution over A. In general, stochastic distributions are class of functions which can be used as a separate module
necessary during the learning stage, in order to identify the inside our learning algorithm module to measure the regret.
best arm. We discuss the exact notion of “best” below. We call these class of functions ”safety value functions”.
We define IS (m, j) as the set obtained by sub-sampling In the following section, we try to formally define these
without replacement j elements form the set S of size m. functions. Assume we have k arms (|A| = k) with reward
Let Xa,t denote the history of observations (records) obtained distributions ϕ1 , ϕ2 , . . . , ϕk .
from action (arm) a up to time t (inclusively), such that Definition 0.1. safety value function: Let D denotes the set
|Xa,t | = Na,t . The notation Xa,t (I) indicates the set of sub- of all possible reward distributions for a given interval. The
safety value function v : D → R provides a score for a given (without hyperparameter) approach for finding the optimal
distribution. arm according to the expected mean regret criterion. Consider
The optimal arm ? under this value function is defined as a two-armed bandit with actions a and ? ,where µ? > µa , and
assume that Na,t < N?,t at time step t. In order to select the
? ∈ arg max(v(ϕa )) (2) next arm for time step t + 1, BESA first sub-samples s? =
a∈A
I? (N?,t , Na,t ) from the observation history (records) of the
The regret corresponding to the safety value function up to arm ? and similarly sub-sample sa = Ia (Na,t , Na,t ) = Xa,t
time T is defined as: from the records of arm a. If µbsa > µbs? , BESA chooses arm
" T # a, otherwise it chooses arm ?.
The main reason behind the sub-sampling is that it gives
X
RT,v = E (v(ϕ? ) − v(ϕat )) (3)
t=1
a similar opportunity to both arms. Consequently, the effect
of having a small sample size, which may cause bias in the
We call (3), safety-aware regret. estimates diminishes. When there are more than two arms,
When the context is clear, we usually drop the subscript v BESA runs a tournament algorithm on the arms [Baransi et
and use only RT for the ease of notation. al., 2014].
Finally, it is worth mentioning that the proof of the regret
Definition 0.2. Well-behaved safety value function: Given a
bound of BESA uses a non-trivial lemma for which authors
reward distribution ϕa over the interval [0, 1], a safety value
did not provide any formal proof. In this paper, we will avoid
function v for this distribution is called well-behaved if there
using this lemma to prove the soundness of our proposed al-
exists an unbiased estimator vb of v such that for any set of
gorithm for a more general regret family. Also, we extend the
observation {x1 , x2 , . . . , xn } sampled from ϕa , and for some
proof for the multi-armed case which was not provided in the
constant γ we have: [Baransi et al., 2014].
γ We are now ready to outline our proposed approach, which
sup |b
v (x1 , . . . , xi , . . . , xn ) − vb(x1 , . . . , xbi , . . . , xn )| <
xbi n we call BESA+. As in [Baransi et al., 2014], we focus on the
(4) two-arm bandit. For more than two arms, a tournament can
be set up in our case as well.
If (4) holds for any reward distribution ϕ over the interval
[0, 1], we call the safety value function v, a well-behaved Algorithm BESA+ two action case
safety value function.
Input: Safety aware value function v and its estimate vb
Example 1: For a given arm a which has reward distribu- Parameters: current time step t, actions a and b. Initially
tion limited to interval [0, 1], consider the safety value func- Na,0 = 0, Nb,0 = 0
tion µa − ρσa2 which measures the balance between the mean 1: if Na,t−1 = 0 ∨ Na,t−1 < log(t) then
and the variance of the reward distribution of arm a. ρ is a 2: at = a
hyper-parameter constant for adjusting the balance between 3: else if Nb,t−1 = 0 ∨ Nb,t−1 < log(t) then
variance and the mean. This is a well-behaved safety func- 4: at = b
tion if we use the following estimator for computing empiri- 5: else
cal mean and variance: 6: nt−1 = min{Na,t−1 , Nb,t−1 }
7: Ia,t−1 ← Ia (Na,t−1 , nt−1 )
Na,t 8: Ib,t−1 ← Ib (Nb,t−1 , nt−1 )
1 X
µ
ba,t = ra,i (5) 9: Calculate ṽa,t = vb(Xa,t−1 (Ia,t−1 )) and ṽb,t =
Na,t i=1 vb(Xb,t−1 (Ib,t−1 ))
Na,t 10: at = arg maxi∈{a,b} ṽi,t (break ties by choosing arm
2 1 X
with fewer tries)
σ
ba,t = ba,t )2
(ra,i − µ (6)
Na,t − 1 i=1 11: end if
12: return at
where ra,i is the ith reward obtained from pulling arm a.
2
It should be clear that the unbiased estimator µ ba,t − ρb σa,t
satisfies (4). If there is a strong belief that one arm should be better
Other types of well-behaved safety function can be defined than the other then instead of using factor log(t) in Algorithm
as a function of standard deviation or conditional value at risk BESA+, one can use α log(t) factor (where 0 < α < 1 and is
similar to the previous example. In the next section, we are constant) to reduce the final regret.
going to develop an algorithm which can optimize the safety- The first major difference between BESA+ and BESA is the
aware regret. use of the safety-aware value function instead of the simple
regret. A second important change is that BESA+ selects the
arm which has been tried less up to time step t if the arm has
Proposed Algorithm been chosen less than log(t) times up to t. Essentially, this
In order to optimize the safety-aware regret, we build on the change in the algorithm is negligible in terms of establish-
BESA algorithm, which we will now briefly review. As dis- ing the total expected regret, as we cannot achieve any better
cussed in [Baransi et al., 2014], BESA is a non-parametric bound than log(T ) which is shown in Robbins’ lemma [Lai
and Robbins, 1985]. This tweak also turns out to be vital in these dlog ke games, we know that at that round we will see
proving that the expected regret of the BESA+ algorithm is a regret. This regret should be less than or equal to ∆max .
bounded by log(T ) (a result which we present shortly). In the following,We use notation 1 −a? ,i to denote the in-
To better understand why this modification is necessary, dicator for the event of a? losing the ith match (1 6 i 6
consider a two arms scenario. The first arm gives a deter- dlog ke).
ministic reward of r ∈ [0, 0.5) and the second arm has a
uniform distribution in the interval [0,1] with the expected T X
X k
reward of 0.5. If we are only interested in the expected re- RT = ∆ai E[11at =ai ]
ward (µ), the algorithm should ultimately favor the second t=1 i=1
arm. On the other hand, there exists a probability of r that the T dlog
X Xke
BESA algorithm is going to constantly choose the first arm 6 ∆max E[11−a? ,i ]
if the second arm gives a value less than r on its first pull. t=1 i=1
In contrast, BESA+ evades this problem by letting the second T dlog
arm be selected enough times such that it eventually becomes X Xke
6 ∆max max{E[11−a? ,i0 ]}
distinguishable from the first arm. 0 i
t=1 i=1
We are now ready to state the main theoretical result of our
dlog ke T
proposed algorithm. X X
6 ∆max max{E[11−a? ,i0 ]}
Theorem 0.1. Let v be a well-behaved safety value function. 0
i
i=1 t=1
Assume A = {a, ?} be a two-armed bandit with bounded
T
rewards ∈ [0, 1], and the value gap ∆ = v? − va . Given the ∆max dlog ke X
value γ, the expected safety-aware regret of the Algorithm 6 ∆ba E[11−a? ,ba ] + k∆max n
∆ba t=n
BESA+ up to time T is upper bounded as follows:
∆max dlog ke
RT 6 ζ∆,γ log(T ) + θ∆,γ (7) 6 [ζ∆ab ,γ log(T ) + θ∆ab ,γ ] + k∆max n
∆ba
where in (7), ζ∆,γ , θ∆,γ are constants which are dependent (9)
on the value of γ, ∆.
Proof. Due to the page limit, we could not include all the
proof. Here, we just provide a short overview of the proof.
The proof mainly consists of two parts. The first part of our
Empirical results
proof is similar to [Baransi et al., 2014] but instead we have Empirical comparison of BESA and BESA+
used McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009]
[Tolstikhin, 2017]. For the second part of the proof, unlike As discussed in the previous section, BESA+ has some ad-
[Baransi et al., 2014], we have avoided using the unproven vantages over BESA. We illustrate the example we discussed
lemma in their work and instead tried to compute the upper in the previous section through the results in Figures 1-3,
bound directly by exploiting the log trick in our algorithm for r ∈ {0.2, 0.3, 0.4}. Each experiment has been repeated
(this trick has been further elaborated in the first experiment). 200 times. Note that while BESA has an almost a linear re-
Interested reader can visit here to see the full proof. gret behavior, BESA+ can learn the optimal arm within the
given time horizon and its expected accumulated regret is up-
Theorem 0.2. Let v be a well-behaved safety value function. per bounded by a log function. It is also easy to notice that
Assume A = {a1 , . . . , ak−1 , ?} be a k-armed bandit with BESA+ has a faster convergence rate compared with BESA.
bounded rewards ∈ [0, 1]. Without loss of generality, con- As r gets closer to 0.5, the problem becomes harder. This
sider the optimal arm is ? and the value gap for arm a, ? phenomenon is a direct illustration of our theoretical result.
is ∆a = v? − va . Also consider ∆max = maxa∈A ∆a . Given
the value γ, the expected safety-aware regret of the Algorithm
800 BESA
BESA+ up to time T is upper bounded as follows: BESA+
700
∆max dlog ke 600
RT 6 [ζ∆ab ,γ log(T ) + θ∆ab ,γ ] + k∆max n
∆ba 500
(8)
regret

400
300
where in (8), ζ, θ are constants which are dependent on the
value of γ, ∆. Moreover, b
a is defined: 200
100
a = arg max ζ∆a ,γ log(T ) + θ∆a ,γ
b 0
a∈A
0 2000 4000 6000 8000 10000
steps
for T > n.
Figure 1: Result of accumulated expected regret for r = 0.4
Proof. We Know that the arm ? has to play at most dlog ke
matches (games) in order to win the round. If it losses any of
30000
1000 BESA MARAB Algorithm
BESA+ BESA+
25000
800
20000

expected regret
600
15000
regret

400 10000

200 5000

0 0
0 2000 4000 6000 8000 10000 0 200 400 600 800 1000
steps steps

Figure 2: Result of accumulated expected regret for r = 0.3 Figure 4: Accumulated regret figure. The safety value function here
is conditional value at risk.
BESA 80 MARAB Algorithm
BESA+ BESA+
1000 70
60
800
50

Percentage
600
regret

40
400 30
20
200
10
0 0
0 2000 4000 6000 8000 10000 0 200 400 600 800 1000
steps steps

Figure 3: Result of accumulated expected regret for r = 0.2 Figure 5: Percentage of optimal arm play figure. The safety value
function here is conditional value at risk.
Conditional value at risk safety value function
As discussed in [Galichet et al., 2013], in some situations, we Mean-variance safety value function
need to limit the exploration of risky arms. Examples include Next, we evaluated the performance of BESA+ with the regret
financial investment where inverters may tend to choose risk- definition provided by [Sani et al., 2012a]. Here, we used the
averse kind of strategy. Using conditional value at risk as a same 20 arms Gaussian mixture environment described in the
risk measure is one of the approaches to achieve this goal. previous section. We evaluated the experiments with ρ = 1
Informally, conditional value at risk level α is defined as the which is the trade off factor between variance and the mean.
expected values of the quantiles of reward distribution where The results of this experiment is depicted in figures 6, 7. The
the probability of the occurrence of values inside this quantile hyper-parameters used here for algorithms MV-LCB and Ex-
is less than or equal to α. More formally: pExp are based on what [Sani et al., 2012a] suggests using.
Again, we can see that BESA+ has a relatively small variance
CV aRα = E[X|X < vα ] (10)
over 10 experiments.
where in (10), vα = arg maxβ {P(X < β) 6 α}. To
estimate (10), we have used the estimation measure intro- Real Clinical Trial Dataset
duced by [Chen, 2007]. This estimation is also employed in Finally, we examined the performance of BESA+ against
[Galichet et al., 2013] work to derive their MARAB algo- other methods (BESA, UCB1 , Thompson sampling, MV-
rithm. Here, we have used this estimation for the Conditional LCB, and ExpExp) based on a real clinical dataset. This
value at risk safety value function which is the regret mea- dataset includes the survival times of patients who were suf-
sure for this problem. Our environment consists of 20 arms fering from lung cancer [Ripley et al., 2013]. Two different
where each arm reward distribution is the truncated Gaussian kinds of treatments (standard treatment and test treatment)
mixture consisting of four Gaussian distribution with equal were applied to them and the results are based on the number
probability. The reward of arms are restricted to the interval of days the patient survived after receiving one of the treat-
[0, 1]. To make the environment more complex, the mean and ments. For the purpose of illustration and simplicity, we as-
standard deviation of arms are sampled uniformly from the sumed non-informative censoring and equal follow-up times
interval [0, 1] and [0.5, 1] respectively. The experiments are in both treatment groups. As the experiment has already been
carried out for α = 10%. For MARAB algorithm, we have conducted, to apply bandit algorithms, each time a treatment
used grid search and set the value C = 1. The figures 4, 5 de- is selected by a bandit algorithm, we sampled uniformly from
pict the results of the run for ten experiments. It is noticeable the recorded results of the patients whom received that se-
that in both figures BESA+ has a lower variance in experi- lected treatment and used the survival time as the reward sig-
ments. nal. Figure 8 shows the distribution of treatment 1 and 2. We
0.7 treatment 1
treatment 2
0.6

0.5

0.4

0.3

0.2

0.1

0.0
1 2 3 4 5 6 7 8 9 10

Figure 6: Accumulated regret figure. The safety value function here
Figure 8: Distribution graph
is mean-variance.
12 UCB1
Thompson Sampling
MV_LCB Algorithm
10 ExpExp Algorithm
BESA
8 BESA+

regret
6

0
0 200 400 600 800 1000
steps

Figure 7: Percentage of optimal arm play figure. The safety value Figure 9: Accumulated consistency-aware regret
function here is mean-variance.
extension. The link is here
categorized the survival time into ten categories (category 1
showing the minimum survival time). It is interesting to no- Conclusion and future work
tice that while treatment 2 has a higher mean than treatment In this paper, we developed a modular safety-aware regret
1 due to the effect of outliers, it has a higher level of variance definition which can be used to define the function of interest
compared to treatment 1. From figure 8 it is easy to deduce as a safety measure. We also modified the BESA algorithm
that treatment 1 has a more consistent behavior than treat- and equipped it with new features to solve modular safety-
ment 2 and a higher number of patients who received treat- aware regret bandit problems. We then computed the asymp-
ment 2 died early. That is why treatment 1 may be preferred totic regret of BESA+ and showed that it can perform like an
over treatment 2 if we use the safety value function described admissible policy if the safety value function satisfies a mild
in Example 1. In this regard, by setting ρ = 1, treatment assumption. Finally, we depicted the performance of BESA+
1 has less expected mean-variance regret than treatment 2, on the regret definition of previous works and showed that it
and it should be ultimately favored by the learning algorithm. can have better performance in most cases.
Figure 9 illustrates the performance of different bandit algo- It is still interesting to investigate whether we can find bet-
rithms. It is easy to notice that BESA+ has relatively better ter bounds for BESA+ algorithm with modular safety-aware
performance than all the other ones. regret definition. Another interesting path would be to re-
search if we can define similar safety-aware regret definition
Web Application Simulator for broader reinforcement learning problems including MDP
As discussed earlier, for this project, we have developed a environments.
web application simulator for bandit problem where users
can create their customized environment and run experiments Acknowledgment
online. Usually, research works provide limited experiments
We would like to thank Audrey Durand for her comments and
to testify their method. We tried to overcome this problem
insight on this project. We also thank department of family
by developing this web application where the user can select
medicine of McGill University and CIHR for their generous
number of arms and change their reward distribution. Then
support during this project.
the web application will send the input to the web-server and
show the results to the user by providing regret figures and
additional figures describing the way algorithms have chosen References
arms over time. This software can be used as a benchmark [Agrawal and Goyal, 2013] Shipra Agrawal and Navin
for future bandit research and it is open sourced for future Goyal. Further optimal regret bounds for thompson
sampling. In Artificial Intelligence and Statistics, pages [Sani et al., 2012b] Amir Sani, Alessandro Lazaric, and
99–107, 2013. Rémi Munos. Risk-aversion in multi-armed bandits. In
[Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and NIPS, pages 3275–3283, 2012.
Paul Fischer. Finite-time analysis of the multiarmed bandit [Sutton and Barto, 1998] Richard S Sutton and Andrew G
problem. Machine learning, 47(2-3):235–256, 2002. Barto. Reinforcement learning: An introduction, volume 1.
[Austin, 2011] Peter C Austin. An introduction to propensity MIT press Cambridge, 1998.
score methods for reducing the effects of confounding in [Tolstikhin, 2017] IO Tolstikhin. Concentration inequalities
observational studies. Multivariate behavioral research, for samples without replacement. Theory of Probability &
46(3):399–424, 2011. Its Applications, 61(3):462–481, 2017.
[Baransi et al., 2014] Akram Baransi, Odalric-Ambrym
Maillard, and Shie Mannor. Sub-sampling for multi-
armed bandits. In ECML-KDD, pages 115–131, 2014.
[Burnetas and Katehakis, 1996] Apostolos N Burnetas and
Michael N Katehakis. Optimal adaptive policies for se-
quential allocation problems. Advances in Applied Math-
ematics, 17(2):122–142, 1996.
[Chapelle and Li, 2011] Olivier Chapelle and Lihong Li. An
empirical evaluation of thompson sampling. In Advances
in neural information processing systems, pages 2249–
2257, 2011.
[Chen, 2007] Song Xi Chen. Nonparametric estimation of
expected shortfall. Journal of financial econometrics,
6(1):87–107, 2007.
[El-Yaniv and Pechyony, 2009] Ran El-Yaniv and Dmitry
Pechyony. Transductive rademacher complexity and its
applications. Journal of Artificial Intelligence Research,
35(1):193, 2009.
[Galichet et al., 2013] Nicolas Galichet, Michele Sebag, and
Olivier Teytaud. Exploration vs exploitation vs safety:
Risk-aware multi-armed bandits. In Asian Conference on
Machine Learning, pages 245–260, 2013.
[Garcıa and Fernández, 2015] Javier Garcıa and Fernando
Fernández. A comprehensive survey on safe reinforce-
ment learning. Journal of Machine Learning Research,
16(1):1437–1480, 2015.
[Kuleshov and Precup, 2014] Volodymyr Kuleshov and
Doina Precup. Algorithms for multi-armed bandit
problems. arXiv preprint arXiv:1402.6028, 2014.
[Lai and Robbins, 1985] Tze Leung Lai and Herbert Rob-
bins. Asymptotically efficient adaptive allocation rules.
Advances in applied mathematics, 6(1):4–22, 1985.
[Maillard, 2013] Odalric-Ambrym Maillard. Robust risk-
averse stochastic multi-armed bandits. In ICML, pages
218–233, 2013.
[Ripley et al., 2013] Brian Ripley, Bill Venables, Douglas M
Bates, Kurt Hornik, Albrecht Gebhardt, David Firth, and
Maintainer Brian Ripley. Package ‘mass’. Cran R, 2013.
[Robbins, 1985] Herbert Robbins. Some aspects of the se-
quential design of experiments. In Herbert Robbins Se-
lected Papers, pages 169–177. Springer, 1985.
[Sani et al., 2012a] Amir Sani, Alessandro Lazaric, and
Rémi Munos. Risk-aversion in multi-armed bandits.
In Advances in Neural Information Processing Systems,
pages 3275–3283, 2012.