Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive
                                 Clinical Trials

                               Hossein Aboutalebi∗1 , Doina Precup1 , and Tibor Schuster2
                  1
                      Department of Computer Science, McGill University. Mila Quebec AI Institute
                                 2
                                   Department of Family Medicine, McGill University


                            Abstract                                in the multi-armed bandit setting is to find the arm ? which
                                                                    has the maximum expected reward µ? (or equivalently, mini-
       The stochastic multi-armed bandit problem is a               mum expected regret). The expected regret after T steps RT
       well-known model for studying the exploration-               is defined as the sum of the expected difference between the
       exploitation trade-off. It has significant possible ap-      mean reward under {at } and the reward expected under the
       plications in adaptive clinical trials, which allow for      optimal arm ?:
       dynamic changes in the treatment allocation prob-                                       " T             #
       abilities of patients. However, most bandit learning                                     X
       algorithms are designed with the goal of minimiz-                              RT = E       (µ? − µat )
                                                                                                 t=1
       ing the expected regret. While this approach is use-
       ful in many areas, in clinical trials, it can be sensi-         While this objective is very popular, there are practi-
       tive to outlier data, especially when the sample size        cal applications, for example in medical research and AI
       is small. In this paper, we define and study a new ro-       safety [Garcıa and Fernández, 2015] where maximizing ex-
       bustness criterion for bandit problems. Specifically,        pected value is not sufficient, and it would be better to have
       we consider optimizing a function of the distribu-           an algorithm sensitive also to the variability of the outcomes
       tion of returns as a regret measure. This provides           of a given arm. For example, consider multi-arm clinical tri-
       practitioners more flexibility to define an appropri-        als where the objective is to find the most promising treatment
       ate regret measure. The learning algorithm we pro-           among a pool of available treatments. Due to heterogeneity in
       pose to solve this type of problem is a modifica-            patients’ treatment responses, considering only the expected
       tion of the BESA algorithm [Baransi et al., 2014],           mean may not be of interest [Austin, 2011]. Specifically, as
       which considers a more general version of regret.            the mean is usually sensitive to outliers and does not provide
       We present a regret bound for our approach and               information about the dispersion of individual responses, the
       evaluate it empirically both on synthetic problems           expected reward has only limited value in achieving a clini-
       as well as on a dataset from the clinical trial liter-       cal trial’s objective. Due to these problems, previous contri-
       ature. Our approach compares favorably to a suite            butions like [Sani et al., 2012a] try to include the variance
       of standard bandit algorithms. Finally, we provide           of rewards in the regret definition and develop algorithms to
       a web application where users can create their de-           solve this slightly enhanced problem. While these modified
       sired synthetic bandit environment and compare the           approaches try to consider variablity in the response of arms,
       performance of different bandit algorithms online.           they induce new problems due to the fact that the variance
                                                                    is not necessarily a good measure of variablity for a distribu-
                                                                    tion. This is because the variance equally penalizes responses
Introduction                                                        that are above or below the mean response. Other articles like
The multi-armed bandit is a standard model for researchers          [Galichet et al., 2013] try to use the conditional value at risk
to investigate the exploration-exploitation trade-off, see          to define a better regret definition. Though the conditional
e.g [Baransi et al., 2014; Auer et al., 2002; Sani et al., 2012a;   value at risk may address the problem we faced with includ-
Chapelle and Li, 2011; Sutton and Barto, 1998]. One of the          ing variance, it may not reflect the amount of variablity we
main advantage of multi-armed bandit problems is its sim-           could observe for a distribution over its entire domain. All in
plicity that allows for a higher level of theoretical studies.      all, the consistency of treatments among patients is essential,
   The multi-armed bandit problem consists of a set of arms,        with the ideal treatment usually defined as the one which has
each of which generates a stochastic reward from a fixed but        a high positive response rate while showing low variability in
unknown distribution associated to it. Consider a series of         response among patients. Thus, the idea of consistency and
mulitple arm pulls (or steps) t = 1, ..., T and selecting a spe-    saftey seems to some extent subjective and problem depen-
cific arm a ∈ A at each step i.e. a(t) = at . The standard goal     dant. As a result, it might be necessary to develop an algo-
                                                                    rithm which can work with an arbitrary definition of consis-
   ∗
       hossein.aboutalebi@mail.mcgill.ca                            tency for a distribution.
   This kind of system design which allows the separation of       samples from Xa,t , where sub-sample I ⊂ {1, 2, . . . , Na,t }.
different parts of a system (here regret function and learning        The multi-armed bandit was first presented in the sem-
algorithm) has already been explored in modular program-           inal work of Robbins [Robbins, 1985]. It has been shown
ming. In modular programming, we emphasize on splitting            that under certain conditions [Burnetas and Katehakis, 1996;
the entire system into independant modules which at the end,       Lai and Robbins, 1985], a policy can have logarithmic cumu-
the composite of these modules builds our system. This de-         lative regret:
sign trick is necessary when we are dealing with the change
                                                                                         Rt        X       µ? − µa
of customer demands and we require our system to adapt                         lim inf        >
with the new demands. Here, we follow the same paradigm                       t→∞      log(t) a:µ <µ Kinf (ra ; r? )
                                                                                                     a   ?
by making regret definition independent of the learning al-
gorithm. As a result, we allow more flexibility in defining        where Kinf (ra ; r? ) is the Kullback-Leibler divergence be-
the regret function which is capable of incorporating problem      tween the reward distributions of the respective arms. Policies
specific demands.                                                  for which this bound holds are called admissible.
   Finally, we achieve the aforementioned goals by extend-            Several algorithms have been shown to produce admissi-
ing one of the recent algorithms in the bandit literature called   ble policies, including UCB1 [Auer et al., 2002], Thomp-
BESA (Best Empirical Sampled Average) [Baransi et al.,             son sampling [Chapelle and Li, 2011; Agrawal and Goyal,
2014]. One of the main advantage of BESA compared to               2013] and BESA [Baransi et al., 2014]. However, theoreti-
other existing bandit algorithms is that it does not involve       cal bounds are not always matched by empirical results. For
many hyper-parameters. This is especially useful when one          example, it has been shown in [Kuleshov and Precup, 2014]
does not have any prior knowledge or has insufficient prior        that two algorithms which do not produce admissible poli-
knowledge about the different arms in the beginning. Also,         cies, ε-greedy and Boltzmann exploration [Sutton and Barto,
this feature makes it easier to introduce modular design by        1998], behave better than UCB1 on certain problems. Both
using McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009].             BESA and Thompson sampling were shown to have compa-
   Key contributions: We provide a modular definition of re-       rable performance with Softmax and ε-greedy.
gret called safety-aware regret which allows higher flexibility       While the expected regret is a natural and popular measure
in defining the risk for multi-armed bandit problems. We pro-      of performance which allows the development of theoretical
pose a new algorithm called BESA+ which solves this cate-          results, recently, some papers have explored other definitions
gory of problems. We show the upper-bounds of its safety-          for regret. For example, [Sani et al., 2012b] consider a linear
aware regret for two-armed and multi-armed bandits. For the        combination of variance and mean as the definition of regret
experiment parts, we compare our model with some of the            for a learning algorithm A:
notable earlier research works and show that BESA+ has a                           M
                                                                                   d           bt2 (A) − ρb
                                                                                     V t (A) = σ          µt (A)                 (1)
satisfying performance. For the last experiment, we depict
the performance of our algorithm on a real clinical dataset        where µ bt is the estimate of the average of observed rewards
and illustrate that it is capable of solving the problem with      up to time step t and σbt is a biased estimate of the variance of
user-defined safety-aware regret. Finally, for the first time as   rewards up to time step t. The regret is then defined as:
far as we know, we provide a web application which allows
users to create their own custom environment and compare                         Rt (A) = M
                                                                                          d V t (A) − M
                                                                                                      d V ?,t (A),
our algorithm with other works.                                    where ? is the optimal arm. According to [Maillard, 2013],
                                                                   however, this definition is going to penalize the algorithm if it
Background and Notation                                            switches between optimal arms. Instead, in [Maillard, 2013],
                                                                   the authors devise a new definition of regret which controls
We consider the standard bandit setting with action (arm)
                                                                   the lower tail of the reward distribution. However, the algo-
set A, where each action a ∈ A is characterized by a reward
                                                                   rithm to solve the corresponding objective function seems
distribution ϕa . The distribution for action a has mean µa and
                                                                   time-consuming, and the optimization to be performed may
variance σa2 . Let Xa,i ∼ ϕa denote the i-th reward sampled
                                                                   be intricate. Finally, in [Galichet et al., 2013], the authors use
from the distribution of action a. All actions and samples are
                                                                   the notion of conditional value at risk in order to define the
independent. The bandit problem is described as an iterative
                                                                   regret.
game where, on each step (round) t, the player (an algorithm)
selects action (arm) at and observes sample Xa,Na,t , where
          Pt                                                       Measure of regret
Na,t =       s=1 I{as = a} denotes the number of samples
observed for action a up to time t (inclusively). A policy is      Unlike previous works, we now give a formal definition of
a distribution over A. In general, stochastic distributions are    class of functions which can be used as a separate module
necessary during the learning stage, in order to identify the      inside our learning algorithm module to measure the regret.
best arm. We discuss the exact notion of “best” below.             We call these class of functions ”safety value functions”.
   We define IS (m, j) as the set obtained by sub-sampling            In the following section, we try to formally define these
without replacement j elements form the set S of size m.           functions. Assume we have k arms (|A| = k) with reward
Let Xa,t denote the history of observations (records) obtained     distributions ϕ1 , ϕ2 , . . . , ϕk .
from action (arm) a up to time t (inclusively), such that          Definition 0.1. safety value function: Let D denotes the set
|Xa,t | = Na,t . The notation Xa,t (I) indicates the set of sub-   of all possible reward distributions for a given interval. The
safety value function v : D → R provides a score for a given                           (without hyperparameter) approach for finding the optimal
distribution.                                                                          arm according to the expected mean regret criterion. Consider
   The optimal arm ? under this value function is defined as                           a two-armed bandit with actions a and ? ,where µ? > µa , and
                                                                                       assume that Na,t < N?,t at time step t. In order to select the
                            ? ∈ arg max(v(ϕa ))                                (2)     next arm for time step t + 1, BESA first sub-samples s? =
                                        a∈A
                                                                                       I? (N?,t , Na,t ) from the observation history (records) of the
   The regret corresponding to the safety value function up to                         arm ? and similarly sub-sample sa = Ia (Na,t , Na,t ) = Xa,t
time T is defined as:                                                                  from the records of arm a. If µbsa > µbs? , BESA chooses arm
                       " T                     #                                       a, otherwise it chooses arm ?.
                                                                                          The main reason behind the sub-sampling is that it gives
                         X
             RT,v = E       (v(ϕ? ) − v(ϕat ))             (3)
                                  t=1
                                                                                       a similar opportunity to both arms. Consequently, the effect
                                                                                       of having a small sample size, which may cause bias in the
We call (3), safety-aware regret.                                                      estimates diminishes. When there are more than two arms,
  When the context is clear, we usually drop the subscript v                           BESA runs a tournament algorithm on the arms [Baransi et
and use only RT for the ease of notation.                                              al., 2014].
                                                                                          Finally, it is worth mentioning that the proof of the regret
Definition 0.2. Well-behaved safety value function: Given a
                                                                                       bound of BESA uses a non-trivial lemma for which authors
reward distribution ϕa over the interval [0, 1], a safety value
                                                                                       did not provide any formal proof. In this paper, we will avoid
function v for this distribution is called well-behaved if there
                                                                                       using this lemma to prove the soundness of our proposed al-
exists an unbiased estimator vb of v such that for any set of
                                                                                       gorithm for a more general regret family. Also, we extend the
observation {x1 , x2 , . . . , xn } sampled from ϕa , and for some
                                                                                       proof for the multi-armed case which was not provided in the
constant γ we have:                                                                    [Baransi et al., 2014].
                                                                                 γ        We are now ready to outline our proposed approach, which
  sup |b
       v (x1 , . . . , xi , . . . , xn ) − vb(x1 , . . . , xbi , . . . , xn )| <
   xbi                                                                           n     we call BESA+. As in [Baransi et al., 2014], we focus on the
                                                                                 (4)   two-arm bandit. For more than two arms, a tournament can
                                                                                       be set up in our case as well.
If (4) holds for any reward distribution ϕ over the interval
[0, 1], we call the safety value function v, a well-behaved                            Algorithm BESA+ two action case
safety value function.
                                                                                       Input: Safety aware value function v and its estimate vb
   Example 1: For a given arm a which has reward distribu-                             Parameters: current time step t, actions a and b. Initially
tion limited to interval [0, 1], consider the safety value func-                       Na,0 = 0, Nb,0 = 0
tion µa − ρσa2 which measures the balance between the mean                              1: if Na,t−1 = 0 ∨ Na,t−1 < log(t) then
and the variance of the reward distribution of arm a. ρ is a                            2:    at = a
hyper-parameter constant for adjusting the balance between                              3: else if Nb,t−1 = 0 ∨ Nb,t−1 < log(t) then
variance and the mean. This is a well-behaved safety func-                              4:    at = b
tion if we use the following estimator for computing empiri-                            5: else
cal mean and variance:                                                                  6:    nt−1 = min{Na,t−1 , Nb,t−1 }
                                                                                        7:    Ia,t−1 ← Ia (Na,t−1 , nt−1 )
                                   Na,t                                                 8:    Ib,t−1 ← Ib (Nb,t−1 , nt−1 )
                               1 X
               µ
               ba,t     =               ra,i                                   (5)      9:    Calculate ṽa,t = vb(Xa,t−1 (Ia,t−1 )) and ṽb,t =
                              Na,t i=1                                                        vb(Xb,t−1 (Ib,t−1 ))
                                            Na,t                                       10:    at = arg maxi∈{a,b} ṽi,t (break ties by choosing arm
                 2                  1       X
                                                                                              with fewer tries)
               σ
               ba,t     =                                  ba,t )2
                                                   (ra,i − µ                   (6)
                              Na,t − 1 i=1                                             11: end if
                                                                                       12: return at
   where ra,i is the ith reward obtained from pulling arm a.
                                                              2
It should be clear that the unbiased estimator µ  ba,t − ρb σa,t
satisfies (4).                                                                           If there is a strong belief that one arm should be better
   Other types of well-behaved safety function can be defined                          than the other then instead of using factor log(t) in Algorithm
as a function of standard deviation or conditional value at risk                       BESA+, one can use α log(t) factor (where 0 < α < 1 and is
similar to the previous example. In the next section, we are                           constant) to reduce the final regret.
going to develop an algorithm which can optimize the safety-                           The first major difference between BESA+ and BESA is the
aware regret.                                                                          use of the safety-aware value function instead of the simple
                                                                                       regret. A second important change is that BESA+ selects the
                                                                                       arm which has been tried less up to time step t if the arm has
Proposed Algorithm                                                                     been chosen less than log(t) times up to t. Essentially, this
In order to optimize the safety-aware regret, we build on the                          change in the algorithm is negligible in terms of establish-
BESA algorithm, which we will now briefly review. As dis-                              ing the total expected regret, as we cannot achieve any better
cussed in [Baransi et al., 2014], BESA is a non-parametric                             bound than log(T ) which is shown in Robbins’ lemma [Lai
and Robbins, 1985]. This tweak also turns out to be vital in        these dlog ke games, we know that at that round we will see
proving that the expected regret of the BESA+ algorithm is          a regret. This regret should be less than or equal to ∆max .
bounded by log(T ) (a result which we present shortly).                In the following,We use notation 1 −a? ,i to denote the in-
   To better understand why this modification is necessary,         dicator for the event of a? losing the ith match (1 6 i 6
consider a two arms scenario. The first arm gives a deter-          dlog ke).
ministic reward of r ∈ [0, 0.5) and the second arm has a
uniform distribution in the interval [0,1] with the expected                  T X
                                                                              X k
reward of 0.5. If we are only interested in the expected re-          RT =                         ∆ai E[11at =ai ]
ward (µ), the algorithm should ultimately favor the second                     t=1 i=1
arm. On the other hand, there exists a probability of r that the              T dlog
                                                                              X  Xke
BESA algorithm is going to constantly choose the first arm                6                            ∆max E[11−a? ,i ]
if the second arm gives a value less than r on its first pull.                 t=1               i=1
In contrast, BESA+ evades this problem by letting the second                  T dlog
arm be selected enough times such that it eventually becomes                  X  Xke
                                                                          6                            ∆max max{E[11−a? ,i0 ]}
distinguishable from the first arm.                                                                          0     i
                                                                               t=1               i=1
   We are now ready to state the main theoretical result of our
                                                                              dlog ke                     T
proposed algorithm.                                                               X                       X
                                                                          6                      ∆max             max{E[11−a? ,i0 ]}
Theorem 0.1. Let v be a well-behaved safety value function.                                                        0
                                                                                                                   i
                                                                                  i=1                     t=1
Assume A = {a, ?} be a two-armed bandit with bounded
                                                                                                              T
rewards ∈ [0, 1], and the value gap ∆ = v? − va . Given the                    ∆max dlog ke X
value γ, the expected safety-aware regret of the Algorithm                6                     ∆ba E[11−a? ,ba ] + k∆max n
                                                                                  ∆ba       t=n
BESA+ up to time T is upper bounded as follows:
                                                                               ∆max dlog ke
                  RT 6 ζ∆,γ log(T ) + θ∆,γ                   (7)          6                 [ζ∆ab ,γ log(T ) + θ∆ab ,γ ] + k∆max n
                                                                                  ∆ba
where in (7), ζ∆,γ , θ∆,γ are constants which are dependent                                                                       (9)
on the value of γ, ∆.
Proof. Due to the page limit, we could not include all the
proof. Here, we just provide a short overview of the proof.
The proof mainly consists of two parts. The first part of our
                                                                    Empirical results
proof is similar to [Baransi et al., 2014] but instead we have      Empirical comparison of BESA and BESA+
used McDiarmid’s Lemma [El-Yaniv and Pechyony, 2009]
[Tolstikhin, 2017]. For the second part of the proof, unlike        As discussed in the previous section, BESA+ has some ad-
[Baransi et al., 2014], we have avoided using the unproven          vantages over BESA. We illustrate the example we discussed
lemma in their work and instead tried to compute the upper          in the previous section through the results in Figures 1-3,
bound directly by exploiting the log trick in our algorithm         for r ∈ {0.2, 0.3, 0.4}. Each experiment has been repeated
(this trick has been further elaborated in the first experiment).   200 times. Note that while BESA has an almost a linear re-
Interested reader can visit here to see the full proof.             gret behavior, BESA+ can learn the optimal arm within the
                                                                    given time horizon and its expected accumulated regret is up-
Theorem 0.2. Let v be a well-behaved safety value function.         per bounded by a log function. It is also easy to notice that
Assume A = {a1 , . . . , ak−1 , ?} be a k-armed bandit with         BESA+ has a faster convergence rate compared with BESA.
bounded rewards ∈ [0, 1]. Without loss of generality, con-          As r gets closer to 0.5, the problem becomes harder. This
sider the optimal arm is ? and the value gap for arm a, ?           phenomenon is a direct illustration of our theoretical result.
is ∆a = v? − va . Also consider ∆max = maxa∈A ∆a . Given
the value γ, the expected safety-aware regret of the Algorithm
                                                                                       800                                                     BESA
BESA+ up to time T is upper bounded as follows:                                                                                                BESA+
                                                                                       700
          ∆max dlog ke                                                                 600
  RT 6                 [ζ∆ab ,γ log(T ) + θ∆ab ,γ ] + k∆max n
             ∆ba                                                                       500
                                                             (8)
                                                                              regret


                                                                                       400
                                                                                       300
where in (8), ζ, θ are constants which are dependent on the
value of γ, ∆. Moreover, b
                         a is defined:                                                 200
                                                                                       100
             a = arg max ζ∆a ,γ log(T ) + θ∆a ,γ
             b                                                                          0
                    a∈A
                                                                                             0         2000       4000           6000   8000       10000
                                                                                                                         steps
for T > n.
                                                                       Figure 1: Result of accumulated expected regret for r = 0.4
Proof. We Know that the arm ? has to play at most dlog ke
matches (games) in order to win the round. If it losses any of
                                                                                                                  30000
                  1000                                             BESA                                                                                                 MARAB Algorithm
                                                                   BESA+                                                                                                BESA+
                                                                                                                  25000
                  800
                                                                                                                  20000


                                                                                                expected regret
                  600
                                                                                                                  15000
         regret


                  400                                                                                             10000

                  200                                                                                             5000

                    0                                                                                                  0
                         0     2000   4000           6000   8000       10000                                               0          200         400             600     800         1000
                                             steps                                                                                                       steps


   Figure 2: Result of accumulated expected regret for r = 0.3                        Figure 4: Accumulated regret figure. The safety value function here
                                                                                      is conditional value at risk.
                                                                   BESA                                           80           MARAB Algorithm
                                                                   BESA+                                                       BESA+
                  1000                                                                                            70
                                                                                                                  60
                  800
                                                                                                                  50


                                                                                                Percentage
                  600
         regret


                                                                                                                  40
                  400                                                                                             30
                                                                                                                  20
                  200
                                                                                                                  10
                    0                                                                                              0
                         0     2000   4000           6000   8000       10000                                           0           200           400             600      800         1000
                                             steps                                                                                                      steps

   Figure 3: Result of accumulated expected regret for r = 0.2                        Figure 5: Percentage of optimal arm play figure. The safety value
                                                                                      function here is conditional value at risk.
Conditional value at risk safety value function
As discussed in [Galichet et al., 2013], in some situations, we                       Mean-variance safety value function
need to limit the exploration of risky arms. Examples include                         Next, we evaluated the performance of BESA+ with the regret
financial investment where inverters may tend to choose risk-                         definition provided by [Sani et al., 2012a]. Here, we used the
averse kind of strategy. Using conditional value at risk as a                         same 20 arms Gaussian mixture environment described in the
risk measure is one of the approaches to achieve this goal.                           previous section. We evaluated the experiments with ρ = 1
Informally, conditional value at risk level α is defined as the                       which is the trade off factor between variance and the mean.
expected values of the quantiles of reward distribution where                         The results of this experiment is depicted in figures 6, 7. The
the probability of the occurrence of values inside this quantile                      hyper-parameters used here for algorithms MV-LCB and Ex-
is less than or equal to α. More formally:                                            pExp are based on what [Sani et al., 2012a] suggests using.
                                                                                      Again, we can see that BESA+ has a relatively small variance
                             CV aRα = E[X|X < vα ]                             (10)
                                                                                      over 10 experiments.
where in (10), vα = arg maxβ {P(X < β) 6 α}. To
estimate (10), we have used the estimation measure intro-                             Real Clinical Trial Dataset
duced by [Chen, 2007]. This estimation is also employed in                            Finally, we examined the performance of BESA+ against
[Galichet et al., 2013] work to derive their MARAB algo-                              other methods (BESA, UCB1 , Thompson sampling, MV-
rithm. Here, we have used this estimation for the Conditional                         LCB, and ExpExp) based on a real clinical dataset. This
value at risk safety value function which is the regret mea-                          dataset includes the survival times of patients who were suf-
sure for this problem. Our environment consists of 20 arms                            fering from lung cancer [Ripley et al., 2013]. Two different
where each arm reward distribution is the truncated Gaussian                          kinds of treatments (standard treatment and test treatment)
mixture consisting of four Gaussian distribution with equal                           were applied to them and the results are based on the number
probability. The reward of arms are restricted to the interval                        of days the patient survived after receiving one of the treat-
[0, 1]. To make the environment more complex, the mean and                            ments. For the purpose of illustration and simplicity, we as-
standard deviation of arms are sampled uniformly from the                             sumed non-informative censoring and equal follow-up times
interval [0, 1] and [0.5, 1] respectively. The experiments are                        in both treatment groups. As the experiment has already been
carried out for α = 10%. For MARAB algorithm, we have                                 conducted, to apply bandit algorithms, each time a treatment
used grid search and set the value C = 1. The figures 4, 5 de-                        is selected by a bandit algorithm, we sampled uniformly from
pict the results of the run for ten experiments. It is noticeable                     the recorded results of the patients whom received that se-
that in both figures BESA+ has a lower variance in experi-                            lected treatment and used the survival time as the reward sig-
ments.                                                                                nal. Figure 8 shows the distribution of treatment 1 and 2. We
                                                                                 0.7                                                                    treatment 1
                                                                                                                                                        treatment 2
                                                                                 0.6

                                                                                 0.5

                                                                                 0.4

                                                                                 0.3

                                                                                 0.2

                                                                                 0.1

                                                                                 0.0
                                                                                                 1    2     3     4         5       6         7   8         9   10

Figure 6: Accumulated regret figure. The safety value function here
                                                                                                      Figure 8: Distribution graph
is mean-variance.
                                                                                        12           UCB1
                                                                                                     Thompson Sampling
                                                                                                     MV_LCB Algorithm
                                                                                        10           ExpExp Algorithm
                                                                                                     BESA
                                                                                        8            BESA+


                                                                               regret
                                                                                        6

                                                                                        4

                                                                                        2

                                                                                        0
                                                                                             0            200         400               600           800        1000
                                                                                                                                steps


Figure 7: Percentage of optimal arm play figure. The safety value              Figure 9: Accumulated consistency-aware regret
function here is mean-variance.
                                                                      extension. The link is here
categorized the survival time into ten categories (category 1
showing the minimum survival time). It is interesting to no-          Conclusion and future work
tice that while treatment 2 has a higher mean than treatment          In this paper, we developed a modular safety-aware regret
1 due to the effect of outliers, it has a higher level of variance    definition which can be used to define the function of interest
compared to treatment 1. From figure 8 it is easy to deduce           as a safety measure. We also modified the BESA algorithm
that treatment 1 has a more consistent behavior than treat-           and equipped it with new features to solve modular safety-
ment 2 and a higher number of patients who received treat-            aware regret bandit problems. We then computed the asymp-
ment 2 died early. That is why treatment 1 may be preferred           totic regret of BESA+ and showed that it can perform like an
over treatment 2 if we use the safety value function described        admissible policy if the safety value function satisfies a mild
in Example 1. In this regard, by setting ρ = 1, treatment             assumption. Finally, we depicted the performance of BESA+
1 has less expected mean-variance regret than treatment 2,            on the regret definition of previous works and showed that it
and it should be ultimately favored by the learning algorithm.        can have better performance in most cases.
Figure 9 illustrates the performance of different bandit algo-           It is still interesting to investigate whether we can find bet-
rithms. It is easy to notice that BESA+ has relatively better         ter bounds for BESA+ algorithm with modular safety-aware
performance than all the other ones.                                  regret definition. Another interesting path would be to re-
                                                                      search if we can define similar safety-aware regret definition
Web Application Simulator                                             for broader reinforcement learning problems including MDP
As discussed earlier, for this project, we have developed a           environments.
web application simulator for bandit problem where users
can create their customized environment and run experiments           Acknowledgment
online. Usually, research works provide limited experiments
                                                                      We would like to thank Audrey Durand for her comments and
to testify their method. We tried to overcome this problem
                                                                      insight on this project. We also thank department of family
by developing this web application where the user can select
                                                                      medicine of McGill University and CIHR for their generous
number of arms and change their reward distribution. Then
                                                                      support during this project.
the web application will send the input to the web-server and
show the results to the user by providing regret figures and
additional figures describing the way algorithms have chosen          References
arms over time. This software can be used as a benchmark              [Agrawal and Goyal, 2013] Shipra Agrawal and Navin
for future bandit research and it is open sourced for future            Goyal. Further optimal regret bounds for thompson
   sampling. In Artificial Intelligence and Statistics, pages    [Sani et al., 2012b] Amir Sani, Alessandro Lazaric, and
   99–107, 2013.                                                    Rémi Munos. Risk-aversion in multi-armed bandits. In
[Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and            NIPS, pages 3275–3283, 2012.
   Paul Fischer. Finite-time analysis of the multiarmed bandit   [Sutton and Barto, 1998] Richard S Sutton and Andrew G
   problem. Machine learning, 47(2-3):235–256, 2002.                Barto. Reinforcement learning: An introduction, volume 1.
[Austin, 2011] Peter C Austin. An introduction to propensity        MIT press Cambridge, 1998.
   score methods for reducing the effects of confounding in      [Tolstikhin, 2017] IO Tolstikhin. Concentration inequalities
   observational studies. Multivariate behavioral research,         for samples without replacement. Theory of Probability &
   46(3):399–424, 2011.                                             Its Applications, 61(3):462–481, 2017.
[Baransi et al., 2014] Akram Baransi, Odalric-Ambrym
   Maillard, and Shie Mannor. Sub-sampling for multi-
   armed bandits. In ECML-KDD, pages 115–131, 2014.
[Burnetas and Katehakis, 1996] Apostolos N Burnetas and
   Michael N Katehakis. Optimal adaptive policies for se-
   quential allocation problems. Advances in Applied Math-
   ematics, 17(2):122–142, 1996.
[Chapelle and Li, 2011] Olivier Chapelle and Lihong Li. An
   empirical evaluation of thompson sampling. In Advances
   in neural information processing systems, pages 2249–
   2257, 2011.
[Chen, 2007] Song Xi Chen. Nonparametric estimation of
   expected shortfall. Journal of financial econometrics,
   6(1):87–107, 2007.
[El-Yaniv and Pechyony, 2009] Ran El-Yaniv and Dmitry
   Pechyony. Transductive rademacher complexity and its
   applications. Journal of Artificial Intelligence Research,
   35(1):193, 2009.
[Galichet et al., 2013] Nicolas Galichet, Michele Sebag, and
   Olivier Teytaud. Exploration vs exploitation vs safety:
   Risk-aware multi-armed bandits. In Asian Conference on
   Machine Learning, pages 245–260, 2013.
[Garcıa and Fernández, 2015] Javier Garcıa and Fernando
   Fernández. A comprehensive survey on safe reinforce-
   ment learning. Journal of Machine Learning Research,
   16(1):1437–1480, 2015.
[Kuleshov and Precup, 2014] Volodymyr Kuleshov and
   Doina Precup.        Algorithms for multi-armed bandit
   problems. arXiv preprint arXiv:1402.6028, 2014.
[Lai and Robbins, 1985] Tze Leung Lai and Herbert Rob-
   bins. Asymptotically efficient adaptive allocation rules.
   Advances in applied mathematics, 6(1):4–22, 1985.
[Maillard, 2013] Odalric-Ambrym Maillard. Robust risk-
   averse stochastic multi-armed bandits. In ICML, pages
   218–233, 2013.
[Ripley et al., 2013] Brian Ripley, Bill Venables, Douglas M
   Bates, Kurt Hornik, Albrecht Gebhardt, David Firth, and
   Maintainer Brian Ripley. Package ‘mass’. Cran R, 2013.
[Robbins, 1985] Herbert Robbins. Some aspects of the se-
   quential design of experiments. In Herbert Robbins Se-
   lected Papers, pages 169–177. Springer, 1985.
[Sani et al., 2012a] Amir Sani, Alessandro Lazaric, and
   Rémi Munos. Risk-aversion in multi-armed bandits.
   In Advances in Neural Information Processing Systems,
   pages 3275–3283, 2012.