<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Reliable Policies in the Bandit Setting with Application to Adaptive Clinical Trials</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, McGill University. Mila Quebec AI Institute</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Family Medicine, McGill University</institution>
        </aff>
      </contrib-group>
      <fpage>43</fpage>
      <lpage>49</lpage>
      <abstract>
        <p>The stochastic multi-armed bandit problem is a well-known model for studying the explorationexploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for a dynamic change of patient allocation ratios. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data especially when the sample size is small. In this article, we propose a modification of the BESA algorithm [Baransi, Maillard, and Mannor, 2014] which takes into account the variance in the action outcomes in addition to the mean. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset form the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The multi-armed bandit is a standard model for researchers
to investigate the exploration-exploitation trade-off, see
e.g [Baransi, Maillard, and Mannor, 2014; Auer,
CesaBianchi, and Fischer, 2002; Sani, Lazaric, and Munos, 2012a;
Chapelle and Li, 2011; Sutton and Barto, 1998] . Unlike
fully sequential decision-making problems, multi-armed
bandit problems are simple enough to allow for theoretical
studies.</p>
      <p>The multi-armed bandit problem consists of a set of arms,
each of which generates a stochastic reward from a fixed but
unknown distribution associated to it. The standard goal in
this setting is to find the arm ? which has the maximum
expected reward ? (or equivalently, minimum expected regret).
The expected regret RT is defined as the sum of the expected
difference between the mean reward of the chosen arm at and
the optimal arm until t = T :</p>
      <p>RT = E
" T</p>
      <p>X( ?
t=1
at )
#</p>
      <p>While this objective is very popular, there are
practical applications, for example in medical research and AI
safety [Garcıa and Ferna´ndez, 2015] where maximizing
expected value is not sufficient, and it would be better to have
an algorithm sensitive also to the variability of the outcomes
of a given arm. For example, consider multi-arm clinical
trials where the objective is to find the most promising treatment
among a pool of available treatments. Due to heterogeneity in
patients’ treatment responses, considering only the expected
mean may not be of interest [Austin, 2011]. Specifically, as
the mean is usually sensitive to outliers and does not provide
information about the dispersion of individual responses, the
expected reward has only limited value in achieving a clinical
trial’s objective. This is especially true if some outcomes are
very bad for the patients. Due to this issue, analysis of
variance approaches for studying the effectiveness of the
treatments were recently propose [Corbin-Berrigan et al., 2018].
In other studies like [Carandini, 2004], the variance itself is
the source of interest. Also, the consistency of treatment
effects among patients is essential, with the ideal treatment
usually defined as the one which has a high positive response rate
while showing low variability in response among patients.</p>
      <p>In this paper, we tackle the problem of designing bandit
algorithms that reflect both the mean and variability of the
arms, by extending one of the recent algorithms in the
bandit literature called BESA (Best Empirical Sampled
Average) [Baransi, Maillard, and Mannor, 2014]. One of the main
advantage of BESA compared to other existing bandit
algorithms is that it is a non-parametric learning algorithm. This
is especially useful when one does not have any prior
knowledge or has insufficient prior knowledge about the different
arms in the beginning. We establish regret bounds for the
proposed algorithm, and we show that this new algorithm is
superior to some of the past risk-averse learning algorithms like
MV-LCB and ExpExp [Sani, Lazaric, and Munos, 2012a] in
both simulated tasks as well as in some clinical trial tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Notation</title>
      <p>We consider the standard bandit setting with action (arm)
set A, where each action a 2 A is characterized by a reward
distribution ra bounded in the interval [0; 1]. The distribution
for action a has mean a and variance a2. Let Xa;i ra
denote the i-th reward sampled from the distribution of
action a. All actions and samples are independent. The bandit
problem is described as an iterative game where, on each step
(round) t, the player (an algorithm) selects action (arm) at and
observes sample Xa;Na;t , where Na;t = Pts=1 Ifas = ag
denotes the number of samples observed for action a up to
time t (inclusively). A policy is a distribution over A. In
general, stochastic distributions are necessary during the learning
stage, in order to identify the best arm. We discuss the exact
notion of “best” below.</p>
      <p>We define IS (m; j) as the set obtained by sub-sampling
without replacement j elements form the set S of size m.
Let Xa;t denote the history of observations (records) obtained
from action (arm) a up to time t (inclusively), such that
jXa;tj = Na;t. The notation Xa;t(I) indicates the set of
subsamples from Xa;t, where sub-sample I f1; 2; : : : ; Na;tg.</p>
      <p>The multi-armed bandit was first presented in the
seminal work of Robbins [Robbins, 1985]. It has been shown
that under certain conditions [Burnetas and Katehakis, 1996;
Lai and Robbins, 1985], a policy can have logarithmic
cumulative regret:
lim inf
t!1</p>
      <p>Rt
log(t)
&gt;</p>
      <p>X
?</p>
      <p>a</p>
      <p>Kinf (ra; r?)
a: a&lt; ?
where Kinf (ra; r?) is the Kullback-Leibler divergence
between the reward distributions of the respective arms. Policies
for which this bound holds are called admissible.</p>
      <p>Several algorithms have been shown to produce
admissible policies, including UCB1 [Auer, Cesa-Bianchi, and
Fischer, 2002], Thompson sampling [Chapelle and Li, 2011;
Agrawal and Goyal, 2013] and BESA [Baransi, Maillard, and
Mannor, 2014]. However, theoretical bounds are not always
matched by empirical results. For example, it has been shown
in [Kuleshov and Precup, 2014] that two algorithms which do
not produce admissible policies, "-greedy and Boltzmann
exploration [Sutton and Barto, 1998], behave better than UCB1
on certain problems. Both BESA and Thompson sampling
were shown to have comparable performance with Softmax
and "-greedy.</p>
      <p>While the expected regret is a natural and popular measure
of performance which allows the development of theoretical
results, recently, some papers have explored other definitions
for regret. For example, [Sani, Lazaric, and Munos, 2012b]
consider a linear combination of variance and mean as the
definition of regret for a learning algorithm A:</p>
      <p>MdV t(A) = bt2(A) bt(A);
uwphetoretimbteisstethpeteasntidmbatteisoaf bthiaeseadveersatgime aotfe oobfstehrevveadriraenwcaerdosf
rewards up to time step t. The regret is then defined as:</p>
      <p>Rt(A) = MdV t(A) MdV ?;t(A);
where ? is the optimal arm. According to [Maillard, 2013],
however, this definition is going to penalize the algorithm if it
switches between optimal arms. Instead, in [Maillard, 2013],
the authors devise a new definition of regret which controls
the lower tail of the reward distribution. However, the
algorithm to solve the corresponding objective function seems
time-consuming, and the optimization to be performed may
be intricate. Finally, in [Galichet, Sebag, and Teytaud, 2013],
the authors use the notion of conditional value at risk in order
to define the regret.</p>
    </sec>
    <sec id="sec-3">
      <title>Measure of regret</title>
      <p>In this section, we will present our definition of regret which
considers both the expected value and the variability of the
reward of arms, but unlike [Sani, Lazaric, and Munos, 2012b],
it does not penalize the algorithm if it switches between
optimal arms.</p>
      <p>Definition 0.1. Optimal arm (action): ? is an optimal arm if
it maximizes the trade-off between the expected outcome and
the variance:
? 2 arg max( a
a2A
a2);
for some fixed</p>
      <p>Definition 0.2. Consistency-aware regret: The
consistencyaware regret for a bandit algorithm B is defined as:
RTB ( ) =</p>
      <p>X ( ?
a2A
2
?
a +
a2)E[Na;T ]</p>
      <p>Note that when = 0, the consistency-aware regret
corresponds to the well-known expected regret.</p>
      <p>RTB =</p>
      <p>X ( ?
a2A
a)E[Na;T ]:
Note that when the context is clear, we will just use RT .</p>
      <p>It is clear that computing the consistency-aware regret is
not feasible in a real environment, as we do not have access
to the underlying distributions of arms. Hence, we define the
following empirical mean and variance which will be used to
estimate this regret in our algorithm:
Definition 0.3. Empirical mean and variance: For an
algorithm B, the empirical mean and empirical variance of arm a
up to time t is:
ba;t</p>
      <p>2
ba;t
=
=
1 Na;t</p>
      <p>X ra;i
Na;t i=1</p>
      <p>1
Na;t
1</p>
      <p>Na;t
X(ra;i
i=1
ba;t)
2
(1)
(2)
(3)
(4)
(5)
where ra;i is the ith reward obtained from pulling arm a.</p>
      <p>Note that unlike in [Sani, Lazaric, and Munos, 2012b], the
empirical estimation of the variance of an arm here is
unbiased. We will exploit this feature later in our proofs.</p>
      <p>For ease of notation, we define the value function for the
set of records of an arm as follows:
Definition 0.4. Consistency-aware value function: For a
given record set Xa;t(I) of an arm a up to time step t, the
corresponding value function is defined as:
vb(Xa;t(I)) = b(Xa;t(I))
b2(Xa;t(Ia;t)):
(6)</p>
      <p>In the next section, we are going to develop an algorithm
which optimizes the consistency-aware regret using the
quantities defined above in its estimation.</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Algorithm</title>
      <p>In order to optimize the consistency-aware regret, we build on
the BESA algorithm, which we will now briefly review. As
discussed in [Baransi, Maillard, and Mannor, 2014], BESA
is a non-parametric approach for finding the optimal arm
according to the expected mean regret criterion. Consider a
twoarmed bandit with actions a and ? ,where ? &gt; a, and
assume that Na;t &lt; N?;t at time step t. In order to select the
next arm for time step t + 1, BESA first sub-samples s? =
I?(N?;t; Na;t) from the observation history (records) of the
arm ? and similarly sub-sample sa = Ia(Na;t; Na;t) = Xa;t
from the records the arm a. If bsa &gt; bs? , BESA chooses arm
a, otherwise it chooses arm ?.</p>
      <p>The main reason behind the sub-sampling is that it gives
a similar opportunity to both arms. Consequently, the effect
of having a small sample size, which may cause bias in the
estimates, diminishes. When there are more than two arms,
BESA runs a tournament algorithm on the arms [Baransi,
Maillard, and Mannor, 2014].</p>
      <p>Finally, it is worth mentioning that the proof of the regret
bound of BESA uses a non-trivial lemma for which authors
did not provide any formal proof. In this paper, we will avoid
using this lemma to prove the soundness of our proposed
algorithm.</p>
      <p>We are now ready to outline our proposed approach, which
we call BESA+. As in [Baransi, Maillard, and Mannor, 2014],
we focus on the two-arm bandit. For more than two arms, a
tournament can be set up in our case as well.</p>
      <sec id="sec-4-1">
        <title>Algorithm BESA+ two action case</title>
        <p>Parameters: current time step t, actions a and b. Initially
Na;0 = 0; Nb;0 = 0
Shuffle the arms a and b with a function M to get a0; b0.
1: if Na0;t 1 = 0 _ Na0;t 1 &lt; log(t) then
2: at = a0
3: else if Nb0;t 1 = 0 _ Nb0;t 1 &lt; log(t) then
4: at = b0</p>
      </sec>
      <sec id="sec-4-2">
        <title>5: else</title>
        <p>6:
7:
8:
9:
nt 1 = minfNa0;t 1; Nb0;t 1g
Ia0;t 1 Ia0 (Na0;t 1; nt 1)
Ib0;t 1 Ib0 (Nb0;t 1; nt 1)
Calculate v~a0;t = vb(Xa0;t 1(Ia0;t 1)) and v~b0;t =
vb(Xb0;t 1(Ib0;t 1))
10: at = arg maxi2fa0;b0g v~i;t (break ties by choosing arm
with fewer tries)
11: end if
12: return M 1(at)</p>
        <p>The first major difference between BESA+ and BESA is
the use of the consistency-aware value function instead of the
simple regret. A second important change is that BESA+
selects the arm which has been tried less up to time step t if
the arm has been chosen less than log(t) times up to t.
Essentially, this change in the algorithm is negligible in terms
of establishing the total expected regret, as we cannot achieve
any better bound than log(T ), as shown in Robbins’ lemma
[Lai and Robbins, 1985]. This tweak also turns out to be vital
in proving that the expected regret of the BESA+ algorithm
is bounded by log(T ) (a result which we present shortly).</p>
        <p>To better understand why this modification is necessary,
consider a two arms scenario. The first arm gives a
deterministic reward of r 2 [0; 0:5) and the second arm has a uniform
distribution in the interval [0,1] with the expected reward of
0.5. If we are only interested in the expected reward ( = 0),
the algorithm should ultimately favor the second arm. On the
other hand, there exists a probability of r that the BESA
algorithm is going to constantly choose the first arm if the
second arm gives a value less than r on its first pull. In contrast,
BESA+ evades this problem by letting the second arm be
selected enough times such that it eventually becomes
distinguishable from the first arm.</p>
        <p>We are now ready to state the main theoretical result of our
proposed algorithm.</p>
        <p>Theorem 0.1. Let A = fa; ?g be a two-armed bandit with
bounded rewards 2 [0; 1], and the value gap = v? va.
Given the value , the expected consistency-aware regret of
the Algorithm BESA+ up to time T is upper bounded as
follows:</p>
        <p>RT = C ; + O(log(T ))
(7)
where in (7), C ; is a constant which is dependent on the
value of ; .</p>
        <p>Interested reader can visit here to see the full proof.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Empirical results</title>
      <sec id="sec-5-1">
        <title>Empirical comparison of BESA and BESA+</title>
        <p>As discussed in the previous section, BESA+ has some
advantages over BESA. We illustrate the example we discussed
in the previous section through the results in Figures 1-6, for
r 2 f0:2; 0:3; 0:4g. Each experiment has been repeated 200
times. Note that while BESA has an almost a linear regret
behavior, BESA+ can learn the optimal arm within the given
the time horizon and its expected accumulated regret is
upper bounded by a log function. It is also easy to notice that
BESA+ has a faster convergence rate compared with BESA.
As r gets closer to 0:5, the problem becomes harder. This
phenomenon is a direct illustration of our theoretical result.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Statistical dispersion estimate via sub-sampling without replacement</title>
        <p>In this subsection, we are going to explore the effect of
sample size and sub-sample size on the consistency-aware value
function error of BESA+. We have studied different
distributions to find out their effect on consistency-aware value
function error as well. Average results and standard error bars are
computed over 200 independent experiments in all graphs.</p>
        <p>Based on our experiments, changing the value in the
consistency-aware value function definition does not have
much impact on the convergence rate. Moreover, as one
would expect, the distribution type does not have a noticeable
influence on the convergence rate either, although some small
differences can be observed. Due to the space limit, we only
included two figures (figures 7, 8) to illustrate our claim. As
it can be seen in both figures, as we increase the sub-sample
size, the expected error decreases significantly. It is also
in0
2000
8000</p>
        <p>10000
4000 steps 6000
0
2000
8000</p>
        <p>10000
4000 steps 6000
teresting to note that the expected error is almost independent
of the sample size given the sub-sample size.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Algorithm BESA+ performance</title>
        <p>We also evaluated the performance of BESA+ with
consistency-aware regret in a simulated environment. Our
work here is similar to the work [Sani, Lazaric, and Munos,
2012b] in which the authors depicted the performance of
MVLCB and ExpExp with a synthetic environment. Here, we
considered a two-arm environment with arm 1 having mean 1
and variance in the range [0:1; 1] and arm 2 having the mean
in the range [0:1; 1] and variance 1. It is clear that under any
positive value of , arm 1 should be preferred over arm 2. We
have examined different values of and studied their
corresponding effects on the expected consistency-aware regret of
the Algorithm BESA+. (figures 9, 10, 11). In the figures, n
stands for time step. These figures uncover three important
aspects of BESA+. First, we can observe the kind of
problems which are difficult for BESA+. It appears that as the
difference between the mean or variance of two arms shrinks,
BESA+ usually suffers a higher amount of regret. This fact
can also be inferred from Theorem 0.1. Second, we can see
the importance of value in diminishing the effect of
variance or mean. In figure 9, where = 1, we can observe a
bump near the squares where both mean and variance gaps
are small. In figure 10, when = 10, the effect of the mean
gap almost vanishes and we see that as we go from figure 9
to figure 10, the regret graph orients itself toward the smaller
variance gap. The same thing happens as we go from figure
10 to 11. In this regard, in figure 11 (when = 0:1), the
0.0 0.2</p>
        <p>0.4
variance gap0.6 0.8
0.0 0.2</p>
        <p>0.4
variance gap0.6 0.8
0.8</p>
        <p>0.2
0.6 0m.e4angap
0.8</p>
        <p>0.2
0.6 0m.e4angap
regret graph has oriented itself toward the smaller mean gap
size. Finally, these figures depict the speed of convergence
of BESA+ Algorithm which seems faster than MV-LCB and
ExpExp.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Real Clinical Trial Dataset</title>
        <p>Finally, we examined the performance of BESA+ against
other methods (BESA, UCB1 , Thompson sampling,
MVLCB, and ExpExp) based on a real clinical dataset. This
dataset includes the survival times of patients who were
suffering from lung cancer [Ripley et al., 2013]. Two different
kinds of treatments (standard treatment and test treatment)
were applied to them and the results are based on the number
of days the patient survived after receiving one of the
treatments. For the purpose of illustration and simplicity, we
assumed non-informative censoring and equal follow-up times
in both treatment groups. As the experiment has already been
conducted, to apply bandit algorithms, each time a treatment
is selected by a bandit algorithm, we sampled uniformly from
the recorded results of the patients whom received that
selected treatment and used the survival time as the reward
signal. Figure 12 shows the distribution of treatment 1 and 2. We
categorized the survival time into ten categories (category 1
showing the minimum survival time). It is interesting to
notice that while treatment 2 has a higher mean than treatment
1 due to the effect of outliers, it has a higher level of variance
compared to treatment 1. From figure 12 it is easy to deduce
that treatment 1 has a more consistent behavior than treatment
2 and a higher number of patients who received treatment 2
died early. That is why treatment 1 may be preferred over
treatment 2 if we use the consistency-aware regret. In this
regard, by setting = 1, treatment 1 has less expected
meanvariance regret than treatment 2, and it should be ultimately
favored by the learning algorithm. Figure 13 illustrates the
performance of different bandit algorithms. It is easy to
notice that BESA+ has relatively better performance than all the
other ones.
0.0 0.2 0.4
variance gap0.6 0.8
0.8
0.16 rt</p>
        <p>e
0000000....0.0.0.01112468024 onsistency-awarereg
0.00 c
0.0
0.0 0.2 0.4
variance gap0.6 0.8
0.8</p>
        <p>0.2
0.6 0m.e4angap</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future work</title>
      <p>In this paper, we developed a new definition of regret (called
consistency-aware regret) which is sensitive to the variability
in rewards of the different arms by considering the variance of
the rewards. We extended and modified the BESA algorithm
to optimize consistency-aware regret and provided a bound
on its performance. Finally, we illustrated the utility of our
proposed algorithm on a real clinical dataset and studied its
behaviour on some synthetic datasets.</p>
      <p>We believe still there exist a noticeable gap between
clinical trial problems, which inspired our work, and the nature
of multi-armed bandit problems. Considering other ways to
incorporate reward variability and providing some bounds on
the confidence interval of the arm chosen by a bandit
learning algorithm are promising for the future studies. It is also
interesting to extend other bandit algorithms like Thompson
sampling to consistency-aware regret and study their
properties. Finally, utilizing BESA+ in the acquisition of real data
would be an important future validation step.</p>
      <p>ert
0.030 reg
0.025 re
0.020 y-awa
0.015 c</p>
      <p>n
00..000150 onsit
e
s
0.000 c
0.0
0.008 rt</p>
      <p>e
0.007 ereg
0.006 r
0000....000000002345 nsistency-awa
0.001 o
0.000 c
0.0</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>We would like to thank Audrey Durand for her comments and
insight on this project. We also thank department of family
medicine of McGill University and CIHR for their generous
support during this project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Agrawal and Goyal</source>
          , 2013] Agrawal,
          <string-name>
            <given-names>S.</given-names>
            , and
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Further optimal regret bounds for thompson sampling</article-title>
          .
          <source>In Artificial Intelligence and Statistics</source>
          ,
          <volume>99</volume>
          -
          <fpage>107</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Auer,
          <string-name>
            <surname>Cesa-Bianchi</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fischer</surname>
            , 2002] Auer,
            <given-names>P.</given-names>
          </string-name>
          ; CesaBianchi, N.; and
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Finite-time analysis of the multiarmed bandit problem</article-title>
          .
          <source>Machine learning 47(2-3)</source>
          :
          <fpage>235</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Austin, 2011] Austin,
          <string-name>
            <surname>P. C.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>An introduction to propensity score methods for reducing the effects of confounding in observational studies</article-title>
          .
          <source>Multivariate behavioral research 46</source>
          <volume>(3)</volume>
          :
          <fpage>399</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Baransi, Maillard, and
          <string-name>
            <surname>Mannor</surname>
            , 2014] Baransi,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maillard</surname>
            ,
            <given-names>O.-A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Mannor</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Sub-sampling for multiarmed bandits</article-title>
          .
          <source>In ECML-KDD</source>
          ,
          <fpage>115</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Burnetas and Katehakis</source>
          , 1996] Burnetas,
          <string-name>
            <given-names>A. N.</given-names>
            , and
            <surname>Katehakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. N.</surname>
          </string-name>
          <year>1996</year>
          .
          <article-title>Optimal adaptive policies for sequential allocation problems</article-title>
          .
          <source>Advances in Applied Mathematics</source>
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <fpage>122</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Carandini</source>
          , 2004] Carandini,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2004</year>
          .
          <article-title>Amplification of trial-to-trial response variability by neurons in visual cortex</article-title>
          .
          <source>PLoS biology 2</source>
          (
          <issue>9</issue>
          ):
          <fpage>e264</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Chapelle and Li</source>
          , 2011] Chapelle,
          <string-name>
            <given-names>O.</given-names>
            , and
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>An empirical evaluation of thompson sampling</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>2249</volume>
          -
          <fpage>2257</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [
          <string-name>
            <surname>Corbin-Berrigan</surname>
          </string-name>
          et al.,
          <year>2018</year>
          ]
          <article-title>Corbin-</article-title>
          <string-name>
            <surname>Berrigan</surname>
          </string-name>
          , L.-A.;
          <string-name>
            <surname>Kowalski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Faubert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Christie</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Gagnon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Three-dimensional multiple object tracking in the pediatric population: the neurotracker and its promising role in the management of mild traumatic brain injury</article-title>
          .
          <source>NeuroReport</source>
          <volume>29</volume>
          (
          <issue>7</issue>
          ):
          <fpage>559</fpage>
          -
          <lpage>563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Galichet, Sebag, and
          <string-name>
            <surname>Teytaud</surname>
            , 2013] Galichet,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sebag</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Teytaud</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Exploration vs exploitation vs safety: Risk-aware multi-armed bandits</article-title>
          .
          <source>In Asian Conference on Machine Learning</source>
          ,
          <fpage>245</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Garcıa and Ferna´ndez, 2015] Garcıa,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , and Ferna´ndez,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>A comprehensive survey on safe reinforcement learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Kuleshov and Precup</source>
          , 2014] Kuleshov,
          <string-name>
            <given-names>V.</given-names>
            , and
            <surname>Precup</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Algorithms for multi-armed bandit problems</article-title>
          .
          <source>arXiv preprint arXiv:1402</source>
          .
          <fpage>6028</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Lai and Robbins</source>
          , 1985] Lai,
          <string-name>
            <given-names>T. L.</given-names>
            , and
            <surname>Robbins</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>1985</year>
          .
          <article-title>Asymptotically efficient adaptive allocation rules</article-title>
          .
          <source>Advances in applied mathematics 6</source>
          (
          <issue>1</issue>
          ):
          <fpage>4</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Maillard</source>
          , 2013] Maillard,
          <string-name>
            <surname>O.-A.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Robust risk-averse stochastic multi-armed bandits</article-title>
          .
          <source>In ICML</source>
          ,
          <fpage>218</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Ripley et al.,
          <year>2013</year>
          ] Ripley,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Venables</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Bates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            ;
            <surname>Hornik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Gebhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Firth</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; and Ripley,
          <string-name>
            <surname>M. B.</surname>
          </string-name>
          <year>2013</year>
          . Package mass. Cran R.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Robbins</source>
          , 1985] Robbins,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>1985</year>
          .
          <article-title>Some aspects of the sequential design of experiments</article-title>
          .
          <source>In Herbert Robbins Selected Papers</source>
          . Springer.
          <fpage>169</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Sani, Lazaric, and
          <string-name>
            <surname>Munos</surname>
          </string-name>
          , 2012a]
          <string-name>
            <surname>Sani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lazaric</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Munos,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2012a</year>
          .
          <article-title>Risk-aversion in multi-armed bandits</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>3275</volume>
          -
          <fpage>3283</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Sani, Lazaric, and
          <string-name>
            <surname>Munos</surname>
          </string-name>
          , 2012b]
          <string-name>
            <surname>Sani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lazaric</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Munos,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2012b</year>
          .
          <article-title>Risk-aversion in multi-armed bandits</article-title>
          .
          <source>In NIPS</source>
          ,
          <fpage>3275</fpage>
          -
          <lpage>3283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Sutton and Barto</source>
          , 1998] Sutton,
          <string-name>
            <given-names>R. S.</given-names>
            , and
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G.</surname>
          </string-name>
          <year>1998</year>
          .
          <article-title>Reinforcement learning: An introduction</article-title>
          , volume
          <volume>1</volume>
          . MIT press Cambridge.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>