Introduction

Learning Reliable Policies in the Bandit Setting with Application to Adaptive Clinical Trials

0 Department of Computer Science, McGill University. Mila Quebec AI Institute 1 Department of Family Medicine, McGill University

43 49

The stochastic multi-armed bandit problem is a well-known model for studying the explorationexploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for a dynamic change of patient allocation ratios. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data especially when the sample size is small. In this article, we propose a modification of the BESA algorithm [Baransi, Maillard, and Mannor, 2014] which takes into account the variance in the action outcomes in addition to the mean. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset form the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.

Introduction

The multi-armed bandit is a standard model for researchers to investigate the exploration-exploitation trade-off, see e.g [Baransi, Maillard, and Mannor, 2014; Auer, CesaBianchi, and Fischer, 2002; Sani, Lazaric, and Munos, 2012a; Chapelle and Li, 2011; Sutton and Barto, 1998] . Unlike fully sequential decision-making problems, multi-armed bandit problems are simple enough to allow for theoretical studies.

The multi-armed bandit problem consists of a set of arms, each of which generates a stochastic reward from a fixed but unknown distribution associated to it. The standard goal in this setting is to find the arm ? which has the maximum expected reward ? (or equivalently, minimum expected regret). The expected regret RT is defined as the sum of the expected difference between the mean reward of the chosen arm at and the optimal arm until t = T :

RT = E " T

X( ? t=1 at ) #

While this objective is very popular, there are practical applications, for example in medical research and AI safety [Garcıa and Ferna´ndez, 2015] where maximizing expected value is not sufficient, and it would be better to have an algorithm sensitive also to the variability of the outcomes of a given arm. For example, consider multi-arm clinical trials where the objective is to find the most promising treatment among a pool of available treatments. Due to heterogeneity in patients’ treatment responses, considering only the expected mean may not be of interest [Austin, 2011]. Specifically, as the mean is usually sensitive to outliers and does not provide information about the dispersion of individual responses, the expected reward has only limited value in achieving a clinical trial’s objective. This is especially true if some outcomes are very bad for the patients. Due to this issue, analysis of variance approaches for studying the effectiveness of the treatments were recently propose [Corbin-Berrigan et al., 2018]. In other studies like [Carandini, 2004], the variance itself is the source of interest. Also, the consistency of treatment effects among patients is essential, with the ideal treatment usually defined as the one which has a high positive response rate while showing low variability in response among patients.

In this paper, we tackle the problem of designing bandit algorithms that reflect both the mean and variability of the arms, by extending one of the recent algorithms in the bandit literature called BESA (Best Empirical Sampled Average) [Baransi, Maillard, and Mannor, 2014]. One of the main advantage of BESA compared to other existing bandit algorithms is that it is a non-parametric learning algorithm. This is especially useful when one does not have any prior knowledge or has insufficient prior knowledge about the different arms in the beginning. We establish regret bounds for the proposed algorithm, and we show that this new algorithm is superior to some of the past risk-averse learning algorithms like MV-LCB and ExpExp [Sani, Lazaric, and Munos, 2012a] in both simulated tasks as well as in some clinical trial tasks.

Background and Notation

We consider the standard bandit setting with action (arm) set A, where each action a 2 A is characterized by a reward distribution ra bounded in the interval [0; 1]. The distribution for action a has mean a and variance a2. Let Xa;i ra denote the i-th reward sampled from the distribution of action a. All actions and samples are independent. The bandit problem is described as an iterative game where, on each step (round) t, the player (an algorithm) selects action (arm) at and observes sample Xa;Na;t , where Na;t = Pts=1 Ifas = ag denotes the number of samples observed for action a up to time t (inclusively). A policy is a distribution over A. In general, stochastic distributions are necessary during the learning stage, in order to identify the best arm. We discuss the exact notion of “best” below.

We define IS (m; j) as the set obtained by sub-sampling without replacement j elements form the set S of size m. Let Xa;t denote the history of observations (records) obtained from action (arm) a up to time t (inclusively), such that jXa;tj = Na;t. The notation Xa;t(I) indicates the set of subsamples from Xa;t, where sub-sample I f1; 2; : : : ; Na;tg.

The multi-armed bandit was first presented in the seminal work of Robbins [Robbins, 1985]. It has been shown that under certain conditions [Burnetas and Katehakis, 1996; Lai and Robbins, 1985], a policy can have logarithmic cumulative regret: lim inf t!1

Rt log(t) >

X ?

Kinf (ra; r?) a: a< ? where Kinf (ra; r?) is the Kullback-Leibler divergence between the reward distributions of the respective arms. Policies for which this bound holds are called admissible.

Several algorithms have been shown to produce admissible policies, including UCB1 [Auer, Cesa-Bianchi, and Fischer, 2002], Thompson sampling [Chapelle and Li, 2011; Agrawal and Goyal, 2013] and BESA [Baransi, Maillard, and Mannor, 2014]. However, theoretical bounds are not always matched by empirical results. For example, it has been shown in [Kuleshov and Precup, 2014] that two algorithms which do not produce admissible policies, "-greedy and Boltzmann exploration [Sutton and Barto, 1998], behave better than UCB1 on certain problems. Both BESA and Thompson sampling were shown to have comparable performance with Softmax and "-greedy.

While the expected regret is a natural and popular measure of performance which allows the development of theoretical results, recently, some papers have explored other definitions for regret. For example, [Sani, Lazaric, and Munos, 2012b] consider a linear combination of variance and mean as the definition of regret for a learning algorithm A:

MdV t(A) = bt2(A) bt(A); uwphetoretimbteisstethpeteasntidmbatteisoaf bthiaeseadveersatgime aotfe oobfstehrevveadriraenwcaerdosf rewards up to time step t. The regret is then defined as:

Rt(A) = MdV t(A) MdV ?;t(A); where ? is the optimal arm. According to [Maillard, 2013], however, this definition is going to penalize the algorithm if it switches between optimal arms. Instead, in [Maillard, 2013], the authors devise a new definition of regret which controls the lower tail of the reward distribution. However, the algorithm to solve the corresponding objective function seems time-consuming, and the optimization to be performed may be intricate. Finally, in [Galichet, Sebag, and Teytaud, 2013], the authors use the notion of conditional value at risk in order to define the regret.

Measure of regret

In this section, we will present our definition of regret which considers both the expected value and the variability of the reward of arms, but unlike [Sani, Lazaric, and Munos, 2012b], it does not penalize the algorithm if it switches between optimal arms.

Definition 0.1. Optimal arm (action): ? is an optimal arm if it maximizes the trade-off between the expected outcome and the variance: ? 2 arg max( a a2A a2); for some fixed

Definition 0.2. Consistency-aware regret: The consistencyaware regret for a bandit algorithm B is defined as: RTB ( ) =

X ( ? a2A 2 ? a + a2)E[Na;T ]

Note that when = 0, the consistency-aware regret corresponds to the well-known expected regret.

RTB =

X ( ? a2A a)E[Na;T ]: Note that when the context is clear, we will just use RT .

It is clear that computing the consistency-aware regret is not feasible in a real environment, as we do not have access to the underlying distributions of arms. Hence, we define the following empirical mean and variance which will be used to estimate this regret in our algorithm: Definition 0.3. Empirical mean and variance: For an algorithm B, the empirical mean and empirical variance of arm a up to time t is: ba;t

2 ba;t = = 1 Na;t

X ra;i Na;t i=1

1 Na;t 1

Na;t X(ra;i i=1 ba;t) 2 (1) (2) (3) (4) (5) where ra;i is the ith reward obtained from pulling arm a.

Note that unlike in [Sani, Lazaric, and Munos, 2012b], the empirical estimation of the variance of an arm here is unbiased. We will exploit this feature later in our proofs.

For ease of notation, we define the value function for the set of records of an arm as follows: Definition 0.4. Consistency-aware value function: For a given record set Xa;t(I) of an arm a up to time step t, the corresponding value function is defined as: vb(Xa;t(I)) = b(Xa;t(I)) b2(Xa;t(Ia;t)): (6)

In the next section, we are going to develop an algorithm which optimizes the consistency-aware regret using the quantities defined above in its estimation.

Proposed Algorithm

In order to optimize the consistency-aware regret, we build on the BESA algorithm, which we will now briefly review. As discussed in [Baransi, Maillard, and Mannor, 2014], BESA is a non-parametric approach for finding the optimal arm according to the expected mean regret criterion. Consider a twoarmed bandit with actions a and ? ,where ? > a, and assume that Na;t < N?;t at time step t. In order to select the next arm for time step t + 1, BESA first sub-samples s? = I?(N?;t; Na;t) from the observation history (records) of the arm ? and similarly sub-sample sa = Ia(Na;t; Na;t) = Xa;t from the records the arm a. If bsa > bs? , BESA chooses arm a, otherwise it chooses arm ?.

The main reason behind the sub-sampling is that it gives a similar opportunity to both arms. Consequently, the effect of having a small sample size, which may cause bias in the estimates, diminishes. When there are more than two arms, BESA runs a tournament algorithm on the arms [Baransi, Maillard, and Mannor, 2014].

Finally, it is worth mentioning that the proof of the regret bound of BESA uses a non-trivial lemma for which authors did not provide any formal proof. In this paper, we will avoid using this lemma to prove the soundness of our proposed algorithm.

We are now ready to outline our proposed approach, which we call BESA+. As in [Baransi, Maillard, and Mannor, 2014], we focus on the two-arm bandit. For more than two arms, a tournament can be set up in our case as well.

Algorithm BESA+ two action case

Parameters: current time step t, actions a and b. Initially Na;0 = 0; Nb;0 = 0 Shuffle the arms a and b with a function M to get a0; b0. 1: if Na0;t 1 = 0 _ Na0;t 1 < log(t) then 2: at = a0 3: else if Nb0;t 1 = 0 _ Nb0;t 1 < log(t) then 4: at = b0

5: else

6: 7: 8: 9: nt 1 = minfNa0;t 1; Nb0;t 1g Ia0;t 1 Ia0 (Na0;t 1; nt 1) Ib0;t 1 Ib0 (Nb0;t 1; nt 1) Calculate v~a0;t = vb(Xa0;t 1(Ia0;t 1)) and v~b0;t = vb(Xb0;t 1(Ib0;t 1)) 10: at = arg maxi2fa0;b0g v~i;t (break ties by choosing arm with fewer tries) 11: end if 12: return M 1(at)

The first major difference between BESA+ and BESA is the use of the consistency-aware value function instead of the simple regret. A second important change is that BESA+ selects the arm which has been tried less up to time step t if the arm has been chosen less than log(t) times up to t. Essentially, this change in the algorithm is negligible in terms of establishing the total expected regret, as we cannot achieve any better bound than log(T ), as shown in Robbins’ lemma [Lai and Robbins, 1985]. This tweak also turns out to be vital in proving that the expected regret of the BESA+ algorithm is bounded by log(T ) (a result which we present shortly).

To better understand why this modification is necessary, consider a two arms scenario. The first arm gives a deterministic reward of r 2 [0; 0:5) and the second arm has a uniform distribution in the interval [0,1] with the expected reward of 0.5. If we are only interested in the expected reward ( = 0), the algorithm should ultimately favor the second arm. On the other hand, there exists a probability of r that the BESA algorithm is going to constantly choose the first arm if the second arm gives a value less than r on its first pull. In contrast, BESA+ evades this problem by letting the second arm be selected enough times such that it eventually becomes distinguishable from the first arm.

We are now ready to state the main theoretical result of our proposed algorithm.

Theorem 0.1. Let A = fa; ?g be a two-armed bandit with bounded rewards 2 [0; 1], and the value gap = v? va. Given the value , the expected consistency-aware regret of the Algorithm BESA+ up to time T is upper bounded as follows:

RT = C ; + O(log(T )) (7) where in (7), C ; is a constant which is dependent on the value of ; .

Interested reader can visit here to see the full proof.

Empirical results Empirical comparison of BESA and BESA+

As discussed in the previous section, BESA+ has some advantages over BESA. We illustrate the example we discussed in the previous section through the results in Figures 1-6, for r 2 f0:2; 0:3; 0:4g. Each experiment has been repeated 200 times. Note that while BESA has an almost a linear regret behavior, BESA+ can learn the optimal arm within the given the time horizon and its expected accumulated regret is upper bounded by a log function. It is also easy to notice that BESA+ has a faster convergence rate compared with BESA. As r gets closer to 0:5, the problem becomes harder. This phenomenon is a direct illustration of our theoretical result.

Statistical dispersion estimate via sub-sampling without replacement

In this subsection, we are going to explore the effect of sample size and sub-sample size on the consistency-aware value function error of BESA+. We have studied different distributions to find out their effect on consistency-aware value function error as well. Average results and standard error bars are computed over 200 independent experiments in all graphs.

Based on our experiments, changing the value in the consistency-aware value function definition does not have much impact on the convergence rate. Moreover, as one would expect, the distribution type does not have a noticeable influence on the convergence rate either, although some small differences can be observed. Due to the space limit, we only included two figures (figures 7, 8) to illustrate our claim. As it can be seen in both figures, as we increase the sub-sample size, the expected error decreases significantly. It is also in0 2000 8000

10000 4000 steps 6000 0 2000 8000

10000 4000 steps 6000 teresting to note that the expected error is almost independent of the sample size given the sub-sample size.

Algorithm BESA+ performance

We also evaluated the performance of BESA+ with consistency-aware regret in a simulated environment. Our work here is similar to the work [Sani, Lazaric, and Munos, 2012b] in which the authors depicted the performance of MVLCB and ExpExp with a synthetic environment. Here, we considered a two-arm environment with arm 1 having mean 1 and variance in the range [0:1; 1] and arm 2 having the mean in the range [0:1; 1] and variance 1. It is clear that under any positive value of , arm 1 should be preferred over arm 2. We have examined different values of and studied their corresponding effects on the expected consistency-aware regret of the Algorithm BESA+. (figures 9, 10, 11). In the figures, n stands for time step. These figures uncover three important aspects of BESA+. First, we can observe the kind of problems which are difficult for BESA+. It appears that as the difference between the mean or variance of two arms shrinks, BESA+ usually suffers a higher amount of regret. This fact can also be inferred from Theorem 0.1. Second, we can see the importance of value in diminishing the effect of variance or mean. In figure 9, where = 1, we can observe a bump near the squares where both mean and variance gaps are small. In figure 10, when = 10, the effect of the mean gap almost vanishes and we see that as we go from figure 9 to figure 10, the regret graph orients itself toward the smaller variance gap. The same thing happens as we go from figure 10 to 11. In this regard, in figure 11 (when = 0:1), the 0.0 0.2

0.4 variance gap0.6 0.8 0.0 0.2

0.4 variance gap0.6 0.8 0.8

0.2 0.6 0m.e4angap 0.8

0.2 0.6 0m.e4angap regret graph has oriented itself toward the smaller mean gap size. Finally, these figures depict the speed of convergence of BESA+ Algorithm which seems faster than MV-LCB and ExpExp.

Real Clinical Trial Dataset

Finally, we examined the performance of BESA+ against other methods (BESA, UCB1 , Thompson sampling, MVLCB, and ExpExp) based on a real clinical dataset. This dataset includes the survival times of patients who were suffering from lung cancer [Ripley et al., 2013]. Two different kinds of treatments (standard treatment and test treatment) were applied to them and the results are based on the number of days the patient survived after receiving one of the treatments. For the purpose of illustration and simplicity, we assumed non-informative censoring and equal follow-up times in both treatment groups. As the experiment has already been conducted, to apply bandit algorithms, each time a treatment is selected by a bandit algorithm, we sampled uniformly from the recorded results of the patients whom received that selected treatment and used the survival time as the reward signal. Figure 12 shows the distribution of treatment 1 and 2. We categorized the survival time into ten categories (category 1 showing the minimum survival time). It is interesting to notice that while treatment 2 has a higher mean than treatment 1 due to the effect of outliers, it has a higher level of variance compared to treatment 1. From figure 12 it is easy to deduce that treatment 1 has a more consistent behavior than treatment 2 and a higher number of patients who received treatment 2 died early. That is why treatment 1 may be preferred over treatment 2 if we use the consistency-aware regret. In this regard, by setting = 1, treatment 1 has less expected meanvariance regret than treatment 2, and it should be ultimately favored by the learning algorithm. Figure 13 illustrates the performance of different bandit algorithms. It is easy to notice that BESA+ has relatively better performance than all the other ones. 0.0 0.2 0.4 variance gap0.6 0.8 0.8 0.16 rt

e 0000000....0.0.0.01112468024 onsistency-awarereg 0.00 c 0.0 0.0 0.2 0.4 variance gap0.6 0.8 0.8

0.2 0.6 0m.e4angap

Conclusion and future work

In this paper, we developed a new definition of regret (called consistency-aware regret) which is sensitive to the variability in rewards of the different arms by considering the variance of the rewards. We extended and modified the BESA algorithm to optimize consistency-aware regret and provided a bound on its performance. Finally, we illustrated the utility of our proposed algorithm on a real clinical dataset and studied its behaviour on some synthetic datasets.

We believe still there exist a noticeable gap between clinical trial problems, which inspired our work, and the nature of multi-armed bandit problems. Considering other ways to incorporate reward variability and providing some bounds on the confidence interval of the arm chosen by a bandit learning algorithm are promising for the future studies. It is also interesting to extend other bandit algorithms like Thompson sampling to consistency-aware regret and study their properties. Finally, utilizing BESA+ in the acquisition of real data would be an important future validation step.

ert 0.030 reg 0.025 re 0.020 y-awa 0.015 c

n 00..000150 onsit e s 0.000 c 0.0 0.008 rt

e 0.007 ereg 0.006 r 0000....000000002345 nsistency-awa 0.001 o 0.000 c 0.0

Acknowledgment

We would like to thank Audrey Durand for her comments and insight on this project. We also thank department of family medicine of McGill University and CIHR for their generous support during this project.

[Agrawal and Goyal , 2013] Agrawal, S. , and Goyal , N. 2013 . Further optimal regret bounds for thompson sampling . In Artificial Intelligence and Statistics , 99 - 107 .

[Auer, Cesa-Bianchi , and Fischer , 2002] Auer, P. ; CesaBianchi, N.; and Fischer , P. 2002 . Finite-time analysis of the multiarmed bandit problem . Machine learning 47(2-3) : 235 - 256 .

[Austin, 2011] Austin, P. C. 2011 . An introduction to propensity score methods for reducing the effects of confounding in observational studies . Multivariate behavioral research 46 (3) : 399 - 424 .

[Baransi, Maillard, and Mannor , 2014] Baransi, A. ; Maillard , O.-A. ; and Mannor , S. 2014 . Sub-sampling for multiarmed bandits . In ECML-KDD , 115 - 131 .

[Burnetas and Katehakis , 1996] Burnetas, A. N. , and Katehakis , M. N. 1996 . Optimal adaptive policies for sequential allocation problems . Advances in Applied Mathematics 17 ( 2 ): 122 - 142 .

[Carandini , 2004] Carandini, M. 2004 . Amplification of trial-to-trial response variability by neurons in visual cortex . PLoS biology 2 ( 9 ): e264 .

[Chapelle and Li , 2011] Chapelle, O. , and Li , L. 2011 . An empirical evaluation of thompson sampling . In Advances in neural information processing systems , 2249 - 2257 .

[ Corbin-Berrigan et al., 2018 ] Corbin- Berrigan , L.-A.; Kowalski , K. ; Faubert , J. ; Christie , B. ; and Gagnon , I. 2018 . Three-dimensional multiple object tracking in the pediatric population: the neurotracker and its promising role in the management of mild traumatic brain injury . NeuroReport 29 ( 7 ): 559 - 563 .

[Galichet, Sebag, and Teytaud , 2013] Galichet, N. ; Sebag , M. ; and Teytaud , O. 2013 . Exploration vs exploitation vs safety: Risk-aware multi-armed bandits . In Asian Conference on Machine Learning , 245 - 260 .

[Garcıa and Ferna´ndez, 2015] Garcıa, J. , and Ferna´ndez, F. 2015 . A comprehensive survey on safe reinforcement learning . Journal of Machine Learning Research 16 ( 1 ): 1437 - 1480 .

[Kuleshov and Precup , 2014] Kuleshov, V. , and Precup , D. 2014 . Algorithms for multi-armed bandit problems . arXiv preprint arXiv:1402 . 6028 .

[Lai and Robbins , 1985] Lai, T. L. , and Robbins , H. 1985 . Asymptotically efficient adaptive allocation rules . Advances in applied mathematics 6 ( 1 ): 4 - 22 .

[Maillard , 2013] Maillard, O.-A. 2013 . Robust risk-averse stochastic multi-armed bandits . In ICML , 218 - 233 .

[Ripley et al., 2013 ] Ripley, B. ; Venables , B. ; Bates , D. M. ; Hornik , K. ; Gebhardt , A. ; Firth , D. ; and Ripley, M. B. 2013 . Package mass. Cran R.

[Robbins , 1985] Robbins, H. 1985 . Some aspects of the sequential design of experiments . In Herbert Robbins Selected Papers . Springer. 169 - 177 .

[Sani, Lazaric, and Munos , 2012a] Sani , A. ; Lazaric , A. ; and Munos, R. 2012a . Risk-aversion in multi-armed bandits . In Advances in Neural Information Processing Systems , 3275 - 3283 .

[Sani, Lazaric, and Munos , 2012b] Sani , A. ; Lazaric , A. ; and Munos, R. 2012b . Risk-aversion in multi-armed bandits . In NIPS , 3275 - 3283 .

[Sutton and Barto , 1998] Sutton, R. S. , and Barto , A. G. 1998 . Reinforcement learning: An introduction , volume 1 . MIT press Cambridge.