Adaptive Budgeted Bandit Algorithms for Trust in a Supply-Chain Setting Sandip Sen Anton Ridgway Department of Computer Science Department of Computer Science The University of Tulsa The University of Tulsa sandip@utulsa.edu anton-ridgway@utulsa.edu Michael Ripley Department of Computer Science The University of Tulsa michael-ripley@utulsa.edu Abstract Recently, an AAMAS Challenges & Visions paper identified several key com- ponents of a comprehensive trust management has been understudied by the research community [SEN13]. We believe that we can build on recent ad- vances in closely related research in other sub-fields of AI and multiagent sys- tems to address some of these issues. For example, the budgeted multi-armed bandit problem involves pulling multiple arms with stochastic rewards with the goal of maximizing the total reward generated from those arms, while keeping the cost of pulling the arms beneath a given budget. We argue that multi-armed bandit algorithms can be adapted to address research issues in trust engagement and evaluation components of a comprehensive trust man- agement approach. To support this proposition, we consider a supply-chain application, where a tree of dependent supplier agents can be considered as an arm of the online bandit problem with budget constraints. Each of the nodes in the supply chain must then solve their local bandit problem in parallel to determine which of its sub-suppliers is most trustworthy. We use new arm- selection strategies, and demonstrate how they can be gainfully applied to the trust-based decision-making in the supply chain to reduce time to production and hence improve utility by timely delivery of products. 1 Introduction Research on trust in multiagent systems has given us a variety of conceptual frameworks to view trust and effective al- gorithms for evaluating the trustworthiness of other agents given the history of mutual interactions [CAS98, GAM90, HUY06, SEN02, VOG10, YU02, YOL05]. However, it has been posited that to fully harness the effectiveness of trust based decision making, it is critical to develop a more comprehensive approach to trust management that addresses not only trust evaluation but also provides strategic reasoning procedures that determine who to interact with, in what context, and how to best utilize the resultant knowledge about the trustworthiness of others [SEN13]. The goal then is to develop proactive trust mechanisms that explore possible fruitful partnerships, create situations where interactions will provide c by the paper’s authors. Copying permitted only for private and academic purposes. Copyright � In: R. Cohen, R. Falcone and T. J. Norman (eds.): Proceedings of the 17th International Workshop on Trust in Agent Societies, Paris, France, 05-MAY- 2014, published at http://ceur-ws.org 1 discriminatory evidence of the trustworthiness of potential partners, and use a well-thought-out plan of how to best utilize the trusted partners given expectations of future goals, plans and resource requirements. In this paper, we consider the problem of proactive engagement of potential long-term partners to determine who can be trusted for providing consistent service on which an agent’s goal is critically dependent. Whereas most of the research on trust in multiagent system focuses on after-the-event, offline evaluation of interaction histories to determine trustworthiness of partners, real-world scenarios demand forward-looking, online schemes that has to choose to engage interaction partners where the process of engagement incurs a cost. This “exploration cost” must be balanced against possible gain or “exploitation benefit” in the long run from the knowledge of who the more trustworthy partners are. Research in machine learning, and in particular, reinforcement learning has long confronted similar exploration versus exploitation dilemmas [SUT98]. However, most of these research has ignored the time and cost of exploration, and focused primarily on proving convergence to optimal policies for solving the underlying Markov Decision processes in the limit, i.e., without time or budget constraints [KAE96]. In particular, the multi-armed bandit (MAB) problem has been proposed as a theoretical framework for evaluating the utility of interacting with multiple entities with stochastically varying performance [AGR88, AUE02]. Though this would appear to be a natural mapping of the problem of evaluation of trust in new partners, on closer inspection a key missing aspect underlines the fundamental difference between the two. The basic MAB model does not consider any cost for pulling an arm, whereas interacting with a partner to gather further information to evaluate their trustworthiness involves risks, time and resource commitment and consumption as well as other costs. More recently, however, researchers have considered augmentations of the baisc MAB problem, that consider the cost of pulling arms in combination with a budget limit for exploration [DIN13, TRA10, TRA12], i.e., a resource constraint that necessitates strategic engagement with potential partners to quickly identify partnerships which are likely to deliver maximal long-term benefits. We believe that solution approaches for the Multi-Armed Bandit Problem with Budget Constraints (MAB-BF) can be adapted to address the issue of strategic engagement and evaluation of potential partner’s trustworthiness under a number of settings and trust metrics such as reliability, fairness, quality of performance, timeliness, etc. The MAB-BF has already been utilized for a wide range of reinforcement learning applications, including bidding in ad exchanges, bid optimization in search, and service provider selection in cloud computing [ARD11, YEH11, BOR07, CHA10, GUH07]. Various metrics for determining the trustworthiness of a partner has been proposed. In this paper, we consider per- formance, in terms of reliability, as the objective criteria for trustworthiness. This choice is motivated by the following observation [SEN13]: Trust in another agent reduces the uncertainty over that agent’s independent actions which posi- tively correlates with the truster’s utility. We consider existing and recently developed approaches to addressing the MAB problem with cost and budget constraints to proactively engage and evaluate potential partners to gauge their reliability and hence long-term trustworthiness. To evaluate the efficacy of these approaches, we introduce a Supply Chain domain where a manufacturer has to procure raw materials from contractors of initially unknown reliability. Contractors in turn have to depend on sub-contractors to produce their deliverables. The goal of each agent in the supply chain is to reduce their time to delivery of their product for which they have to strategically engage their sub-contractors to discriminate sub-contractors of differing reliability. We introduce three new algorithms for addressing the MAB-BF problem. First, we first formally present the MAB-BF problem (Section 2); then, we introduce recent algorithms proposed for this variant of the MAB problem, together with our own proposed algorithms, highlighting their interrelationships and differences (Section 3). We next introduce the supply chain domain used for evaluating the effectiveness of the proposed algorithms in determining the trustworthiness of the sub-contractors under varying budget constraints and performance variability (Section 4). In Section 5 we describe the simulation framework used for experiments and then present a series of experimental results from a variety of scenarios in Section 6, varying the branching factor of the arms in each tree, the payoff distributions of the arms, and the budget available to sample from the arms. In conclusion, in Section 7 we summarize the key findings from the comparative experimental evaluation and present thoughts for future research. 2 Fixed-cost Multi-Armed Bandit with Budget Constraint We now formally introduce the version of MAB-BF problem where the goal of the agent is to maximize the reward obtained by pulling a set of A = {1, . . . , K} arms, where the ith arm has a fixed cost ci per pull with reward drawn from an unknown distribution with a mean value of µi . The agent has at its disposal a total budget of B. Let C = �c1 , . . . , cK � and µ = �µ1 , . . . , µK � refer to the vector of pulling costs and mean rewards for the K arms. We refer to the above problem as the Fixed-Cost Multi-Armed Bandit with Budget Constraint (MAB-BF), defined by the tuple P = (A, B, C, µ). The goal of an agent facing such a problem is to choose a sequence of arms that optimizes the expected reward without exceeding the total budget for arm pulls. Because the number of arm pulls in a given sequence is limited by the costs of the particular 2 arms chosen, we generally consider the utility ratio µcii rather than µi alone to determine the priority of a given arm i. It should be noted that while one could improve overall performance somewhat by incorporating other statistical measures together with the utility ratio to produce a more general utility index, the result would only be a general optimization, applicable to any of the algorithms we consider; for simplicity’s sake, we choose to focus on the utility ratio only, and leave the determination of a best-case utility index for another setting. Finally, note that the µ values are not available to the agent. The number of arm pulls that an agent can make is dependent on the multiset of arms it decides to pull. We now introduce some notations to refer to sequences of arm pulls: • S = �a1 , . . . , at � refers to the sequence of arms pulled by the agent in the first t attempts. We will use |S| to refer to the number of arm pulls in the sequence S and S(i), ∀i ∈ {1, . . . , |S|} to refer to the ith arm pulled in that sequence. Also we denote the subsequence of pulls in S from t = t1 , . . . , t2 as St1 ,t2 . We will use a ∈ S to test for the existence of an arm a in the sequence S or to range over the set of arms in the sequence. • nSa represents the number of times arm a ∈ A was pulled in sequence S. • At is the set of arms which can still be pulled after the sequence of arm pulls S1,t with the remaining budget Bt = �t B − i=1 cS(i) , i.e., At = {a|a ∈ A ∧ ca ≤ Bt }. � • A sequence S is valid if it does not violate the budget constraint, i.e., a∈S ca ≤ B. • S = {S|S is valid} is the set of all valid sequences. For the rest of the paper we will use the term ‘sequence’ for valid sequences. �|S| • GS = i rai ,i is the total reward obtained from a valid sequence S, where rai ,i is the reward returned from arm ai ∈ A pulled at time i. As the rewards generated are non-deterministic, we are more interested the sum of the mean �|S| rewards in the sequence; that is, the Expected Total Reward of sequence S, E[GS ] = i µai . Correspondingly, we are interested in the optimal sequence S ∗ if the reward distributions for the arms were known, S ∗ = arg max E[GS ] (1) S∈S and the associated payoff E[GS ∗ ]. Note that since the mean rewards for the arms are not known a priori, we will only know GS . • Given a sequence of arm pulls, S, an estimate of the mean reward for each of the sampled arms, µS , can be formed. Then µS = �µS1 , . . . , µSK � where µSi , ∀i ∈ A, is the average reward obtained from the nSi pulls of arm i in the sequence S: |S| 1 � µSi = I(S(j) = i)ri,j (2) nSi j=1 where I(·) is the Indicator function. • The expected regret, R(S) of the sequence S is then calculated as the difference: R(S) = E[GS ∗ ] − E[GS ]. The desirability of a sequence of arm pulls can be measured by its expected regret and the problem of optimizing expected reward can be mapped into the problem of minimizing expected regret. 3 Algorithms for the MAB-BF problem In this section we introduce the different algorithms we experiment with. We first introduce some existing algorithms and then discuss our proposed approaches. 3.1 Existing Algorithms We now describe five existing algorithms; the first two (�-first, Greedy) have an initial exploration phase for estimating the mean rewards of the K arms, and in the subsequent exploitation phase pull the arms in a greedy manner; the second two (fKUBE, UCB-BV) are Upper Confidence Bound algorithms that integrate the exploration and exploitation phases by comparing relative confidence for each arm; and the last (fKDE) stochastically integrates exploration and exploitation while narrowing its focus: Budget-Limited �-first: This algorithm, henceforth referred to as �-first, uniformly selects from the set of arms, perform- ing unordered sweeps of each before beginning again, until its exploration budget, �B, is exhausted (not enough bud- get remains to pull even the minimum cost arm1 ). Let S�−first explore (P )2 be the sequence of arm pulls for the exploration 1 We only consider scenarios where the exploration budget allows us to pull every arm the same number of times in the exploration phase. 2 We will omit the problem argument P in cases where the context is clear. 3 phase for an �-first approach with MAB-BF problem P . At the end of the exploration phase, an estimate of the mean rewards, µS explore is calculated and subsequently used in the exploitation phase. Next, the arms are sorted by the ratio �−first explore S µ �−first of their estimated mean to pulling cost (their reward-cost ratio). The best such arm, I max = arg maxi∈A i ci , which we refer to as the active arm, is pulled until not enough of the exploitation budget, (1 − �)B, is left for one more pull of that arm. This process is repeated with the rest of the arms until the budget is completely exhausted (not enough remains to pull even the lowest cost arm). We refer to the sequence of arms pulled in this exploration phase exploit explore exploit as S�−first (P ) and the combined exploration-exploitation sequence as S�−first = S�−first (P ) + S�−first (P ). This algorithm is presented in Figure 1. One key difference between the original formulation of this algorithm[TRA10] and our implementation of it is that our version continues to update the estimates of the arms during the exploitation phase, so that the active arm in that phase can change if its utility ratio drops below that of another arm, even if its cost is still affordable. We refer to our version of algorithms that behave this way as online-exploitation variants, in contrast to the original offline- exploitation variants. Experiments have shown that the online variants significantly outperform the offline variants in most scenarios and hence in this paper we present results with the online variants only, unless otherwise noted. Input : P = (A, B, C, µ): MAB-BF problem; �, exploration fraction of budget Output: S, a sequence of arm pulls; µ� , a vector of estimated mean rewards from the K arms; G, total reward from all arm pulls; 1 Exploration phase: � � 2 t ← 1; S ← ∅; G ← 0; ExplorationRounds ← �B �K ; i=1 ci 3 for i = 1 → K do 4 µ�i ← 0 5 for n = 1 → ExplorationRounds do 6 for i = 1 → K do 7 pull arm i to obtain ri,t ; G ← G + ri,t ; 8 S ← S + i ; // Add i to sequence S 9 µ�i ← µ�i + ri,t ; 10 t ← t + 1; 11 for i = 1 → K do µ� 12 µ�i = nSi i ; // form reward mean estimates 13 Exploitation phase: �K 14 RemainingBudget = B − ExplorationRounds ∗ i=1 ci ; � 15 A = A ; // Initialize available arms 16 while RemainingBudget≥ mini∈A� ci do 17 I max = arg maxi∈A µcii ; // pick best arm 18 if RemainingBudget≥ cI max then // if budget allows to pull arm 19 pull arm I max to get reward rI max ,t ; 20 G ← G + rI max ,t ; µ�I max ∗nS I max +rI max ,t 21 µ�I max ← nS +1 ; // update mean reward estimate of arm I max max 22 S ←S+I ; 23 RemainingBudget ← RemainingBudget−cI max ; 24 t ← t + 1; 25 else 26 A� ← A� \ {I max } ; // eliminate arm 27 return (S, µ� , G); Algorithm 1: The �-first algorithm. Greedy Algorithm: The Greedy algorithm is a special case of the �-first algorithm where the exploration budget only allows each arm to be pulled once before the exploitation phase begins. This algorithm is remarkable because to 4 our knowledge it has not been used before for the MAB-BF problem, but its online-exploration variant nevertheless performs �K exceptionally well on average. Note that since we require all arms to be pulled at least once, when � ≤ i=1 ci B , �-first algorithms are reduced to the Greedy algorithm. Fractional fKUBE: The Fractional Knapsack-based Upper Confidence Bound Exploration and Exploitation (fKUBE) [TRA10], is based on a class of upper-confidence bound (UCB) based algorithms originally developed for the MAB problem without budget constraints. This algorithm has a minimal exploration phase, pulling each arm exactly once, and thereafter uses a confidence index based on the reward-cost ratio of the arms to determine which arm ai should be pulled at time t + 1, given by the following, where At is the set of arms still affordable at time t, µ�i,t is the sample mean of ai at t, and ni is the number of times that ai has been pulled by t:  �  2 ln t µ�i,t + ni,t arg max   (3) i∈At ci UCB with Budget Constraint and Variable Costs (UCB-BV): UCB-BV, created for the budget-constrained MAB, with the added complication of variable costs for each arm, inherits from the UCB algorithms [DIN13]. It begins with a minimal exploration phase, followed by a mixed exploration-exploitation phase driven by a confidence index. After each sweep of the arms, the arm with the highest index value becomes the active arm: This bound, for arm i at time t, where ni,t represents the number of times it has been pulled by this time, is given by:  �  µ�i,t (1 + min1j cj ) ln n(t−1) arg max   i,t + � (4) ai ∈A ci min c − ln (t−1) j j ni,t Fractional Knapsack-based Decreasing �-greedy (fKDE): The fractional KDE algorithm [TRA10] is a decreasing ex- ploration over time approach which first pulls the arms uniformly γ times and thereafter pulls the arm with the highest estimated reward-cost ratio with increasing probability. The probability of uniform random exploration after t arm pulls is set to εt = min(1, γt ). Otherwise, the arm with the highest estimated reward-cost ratio, based on the sequence of arms pulled, is chosen. 3.2 New Algorithms Now, we present algorithms that we have developed for the MAB-BF problem. We believe these algorithms explore more intelligently compared to their predecessors; the choice of arms to be pulled during exploration is driven by either the number of arms |A|, or the distribution of the arm rewards in µ and the costs in C; past algorithms only made use of uniform or minimal exploration phases, and fixed the exploration budget without consideration of the bandit at hand. Also note that all the algorithms that we introduce are online algorithms, i.e., an eliminated arm may be reconsidered if the reward-cost ratio of a sufficient number of previously preferred arms drops below the corresponding ratio of this arm upon further sampling. l-split (lS): This is a generalized Greedy approach. Instead of eliminating all but one arm after the first pass, the lS algorithm successively eliminates (1 − 1l ) of the arms after each pass: if AlS (p) is the number of surviving arms after p splitting passes of the algorithm, then AlS (p+1) = � 1l AlS (p)�. After �logl K� passes, lS narrows down the choice to one, and thereafter performs greedy exploitation. The algorithm is presented in detail in Algorithm 2. The simplest of this family of algorithms is the 2-split (2S) or halving algorithm, which successively eliminates approximately half of the underperforming arms after each pass. Progressive exploration �-first (PEEF): The PEEF algorithm was developed upon a careful evaluation of the �-first al- gorithm. Whereas the latter expends its exploration budget uniformly over the set of K arms, we conjectured that in a number of scenarios, particularly those with a large number of arms, there might be some low return arms that can be quickly discarded. More importantly, we believe that the exploration budget can be better utilized to tease out differences between similar, high reward-cost ratio arms by visiting them with increasing frequency. Hence, rather than uniform exploration, we propose a progressive exploration scheme where we perform a l−split operation, as in the lS algorithm after each pass. The difference with that algorithm is that in PEEF the splitting value l is calculated such that the number of remaining arm choices is reduced to 1 approximately at the end of the exploration phase. Given an exploration budget of �B then we want l to be such that the following condition holds: 5 logl K � K cavg = �B, (5) j=1 j �K 3 where cavg = K 1 i=1 ci , is the average cost of pulling an arm . Solving this equation for l we obtain l = �B−K . �B−1 For obvious reasons, we will perform a pairwise comparison of the �-first algorithm and the corresponding PEEF algorithm for different scenarios in the experimental section. Input : P = (A, B, C, µ): MAB-BF problem; l, elimination factor Output: S, a sequence of arm pulls; µ� , a vector of the arms’ sample mean rewards; G, total reward from all arm pulls; 1 t ← 1; S ← ∅; G ← 0; RemainingBudget ← B; NumPasses ← 0 2 for i = 1 → K do 3 µ�i ← 0 4 A� = A ; // Initialize available arms 5 while A �= ∅ do � 6 foreach a ∈ A� do 7 if RemainingBudget≥ cai then // if budget allows to pull arm 8 pull arm a to obtain reward ra,t ; 9 G ← G + ra,t ; µ� ∗nS +r 10 µ�a ← a nSa+1 a,t ; // update mean reward estimate of arm a 11 S ← S + a; 12 RemainingBudget ← RemainingBudget−ca ; 13 t ← t + 1; 14 N umP asses ← numP asses + 1; 15 A� = ∅; � � 16 NumToPull ← K − lN umP K asses ; 17 while NumToPull > 0 and A − A� �= ∅ do 18 A� ← A� ∪ {arg maxi∈A µcii | cai ≤ RemainingBudget, Ai �∈ S � }; 19 NumToPull ← NumToPull - 1; 20 return (S, µ� , G); Algorithm 2: The l-Split algorithm. Survival of the Above Average (SOAAv): This algorithm also successively narrows down the set of active arms by elim- inating underperforming arms. But rather than eliminating a fixed number of arms after each pass, it eliminates arms whose estimated reward-cost ratio is below (1 + x) times the average of such ratios of the arms in the last pass. Setting x = 0 means only above average individuals survive from one pass of the arms to the next. Note again that this is an online-exploration approach where a previously eliminated arm can come back into the active set if estimates of other active arms drop. This algorithm is presented in more detail in Algorithm 3. Of the above algorithms, both lS and PEEF are rank-based algorithms, where arms are eliminated based on their ranking by reward-ratio cost ratios, whereas only the SOAAv algorithm is a value-based approach, where arms with estimated reward-cost ratios below a certain factor of the average of the currently active set are eliminated. 4 Supply-chain Model We now introduce the supply-chain model where contractors have to engage strategically with sub-contractors of initially- unknown reliability (trustworthiness). Each interaction has a cost and the contractor’s goal is to distinguish the trustwor- thiness of all of its subcontractors given a fixed budget to pay for the interactions. This domain allows us to examine 3 Note that this calculation of l is necessarily approximate, both for the use of c avg as the cost per unit pull during the exploration phase and in the use of logl K as the number of passes during the exploration phase. 6 Input : P = (A, B, C, µ): MAB-BF problem; x, elimination factor Output: S, a sequence of arm pulls; µ� , a vector of the arms’ sample mean rewards; G, total reward from all arm pulls; 1 t ← 1; S ← ∅; G ← 0; RemainingBudget ← B; 2 for i = 1 → K do 3 µ�i ← 0 4 A� = A ; // Initialize available arms 5 while RemainingBudget ≥ mini∈A� ci do 6 numPullsInPass=0; passAverageRatio=0; 7 foreach a ∈ A� do 8 if RemainingBudget ≥ ca then // if budget allows to pull arm 9 pull arm a to obtain reward ra,t ; 10 G ← G + ra,t ; µ�a ∗nS a +ra,t 11 µ�a ← nS ; // update mean reward estimate of arm a +1 12 S ← S + a; 13 RemainingBudget ← RemainingBudget−ca ; 14 t ← t + 1; r 15 passAverageRatio ← passAverageRatio+ ca,t a ; 16 numPullsInPass← numPullsInPass+1; 17 else 18 A� ← A� \ {a} ; // eliminate arm 19 if numPullsInPass >0 then passAverageRatio 20 passAverageRatio ← numPullsInPass ; 21 A� = ∅ 22 foreach a ∈ A do 23 if ca