=Paper= {{Paper |id=Vol-1740/paper2 |storemode=property |title=Adaptive Budgeted Bandit Algorithms for Trust in a Supply-Chain Setting |pdfUrl=https://ceur-ws.org/Vol-1740/paper2.pdf |volume=Vol-1740 |authors=Sandip Sen,Anton Ridgway,Michael Ripley |dblpUrl=https://dblp.org/rec/conf/atal/SenRR14 }} ==Adaptive Budgeted Bandit Algorithms for Trust in a Supply-Chain Setting== https://ceur-ws.org/Vol-1740/paper2.pdf
               Adaptive Budgeted Bandit Algorithms for Trust in a
                            Supply-Chain Setting

                            Sandip Sen                            Anton Ridgway
                   Department of Computer Science        Department of Computer Science
                       The University of Tulsa                The University of Tulsa
                         sandip@utulsa.edu                  anton-ridgway@utulsa.edu
                                               Michael Ripley
                                      Department of Computer Science
                                          The University of Tulsa
                                         michael-ripley@utulsa.edu




                                                                   Abstract

                           Recently, an AAMAS Challenges & Visions paper identified several key com-
                           ponents of a comprehensive trust management has been understudied by the
                           research community [SEN13]. We believe that we can build on recent ad-
                           vances in closely related research in other sub-fields of AI and multiagent sys-
                           tems to address some of these issues. For example, the budgeted multi-armed
                           bandit problem involves pulling multiple arms with stochastic rewards with
                           the goal of maximizing the total reward generated from those arms, while
                           keeping the cost of pulling the arms beneath a given budget. We argue that
                           multi-armed bandit algorithms can be adapted to address research issues in
                           trust engagement and evaluation components of a comprehensive trust man-
                           agement approach. To support this proposition, we consider a supply-chain
                           application, where a tree of dependent supplier agents can be considered as an
                           arm of the online bandit problem with budget constraints. Each of the nodes
                           in the supply chain must then solve their local bandit problem in parallel to
                           determine which of its sub-suppliers is most trustworthy. We use new arm-
                           selection strategies, and demonstrate how they can be gainfully applied to the
                           trust-based decision-making in the supply chain to reduce time to production
                           and hence improve utility by timely delivery of products.


1    Introduction
Research on trust in multiagent systems has given us a variety of conceptual frameworks to view trust and effective al-
gorithms for evaluating the trustworthiness of other agents given the history of mutual interactions [CAS98, GAM90,
HUY06, SEN02, VOG10, YU02, YOL05]. However, it has been posited that to fully harness the effectiveness of trust
based decision making, it is critical to develop a more comprehensive approach to trust management that addresses not
only trust evaluation but also provides strategic reasoning procedures that determine who to interact with, in what context,
and how to best utilize the resultant knowledge about the trustworthiness of others [SEN13]. The goal then is to develop
proactive trust mechanisms that explore possible fruitful partnerships, create situations where interactions will provide

          c by the paper’s authors. Copying permitted only for private and academic purposes.
Copyright �
In: R. Cohen, R. Falcone and T. J. Norman (eds.): Proceedings of the 17th International Workshop on Trust in Agent Societies, Paris, France, 05-MAY-
2014, published at http://ceur-ws.org




                                                                         1
discriminatory evidence of the trustworthiness of potential partners, and use a well-thought-out plan of how to best utilize
the trusted partners given expectations of future goals, plans and resource requirements.
   In this paper, we consider the problem of proactive engagement of potential long-term partners to determine who
can be trusted for providing consistent service on which an agent’s goal is critically dependent. Whereas most of the
research on trust in multiagent system focuses on after-the-event, offline evaluation of interaction histories to determine
trustworthiness of partners, real-world scenarios demand forward-looking, online schemes that has to choose to engage
interaction partners where the process of engagement incurs a cost. This “exploration cost” must be balanced against
possible gain or “exploitation benefit” in the long run from the knowledge of who the more trustworthy partners are.
Research in machine learning, and in particular, reinforcement learning has long confronted similar exploration versus
exploitation dilemmas [SUT98]. However, most of these research has ignored the time and cost of exploration, and
focused primarily on proving convergence to optimal policies for solving the underlying Markov Decision processes in
the limit, i.e., without time or budget constraints [KAE96]. In particular, the multi-armed bandit (MAB) problem has
been proposed as a theoretical framework for evaluating the utility of interacting with multiple entities with stochastically
varying performance [AGR88, AUE02].
   Though this would appear to be a natural mapping of the problem of evaluation of trust in new partners, on closer
inspection a key missing aspect underlines the fundamental difference between the two. The basic MAB model does not
consider any cost for pulling an arm, whereas interacting with a partner to gather further information to evaluate their
trustworthiness involves risks, time and resource commitment and consumption as well as other costs. More recently,
however, researchers have considered augmentations of the baisc MAB problem, that consider the cost of pulling arms
in combination with a budget limit for exploration [DIN13, TRA10, TRA12], i.e., a resource constraint that necessitates
strategic engagement with potential partners to quickly identify partnerships which are likely to deliver maximal long-term
benefits. We believe that solution approaches for the Multi-Armed Bandit Problem with Budget Constraints (MAB-BF)
can be adapted to address the issue of strategic engagement and evaluation of potential partner’s trustworthiness under a
number of settings and trust metrics such as reliability, fairness, quality of performance, timeliness, etc. The MAB-BF
has already been utilized for a wide range of reinforcement learning applications, including bidding in ad exchanges, bid
optimization in search, and service provider selection in cloud computing [ARD11, YEH11, BOR07, CHA10, GUH07].
   Various metrics for determining the trustworthiness of a partner has been proposed. In this paper, we consider per-
formance, in terms of reliability, as the objective criteria for trustworthiness. This choice is motivated by the following
observation [SEN13]: Trust in another agent reduces the uncertainty over that agent’s independent actions which posi-
tively correlates with the truster’s utility. We consider existing and recently developed approaches to addressing the MAB
problem with cost and budget constraints to proactively engage and evaluate potential partners to gauge their reliability and
hence long-term trustworthiness. To evaluate the efficacy of these approaches, we introduce a Supply Chain domain where
a manufacturer has to procure raw materials from contractors of initially unknown reliability. Contractors in turn have to
depend on sub-contractors to produce their deliverables. The goal of each agent in the supply chain is to reduce their time
to delivery of their product for which they have to strategically engage their sub-contractors to discriminate sub-contractors
of differing reliability.
   We introduce three new algorithms for addressing the MAB-BF problem. First, we first formally present the MAB-BF
problem (Section 2); then, we introduce recent algorithms proposed for this variant of the MAB problem, together with our
own proposed algorithms, highlighting their interrelationships and differences (Section 3). We next introduce the supply
chain domain used for evaluating the effectiveness of the proposed algorithms in determining the trustworthiness of the
sub-contractors under varying budget constraints and performance variability (Section 4). In Section 5 we describe the
simulation framework used for experiments and then present a series of experimental results from a variety of scenarios
in Section 6, varying the branching factor of the arms in each tree, the payoff distributions of the arms, and the budget
available to sample from the arms. In conclusion, in Section 7 we summarize the key findings from the comparative
experimental evaluation and present thoughts for future research.

2   Fixed-cost Multi-Armed Bandit with Budget Constraint
We now formally introduce the version of MAB-BF problem where the goal of the agent is to maximize the reward obtained
by pulling a set of A = {1, . . . , K} arms, where the ith arm has a fixed cost ci per pull with reward drawn from an
unknown distribution with a mean value of µi . The agent has at its disposal a total budget of B. Let C = �c1 , . . . , cK � and
µ = �µ1 , . . . , µK � refer to the vector of pulling costs and mean rewards for the K arms. We refer to the above problem as
the Fixed-Cost Multi-Armed Bandit with Budget Constraint (MAB-BF), defined by the tuple P = (A, B, C, µ). The goal
of an agent facing such a problem is to choose a sequence of arms that optimizes the expected reward without exceeding
the total budget for arm pulls. Because the number of arm pulls in a given sequence is limited by the costs of the particular




                                                               2
arms chosen, we generally consider the utility ratio µcii rather than µi alone to determine the priority of a given arm i. It
should be noted that while one could improve overall performance somewhat by incorporating other statistical measures
together with the utility ratio to produce a more general utility index, the result would only be a general optimization,
applicable to any of the algorithms we consider; for simplicity’s sake, we choose to focus on the utility ratio only, and
leave the determination of a best-case utility index for another setting. Finally, note that the µ values are not available to
the agent.
   The number of arm pulls that an agent can make is dependent on the multiset of arms it decides to pull. We now
introduce some notations to refer to sequences of arm pulls:
   • S = �a1 , . . . , at � refers to the sequence of arms pulled by the agent in the first t attempts. We will use |S| to refer to
     the number of arm pulls in the sequence S and S(i), ∀i ∈ {1, . . . , |S|} to refer to the ith arm pulled in that sequence.
     Also we denote the subsequence of pulls in S from t = t1 , . . . , t2 as St1 ,t2 . We will use a ∈ S to test for the existence
     of an arm a in the sequence S or to range over the set of arms in the sequence.
   • nSa represents the number of times arm a ∈ A was pulled in sequence S.
   • At is the set of arms which can still be pulled after the sequence of arm pulls S1,t with the remaining budget Bt =
           �t
     B − i=1 cS(i) , i.e., At = {a|a ∈ A ∧ ca ≤ Bt }.                           �
   • A sequence S is valid if it does not violate the budget constraint, i.e., a∈S ca ≤ B.
   • S = {S|S is valid} is the set of all valid sequences. For the rest of the paper we will use the term ‘sequence’ for valid
     sequences.
             �|S|
   • GS = i rai ,i is the total reward obtained from a valid sequence S, where rai ,i is the reward returned from arm
     ai ∈ A pulled at time i. As the rewards generated are non-deterministic, we are more interested the sum of the mean
                                                                                                   �|S|
     rewards in the sequence; that is, the Expected Total Reward of sequence S, E[GS ] = i µai . Correspondingly, we
     are interested in the optimal sequence S ∗ if the reward distributions for the arms were known,

                                                                 S ∗ = arg max E[GS ]                                                          (1)
                                                                                 S∈S
      and the associated payoff E[GS ∗ ]. Note that since the mean rewards for the arms are not known a priori, we will only
      know GS .
    • Given a sequence of arm pulls, S, an estimate of the mean reward for each of the sampled arms, µS , can be formed.
      Then µS = �µS1 , . . . , µSK � where µSi , ∀i ∈ A, is the average reward obtained from the nSi pulls of arm i in the
      sequence S:
                                                                          |S|
                                                                     1 �
                                                            µSi =           I(S(j) = i)ri,j                                                    (2)
                                                                    nSi j=1
      where I(·) is the Indicator function.
    • The expected regret, R(S) of the sequence S is then calculated as the difference: R(S) = E[GS ∗ ] − E[GS ]. The
      desirability of a sequence of arm pulls can be measured by its expected regret and the problem of optimizing expected
      reward can be mapped into the problem of minimizing expected regret.

3     Algorithms for the MAB-BF problem
In this section we introduce the different algorithms we experiment with. We first introduce some existing algorithms and
then discuss our proposed approaches.

3.1    Existing Algorithms
We now describe five existing algorithms; the first two (�-first, Greedy) have an initial exploration phase for estimating
the mean rewards of the K arms, and in the subsequent exploitation phase pull the arms in a greedy manner; the second
two (fKUBE, UCB-BV) are Upper Confidence Bound algorithms that integrate the exploration and exploitation phases
by comparing relative confidence for each arm; and the last (fKDE) stochastically integrates exploration and exploitation
while narrowing its focus:
Budget-Limited �-first: This algorithm, henceforth referred to as �-first, uniformly selects from the set of arms, perform-
     ing unordered sweeps of each before beginning again, until its exploration budget, �B, is exhausted (not enough bud-
     get remains to pull even the minimum cost arm1 ). Let S�−first
                                                              explore
                                                                      (P )2 be the sequence of arm pulls for the exploration
    1 We only consider scenarios where the exploration budget allows us to pull every arm the same number of times in the exploration phase.
    2 We will omit the problem argument P in cases where the context is clear.




                                                                           3
       phase for an �-first approach with MAB-BF problem P . At the end of the exploration phase, an estimate of the mean
       rewards, µS explore is calculated and subsequently used in the exploitation phase. Next, the arms are sorted by the ratio
                        �−first
                                                                                                                              explore
                                                                                                                          S
                                                                                                                        µ �−first
       of their estimated mean to pulling cost (their reward-cost ratio). The best such arm, I max = arg maxi∈A i ci              ,
       which we refer to as the active arm, is pulled until not enough of the exploitation budget, (1 − �)B, is left for one
       more pull of that arm. This process is repeated with the rest of the arms until the budget is completely exhausted (not
       enough remains to pull even the lowest cost arm). We refer to the sequence of arms pulled in this exploration phase
            exploit                                                                            explore           exploit
       as S�−first  (P ) and the combined exploration-exploitation sequence as S�−first = S�−first     (P ) + S�−first   (P ). This
       algorithm is presented in Figure 1.
       One key difference between the original formulation of this algorithm[TRA10] and our implementation of it is that
       our version continues to update the estimates of the arms during the exploitation phase, so that the active arm in
       that phase can change if its utility ratio drops below that of another arm, even if its cost is still affordable. We refer
       to our version of algorithms that behave this way as online-exploitation variants, in contrast to the original offline-
       exploitation variants. Experiments have shown that the online variants significantly outperform the offline variants in
       most scenarios and hence in this paper we present results with the online variants only, unless otherwise noted.

       Input : P = (A, B, C, µ): MAB-BF problem;
               �, exploration fraction of budget
       Output: S, a sequence of arm pulls;
               µ� , a vector of estimated mean rewards from the K arms;
               G, total reward from all arm pulls;

  1    Exploration phase:
                                                                  �             �
  2    t ← 1; S ← ∅; G ← 0; ExplorationRounds ←                        �B
                                                                      �K            ;
                                                                       i=1 ci
  3    for i = 1 → K do
  4         µ�i ← 0
  5  for n = 1 → ExplorationRounds do
  6      for i = 1 → K do
   7          pull arm i to obtain ri,t ; G ← G + ri,t ;
   8          S ← S + i ; // Add i to sequence S
   9          µ�i ← µ�i + ri,t ;
  10          t ← t + 1;

  11 for i = 1 → K do
                µ�
  12      µ�i = nSi
                    i

     ; // form reward mean estimates
  13 Exploitation phase:
                                                       �K
  14 RemainingBudget = B − ExplorationRounds ∗          i=1 ci ;
       �
  15 A = A ; // Initialize available arms
  16 while RemainingBudget≥ mini∈A� ci do
  17     I max = arg maxi∈A µcii ; // pick best arm
  18     if RemainingBudget≥ cI max then                                                     // if budget allows to pull arm
  19          pull arm I max to get reward rI max ,t ;
  20          G ← G + rI max ,t ;
                                  µ�I max ∗nS
                                            I max +rI max ,t
  21              µ�I max ←              nS     +1
                                                               ; // update mean reward estimate of arm
                                          I max
                                  max
  22              S ←S+I       ;
  23              RemainingBudget ← RemainingBudget−cI max ;
  24              t ← t + 1;
  25       else
  26              A� ← A� \ {I max } ; // eliminate arm

  27   return (S, µ� , G);
                                                       Algorithm 1: The �-first algorithm.

Greedy Algorithm: The Greedy algorithm is a special case of the �-first algorithm where the exploration budget only
    allows each arm to be pulled once before the exploitation phase begins. This algorithm is remarkable because to




                                                                                        4
    our knowledge it has not been used before for the MAB-BF problem, but its online-exploration variant nevertheless
    performs
    �K
                 exceptionally well on average. Note that since we require all arms to be pulled at least once, when � ≤
       i=1 ci
        B     , �-first algorithms are reduced to the Greedy algorithm.
Fractional fKUBE: The Fractional Knapsack-based Upper Confidence Bound Exploration and Exploitation
    (fKUBE) [TRA10], is based on a class of upper-confidence bound (UCB) based algorithms originally developed
    for the MAB problem without budget constraints. This algorithm has a minimal exploration phase, pulling each arm
    exactly once, and thereafter uses a confidence index based on the reward-cost ratio of the arms to determine which
    arm ai should be pulled at time t + 1, given by the following, where At is the set of arms still affordable at time t, µ�i,t
    is the sample mean of ai at t, and ni is the number of times that ai has been pulled by t:
                                                                            �            
                                                                                 2 ln t
                                                                   µ�i,t +       ni,t
                                                   arg max                                                                (3)
                                                        i∈At             ci

UCB with Budget Constraint and Variable Costs (UCB-BV): UCB-BV, created for the budget-constrained MAB, with
   the added complication of variable costs for each arm, inherits from the UCB algorithms [DIN13]. It begins with a
   minimal exploration phase, followed by a mixed exploration-exploitation phase driven by a confidence index. After
   each sweep of the arms, the arm with the highest index value becomes the active arm: This bound, for arm i at time t,
   where ni,t represents the number of times it has been pulled by this time, is given by:
                                                                           �          
                                                     µ�i,t   (1 + min1j cj ) ln n(t−1)
                                           arg max                                    
                                                                                  i,t
                                                           +               �                                                (4)
                                               ai ∈A  ci      min c −        ln (t−1)
                                                                        j j               ni,t


Fractional Knapsack-based Decreasing �-greedy (fKDE): The fractional KDE algorithm [TRA10] is a decreasing ex-
    ploration over time approach which first pulls the arms uniformly γ times and thereafter pulls the arm with the highest
    estimated reward-cost ratio with increasing probability. The probability of uniform random exploration after t arm
    pulls is set to εt = min(1, γt ). Otherwise, the arm with the highest estimated reward-cost ratio, based on the sequence
    of arms pulled, is chosen.


3.2   New Algorithms

Now, we present algorithms that we have developed for the MAB-BF problem. We believe these algorithms explore more
intelligently compared to their predecessors; the choice of arms to be pulled during exploration is driven by either the
number of arms |A|, or the distribution of the arm rewards in µ and the costs in C; past algorithms only made use of
uniform or minimal exploration phases, and fixed the exploration budget without consideration of the bandit at hand. Also
note that all the algorithms that we introduce are online algorithms, i.e., an eliminated arm may be reconsidered if the
reward-cost ratio of a sufficient number of previously preferred arms drops below the corresponding ratio of this arm upon
further sampling.
l-split (lS): This is a generalized Greedy approach. Instead of eliminating all but one arm after the first pass, the lS
      algorithm successively eliminates (1 − 1l ) of the arms after each pass: if AlS (p) is the number of surviving arms after
      p splitting passes of the algorithm, then AlS (p+1) = � 1l AlS (p)�. After �logl K� passes, lS narrows down the choice
      to one, and thereafter performs greedy exploitation. The algorithm is presented in detail in Algorithm 2. The simplest
      of this family of algorithms is the 2-split (2S) or halving algorithm, which successively eliminates approximately half
      of the underperforming arms after each pass.
Progressive exploration �-first (PEEF): The PEEF algorithm was developed upon a careful evaluation of the �-first al-
      gorithm. Whereas the latter expends its exploration budget uniformly over the set of K arms, we conjectured that
      in a number of scenarios, particularly those with a large number of arms, there might be some low return arms that
      can be quickly discarded. More importantly, we believe that the exploration budget can be better utilized to tease out
      differences between similar, high reward-cost ratio arms by visiting them with increasing frequency. Hence, rather
      than uniform exploration, we propose a progressive exploration scheme where we perform a l−split operation, as in
      the lS algorithm after each pass. The difference with that algorithm is that in PEEF the splitting value l is calculated
      such that the number of remaining arm choices is reduced to 1 approximately at the end of the exploration phase.
      Given an exploration budget of �B then we want l to be such that the following condition holds:




                                                               5
                                                                      logl K
                                                                       � K
                                                                                   cavg = �B,                                                                (5)
                                                                       j=1
                                                                               j
                           �K                                               3
         where cavg = K  1
                              i=1 ci , is the average cost of pulling an arm . Solving this equation for l we obtain l = �B−K .
                                                                                                                         �B−1

         For obvious reasons, we will perform a pairwise comparison of the �-first algorithm and the corresponding PEEF
         algorithm for different scenarios in the experimental section.

         Input : P = (A, B, C, µ): MAB-BF problem;
                 l, elimination factor
         Output: S, a sequence of arm pulls;
                 µ� , a vector of the arms’ sample mean rewards;
                 G, total reward from all arm pulls;

     1 t ← 1; S ← ∅; G ← 0; RemainingBudget ← B; NumPasses ← 0
     2 for i = 1 → K do
     3     µ�i ← 0
     4 A� = A ; // Initialize available arms
     5 while A �= ∅ do
               �

     6    foreach a ∈ A� do
     7        if RemainingBudget≥ cai then                                                      // if budget allows to pull arm
     8            pull arm a to obtain reward ra,t ;
     9            G ← G + ra,t ;
                               µ� ∗nS +r
    10               µ�a ← a nSa+1 a,t ; // update mean reward estimate of arm
                                a
    11               S ← S + a;
    12               RemainingBudget ← RemainingBudget−ca ;
    13               t ← t + 1;
    14       N umP asses ← numP asses + 1;
    15       A� = ∅;
                               �           �
    16       NumToPull ← K − lN umP K
                                      asses ;

    17       while NumToPull > 0 and A − A� �= ∅ do
    18          A� ← A� ∪ {arg maxi∈A µcii | cai ≤ RemainingBudget, Ai �∈ S � };
    19          NumToPull ← NumToPull - 1;
    20   return (S, µ� , G);
                                                 Algorithm 2: The l-Split algorithm.

Survival of the Above Average (SOAAv): This algorithm also successively narrows down the set of active arms by elim-
     inating underperforming arms. But rather than eliminating a fixed number of arms after each pass, it eliminates arms
     whose estimated reward-cost ratio is below (1 + x) times the average of such ratios of the arms in the last pass.
     Setting x = 0 means only above average individuals survive from one pass of the arms to the next. Note again that this
     is an online-exploration approach where a previously eliminated arm can come back into the active set if estimates of
     other active arms drop. This algorithm is presented in more detail in Algorithm 3.
   Of the above algorithms, both lS and PEEF are rank-based algorithms, where arms are eliminated based on their ranking
by reward-ratio cost ratios, whereas only the SOAAv algorithm is a value-based approach, where arms with estimated
reward-cost ratios below a certain factor of the average of the currently active set are eliminated.

4        Supply-chain Model
We now introduce the supply-chain model where contractors have to engage strategically with sub-contractors of initially-
unknown reliability (trustworthiness). Each interaction has a cost and the contractor’s goal is to distinguish the trustwor-
thiness of all of its subcontractors given a fixed budget to pay for the interactions. This domain allows us to examine
    3 Note that this calculation of l is necessarily approximate, both for the use of c
                                                                                        avg as the cost per unit pull during the exploration phase and in the use
of logl K as the number of passes during the exploration phase.




                                                                               6
       Input : P = (A, B, C, µ): MAB-BF problem;
               x, elimination factor
       Output: S, a sequence of arm pulls;
               µ� , a vector of the arms’ sample mean rewards;
               G, total reward from all arm pulls;

   1 t ← 1; S ← ∅; G ← 0; RemainingBudget ← B;
   2 for i = 1 → K do
   3      µ�i ← 0
   4 A� = A ; // Initialize available arms
   5 while RemainingBudget ≥ mini∈A� ci do
   6     numPullsInPass=0; passAverageRatio=0;
   7     foreach a ∈ A� do
   8         if RemainingBudget ≥ ca then                                              // if budget allows to pull arm
   9              pull arm a to obtain reward ra,t ;
  10              G ← G + ra,t ;
                              µ�a ∗nS
                                    a +ra,t
  11                  µ�a ←        nS
                                            ; //   update mean reward estimate of arm
                                    a +1
  12                  S ← S + a;
  13                  RemainingBudget ← RemainingBudget−ca ;
  14                  t ← t + 1;
                                                          r
  15                  passAverageRatio ← passAverageRatio+ ca,t
                                                             a
                                                                ;
  16                  numPullsInPass← numPullsInPass+1;
  17           else
  18                  A� ← A� \ {a} ; // eliminate arm

  19       if numPullsInPass >0 then
                                    passAverageRatio
  20            passAverageRatio ← numPullsInPass ;
  21            A� = ∅
  22            foreach a ∈ A do
  23                if ca