Adaptive Budgeted Bandit Algorithms for Trust in a
                            Supply-Chain Setting

                            Sandip Sen                            Anton Ridgway
                   Department of Computer Science        Department of Computer Science
                       The University of Tulsa                The University of Tulsa
                         sandip@utulsa.edu                  anton-ridgway@utulsa.edu
                                               Michael Ripley
                                      Department of Computer Science
                                          The University of Tulsa
                                         michael-ripley@utulsa.edu


                                                                   Abstract

                           Recently, an AAMAS Challenges & Visions paper identified several key com-
                           ponents of a comprehensive trust management has been understudied by the
                           research community [SEN13]. We believe that we can build on recent ad-
                           vances in closely related research in other sub-fields of AI and multiagent sys-
                           tems to address some of these issues. For example, the budgeted multi-armed
                           bandit problem involves pulling multiple arms with stochastic rewards with
                           the goal of maximizing the total reward generated from those arms, while
                           keeping the cost of pulling the arms beneath a given budget. We argue that
                           multi-armed bandit algorithms can be adapted to address research issues in
                           trust engagement and evaluation components of a comprehensive trust man-
                           agement approach. To support this proposition, we consider a supply-chain
                           application, where a tree of dependent supplier agents can be considered as an
                           arm of the online bandit problem with budget constraints. Each of the nodes
                           in the supply chain must then solve their local bandit problem in parallel to
                           determine which of its sub-suppliers is most trustworthy. We use new arm-
                           selection strategies, and demonstrate how they can be gainfully applied to the
                           trust-based decision-making in the supply chain to reduce time to production
                           and hence improve utility by timely delivery of products.


1    Introduction
Research on trust in multiagent systems has given us a variety of conceptual frameworks to view trust and effective al-
gorithms for evaluating the trustworthiness of other agents given the history of mutual interactions [CAS98, GAM90,
HUY06, SEN02, VOG10, YU02, YOL05]. However, it has been posited that to fully harness the effectiveness of trust
based decision making, it is critical to develop a more comprehensive approach to trust management that addresses not
only trust evaluation but also provides strategic reasoning procedures that determine who to interact with, in what context,
and how to best utilize the resultant knowledge about the trustworthiness of others [SEN13]. The goal then is to develop
proactive trust mechanisms that explore possible fruitful partnerships, create situations where interactions will provide

          c by the paper’s authors. Copying permitted only for private and academic purposes.
Copyright �
In: R. Cohen, R. Falcone and T. J. Norman (eds.): Proceedings of the 17th International Workshop on Trust in Agent Societies, Paris, France, 05-MAY-
2014, published at http://ceur-ws.org


                                                                         1
discriminatory evidence of the trustworthiness of potential partners, and use a well-thought-out plan of how to best utilize
the trusted partners given expectations of future goals, plans and resource requirements.
   In this paper, we consider the problem of proactive engagement of potential long-term partners to determine who
can be trusted for providing consistent service on which an agent’s goal is critically dependent. Whereas most of the
research on trust in multiagent system focuses on after-the-event, offline evaluation of interaction histories to determine
trustworthiness of partners, real-world scenarios demand forward-looking, online schemes that has to choose to engage
interaction partners where the process of engagement incurs a cost. This “exploration cost” must be balanced against
possible gain or “exploitation benefit” in the long run from the knowledge of who the more trustworthy partners are.
Research in machine learning, and in particular, reinforcement learning has long confronted similar exploration versus
exploitation dilemmas [SUT98]. However, most of these research has ignored the time and cost of exploration, and
focused primarily on proving convergence to optimal policies for solving the underlying Markov Decision processes in
the limit, i.e., without time or budget constraints [KAE96]. In particular, the multi-armed bandit (MAB) problem has
been proposed as a theoretical framework for evaluating the utility of interacting with multiple entities with stochastically
varying performance [AGR88, AUE02].
   Though this would appear to be a natural mapping of the problem of evaluation of trust in new partners, on closer
inspection a key missing aspect underlines the fundamental difference between the two. The basic MAB model does not
consider any cost for pulling an arm, whereas interacting with a partner to gather further information to evaluate their
trustworthiness involves risks, time and resource commitment and consumption as well as other costs. More recently,
however, researchers have considered augmentations of the baisc MAB problem, that consider the cost of pulling arms
in combination with a budget limit for exploration [DIN13, TRA10, TRA12], i.e., a resource constraint that necessitates
strategic engagement with potential partners to quickly identify partnerships which are likely to deliver maximal long-term
benefits. We believe that solution approaches for the Multi-Armed Bandit Problem with Budget Constraints (MAB-BF)
can be adapted to address the issue of strategic engagement and evaluation of potential partner’s trustworthiness under a
number of settings and trust metrics such as reliability, fairness, quality of performance, timeliness, etc. The MAB-BF
has already been utilized for a wide range of reinforcement learning applications, including bidding in ad exchanges, bid
optimization in search, and service provider selection in cloud computing [ARD11, YEH11, BOR07, CHA10, GUH07].
   Various metrics for determining the trustworthiness of a partner has been proposed. In this paper, we consider per-
formance, in terms of reliability, as the objective criteria for trustworthiness. This choice is motivated by the following
observation [SEN13]: Trust in another agent reduces the uncertainty over that agent’s independent actions which posi-
tively correlates with the truster’s utility. We consider existing and recently developed approaches to addressing the MAB
problem with cost and budget constraints to proactively engage and evaluate potential partners to gauge their reliability and
hence long-term trustworthiness. To evaluate the efficacy of these approaches, we introduce a Supply Chain domain where
a manufacturer has to procure raw materials from contractors of initially unknown reliability. Contractors in turn have to
depend on sub-contractors to produce their deliverables. The goal of each agent in the supply chain is to reduce their time
to delivery of their product for which they have to strategically engage their sub-contractors to discriminate sub-contractors
of differing reliability.
   We introduce three new algorithms for addressing the MAB-BF problem. First, we first formally present the MAB-BF
problem (Section 2); then, we introduce recent algorithms proposed for this variant of the MAB problem, together with our
own proposed algorithms, highlighting their interrelationships and differences (Section 3). We next introduce the supply
chain domain used for evaluating the effectiveness of the proposed algorithms in determining the trustworthiness of the
sub-contractors under varying budget constraints and performance variability (Section 4). In Section 5 we describe the
simulation framework used for experiments and then present a series of experimental results from a variety of scenarios
in Section 6, varying the branching factor of the arms in each tree, the payoff distributions of the arms, and the budget
available to sample from the arms. In conclusion, in Section 7 we summarize the key findings from the comparative
experimental evaluation and present thoughts for future research.

2   Fixed-cost Multi-Armed Bandit with Budget Constraint
We now formally introduce the version of MAB-BF problem where the goal of the agent is to maximize the reward obtained
by pulling a set of A = {1, . . . , K} arms, where the ith arm has a fixed cost ci per pull with reward drawn from an
unknown distribution with a mean value of µi . The agent has at its disposal a total budget of B. Let C = �c1 , . . . , cK � and
µ = �µ1 , . . . , µK � refer to the vector of pulling costs and mean rewards for the K arms. We refer to the above problem as
the Fixed-Cost Multi-Armed Bandit with Budget Constraint (MAB-BF), defined by the tuple P = (A, B, C, µ). The goal
of an agent facing such a problem is to choose a sequence of arms that optimizes the expected reward without exceeding
the total budget for arm pulls. Because the number of arm pulls in a given sequence is limited by the costs of the particular


                                                               2
arms chosen, we generally consider the utility ratio µcii rather than µi alone to determine the priority of a given arm i. It
should be noted that while one could improve overall performance somewhat by incorporating other statistical measures
together with the utility ratio to produce a more general utility index, the result would only be a general optimization,
applicable to any of the algorithms we consider; for simplicity’s sake, we choose to focus on the utility ratio only, and
leave the determination of a best-case utility index for another setting. Finally, note that the µ values are not available to
the agent.
   The number of arm pulls that an agent can make is dependent on the multiset of arms it decides to pull. We now
introduce some notations to refer to sequences of arm pulls:
   • S = �a1 , . . . , at � refers to the sequence of arms pulled by the agent in the first t attempts. We will use |S| to refer to
     the number of arm pulls in the sequence S and S(i), ∀i ∈ {1, . . . , |S|} to refer to the ith arm pulled in that sequence.
     Also we denote the subsequence of pulls in S from t = t1 , . . . , t2 as St1 ,t2 . We will use a ∈ S to test for the existence
     of an arm a in the sequence S or to range over the set of arms in the sequence.
   • nSa represents the number of times arm a ∈ A was pulled in sequence S.
   • At is the set of arms which can still be pulled after the sequence of arm pulls S1,t with the remaining budget Bt =
           �t
     B − i=1 cS(i) , i.e., At = {a|a ∈ A ∧ ca ≤ Bt }.                           �
   • A sequence S is valid if it does not violate the budget constraint, i.e., a∈S ca ≤ B.
   • S = {S|S is valid} is the set of all valid sequences. For the rest of the paper we will use the term ‘sequence’ for valid
     sequences.
             �|S|
   • GS = i rai ,i is the total reward obtained from a valid sequence S, where rai ,i is the reward returned from arm
     ai ∈ A pulled at time i. As the rewards generated are non-deterministic, we are more interested the sum of the mean
                                                                                                   �|S|
     rewards in the sequence; that is, the Expected Total Reward of sequence S, E[GS ] = i µai . Correspondingly, we
     are interested in the optimal sequence S ∗ if the reward distributions for the arms were known,

                                                                 S ∗ = arg max E[GS ]                                                          (1)
                                                                                 S∈S
      and the associated payoff E[GS ∗ ]. Note that since the mean rewards for the arms are not known a priori, we will only
      know GS .
    • Given a sequence of arm pulls, S, an estimate of the mean reward for each of the sampled arms, µS , can be formed.
      Then µS = �µS1 , . . . , µSK � where µSi , ∀i ∈ A, is the average reward obtained from the nSi pulls of arm i in the
      sequence S:
                                                                          |S|
                                                                     1 �
                                                            µSi =           I(S(j) = i)ri,j                                                    (2)
                                                                    nSi j=1
      where I(·) is the Indicator function.
    • The expected regret, R(S) of the sequence S is then calculated as the difference: R(S) = E[GS ∗ ] − E[GS ]. The
      desirability of a sequence of arm pulls can be measured by its expected regret and the problem of optimizing expected
      reward can be mapped into the problem of minimizing expected regret.

3     Algorithms for the MAB-BF problem
In this section we introduce the different algorithms we experiment with. We first introduce some existing algorithms and
then discuss our proposed approaches.

3.1    Existing Algorithms
We now describe five existing algorithms; the first two (�-first, Greedy) have an initial exploration phase for estimating
the mean rewards of the K arms, and in the subsequent exploitation phase pull the arms in a greedy manner; the second
two (fKUBE, UCB-BV) are Upper Confidence Bound algorithms that integrate the exploration and exploitation phases
by comparing relative confidence for each arm; and the last (fKDE) stochastically integrates exploration and exploitation
while narrowing its focus:
Budget-Limited �-first: This algorithm, henceforth referred to as �-first, uniformly selects from the set of arms, perform-
     ing unordered sweeps of each before beginning again, until its exploration budget, �B, is exhausted (not enough bud-
     get remains to pull even the minimum cost arm1 ). Let S�−first
                                                              explore
                                                                      (P )2 be the sequence of arm pulls for the exploration
    1 We only consider scenarios where the exploration budget allows us to pull every arm the same number of times in the exploration phase.
    2 We will omit the problem argument P in cases where the context is clear.


                                                                           3
       phase for an �-first approach with MAB-BF problem P . At the end of the exploration phase, an estimate of the mean
       rewards, µS explore is calculated and subsequently used in the exploitation phase. Next, the arms are sorted by the ratio
                        �−first
                                                                                                                              explore
                                                                                                                          S
                                                                                                                        µ �−first
       of their estimated mean to pulling cost (their reward-cost ratio). The best such arm, I max = arg maxi∈A i ci              ,
       which we refer to as the active arm, is pulled until not enough of the exploitation budget, (1 − �)B, is left for one
       more pull of that arm. This process is repeated with the rest of the arms until the budget is completely exhausted (not
       enough remains to pull even the lowest cost arm). We refer to the sequence of arms pulled in this exploration phase
            exploit                                                                            explore           exploit
       as S�−first  (P ) and the combined exploration-exploitation sequence as S�−first = S�−first     (P ) + S�−first   (P ). This
       algorithm is presented in Figure 1.
       One key difference between the original formulation of this algorithm[TRA10] and our implementation of it is that
       our version continues to update the estimates of the arms during the exploitation phase, so that the active arm in
       that phase can change if its utility ratio drops below that of another arm, even if its cost is still affordable. We refer
       to our version of algorithms that behave this way as online-exploitation variants, in contrast to the original offline-
       exploitation variants. Experiments have shown that the online variants significantly outperform the offline variants in
       most scenarios and hence in this paper we present results with the online variants only, unless otherwise noted.

       Input : P = (A, B, C, µ): MAB-BF problem;
               �, exploration fraction of budget
       Output: S, a sequence of arm pulls;
               µ� , a vector of estimated mean rewards from the K arms;
               G, total reward from all arm pulls;

  1    Exploration phase:
                                                                  �             �
  2    t ← 1; S ← ∅; G ← 0; ExplorationRounds ←                        �B
                                                                      �K            ;
                                                                       i=1 ci
  3    for i = 1 → K do
  4         µ�i ← 0
  5  for n = 1 → ExplorationRounds do
  6      for i = 1 → K do
   7          pull arm i to obtain ri,t ; G ← G + ri,t ;
   8          S ← S + i ; // Add i to sequence S
   9          µ�i ← µ�i + ri,t ;
  10          t ← t + 1;

  11 for i = 1 → K do
                µ�
  12      µ�i = nSi
                    i

     ; // form reward mean estimates
  13 Exploitation phase:
                                                       �K
  14 RemainingBudget = B − ExplorationRounds ∗          i=1 ci ;
       �
  15 A = A ; // Initialize available arms
  16 while RemainingBudget≥ mini∈A� ci do
  17     I max = arg maxi∈A µcii ; // pick best arm
  18     if RemainingBudget≥ cI max then                                                     // if budget allows to pull arm
  19          pull arm I max to get reward rI max ,t ;
  20          G ← G + rI max ,t ;
                                  µ�I max ∗nS
                                            I max +rI max ,t
  21              µ�I max ←              nS     +1
                                                               ; // update mean reward estimate of arm
                                          I max
                                  max
  22              S ←S+I       ;
  23              RemainingBudget ← RemainingBudget−cI max ;
  24              t ← t + 1;
  25       else
  26              A� ← A� \ {I max } ; // eliminate arm

  27   return (S, µ� , G);
                                                       Algorithm 1: The �-first algorithm.

Greedy Algorithm: The Greedy algorithm is a special case of the �-first algorithm where the exploration budget only
    allows each arm to be pulled once before the exploitation phase begins. This algorithm is remarkable because to


                                                                                        4
    our knowledge it has not been used before for the MAB-BF problem, but its online-exploration variant nevertheless
    performs
    �K
                 exceptionally well on average. Note that since we require all arms to be pulled at least once, when � ≤
       i=1 ci
        B     , �-first algorithms are reduced to the Greedy algorithm.
Fractional fKUBE: The Fractional Knapsack-based Upper Confidence Bound Exploration and Exploitation
    (fKUBE) [TRA10], is based on a class of upper-confidence bound (UCB) based algorithms originally developed
    for the MAB problem without budget constraints. This algorithm has a minimal exploration phase, pulling each arm
    exactly once, and thereafter uses a confidence index based on the reward-cost ratio of the arms to determine which
    arm ai should be pulled at time t + 1, given by the following, where At is the set of arms still affordable at time t, µ�i,t
    is the sample mean of ai at t, and ni is the number of times that ai has been pulled by t:
                                                                            �            
                                                                                 2 ln t
                                                                   µ�i,t +       ni,t
                                                   arg max                                                                (3)
                                                        i∈At             ci

UCB with Budget Constraint and Variable Costs (UCB-BV): UCB-BV, created for the budget-constrained MAB, with
   the added complication of variable costs for each arm, inherits from the UCB algorithms [DIN13]. It begins with a
   minimal exploration phase, followed by a mixed exploration-exploitation phase driven by a confidence index. After
   each sweep of the arms, the arm with the highest index value becomes the active arm: This bound, for arm i at time t,
   where ni,t represents the number of times it has been pulled by this time, is given by:
                                                                           �          
                                                     µ�i,t   (1 + min1j cj ) ln n(t−1)
                                           arg max                                    
                                                                                  i,t
                                                           +               �                                                (4)
                                               ai ∈A  ci      min c −        ln (t−1)
                                                                        j j               ni,t


Fractional Knapsack-based Decreasing �-greedy (fKDE): The fractional KDE algorithm [TRA10] is a decreasing ex-
    ploration over time approach which first pulls the arms uniformly γ times and thereafter pulls the arm with the highest
    estimated reward-cost ratio with increasing probability. The probability of uniform random exploration after t arm
    pulls is set to εt = min(1, γt ). Otherwise, the arm with the highest estimated reward-cost ratio, based on the sequence
    of arms pulled, is chosen.


3.2   New Algorithms

Now, we present algorithms that we have developed for the MAB-BF problem. We believe these algorithms explore more
intelligently compared to their predecessors; the choice of arms to be pulled during exploration is driven by either the
number of arms |A|, or the distribution of the arm rewards in µ and the costs in C; past algorithms only made use of
uniform or minimal exploration phases, and fixed the exploration budget without consideration of the bandit at hand. Also
note that all the algorithms that we introduce are online algorithms, i.e., an eliminated arm may be reconsidered if the
reward-cost ratio of a sufficient number of previously preferred arms drops below the corresponding ratio of this arm upon
further sampling.
l-split (lS): This is a generalized Greedy approach. Instead of eliminating all but one arm after the first pass, the lS
      algorithm successively eliminates (1 − 1l ) of the arms after each pass: if AlS (p) is the number of surviving arms after
      p splitting passes of the algorithm, then AlS (p+1) = � 1l AlS (p)�. After �logl K� passes, lS narrows down the choice
      to one, and thereafter performs greedy exploitation. The algorithm is presented in detail in Algorithm 2. The simplest
      of this family of algorithms is the 2-split (2S) or halving algorithm, which successively eliminates approximately half
      of the underperforming arms after each pass.
Progressive exploration �-first (PEEF): The PEEF algorithm was developed upon a careful evaluation of the �-first al-
      gorithm. Whereas the latter expends its exploration budget uniformly over the set of K arms, we conjectured that
      in a number of scenarios, particularly those with a large number of arms, there might be some low return arms that
      can be quickly discarded. More importantly, we believe that the exploration budget can be better utilized to tease out
      differences between similar, high reward-cost ratio arms by visiting them with increasing frequency. Hence, rather
      than uniform exploration, we propose a progressive exploration scheme where we perform a l−split operation, as in
      the lS algorithm after each pass. The difference with that algorithm is that in PEEF the splitting value l is calculated
      such that the number of remaining arm choices is reduced to 1 approximately at the end of the exploration phase.
      Given an exploration budget of �B then we want l to be such that the following condition holds:


                                                               5
                                                                      logl K
                                                                       � K
                                                                                   cavg = �B,                                                                (5)
                                                                       j=1
                                                                               j
                           �K                                               3
         where cavg = K  1
                              i=1 ci , is the average cost of pulling an arm . Solving this equation for l we obtain l = �B−K .
                                                                                                                         �B−1

         For obvious reasons, we will perform a pairwise comparison of the �-first algorithm and the corresponding PEEF
         algorithm for different scenarios in the experimental section.

         Input : P = (A, B, C, µ): MAB-BF problem;
                 l, elimination factor
         Output: S, a sequence of arm pulls;
                 µ� , a vector of the arms’ sample mean rewards;
                 G, total reward from all arm pulls;

     1 t ← 1; S ← ∅; G ← 0; RemainingBudget ← B; NumPasses ← 0
     2 for i = 1 → K do
     3     µ�i ← 0
     4 A� = A ; // Initialize available arms
     5 while A �= ∅ do
               �

     6    foreach a ∈ A� do
     7        if RemainingBudget≥ cai then                                                      // if budget allows to pull arm
     8            pull arm a to obtain reward ra,t ;
     9            G ← G + ra,t ;
                               µ� ∗nS +r
    10               µ�a ← a nSa+1 a,t ; // update mean reward estimate of arm
                                a
    11               S ← S + a;
    12               RemainingBudget ← RemainingBudget−ca ;
    13               t ← t + 1;
    14       N umP asses ← numP asses + 1;
    15       A� = ∅;
                               �           �
    16       NumToPull ← K − lN umP K
                                      asses ;

    17       while NumToPull > 0 and A − A� �= ∅ do
    18          A� ← A� ∪ {arg maxi∈A µcii | cai ≤ RemainingBudget, Ai �∈ S � };
    19          NumToPull ← NumToPull - 1;
    20   return (S, µ� , G);
                                                 Algorithm 2: The l-Split algorithm.

Survival of the Above Average (SOAAv): This algorithm also successively narrows down the set of active arms by elim-
     inating underperforming arms. But rather than eliminating a fixed number of arms after each pass, it eliminates arms
     whose estimated reward-cost ratio is below (1 + x) times the average of such ratios of the arms in the last pass.
     Setting x = 0 means only above average individuals survive from one pass of the arms to the next. Note again that this
     is an online-exploration approach where a previously eliminated arm can come back into the active set if estimates of
     other active arms drop. This algorithm is presented in more detail in Algorithm 3.
   Of the above algorithms, both lS and PEEF are rank-based algorithms, where arms are eliminated based on their ranking
by reward-ratio cost ratios, whereas only the SOAAv algorithm is a value-based approach, where arms with estimated
reward-cost ratios below a certain factor of the average of the currently active set are eliminated.

4        Supply-chain Model
We now introduce the supply-chain model where contractors have to engage strategically with sub-contractors of initially-
unknown reliability (trustworthiness). Each interaction has a cost and the contractor’s goal is to distinguish the trustwor-
thiness of all of its subcontractors given a fixed budget to pay for the interactions. This domain allows us to examine
    3 Note that this calculation of l is necessarily approximate, both for the use of c
                                                                                        avg as the cost per unit pull during the exploration phase and in the use
of logl K as the number of passes during the exploration phase.


                                                                               6
       Input : P = (A, B, C, µ): MAB-BF problem;
               x, elimination factor
       Output: S, a sequence of arm pulls;
               µ� , a vector of the arms’ sample mean rewards;
               G, total reward from all arm pulls;

   1 t ← 1; S ← ∅; G ← 0; RemainingBudget ← B;
   2 for i = 1 → K do
   3      µ�i ← 0
   4 A� = A ; // Initialize available arms
   5 while RemainingBudget ≥ mini∈A� ci do
   6     numPullsInPass=0; passAverageRatio=0;
   7     foreach a ∈ A� do
   8         if RemainingBudget ≥ ca then                                              // if budget allows to pull arm
   9              pull arm a to obtain reward ra,t ;
  10              G ← G + ra,t ;
                              µ�a ∗nS
                                    a +ra,t
  11                  µ�a ←        nS
                                            ; //   update mean reward estimate of arm
                                    a +1
  12                  S ← S + a;
  13                  RemainingBudget ← RemainingBudget−ca ;
  14                  t ← t + 1;
                                                          r
  15                  passAverageRatio ← passAverageRatio+ ca,t
                                                             a
                                                                ;
  16                  numPullsInPass← numPullsInPass+1;
  17           else
  18                  A� ← A� \ {a} ; // eliminate arm

  19       if numPullsInPass >0 then
                                    passAverageRatio
  20            passAverageRatio ← numPullsInPass ;
  21            A� = ∅
  22            foreach a ∈ A do
  23                if ca <RemainingBudget & µ�a ≥ (1 + x)passAverageRatio then
  24                     A� ← A� ∪ {a};

  25   return (S, µ� , G);
                                                   Algorithm 3: The SOAAv algorithm.

how the algorithms developed for the budget-constrained MAB problem perform in complex, large-scale systems, while
simultaneously examining the algorithms’ efficiency in dealing with trust-based decision making. The trust that the agent
places in each of its sub-contractors in the chain is based on its past observations while engagement decisions have to be
predicated on remaining uncertainties about discriminating between potential partners.
    In this model, there is a root agent, with access to a fixed set of contractor agents, represented as the set of arms
A0,0 = �a1,0 , . . . , a1,k �, where k is the branching factor. Each of the other agents al,i , in turn, has access to its own set
of sub-contractors Al,i , where l is the current level in the tree. This gives rise to a supply-chain tree, with the original
agent a0,0 at the root. Each agent besides the root agent possesses a fixed contracting cost cl,i . The utility that each agent
returns to its contracting agent is the inverse of the time that it takes to complete the task. This time is determined by
the time returned by the agent’s own sub-contractors, added to a value drawn from a Gaussian distribution internal to the
agent, with mean µl,i and standard deviation σl,i ; leaf agents’ times are determined by their own distributions only. For our
implementation of this model, we chose to hold the branching factor fixed for all the levels of each branch. Additionally,
to maintain a reasonable level of variation in our experiments, we chose to give each agent in A1 identical sub-trees.
    Especially important is that, because each algorithm must perform its own bandit problem on its sub-contractors simul-
taneously, and because no agent has the opportunity to sample arms when it is not pulled, most agents will not be able
to complete their exploration before the root agent has completed its budget. To remedy this, we set the budget of agent
i on level l to be Bl,i,t = kcl,i , if it is selected at time t. Thus, if an agent’s budget is distributed evenly among all its
sub-contractor agents, the budget of each sub-contractor will be equal to that of the original contractor, giving each agent
the chance to explore its options and reach exploitation in time for its contractor to benefit. Though we chose for simplicity
to fix the budget multiplier at a fairly neutral level (the parent’s branching factor), it’s worth noting that this value could be
altered per-agent or over time to introduce additional variability in the model. Additionally, because each agent in the tree


                                                                      7
is exploring simultaneously, we expected less aggressive bandit algorithms to have a slight advantage in this setting, since
they will allow the sub-contractors more time to finish exploring before they begin exploitation.
    It is also of note that, because of this set-up, no agent except for the root knows what its total budget will be until it has
already spent it. Thus, for algorithms that fix their exploration budget in advance, we provided an estimated exploration
budget of the same size as its contracting agent– since when budget is expended uniformly, each agent receives the same
budget as its contracting agent, this seemed to be a reasonable measure that would allow these algorithms to be executed
throughout the tree.

5   Simulation Framework
In our simulation framework, we implemented fixed supply-chain trees where the ith agent in level l, agent al,i , was
provided with a Gaussian reward distribution defined by {µl,i ,σl,i } and a fixed engagement or contracting cost of cl,i . The
algorithms’ performance in each scenario was averaged over one thousand trials. As results in experiments where arm
costs were selected randomly close to 1 were qualitatively similar to those where all arm costs were 1, we use the results
from the latter scenario for uniformity in the results presented here. This fixed cost constraint could easily be lifted, and
does not substantially affect our results.
   To evaluate the algorithms under consideration for the supply-chain problem, and to help discover which situations each
could function best in, we made use of the following test scenarios. There are two classes sets of experiments that use
different allotment of MAB-BF algorithms to the different nodes in the supply-chain:
Root-Variation Trees In the root-variation case, the root agent employs the selected algorithm under consideration, while
      each other contractor in the tree employs the same ”basic” algorithm (we chose lS) in every case; because the sub-
      tree of contractors would perform the same for each algorithm tested, we use this case for comparing all MAB-BF
      algorithms.
Homogeneous Trees In the homogeneous case, we employ the same algorithm at each level of the tree. We expected that
      this arrangement would help to amplify each algorithm’s relative strengths and weaknesses. Because some algorithms
      require the total budget to be known initially, but no agent except for the root knows this in advance, estimated
      exploration budgets are needed for these to be allowed in the homogeneous configuration. So experiments with
      Homogeneous trees are run only for algorithms that does not require the knowledge of the total budget.
   In addition, we use different function distributions to generate the performances of the contractors in the supply-chain.
For each of the root-variation and homogeneous cases, therefore, we also experimented with several node performance
distributions.
Curve Distributions: The simplest function type that we used is the linear curve-based distribution. That is, the values
      in the µl vector were chosen so that successive µl,i values followed some linear, superlinear (concave down), or
                                                                                             µ
      sublinear (concave up) curve, with the average time for the first arm being µl,1 = l,max  k   and that of the last arm
      being µl,k = µl , where µl
                      max          max
                                         is the maximum mean time for any arm on level l of the tree. The standard deviation
      of processing time was held the same for all agents. Thus, in the linear set, arms’ means were evenly distanced. In
      contrast, there were more (less) arms with high reward in the superlinear (sublinear) curve distributions compared to
      arms with low rewards . These configurations allowed us to examine the algorithms’ performance in conditions where
      either more (superlinear) or less (sublinear) exploration would be beneficial.
Terraced Agents: We also considered ”terraced” agent organizations, where for agents in the lth level, al,i , a single agent
      with a best µ was placed together with some percentage of “good” performers (around 80% µ values compared to the
      best), and the rest were “poor” performers (around 20%); the payoff variance was chosen to be large enough such that
      the best arm was difficult to identify. Contractors with such variable performance allowed us to better examine and
      evaluate how intelligently algorithms expended their exploration budget; the poor contractors were chosen to penalize
      algorithms that explore less intelligently, while the presence of the ”good” contractors required focused sampling to
      discern which was the best– a single sweep or any myopic choice was unlikely to be sufficient.

6   Experimental Results
In this section, we discuss the algorithms’ performance for our trust-based supply-chain scenario. To evaluate the algo-
rithms, we focused on the reward that they generated as the initial budget varied, and as the standard deviation of completion
time of each contraction (fixed at the same value for every agent) varied. The parameters values chosen for these scenario
were varied to determine when each algorithm would perform at its best, and in which situations it is less effective.
    In general, our experimentation showed that at low budgets, the more aggressive algorithms, that performed the least
exploration, tended to perform the best. The Greedy and PEEF algorithms converged with sufficiently small budget (they
were not allowed to explore any less than the Greedy, since then some of the arms would simply go unexplored); lS and


                                                                8
SOAAv were a little worse, while high-exploration �-first and KDE were significantly worse (see Figure 6.1). As it turns
out, the natural advantage that the model provides to less aggressive algorithms is not enough to overcome the advantage
that more decisive algorithms accrue in a budget-limited environment.
    As the budget increased, the lS and SOAAv algorithms quickly outperformed the other algorithms in nearly every case.
Perhaps surprisingly, the low-exploration �-first method also proved very strong (nearly on-par with PEEF), while KDE,
despite its superior, asymptotically-optimal regret bounds [TRA10], performed relatively worse for small budgets. By
examining the individual contractor-selection sequences, we determined that this was because, in many cases, the amount
of exploration performed by KDE was not justified by the savings accrued in being able to locate the best contractor. By
contrast, the �-first method was still often able to determine the best contractor eventually, by reordering its contractor
ranking after some number of mistakes during exploitation. With a sufficiently high budget, KDE was able to overtake
the low-exploration �-first method (see Figure 6.1), since the reward it missed from increased exploration was eventually
surpassed by the savings it incurred over the Greedy algorithm’s method, which accrued some regret over time on average,
from the trials where it could not correctly identify the best contractor.
    We also saw that while the PEEF algorithm performed well with small budget (it becomes identical to Greedy when
budget is sufficiently small, since the algorithm must select each contractor at least once). Its average time to completion,
like Greedy, drops off relative to the others once it begins spending more time exploring– that is, the other algorithms
are able to explore more intelligently. As was expected, the PEEF method follows the general pattern of the equivalent
�-first algorithm, while their intelligent choice of contractors allows it to remain more competitive. The fact that the �-first
algorithms fall behind KDE in our experiments supports our understanding that these algorithms have asymptotically worse
regret bounds than KDE.
    Moreover, the juxtaposition of lS and PEEF emphasize the importance of how the algorithms are parameterized. The
two algorithms are structurally identical, but because PEEF’s exploration budget is dependent on the size of the budget,
while lS’s is dependent on the number of arms, lS is better able to cope with a variety of different datasets without altering
its parameters, when the given budget is of a moderate to large size. For the same reason, SOAAv’s parameter is sufficiently
general that the algorithm remains effectively adaptive without fine tuning.

6.1   Curve-based Distributions for Base-variation and Homogeneous Trees


Figure 1: Time v. Budget for Curve-based Base-variation Trees - Linear (left), Superlinear (center), and Sublinear (right)
   When testing with the curve-based distributions, we chose to consider cases where the σl,i values were set at 20, and
µl,i values were chosen 10 units apart. The branching factor was set to k = 10. All contracting costs were set to 1 unit,
and contracting budgets ranged from 50 (five times the cost to pull all arms available to the root) to 500 (fifty times).
Results from these experiments are plotted in Figures 6.1 and 2); here, time values are considered relative to the mean time
values of all the algorithms to highlight differences between their performances. We can see that as budget increases, the
differences between the KDE and �-first algorithms become much smaller, while the other algorithms tend to spread out.
The overall trends in the algorithms’ rankings at low and high budgets matched those discussed earlier: the lS algorithm
performed the best at large budgets, with SOAAv generally very close behind. While both PEEF and low-exploration �-first
were preferable to these for very small budgets, they became worse as budget became larger.
   Also of note is the fact that KDE and �-first are much less competitive to begin with, but begin to increase fast enough
to overtake the other algorithms when budget becomes sufficiently large, as it does for �-first and PEEF in the range of our


                                                               9
experiments. This suggests that they could perform very well in very long-term cases. For the budget-limited scenario,
however, this is of limited value.
   In considering the sublinear and superlinear utility distribution trials, we found that the Greedy and PEEF algorithms
tend to do relatively better in the sublinear case, and worse in the superlinear and linear cases. This makes sense, since
these are more aggressive algorithms, and are better able to take advantage of the more obvious best contractor location
possible in the sublinear case. The other algorithms were fairly stably in their relative utility.


                           Figure 2: Time v. Standard Deviation for Linear, Homogeneous Trees


 Figure 3: Time from Mean for Linear, Homogeneous Trees - at Low (left), Moderate (center), and High Budgets (right)

   For the case of homogeneous trees we found that the algorithms’ results were generally much closer together, and
that the low-exploration �-first and PEEF algorithms tended to perform much closer to the adaptive algorithms as budget
increased. This indicates that more aggressive algorithms tended to perform better when the other contractors in their
sub-trees finished exploration at the same time that they did.
   Finally, we also considered the completion time of each algorithm as standard deviation of task completion time of
the bottom level contractors in the tree varied, as seen in Figure 4. Here, we considered the linear utility distribution
case only, and found that, intriguingly, the same pattern applied. This observation reinforced the conjecture that the
relative performance of the algorithms was directly related to the extent of their exploration. Increasing the contractor’s
standard deviation, and effectively bringing the agents’ time distributions closer together, had the same effect as increasing
the amount of interaction required to discriminate effectively between the contractors – the algorithms that acted more
aggressively, PEEF and �-first (� = 0.1), became worse over time than the more deliberate algorithms KDE and �-first (� =
0.25), while the adaptive SOAAv and lS algorithms performed the best overall.


                                                             10
                                 Figure 4: Time v. Standard Deviation for the Linear Case
6.2   Terraced Distributions for Base-variation Trees
We also experimented with terraced arm configurations (see Figure 5), which provided key insights into the relative merits
of these algorithms. We constructed a case where we used 1 best contractor (µi = 10), 4 good contractors (µi = 15), and 5
poor contractors (µi = 25), with σi = 20 for all but the best and σi = 0 for the best. We observed that the high-exploration
�-first algorithm and KDE was able to perform significantly better than any of the others at high budgets. The others
performed somewhat worse, but retained the same ranking as before. What this case corroborated was the hypothesis that
in cases where the best contractor has close competitors, more extensive exploration periods are more beneficial in the
long-term (i.e., high budget) trials, even in the presence of the risk of contracting poor contractors that penalize continued
sampling.


                                      Figure 5: Time v. Budget for the Terraced Case


7     Conclusion and Discussions
In this paper we have evaluated the use of existing and recently- developed algorithms for solving the Fixed-Cost Multi-
Armed Bandit Problem with Budget Constraints (MAB-BF) for addressing the core comprehensive trust management
issues of engagement and evaluation of long-term reliability, and hence trustworthiness, of potential partners. We used
a supply-chain domain where contractor agents have to depend on sub-contractors of initially unknown reliability to be
able to fulfill their own production targets with the goal of minimizing time-to-completion of their orders. Algorithms that
engage in strategic engagement, being cognizant of both current reliability estimates and remaining budgets for engagement
before making final partner selections, have been used to develop trust estimates in potential long-term business partners.
   Our experimental evaluation of the proposed algorithms show that our proposed algorithms are successful in discrim-
inating between partners of variable reliability in the face of domain uncertainty, in the form of stochastic performance


                                                             11
variation over multiple levels of the supply-chain, by effectively managing exploration budget and partial reliability esti-
mates. Experiments over a wide range of problem scenarios show that the lS and PEEF methods’ improvement on the
weaknesses of the �-first algorithm allowed them to be highly competitive in tightly-constrained cases where other algo-
rithms with better theoretical bounds cannot excel. At the same time, the SOAAv algorithm was able to perform very
well in general, consistently outperforming Greedy along with the others in the majority of our test cases. Surprisingly,
the simple Greedy algorithm, not used in the literature earlier for the MAB-BF problem, held its own, underscoring the
usefulness of maximizing the exploitation phase in most scenarios.
    While the initial results of using MAB-BF algorithms for engagement and trust evaluation in a supply-chain domain fare
promising, we would also like to consider cases where the supply-chains’ topological configurations are variable, rather
than uniform for each contractor below the first level. It could also be fruitful to explore the consequences of applying
algorithms to entire sub-trees, or entire levels of the tree. Additionally, we would like to develop variants of the proposed
algorithms for the variable cost MAB problems which would be useful when contracting costs vary from contractor to
contractor. We would like to consider an integrated approach where together with engagement and evaluation, modified
and hybrid approaches can be used that also address other aspects of comprehensive trust management, e.g., the strategic
use of known trustworthiness of contractors to select them for future contractors given expectations of upcoming task load
distributions. Finally, we would like to analyze and characterize the general applicability of the MAB-BF model and its
algorithms for trust management in a varying range of real-life problems.

References
[AUE02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.

[AGR88] R. Agrawal, M. Hedge, and D. Teneketzis. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. Automatic
        Control, IEEE Transactions on, 33(10):899–906, 1988.

[ARD11] D. Ardagna, B. Panicucci, and M. Passacantando. A game theoretic formulation of the service provisioning problem in cloud systems. In Proceedings of the 20th
        international conference on World wide web, pages 177–186. ACM, 2011.

[BOR07] C. Borgs, J. Chayes, N. Immorlica, K. Jain, O. Etesami, and M. Mahdian. Dynamics of bid optimization in online advertisement auctions. In Proceedings of the
        16th international conference on World Wide Web, pages 531–540. ACM, 2007.

[YEH11] O. Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Deconstructing amazon ec2 spot instance pricing. In Cloud Computing Technology and Science
        (Cloud-Com), 2011 IEEE Third International Conference on, pages 304–311. IEEE, 2011.

[CHA10] T. Chakraborty, E. Even-Dar, S. Guha, Y. Mansour, and S. Muthukrishnan. Selective call out and real time bidding. In Internet and Network Economics, pages
        145–157, 2010.

[CAS98] Cristiano Castelfranchi and Rino Falcone. Principles of trust for MAS: Cognitive autonomy, social importance, and quantification. In Proceedings of the Third
        International Conference on Multiagent Systems, pages 72–79, Los Alamitos, CA, 1998. IEEE Computer Society.

[DIN13] W. Ding, T. Qin, X.Zhang, and T. Liu. Multi-armed bandit with budget constraint and variable costs. In AAAI-13. AAAI Press, 2013.

[GAM90] D. Gambetta. Trust. Basil Blackwell, Oxford, 1990.

[GUH07] S. Guha and K. Munagala. Approximation algorithms for budgeted learning problems. In Proceedings of the thirty-ninth annual ACM symposium on Theory of
        computing, pages 104–113. ACM, 2007.

[HUY06] T. D. Huynh, N. R. Jennings, and N. R. Shadbolt. An integrated trust and reputation model for open multi-agent systems. Journal of Autonomous Agents and
        Multi-Agent Systems, 13(2):119–154, 2006.

[KAE96] L.P. Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of AI Research, 4:237–285, 1996.

[SUT98] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.

[SEN13] Sandip Sen. A comprehensive approach to trust management. In Proceedings of the Twelfth International conference on Autonomous Agents and Multi-Agent
        Systems, AAMAS’13, pages 797–800, 2013.

[SEN02] Sandip Sen and Neelima Sajja. Robustness of reputation-based trust: Boolean case. In Proceedings of the First International Joint Conference on Autonomous
        Agents and Multiagent Systems, pages 288–293, New York, NY, 2002. ACM Pres.

[TRA10] L. Tran-Thanh, A. Chapman, J. Munoz De Cote Flores Luna, A. Rogers, and N. Jennings. Epsilon-first policies for budget-limited multi-armed bandits. In
        Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 1211–1216, 2010.

[TRA12] L. Tran-Thanh, A. Chapman, A. Rogers, and N. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the Twenty-
        Sixth AAAI Conference on Artificial Intelligence, pages 1134–1140, 2012.

[VOG10] George Vogiatzis, Ian MacGillivray, and Maria Chli. A proababilistic model for trust and reputation. In Proceedings of the Ninth International Conference on
        Autonomous Agents and Multiagent Systems, pages 225–232, 2010.

[YU02]    Bin Yu and Munindar P. Singh. Distributed reputation management for electronic commerce. Compuational Intelligence, 18(4):535–549, 2002.

[YOL05] Pinar Yolum and Munindar P. Singh. Engineering self-organizing referral networks for trustworthy service selection. IEEE Transactions on System, Man, and
        Cybernetics, 35(3):396–407, 2005.


                                                                                 12