=Paper= {{Paper |id=Vol-2419/paper8 |storemode=property |title=Computational Strategies for the Trustworthy Pursuit and the Safe Modeling of Probabilistic Maintenance Commitments |pdfUrl=https://ceur-ws.org/Vol-2419/paper_8.pdf |volume=Vol-2419 |authors=Qi Zhang,Edmund Durfee,Satinder Singh |dblpUrl=https://dblp.org/rec/conf/ijcai/ZhangDS19 }} ==Computational Strategies for the Trustworthy Pursuit and the Safe Modeling of Probabilistic Maintenance Commitments== https://ceur-ws.org/Vol-2419/paper_8.pdf
                  Computational Strategies for the Trustworthy Pursuit
            and the Safe Modeling of Probabilistic Maintenance Commitments

                                Qi Zhang , Edmund Durfee , Satinder Singh
                            Computer Science and Engineering, University of Michigan
                                       {qizhg,durfee,baveja}@umich.edu


                          Abstract                                  worthy agent, given that an agent often lacks complete con-
                                                                    trol over its environment. Specifically, the form of interde-
     Most research on probabilistic commitments fo-                 pendency we focus on is with respect to a scenario where an
     cuses on commitments to achieve conditions for                 agent (the commitment provider) makes a social commitment
     other agents. Our work reveals that probabilis-                [Singh, 1999; Kalia et al., 2014] to another (the commitment
     tic commitments to instead maintain conditions for             recipient). When stochasticity is inherent in the environ-
     others are surprisingly different from their achieve-          ment, the provider cannot guarantee to bring about the out-
     ment counterparts, despite strong semantic similar-            comes that the recipient expects [Kwiatkowska et al., 2007;
     ities. We focus on the question of how the com-                Nuzzo et al., 2019], and in fact could discover after making
     mitment recipient should model the provider’s ef-              the commitment that how it planned to try to bring about the
     fect on the recipient’s local environment, with only           outcomes would be more costly or risky than it had previ-
     imperfect information being provided in the com-               ously realized. Given that the recipient is unable to predict
     mitment specification. Our theoretic analyses show             precisely the future situations it will face, it’s also unclear
     that we can more tightly bound the inefficiency                how it should model the commitment safely and effectively.
     of this imperfect modeling for achievement com-
     mitments than for maintenance commitments. We                      There exists work focusing on semantics and mechanisms
     empirically demonstrate that probabilistic mainte-             for an agent to follow such that it is assured of faithfully
     nance commitments are qualitatively more chal-                 pursuing its commitments despite the uncertainty [Jennings,
     lenging for the recipient to model, and addressing             1993; Xing and Singh, 2001; Winikoff, 2006; Durfee and
     the challenges can require the provider to adhere to           Singh, 2016]. Previous work articulated the perspective that
     a more detailed profile and sacrifice flexibility.             such a probabilistic commitment should be considered ful-
                                                                    filled if the provider’s actions would have brought about the
                                                                    desired outcome with a high enough expectation, even if in
1   Introduction                                                    a particular instance the desired outcome was not realized.
Safe cooperative behavior among humans is often realized via        That is, the provider acted in good faith. Thus, even if the
social commitments that constrain people to acting reliably.        provider changes its course of action as it learns more about
By making a commitment, a person promises to act in a man-          costs and risks on the fly, it can still fulfill its commitment
ner to fulfill it. This form of commitment-based interaction        if whatever course of action it pursued could be expected to
also exists among autonomous agents. To build safe artificial       achieve the desired outcome with at least the promised like-
multiagent systems, we need trustworthy and reliable mecha-         lihood. With this perspective, previous work has focused
nisms that are accountable for pursuing and modeling agent-         largely on commitments of achievement [Xuan and Lesser,
based commitments. Adopting a decision-theoretic frame-             1999; Maheswaran et al., 2008; Witwicki and Durfee, 2009;
work, this paper formulates and studies problems that arise         Zhang et al., 2016; Zhang et al., 2017], which we also call
in managing commitments in multiagent systems.                      enablement commitments, where the provider commits to
   In multiagent systems, agents are often interdependent in        changing some features of the state in a way desired by the re-
the sense that what one agent does can be beneficial or harm-       cipient with some probability by some time. For example, the
ful to another. If trust means having confidence that another       recipient plans to take an action (e.g., move from one room to
will act so as to reciprocate help in a safe and reliable manner,   another) with a precondition (e.g., the door separating them
then being trustworthy—worthy of such trust—constrains the          is open) that it is counting on the provider to enable.
agent to acting thusly. To persuade an agent designer to create         This paper focuses on another form of commitment, which
trustworthy agents, other agents (individually and/or collec-       we refer to as a maintenance commitment, where instead of
tively) can form and share opinions about agents’ trustwor-         committing to some course of action that in expectation will
thiness, and won’t act to benefit agents with a bad reputation.     enable conditions the recipient wants, the provider instead
   Our work assumes that the designer has been persuaded.           commits to courses of action to probabilistically avoid chang-
Even so, however, it isn’t always clear how to create a trust-      ing conditions that are already the way the recipient wants
them maintained, up until a particular time. After that time,              We assume the recipient’s cumulative reward can be ex-
the condition cannot be assumed to remain unchanged, and                   pressed in the trajectory of l:
before that time, there is a (usually small) probability it could
be changed at any point. For example, an open door the re-                          R(s, a, s0 ) = R(s0 ) = R((l0 , u0 )) = R(l0 ).
cipient needs might be initially achieved, but as the provider             Note that though the value of u does not directly determine
opens and closes other doors during its housekeeping tasks, a              the reward, it does affect the value of l0 at the next time step.
resulting draft could close the door the recipient needs open.             Throughout, we refer to Pu as the true profile of u, which is
The provider could plan its tasks to postpone altering the                 fully determined by the provider’s policy.
riskiest doors as long as possible, but an ill-placed breeze
could close the door at any time.                                          2.1   Commitment Semantics
   Our claim is that decision-theoretic mechanisms for repre-              An enablement or maintenance commitment is concerned
senting and reasoning about enablement commitments can-                    with state feature u that is shared by both agents but only
not straightforwardly apply to maintenance commitments be-                 controllable by the provider. Intuitively, a commitment pro-
cause, despite strong superficial similarities, the two types of           vides partial information about Pu from which the recipient
commitments are fundamentally different. We will substanti-                can plan accordingly. We will refer to u as the commit-
ate this claim analytically and empirically. This in turn raises           ment feature. In this paper, we focus on the setting where
the questions of whether modifications could be made to                    the value of u is binary, letting u+ , as opposed to u− , be
existing decision-theoretic mechanisms for representing and                the value of u that is desirable for the recipient. Further, we
reasoning about enablement commitments so that they can be                 assume that u can be toggled at most once. Citations with
applied to maintenance commitments. We will show empir-                    this assumption include [Hindriks and van Riemsdijk, 2007;
ical results that cast doubt on whether this is possible, thus             Witwicki and Durfee, 2009; Zhang et al., 2016]. In transac-
suggesting that in the future a different treatment of mainte-             tional settings, a feature (e.g., possession of goods) changing
nance commitments should be considered.                                    once is common. It is also common in multiagent planning
                                                                           domains where one agent establishes a precondition needed
2    Preliminaries                                                         by an action of another. Some cooperative agent work re-
In this section, we describe the decision-theoretic setting                quires agents to return changed features to prior values (e.g.,
we adopt for analyzing probabilistic commitments, includ-                  shutting the door after opening and passing through it). And
ing both enablement commitments and maintenance commit-                    in the extreme case where toggling reliably repeats frequently
ments. We review the prior work in the definition and seman-               (e.g., a traffic light) there may be no need for explicit commit-
tics of enablement commitments, which can be extended to                   ments. More generally, while removing this assumption can
maintenance commitments.                                                   complicate the specification of a commitment (e.g., a com-
   The recipient’s environment is modeled as an MDP de-                    pound commitment to enable and then maintain a condition
fined by the tuple M = (S, A, P, R, H, s0 ) where S is the                 over a time interval), we think the fundamental difference be-
finite state space, A is the finite action space, P : S × A →              tween modeling enablement and maintenance commitments
∆(S) (∆(S) denotes the set of all probability distributions                can be best theoretically explained and conceptually under-
over S), R : S × A × S → R is the reward function, H is                    stood without such complications. We next formally give
the finite horizon, and s0 is the initial state. The state space is        the definition and semantics of enablement commitments and
                                                          SH
partitioned into disjoint sets by the time step, S = h=0 Sh ,              maintenance commitments, respectively.
where states in Sh only transition to states in Sh+1 . The                 Enablement Commitments
MDP starts in s0 and terminates in SH . Given a policy                     Let the initial state be factored as s0 = (l0 , u0 ). For enable-
π : S → A and starting in the initial state, a random tra-                 ment commitments, the initial value of the commitment fea-
jectory is generated by ah = π(sh ), sh+1 ∼ P (sh , ah ), rh =             ture is u− , i.e. u0 = u− . The provider commits to pursuing a
R(sh , ah , sh+1 ) for h = 0, · · · , H − 1. The value function of         course of action that can bring about the commitment feature
        π
                     PH−1
π is VM   (s) = E[ h0 =h rh0 |π, sh = s] where h is such that              desirable to the recipient with some probability. Formally, an
                                                                 ∗
s ∈ Sh . There exists an optimal policy in M , denoted as πM       ,       enablement commitment is defined by tuple ce = (Te , pe ),
                                        π
and its value function maximizes VM for all s ∈ S and is ab-               where Te is the enablement commitment time, and pe is the
                 ∗
breviated as VM    . The value of the initial state is abbreviated         enablement commitment probability. The provider’s commit-
     π        π
as vM   := VM   (s0 ).                                                     ment semantics is to follow a policy µ that sets u to u+ by
   As one way to model the interaction between the provider                time step Te with at least probability pe , i.e.
and the recipient [Witwicki and Durfee, 2010; Zhang et al.,
2016], we assume that the recipient’s state can be factored as                           Pr(uTe = u+ |u0 = u− , µ) ≥ pe .
s = (l, u), where l is the set of all the recipient’s state features       Maintenance Commitments
locally controlled by the recipient, and u is the state feature            As a reminder, our maintenance commitment is motivated by
shared with the provider. The provider and the recipient are               scenarios where the initial value of state feature u is desir-
weakly coupled in the sense that u is the only shared state                able to the recipient, who wants it to maintain its initial value
feature and is only controllable by the provider. Formally, the            for some interval of time (e.g., [Hindriks and van Riemsdijk,
dynamics of the recipient’s state can be factored as                       2007; Duff et al., 2014]), but where the provider could want
P (s0 |s, a) = P ((l0 , u0 )|(l, u), a) = Pu (u0 |u)Pl (l0 |(l, u), a) .   to take actions that could change it. Formally, a maintenance
commitment is defined by tuple cm = (Tm , pm ), where Tm               from time step t = Te − 1 to t = Te with probability pe , and
is the maintenance commitment time, and pm is the main-                does not toggle u at any other time step.
tenance commitment probability. Given such a maintenance               Definition 2. Given maintenance commitment cm =
commitment, the provider is constrained to follow a policy                                                   pessimistic
                                                                       (Tm , pm ), its pessimistic profile Pbu,c         toggles u in the
µ that keeps u unchanged for the first Tm time steps with at                                                     m
                                                                       transition from time step t = 0 to t = 1 with probability
least probability pm . Since u can be toggled at most once, it
                                                                       1 − pm , and from t = Tm to t = Tm + 1 with probability
is equivalent to guaranteeing u = u+ at Tm , i.e.
                                                                       one. It does not toggle u at any other time step.
              Pr(uTm = u0 |u0 = u+ , µ) ≥ pm .                            In a previous, unpublished workshop paper [Zhang et al.,
2.2   The Recipient’s Approximate Profile                              2018], we have addressed the general topic of using an ap-
                                                                       proximate profile, especially the pessimitisc profile, for the
The commitment semantics provides partial information on               recipient to model probabilistic commitments, but didn’t pro-
Pu by specifying the profile at the single time step, rather           vide any results. This paper presents both theoretical anal-
than in a potentially more gradual manner. Why would we                ysis (Section 3) and empirical results (Section 4) that reveal
choose to do that? The most important reason is that it opens          the fundamental difference between enablement and mainte-
up even more latitude for the provider to evolve its policy as it      nance commitments.
learns more about its environment: instead of needing to meet
probabilistic expectations at multiple time steps, it can mod-
ify its policy much more flexibly as long as in the long run           3   Theoretical Analysis
(by the commitment time) it hits the target probability. Prior         In this section, we derive bounds on the suboptimality of the
work has shown the value of having such flexibility [Zhang             pessimistic profiles. Our analysis makes the following two
et al., 2017]. Here, we are interested in the problem when             assumptions. Assumption 1 intuitively says that u+ estab-
the recipient creates a profile Pbu with this partial information      lishes a condition for an action that would be irrational, or
as an approximation of Pu , and plans accordingly. Specifi-            even unsafe, to take when u− holds. For example, if u+ is
cally, we are interested in the quality of the plan computed           a door being open, then the action of moving into the door-
from approximate profile Pbu when evaluated in (true) profile          way could be part of an optimal plan, but taking that action if
                                 c = (S, A, Pb, R, H, s0 ) be the      the door is closed (u− ) never is. Assumption 2 is a simplify-
Pu . Formally, given Pbu , let M
                                                                       ing assumption for our analysis stating the true profile agrees
approximate model that only differs from M in terms of the
                                                                       with the pessimistic profile after the commitment time, so that
profile of u, i.e. Pb = (Pl , Pbu ). The quality of Pbu is evaluated   the suboptimality is caused by the imperfect modeling by the
using the difference between the value of the optimal policy           commitment time.
for Mc and the value of the optimal policy for M when both
                                                                       Assumption 1. Let s− = (l, u− ) and s+ = (l, u+ ) be a pair
policies are evaluated in M starting in s0 , i.e.
                                                                       of states that only differ in u. For any M with arbitrary profile
                                  ∗          π∗                        Pu , we have
                 Suboptimality : vM − vMM
                                        c
                                          .
                                                                                  Pl ·|s− , πM
                                                                                             ∗
                                                                                                (s− ) = Pl ·|s+ , πM
                                                                                                                   ∗
                                                                                                                      (s− ) .
                                                                                                                          
Note that when the support of Pu is not fully contained in
the support of Pbu , the recipient could end up in un-modelled         Assumption 2. Pu (uh+1 |uh ) agrees with the pessimistic
                             ∗                             π∗
states when executing πM     c in M , which makes VM
                                                           M
                                                               ill-    profile for h ≥ T , where T is the commitment time.
                                                           c


defined. In this paper, we fix this by re-planning: during exe-           To derive bounds on enablement and maintenance commit-
             ∗
cution of πM c, the recipient re-plans from un-modelled states.        ments, we will make use of the following lemma, where M +
   Previous work chooses an intuitive and straightforward ap-          (M − ) is defined as the recipient’s MDP identical to M ex-
proximate profile for enablement commitments that models a             cept that u is always set to u+ (u− ). Lemma 1 directly follows
single branch (at the commitment time) for when u− proba-              from Assumption 1, stating that the value of M − is no more
bilistically toggles to u+ . This strategy reduces the complex-        than that of M + and the value of any M is between the two.
ity of the recipient’s reasoning; the inefficiency caused by
                                                                       Lemma 1. For any M with arbitrary profile Pu and initial
imperfect modelling is easily outweighed by computational                                   ∗      ∗    ∗
                                                                       value of u, we have vM − ≤ vM ≤ vM + .
benefits [Witwicki and Durfee, 2010; Zhang et al., 2016].
This profile takes a pessimistic view in the sense that u is           Proof. Let’s first consider the case in which Pu toggles u
(stochastically) enabled at the latest possible time consistent        only at a single time step. We show vM    ∗         ∗
                                                                                                                   − ≤ vM by con-
with the commitment, and, if it was not enabled at that time,          structing a policy in M for which the value is vM ∗
                                                                                                                           − by mim-
will never be enabled after the commitment time. Therefore,                     ∗                            −
                                                                       icking πM − . Whether u is initially u and later toggled to u+
we refer to it as the pessimistic profile, as formalized in Defi-      or the other way around, we can construct a policy πM that
nition 1. For maintenance commitments, the pessimistic pro-            chooses the same actions as πM ∗                    −
                                                                                                        − assuming u = u      through-
file should probabilistically disable u at the earliest time, and
                                                                       out the episode. Formally, for any s = (l, u− ), letting
                                                                                                                −
should determinisically disable u after the commitment time,
                                                                       s+ = (l, u+ ), we have πM (s+ ) = πM (s− ) = πM       ∗      −
                                                                                                                                − (s ).
as formalized in Definition 2.
                                                                       By Assumption 1, πM in M yields the same trajectory distri-
Definition 1. Given enablement commitment ce = (Te , pe ),                             ∗        −                   πM       ∗
                                                                       bution of l as πM − in M    , and therefore vM   = vM   − since
                          pessimistic
its pessimistic profile Pbu,c e
                                      toggles u in the transition      value only depends on the trajectory of l.
                          ∗       ∗
   Similarly, we show vM     ≤ vM   + by constructing a policy
            +
πM + in M for which the value is vM      ∗
                                            by mimicking πM ∗
                                                              .                                ...               …...
                                                                              0     1                L0=5                        L=14
Formally, for time steps when u = u− in M , let πM + (s+ ) =                                                       Approx. profile policy
  ∗
πM  (s− ). For time steps when u = u+ in M , let πM + (s+ ) =                                                      Optimal policy
πM (s+ ), where s− = (l, u− ), s+ = (l, u+ ).
  ∗                                                                                            ...               …...
   For the case in which Pu toggles u at K > 1 time steps, we                 0     1                L0=5                       L=14
can decompose the value function Pu as the weighted aver-
                                                                      Figure 1: 1D Walk. Up: Example in the proof of Theorem 1. Down:
age of K value functions corresponding to the K profiles that         Example in the proof of Theorem 2.
toggle u at a single time step, and the weights of the average
are the toggling probabilities of Pu at these K time steps.
                                                                      Proof. The derivation is straightforward from Lemma 2:
3.1    Bounding Suboptimality for Enablement                                              π∗                π∗
                                                                                ∗          ∗            ∗       ∗
Here, we derive a bound on the suboptimality for enablement                    vM − vMM
                                                                                      c
                                                                                        ≤ vM + − v
                                                                                                   c ≤ vM + − v M − .
                                                                                                    M
                                                                                                    c
                                                                                                            M
commitments. From the two assumptions, we also make use
                                                                      Then, we use a simple illustrative example to give an enable-
of Lemma 2 to prove Theorem 1 that bounds the suboptimal-
                                                                      ment commitment for which the equality is attained.
ity for enablement commitments as the difference between
  ∗         ∗                                                         Example: An Enablement Commitment in 1D Walk
vM  − and vM + . Lemma 2 states that, for enablement com-
mitments, the possible imperfect modeling of the pessimistic          Consider the example of a 1D walk on [0, L], as illustrated
profile can only improve the expected value.                          in Figure 1(top), in which the recipient starts at L0 and can
Lemma 2. Given enablement commitment ce = (Te , pe ), let             move right, left, or stay still. There is a gate between 0 and 1
        pessimistic                 π ∗c   π ∗c                       for which u+ denotes the state of open and u− denotes closed.
Pbu = Pbu,c e
                    , then we have v MM  ≥v M   where profile         The gate toggles stochastically according to Pu . For each step
                                               M
                                               c
Pu in M respects the commitment semantics of ce .                     until the recipient reaches either end, a −1 reward is given.
                                                                      Therefore, the optimal policy is to reach either end as soon as
Proof. For enablement commitments, the initial value of u is          possible in expectation. We assume 1 ≤ L0 < L/2 to avoid
u− . Let Pu (t) be the probability that u is enabled to u+ at t in    the uninteresting trivial case of vM  ∗         ∗
                                                                                                              − = vM + . A negative
profile Pu , v πt be the initial state value of π when u is enabled   reward is incurred when bumping into the closed gate, which
from u− to u+ at t with probability one. By Assumption 2,             makes Assumption 1 hold.
 π∗        π∗
vMM
  c
    and v cM
           c
             can be decomposed as                                        Here, we derive an enablement commitment for which the
           M
                                                                      bound in Theorem 1 is attained. Consider L = 14, L0 =
            π∗                  πM∗
                                                π ∗c
          vMM =
                   PTe
                                    + (1 − pe )vMM                    5, H = 15, enablement commitment (Te = L−L0 = 9, pe =
                    t=1 Pu (t)v t                  −,
            c                     c


           πM∗
                      π ∗
                                      π ∗                             1), and the true profile Pu in M that toggles the gate to open at
          v cc = pe v TMc
                          + (1 − pe )vMMc
                                         −.                           t = 4 with probability pe = 1. The optimal policy in M is to
           M            e
                                                                                                     ∗      ∗
                                                                      move left to 0. Therefore, vM     = vM   + = −L0 = −5. Given
                                  ∗
When u is enabled at t in M , πM  c can be executed as if u           the pessimistic profile, moving right to L (arriving at time 9)
is not enabled, by Assumption 1, yielding identical trajectory        is faster than waiting for the gate to toggle at Te = 9 and then
distribution of l (therefore value) as in M
                                          c. Therefore, the           reaching location 0 at time 10. Had the recipient known the
recipient’s replanning at t when u = u+ will derive a better          gate would toggle at time 4, it would have moved left, but by
policy if possible. Therefore, the value of executing πM ∗            the time it toggles at time 4 the recipient is at location 9, and
                                                         c in                                                            π∗
                                      ∗        ∗                                                                        ∗
                          c, i.e. v πt M     πc                       going to L is the faster choice. Therefore vMM
                                                                                                                   c
                                                                                                                     = vM − =
M is no less than that in M            c
                                         ≥ v TM . Therefore,
                                              e                       −(L − L0 ) = −9, and bound (1) is attained.
  π∗     PTe            πM∗
                                        π ∗c
 vMM
   c
     =      t=1 Pu (t)v t
                          c
                            + (1 − pe )vMM −                          3.2   Bounding Suboptimality for Maintenance
                          ∗               ∗
        PTe             πM              πM                            We next ask if the bound in Equation (1) on suboptimality in
       ≥ t=1 Pu (t)v Te + (1 − pe )vM −
                          c                c

                                                                      enablement commitments also holds for maintenance com-
             π ∗c             π ∗c
       ≥pe v TM   + (1 − pe )vMM −     commitment semantics           mitments. Unfortunately, as stated in Theorem 2, the opti-
               e
           ∗
                                                                      mal policy of the pessimistic profile for maintenance com-
         πc
       =v cM  .                                                       mitments can be arbitrarily bad when evaluated in the true
         M
                                                                      profile, incurring a suboptimality exceeding (1). An existence
                                                                      proof is given with an example.
Theorem 1. Given enablement commitment ce , let Pbu =                 Theorem 2. There exists an MDP M with nonnegative re-
  pessimistic
Pbu,c         . The suboptimality can be bounded as                   wards, and a maintenance commitment cm , such that the
      e
                                                                      profile Pu in M respects the commitment semantics of cm ,
                           π∗                                          ∗    ∗        π∗
                   ∗          ∗      ∗                                vM = vM + , vM
                                                                                    M
                                                                                      = 0, and therefore the suboptimality is
                     − vMM ≤ vM + − vM −
                                                                                    c
                  vM     c
                                                               (1)
                                                                                                ∗     π∗   ∗
where profile Pu in M respects the commitment semantics                                        vM − vMM
                                                                                                      c
                                                                                                        = vM +                              (2)
of ce . Further, there exists an enablement commitment for
                                                                                    pessimistic
which the equality is attained.                                       where Pbu = Pbu,c m
                                                                                                is the profile in M
                                                                                                                  c.
Proof. As an existence proof, we give an example of a main-        Pu , Pbu are chosen. Besides using the pessimistic profile to
tenance commitment in 1D Walk with nonnegative rewards             approximate the true profile, we consider the following three
                             π∗
for which vM ∗
                = vM ∗
                       + and vM
                                M
                                c
                                   = 0.                            heuristics for generating approximate profile Pbu ∈ Pu1 :
   Consider 1D Walk with the same L = 14, L0 = 5, H = 15           Optimistic. As opposed to the pessimistic, the optimistic
as in the example for Theorem 1. Here we offset the re-                profile toggles u right after the initial time step for en-
wards by +1 such that the rewards are nonnegative. Consider            ablement commitments, and at the commitment time for
maintenance commitment (Tm = 7, pm = 0), and Pu tog-                   maintenance commitments.
gles the gate from open to closed at t = 6 with probability
1 − pm = 1. As illustrated in Figure 1(bottom), the optimal-       Minimum Value. The toggling time minimizes the op-
ity policy should take 5 steps to move directly to 0, for which        timal value over all possible profiles in Pu1 , i.e.,
                                                                                        ∗
the value is vM∗
                 + . With probability pm , the gate is kept open
                                                                       arg minPbu ∈P 1 vM
                                                                                        c, where Pu is the profile of u in M .
                                                                                                 b                         c
                                                                                       u
                     ∗
through Tm , and πM   c takes 7 steps to reach 0. With probabil-   Minimax Regret. The toggling time is chosen based on the
                                                ∗
ity 1 − pm , the gate is closed at t = 6, and πMc takes 19 > H         minimax regret principle. Formally,
                                     π∗
steps to reach L = 14. Therefore, vMM
                                    c
                                      = 0.                                                                    ∗        π∗
                                                                                  arg minPbu ∈P 1 maxPu ∈Pu1 vM − vMM
                                                                                                                    c

                                                                                                u
   Comparing bound (1) in Theorem 1 with bound (2) in The-
orem 2 reveals a fundamental difference between enablement               where Pu , Pbu are the profiles of u in M, M
                                                                                                                    c, respectively.
and maintenance commitments: maintenance commitments
are inherently less tolerant to an unexpected change in the        The four heuristics include two simple, inexpensive heuris-
commitment feature. For enablement commitments, it is easy         tics (Pessimistic and Optimistic), and two more complex and
to construct a pessimistic profile, such that any unexpected       expensive heuristics (Minimum Value and Minimax Regret).
changes to the feature, if they impact the recipient at all, can   Recall that our theoretical analysis suggests, for maintenance,
only improve the expected value. Thus, if despite the pes-         the worst time for toggling to u− is not right away, but right
simistic profile, a recipient has chosen to follow a policy that   before the recipient uses the condition, and this causes the
exploits the commitment, it can never experience a true pro-       poor performance of the pessimistic profile. We hypothesize
file that would lead it to regret having done so. The same         that the latter two heuristics can improve the pessimistic pro-
cannot be said for maintenance commitments. The easily-            file by identifying the worst toggling time.
constructed pessimistic profile does not guarantee that any        Results Here we evaluate the suboptimality of our candi-
deviations from the profile can only improve the expected          date heuristics for both enablement commitments and main-
value. As our theoretical results show, the pessimistic pro-       tenance commitments. The setting is the same as the exam-
file of assuming toggling from u+ to u− right away can still       ple for Theorem 1 except that the horizon is longer, L =
lead to negative surprises, since if the toggling doesn’t occur    14, L0 = 5, H = 30. Figure 2 shows the mean, minimum,
the profile suggests that it is safe to assume no toggling until   and maximum suboptimality over all realizations of Pu ∈ Pu1
Tm , but that is not true since toggling could happen sooner,      for commitment time Te , Tm ∈ {5, 7, 10}. We see that for en-
after the recipient has incurred cost for a policy that would      ablement commitments, the suboptimality of the pessimistic
need to be abandoned. The poor performance of the pes-             profile is comparable to the two more sophisticated strategies,
simistic model for maintenance is because it is not reliably       and the optimistic profile incurs most suboptimality overall.
pessimistic enough: in the example for Theorem 2, the worst        For maintenance commitments, however, the two expensive
time for toggling to u− is not right away, but right before the    strategies incur overall less suboptimality than the pessimistic
condition would be used (the gate shutting just as the recipi-     and the optimistic, yet it is difficult to identify a single best
ent was about to pass through).                                    heuristic that reliably reduces the suboptimality for all the
                                                                   maintenance commitments.
4     Empirical Results
                                                                   4.2    Gate Control
Our analyses suggest the pessimistic profile might not be the
best approximate profile for a recipient to adopt for mainte-      In this domain, we are concerned with the more general situ-
nance commitments. In this section, we identify several alter-     ation in which Pu 6∈ Pu1 can toggle u at more than one time
native heuristics to create approximate profiles for the recipi-   step by the commitment time, and even can toggle u after the
ent, and evaluate them for both maintenance and enablement         commitment time. We also consider approximate profiles Pbu
commitments. We conduct our evaluations in two domains.            that are not elements of Pu1 .
The first is the same 1D Walk domain as in our theoretical            As illustrated in Figure 3, the provider’s environment con-
analysis, and the second is a Gate Control problem with a          tains four cells, A ↔ B ↔ C ↔ D ↔ A, that are connected
more interesting transition profile (violating Assumption 2).      circularly. The provider can deterministically move to an ad-
                                                                   jacent cell or stay in the current cell. Upon a transition, the
4.1    1D Walk                                                     gate could toggle with probability 0.5 if the provider ends
As previously defined, the 1D Walk domain restricts the set        up in cell C. In the enablement commitment scenario, the
of profiles to toggle u only at a single time step no later than   provider gets a +1 reward if it ends up in cell C, and in the
the commitment time, and agree with Assumption 2 there-            maintenance commitment scenario it gets a +1 reward if end-
after. We denote the set of such profiles as Pu1 from which        ing up in cell A. For a given commitment, the provider adopts
                             commitment time = 5                                                              commitment time = 5                                 2.00                                                                      2.00
                 14                             Pessimistic                            14                                                                                                          Pessimistic
                                                Optimistic                                                                                                        1.75                             Optimistic                               1.75
                 12                             Min Value                              12
                                                                                                                                                                                                   Min Value
                                                                                                                                                                  1.50                                                                      1.50
                 10                             Minimax Regret                         10                                                                                                          Minimax Regret
                                                                                                                                                                  1.25                             Const                                    1.25
 suboptimality




                                                                                                                                                  suboptimality




                                                                                                                                                                                                                            suboptimality
                                                                       suboptimality
                  8                                                                     8                                                                                                          Multi Timepoints
                                                                                                                                                                  1.00                                                                      1.00
                  6                                                                     6
                                                                                                                                                                  0.75                                                                      0.75
                  4                                                                     4
                                                                                                                                                                  0.50                                                                      0.50
                  2                                                                     2
                                                                                                                                                                  0.25                                                                      0.25
                  0                                                                     0
                                                                                                                                                                  0.00                                                                      0.00
                   0.0    0.2      0.4     0.6      0.8          1.0                     0.0           0.2       0.4      0.6         0.8   1.0                       0.0    0.2      0.4     0.6      0.8            1.0                       0.0    0.2      0.4      0.6      0.8      1.0
                         enablement commitment probability                                             maintenance commitment probability                                   enablement commitment probability                                         maintenance commitment probability

                     (a) enablement, Te = 5                                             (b) maintenance, Tm = 5                                                         (a) enablement, Te = 4                                                 (b) maintenance, Tm = 4
                             commitment time = 7                                                              commitment time = 7                                 2.00                                                                      2.00
                 14                                                                    14
                                                                                                                                                                  1.75                                                                      1.75
                 12                                                                    12
                                                                                                                                                                  1.50                                                                      1.50
                 10                                                                    10
                                                                                                                                                                  1.25                                                                      1.25
 suboptimality




                                                                                                                                                  suboptimality




                                                                                                                                                                                                                            suboptimality
                                                                       suboptimality

                  8                                                                     8
                                                                                                                                                                  1.00                                                                      1.00
                  6                                                                     6
                                                                                                                                                                  0.75                                                                      0.75
                  4                                                                     4
                                                                                                                                                                  0.50                                                                      0.50
                  2                                                                     2
                                                                                                                                                                  0.25                                                                      0.25
                  0                                                                     0
                                                                                                                                                                  0.00                                                                      0.00
                   0.0    0.2      0.4     0.6      0.8          1.0                     0.0           0.2       0.4      0.6         0.8   1.0                       0.0    0.2      0.4     0.6      0.8            1.0                       0.0    0.2      0.4      0.6      0.8      1.0
                         enablement commitment probability                                             maintenance commitment probability                                   enablement commitment probability                                         maintenance commitment probability

                     (c) enablement, Te = 7                                             (d) maintenance, Tm = 7                                                         (c) enablement, Te = 6                                                 (d) maintenance, Tm = 6
                             commitment time = 10                                                             commitment time = 10
                 14                                                                    14                                                         Figure 4: Suboptimality in Gate Control. Markers on the curves
                 12                                                                    12                                                         show the mean suboptimality over possible time steps of toggling.
                 10                                                                    10
                                                                                                                                                  Bars show the minimum and maximum.
 suboptimality




                                                                       suboptimality




                  8                                                                     8
                  6                                                                     6
                  4                                                                     4
                  2                                                                     2                                                                               other time steps T . Here, we consider T = {1, bT /2c},
                  0                                                                     0                                                                               and the pessimistic heuristic is then used to match the
                   0.0    0.2      0.4     0.6      0.8          1.0                     0.0           0.2       0.4      0.6         0.8   1.0
                         enablement commitment probability                                             maintenance commitment probability                               toggling probabilities at these three time steps.
                   (e) enablement, Te = 10                                             (f) maintenance, Tm = 10
                                                                                                                                                  Results We consider the combination of the following sce-
Figure 2: Suboptimality in 1D Walk. Markers on the curves show                                                                                    narios: the provider can start in any one of the four cells;
the mean suboptimality over possible time steps of toggling, Pu ∈                                                                                 and the toggling can happen in even, odd, or all time steps.
Pu1 . Bars show the minimum and maximum.                                                                                                          The time horizon is H = 10 for both the provider and the
                                                                                                                                                  recipient. This yields in total 12 (true) profiles Pu . Figure 4
                                                                                                                                                  shows the mean, maximum, and minimum suboptimality for
a policy that aims to maximize its cumulative reward while re-                                                                                    Te , Tm ∈ {4, 6} over the 12 profiles. Similar to 1D Walk,
specting the commitment semantics. The recipient gets a -0.1                                                                                      the results show that the pessimistic profile is among the best
reward each time step. Upon reaching cell G, the recipient                                                                                        for enablement commitments, but it is difficult for mainte-
gets a +1 reward and the episode ends.                                                                                                            nance commitments to identify a best heuristic, besides the
   Besides the four heuristics we considered for the 1D Walk,                                                                                     multi-timepoints, that reliably reduces the suboptimality for
we further consider the following two that choose an approx-                                                                                      all commitment time/probability pairs we consider. Using the
imate profile outside of the set Pu1 :                                                                                                            multi-timepoints profile that is more aligned with the true pro-
Constant. This profile toggles u at every time step up to the                                                                                     file, the suboptimality can be dramatically reduced for main-
    commitment time with a constant probability, and the                                                                                          tenance commitments, but it has a less significant impact for
    probability is chosen such that the overall probability of                                                                                    enablement commitments. This suggests that, unlike enable-
    toggling by the commitment time matches the commit-                                                                                           ment commitments where the cost is low of the provider re-
    ment probability. It agrees with the pessimistic profile                                                                                      taining considerable flexibility by only committing to a single
    after the commitment time.                                                                                                                    time-probability pair (leaving itself freedom to change its pol-
                                                                                                                                                  icy dynamically so long as it meets or exceeds that target),
Multi-timepoints. Besides time T , the provider also pro-                                                                                         maintenance commitments greatly benefit from a provider
    vides the recipient with the toggling probabilities for                                                                                       committing to a more detailed profile, sacrificing flexibility
                                                                                                                                                  in order to improve the quality of the recipient’s expectations
                                                                                                        G                                         to reduce the frequency and costs of negative surprises.
                                      A                                                        3


                                                                                               2

                            D                    B                                                                                                5                   Conclusion
                                                                                               1

                                                                                                                                                  We have shown that we cannot straightforwardly extend the
                                      C                                                        0                                              S         ?
                                                                                                                                                  semantics   and algorithms for trustworthy fulfillment of en-
                                                                                               y
                                                                                                   x      0        1       2        3
                                                                                                                                                  ablement commitments to maintenance commitments. Our
Figure 3: Gate Control. Left: The provider. Cell C toggles the gate.                                                                              theoretical and empirical results suggest that, despite their
Right: The recipient.                                                                                                                             similarities in describing the toggling of conditions over time,
maintenance commitments are fundamentally different from             Computer, Communication and Software Systems, pages
enablement commitments. We have theoretically shown that             220–270. Springer, 2007.
the easily-constructed pessimistic profile can only improve       [Maheswaran et al., 2008] Rajiv         Maheswaran,         Pedro
the expected value in the face of unexpected changes for             Szekely, Marcel Becker, Stephen Fitzpatrick, Gergely
enablement commitments but not for maintenance commit-               Gati, Jing Jin, Robert Neches, Narges Noori, Craig
ments. Empirically, we have seen that an inexpensive pes-            Rogers, Romeo Sanchez, et al. Predictability & criticality
simistic approximation of the profile works comparably to            metrics for coordination in complex environments. In
more sophisticated approximations for enablement commit-             Proceedings of the 7th International Joint Conference on
ments, but not for maintenance commitments.                          Autonomous Agents and Multiagent Systems-Volume 2,
   The fact that approximating profiles well is harder for           pages 647–654, 2008.
maintenance commitments could mean that agents engaged
in maintenance commitments might need to make a differ-           [Nuzzo et al., 2019] Pierluigi Nuzzo, Jiwei Li, Alberto L.
ent tradeoff. That is, for enablement, we could give the             Sangiovanni-Vincentelli, Yugeng Xi, and Dewei Li.
provider a lot of flexibility by only constraining it to meet        Stochastic assume-guarantee contracts for cyber-physical
the probability at the commitment time and so can unilater-          system design. ACM Trans. Embed. Comput. Syst.,
ally change the profile before then. The gain in flexibility         18(1):2:1–2:26, January 2019.
for the provider is worth the relatively small value loss to      [Singh, 1999] M. P. Singh. An ontology for commitments
the recipient from using the pessimistic profile. However, for       in multiagent systems. Artificial Intelligence and Law,
maintenance commitments, the potential for the recipient to          7(1):97–113, 1999.
lose more value with a bad (simple or sophisticated) approxi-     [Winikoff, 2006] Michael Winikoff. Implementing flexible
mate profile could mean that the provider should commit to a         and robust agent interactions using distributed commit-
more detailed profile—the loss of flexibility for the provider       ment machines. Multiagent and Grid Systems, 2(4):365–
in this case is warranted because the recipient makes much           381, 2006.
better decisions. In other words, our theoretical and empirical
work suggests that maintenance commitments could require          [Witwicki and Durfee, 2009] Stefan J. Witwicki and Ed-
providers and recipients to inherently be more tightly coupled       mund H. Durfee. Commitment-based service coordina-
than they need to be for enablement commitments.                     tion. Int.J. Agent-Oriented Software Engineering, 3:59–
Acknowledgments We thank the anonymous reviewers for                 87, 01 2009.
great suggestions. This work was supported in part by the         [Witwicki and Durfee, 2010] Stefan J. Witwicki and Ed-
Air Force Office of Scientific Research under grant FA9550-          mund H. Durfee. Influence-based policy abstraction for
15-1-0039. Any opinions, findings, conclusions, or recom-            weakly-coupled Dec-POMDPs. In Int. Conf. Auto. Plan-
mendations expressed here are those of the authors and do            ning Sys. (ICAPS), pages 185–192, 2010.
not necessarily reflect the views of the sponsors.
                                                                  [Xing and Singh, 2001] Jie Xing and Munindar P Singh.
                                                                     Formalization of commitment-based agent interaction. In
References                                                           Proceedings of the 2001 ACM symposium on Applied com-
[Duff et al., 2014] S. Duff, J. Thangarajah, and J. Harland.         puting, pages 115–120. ACM, 2001.
   Maintenance goals in intelligent agents. Computational         [Xuan and Lesser, 1999] Ping Xuan and Victor R Lesser. In-
   Intelligence, 30(1):71–114, 2014.                                 corporating uncertainty in agent commitments. In Inter-
[Durfee and Singh, 2016] Edmund H. Durfee and Satinder               national Workshop on Agent Theories, Architectures, and
   Singh. On the trustworthy fulfillment of commitments. In          Languages, pages 57–70. Springer, 1999.
   Nardine Osman and Carles Sierra, editors, AAMAS Work-          [Zhang et al., 2016] Qi Zhang, Edmund H Durfee, Satinder
   shops, pages 1–13. Springer, 2016.                                Singh, Anna Chen, and Stefan J Witwicki. Commitment
[Hindriks and van Riemsdijk, 2007] K. V. Hindriks and                semantics for sequential decision making under reward un-
   M. B. van Riemsdijk. Satisfying maintenance goals.                certainty. In Int J. Conf. on Artificial Intelligence (IJCAI),
   In 5th Int. Workshop Declarative Agent Languages and              pages 3315–3323, 2016.
   Technologies (DALT), pages 86–103, 2007.                       [Zhang et al., 2017] Qi Zhang, Satinder Singh, and Edmund
[Jennings, 1993] Nick R Jennings. Commitments and con-               Durfee. Minimizing maximum regret in commitment con-
   ventions: The foundation of coordination in multi-agent           strained sequential decision making. In Int. Conf. Auto-
   systems. The knowledge engineering review, 8(3):223–              mated Planning Systems (ICAPS), pages 348–356, 2017.
   250, 1993.                                                     [Zhang et al., 2018] Qi Zhang, Edmund H. Durfee, and
[Kalia et al., 2014] A. K. Kalia, Z. Zhang, and M. P. Singh.         Satinder P. Singh. Challenges in the trustworthy pursuit of
   Estimating trust from agents’ interactions via commit-            maintenance commitments under uncertainty. In Proceed-
   ments. In 21st Euro. Conf. on AI (ECAI), pages 1043–              ings of the 20th International Trust Workshop co-located
   1044, 2014.                                                       with AAMAS/IJCAI/ECAI/ICML 2018, Stockholm, Swe-
[Kwiatkowska et al., 2007] Marta Kwiatkowska, Gethin                 den, July 14, 2018., pages 75–86, 2018.
   Norman, and David Parker. Stochastic model checking. In
   International School on Formal Methods for the Design of