=Paper=
{{Paper
|id=Vol-3022/paper7
|storemode=property
|title=Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception
|pdfUrl=https://ceur-ws.org/Vol-3022/paper7.pdf
|volume=Vol-3022
|authors=Sarit Adhikari,Piotr J. Gmytrasiewicz
|dblpUrl=https://dblp.org/rec/conf/atal/AdhikariG21
}}
==Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception==
Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception Sarit Adhikari Piotr J. Gmytrasiewicz University of Illinois at Chicago University of Illinois at Chicago Chicago, IL 60607 Chicago, IL 60607 sadhik6@uic.edu piotr@uic.edu Abstract In Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs), agents use Bayes update to process their ob- servations and messages without the usual assumption of cooperative discourse. We formalize the notion of sincerity and deception in terms of message in message space and belief of the agent sending the message. We then use a variant of the point-based value iteration method called IPBVI-Comm to study the optimal interactive behavior of agents in cooperative and competitive variants of the Tiger game. Modeling the belief and preferences of the opponent allows an agent to predict their optimal communicative behavior and physical action. Unsurprisingly, it may be optimal for agents to attempt to mislead others if their pref- erences are not aligned. But it turns out the higher depth of reasoning allows an agent to detect insincere communication and to guard against it. Specifically, in some scenarios, the agent can distinguish a truthful friend from a deceptive foe when the message received contradicts the agent’s observations, even when the received message does not directly reveal the opponent type. 1 Introduction The study of communication and interaction among self-interested agents in a partially observable and stochastic domain has application in several fields ranging from military [BAG+ 05, GYPC20] to social robotics [BTK08]. The communicative behavior among the agents in multi-agent systems has been studied in cognitive science [AdR18], economics [EG19] and artificial intelligence [GL20, YFS+ 20]. With the advancement of artificial intel- ligence, the topic of machine deception has become more important. In particular, since communication among agents is becoming ubiquitous, malicious agents trying to exploit the vulnerabilities in other AI systems and humans might be a common problem of the future. Thus it is important to lay the foundation for deception- resistant AI systems. Further, as more AI agents are becoming part of our social life, the study of emergent social behavior among communicating agents (both artificial and human) with varied preferences is vital. Although vast literature exists on the topic of machine fooling humans through fake content [SSW+ 17] and human fooling machines with adversarial attacks [KGB17], the study of deception in a sequential decision-making scenario, by modeling other agents have rarely been explored. As argued in [IB17], AI needs to guard itself against malevolent humans and sometimes be able to deceive as well. On the other hand, when the agents’ preferences Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) In: R. Falcone, J. Zhang, and D. Wang (eds.): Proceedings of the 22nd International Workshop on Trust in Agent Societies, London, UK on May 3-7, 2021, published at http://ceur-ws.org 1 Figure 1: Theory of Mind (ToM) reasoning from the perspective of agent i interacting with agent j when there is uncertainty about the opponent type. Level is indicative of cognitive sophistication. The neutral agent i thinks j might be enemy, friend or random agent. Further, i thinks j thinks i might be a sincere and gullible agent or a random agent. The behavior of j is simulated by i by putting itself in j’s shoes. Within that simulation, i needs to reason how j simulates i’s behavior. align, then they benefit from sincere communication. Like physical actions, communicative actions are guided by the expected utility obtained in a particular state. Agents sometimes benefit from being sincere and sometimes it is in their best interest to deceive. To be able to cooperate, deceive or guard against deception, the agent needs to model the belief, intention, and preference of the other agent [BI11]. Humans are not very adept in detecting lies; we find it difficult to tell a skillful liar from a truthful, but possibly misinformed, friend. The main reason is that both may behave the same. We show examples of interactions during which agents, using Bayes update, can form informative beliefs about the identity of their opponents by keeping careful track of their possible states of knowledge, and by comparing the content of their communications to the agent’s own observations. It is important to note that analyzing communicative behavior as separate from physical actions is bound to be insufficient, unless agents care directly about states of other agents’ beliefs, as opposed to caring only about physical states of the world. In that latter, arguably more realistic, case, rational agents will attempt to modify others’ beliefs only because this could induce them into advantageous physical actions - this could not be analyzed if one abstracts physical actions away. Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs) provide a principled framework for rational interaction and communication in a multi-agent environments [Gmy20]. CIPOMDP framework is an extension of interactive POMDP [GD05] to include the exchange of messages among the agents. IPOMDP, in turn, extends Partially Observable Markov Decision Process (POMDP) [Son78] to include other agents by incorporating their models into its state space. Thus, while POMDPs provide a theoretically sound framework to model uncertainty about the state of the world, the Theory of Mind (ToM) approach of IPOMDPs and CIPOMDPs allows modeling of uncertainty about other agents, including their beliefs about the world and other agents. Figure 1 shows the theory of mind (ToM) of the decision-making agent which is uncertain about the type of another agent and how it may model others including the original agent. At the bottom of the hierarchy of models could be a random agent, or a rational agent that does not model others, i.e., a classical POMDP agent. A great deal of research in psychology establishes a connection of deception to the recursive theory of mind reasoning, which starts at an early age in humans [STHP91, RO98, DWW+ 15]. More recently, [OSV19] provides a comprehensive quantitative analysis of the role of rationality and theory of mind in deception and detecting deception. We first formalize the definition of sincerity and deception in terms of the message sent by the agent and its belief about the state of the world. We then propose an offline solution method for communicative IPOMDPs, which adopts a point-based solution technique to solve for the optimal communicative behavior. The algorithm computes the policy to find the optimal action-message pair at each time-step of interaction with the other agent. The executed action may not be known to other agents but we assume that the message sent is received by the other agent with certainty. The subsequent policies are conditional on not only observation but also message 2 received from the other agent. Based on the preference of the agent and what is known about the preferences of the modeled agent, the agent may benefit from incorporating the message from another agent into its belief, but discounting it if it thinks the other agent has an incentive to lie. Since CIPOMDP agent needs message space to calculate the policy offline and messages express agent’s beliefs, we construct the message space by discretizing the interactive belief space. The policy of POMDP agent on the bottom of a ToM hierarchy is augmented with a sincere messages it can send using a look-ahead reachability tree from the initial belief in the interactive state of the modeling agent. Similarly, we propose a way for an agent modeled as a POMDP to receive messages by augmenting its observation space. We apply CIPOMDPs to agents interacting in the multi-agent tiger game and we show that communication is valuable to agents and results in superior policies compared to its no communication counterpart. In cooperative scenarios, the agent can take advantage of messages from a sincere agent as additional observations, and can send sincere messages that inform the other agent. In competitive scenarios, the agent not only attempts to deceive the other agent but also ignores the message it knows to be deceitful. We then show how Bayesian update allows an agent higher in a cognitive hierarchy to tell a friend from foe based on the message received and its own observation. 2 Related work The problem of agents communicating and interacting simultaneously has been addressed in several decision- theoretic as well as RL settings. [FAdFW16] uses DDRQN to learn communication protocol and solve a riddle by coordination. [FSH+ 19] combines multiagent RL with a bayesian update to compute communication protocols and policies in cooperative, partially observable multi-agent settings. [SsF16] uses a neural model that learns communication along with the policy. In the planning and control setting, [NMT13] uses communication for controllers to share part of their observation and control history at each step. More recently, [ULS20] used POMDP with communication for human-robot collaboration task. Other work in HRI include [CT16], [DCA17] and [WPH16] . [DA16] uses a theory of mind approach for the execution of a shared plan. Communication has been studied extensively in other multi-agent decision theoretic frameworks [NRYT03], [NPY+ 04], [ZVlY04], [OSV07] In [RMG15], agents use extended belief state that contain approximation of other agents’ beliefs. But these works assume fully cooperative interactions and mostly involve central planning. CIPOMDPs, on the other hand, provide subjective theory of mind reasoning during communication, and follows Bayesian approaches to pragmatics. Deception has been widely studied across multiple disciplines including game theory [STX+ 18], [EJ10], psy- chology [GA14] and economics [Gne05]. When it comes to a sequential decision process to study attacker’s and defender’s approaches in cybersecurity research , decision theoretic framework of POMDP [AASN+ 20] and IPOMDP [SDS20] has been used. [SPB+ 19] combines ToM with components from deception theory and imple- ments an epistemic agent using Agent-Oriented Programming Language. 3 Background 3.1 Communicative Interactive POMDPs CIPOMDP [Gmy20] is the first general framework for an autonomous self-interested agent to communicate and interact with other agents in the environment based on Bayesian decision theory. A finitely nested communicative interactive POMDP of agent i in an environment with agent j, is defined as: CIP OM DPi = hISi,l , Ai , M, Ωi , Ti , Oi , Ri i (1) where ISi,l is a set of interactive states, defined as ISi,l = S × Mj,k , l ≥ 1, where S is the set of physical states and Mj,k is the set of possible models of agent j, l is the strategy (nesting) level, and k < l. The set of possible models Mj,k consists of intentional models, Θj , or sub-intentional ones, SMj . While the intentional models ascribe beliefs, preferences and rationality in action selection to the modeled agent, the sub-intentional models do not. We consider kth (less than l) level intentional models of agent j defined as θj,k = hbj,k , Aj , Ωj , Tj , Oj , Rj i, where bj,k is agent j’s belief nested to the level k, bj,k ∈ ∆(ISj,k ). The intentional model θj,k , is sometimes called type, can be rewritten as θj,k = hbj,k , θˆj i, where θˆj includes all elements of the intentional model other than the belief and is called the agent j’s frame. Among the classes of sub-intentional models, we consider no-information model [GD00] which randomly selects action to execute and message to send in each time-step. The random models are possible in each level starting with level-0. 3 In contrast to classical POMDPs and similar to IPOMDPs, the transition, observation and reward functions in CIPOMDPs take actions of other agents into account. A = Ai ×Aj is the set of joint actions of all agents, Ωi is the set of agent i’s possible observations, Ti : S ×A×S → [0, 1] is the state transition function, Oi : S ×A×Ωi → [0, 1] is the observation function, Ri : S × A → R is the reward function. The ISi,l can be defined inductively ISi,0 = S, Θj,0 = {hbj,0 , θˆj i : bj,0 ∈ ∆(S)} Mj,0 = Θj,0 ∪ SMj ISi,1 = S × Mj,0 , Θj,1 = {hbj,1 , θˆj i : bj,1 ∈ ∆(ISj,1 )} Mj,1 = Θj,1 ∪ SMj ...... l−1 ISi,l = S × Mj,k , Θj,l = {hbj,l , θˆj i : bj,l ∈ ∆(ISj,l )} k=0 Mj,l = Θj,l ∪ SMj The above defines the 0-level model, θj,0 as having beliefs only over the physical state space, S. The level 1 agent model maintains beliefs over the physical states and 0-level models of the opponent. A level l agent, θj,l , maintains beliefs over S and over models of the opponent nested up to l − 1. M is a set of messages the agents can send to and receive from each other , i.e., it is a communication language the agents share. Since agents’ beliefs are probability distributions and communication is intended to share beliefs, it is natural to interpret a message in M as a marginal probability distribution over a subset of variables in the agents’ interactive state spaces ISi and ISj , which overlap. That way M is a set of probabilistic statements about the interactive state space. The message nil, i.e., silence, contains no variables. Note that we do not assume that messages reflect agents’ actual beliefs. We will further discretize M below. 3.2 Belief update in CIPOMDPs Belief update in CIPOMDPs is analogous to belief update in IPOMDPs when it comes to actions and observa- tions. At any particular time step agents i and j can not only perform physical actions and observe but also send and receive messages. Call the message i sent at time t − 1 mt−1 t i,s , and one i received at time t mi,r , and analogously for j. We assume all messages are in M and that message transmission is perfect. We provide precise definition of message space in our formulation in section 4.1. The belief update in CIPOMDPs has to update the probability of interactive state given the previous belief, action and observation, and given the message sent (at the previous time step) and received (at the current time): P (ist |bit−1 , at−1 i , mt−1 t t i,s , oi , mi,r ): X X bti (ist ) = P (ist |bt−1 i , at−1 i , mt−1 t t i,s , oi , mi,r ) = η bt−1 i (ist−1 ) P (mt−1 t−1 t−1 j,s , aj |θj ) ist−1 at−1 j X × Oi (st , at−1 , oti )Ti (st−1 , at−1 , st ) τθjt (bt−1 t−1 t−1 t t t t t−1 t j , aj , mj,s , oj , mj,r , bj )Oj (s , a , oj ) (2) otj The term P (mt−1 t−1 t−1 j,s , aj |θj ) quantifies the relation between the message i received from j and the model, θj , of agent j that generated the message.1 This term is the measure of j 0 s sincerity, i.e., whether the message j sent reflects j’s beliefs which are part of the model θj . η is the normalizing constant. 3.3 Planning in CIPOMDPs The utility of interactive belief of agent i, contained in i’s type θi , is: n X Ui (θi ) = max(mi,s ,ai ) bis (s)ERi (is, mi,s , ai ) (3) is∈IS X o +γ P (mi,r , oi |bi , ai )Ui (hSEθi (bi , ai , mi,s , oi , mi,r ), θˆi i) (mi,r ,oi ) 1 Note that mt−1 = mt j,s i,r because message transmission is assumed to be perfect. 4 ERi (is, mi,s , ai ) above is the immediate reward to i for sending mi,s and executing action ai given the interactive state is and is equal to Σaj Ri (is, ai , aj , mi,s )P (aj |θj ). The planning in CIPOMDP makes use of equation 3 , which is based on the Bellman optimality principle. The policy computes optimal action, message pair which results in a maximum expected reward. Consequently, value iteration in CIPOMDP is analogous to that in IPOMDP and POMDP. The set of optimal message-action pairs, (m∗i,s , a∗i ) is obtained by replacing max in equation 3 with argmax. We call the resulting set of optimal message-action pairs OP T (θi ). When agent i models agent j as a strict optimizer, i predicts j would choose action-message pair in OP T set with equal probability: 1 P (mj,s , aj |θj ) = (4) |OP T (θj )| and that P (mj,s , aj |θj ) is equal to zero if (mj,s , aj ) is not in OP T . The possibility that agents may be less than optimal is considered in [Gmy20] Being able to compute the probabilities of messages given the preferences and beliefs of a speaker is of crucial importance when sincerity is not guaranteed. The belief update in CIPOMDPs provides the principled way to discount content of messages that may be insincere because it is in the interest of the speaker to transmit them. We give an example of this further below. 4 Approach 4.1 Message space We define message space, M, as a set of marginal probability distributions over a subset of variables in the interactive states of the agent. Since the belief space is continuous we make the computation more tractable by quantizing M into finite set of belief points. The message space is augmented with nil, which is analogous to no-op operation in the physical action set. Limiting message space to only nil message reduces CIPOMDP to IPOMDP. Usually, a message contains information about only the subset of possible interactive states.2 The variables that the message doesn’t mention are interpreted as being marginalized. We use the principle of indifference and assume that probability is uniformly distributed among the remaining variables. The message received mi,r can provide a mapping from physical state to belief marginalizing other variables of interactive state (belief of another agent, frame, etc). Then mi,r (s) denotes belief the message ascribes to physical state s ∈ S. Further, message may only contain probabilities of subset of values that the variables mentioned in message can take. Let S 0 be the set of states not mentioned in the message. If a message m does not mention state s’ then ∀s0 ∈ S 0 P 0 1 − s∈S−S 0 m(s) m(s ) = (5) |S| − |S 0 | 4.2 Sincerity and Deception Let X denote a set of variables describing the interactive state of the agent. ℘(X) is a set of all non-empty subsets of X. The joint belief bt of an agent can be marginalized over the subset of variables in X. Let bX represent belief marginalized over variables in X. Accordingly message space M can be factored into the sets of messages marginalized over each of X ∈ ℘(X). Sincere message can be defined as a message in message space m ∈ MX which is closest to the marginalized belief btX consistent with the true joint belief bt of the agent at time t. The distance is defined in terms of the L1 norm. Thus the set of sincere messages is given by [ t Msincere = arg min kbX − mk (6) m∈MX X∈℘(X) Insincere(deceptive) message can be defined as any message in message space except the one closest to true belief bt of the agent at time t. Thus the set of insincere messages is given by [ t Minsincere = MX − arg min kbX − mk (7) m∈MX X∈℘(X) 2 For example, if the physical state space, S, is factored into variables X and Y , a message, m, might be ”P (0 ≤ X ≤ 100) = 0.7”. 5 4.3 Communication for POMDP (level-0 CIPOMDP) The recursion in CIPOMDP bottoms out as a flat POMDP which does not have a model of the other agent. We use the definition of literal speaker 3 from [Gmy20]. A literal listener can incorporate the incoming message as additional observation, which we describe in the following section. These assumptions enable an agent that does not model the other agent to participate in the exchange of messages. 4.3.1 Augmented observation space and function for POMDP We propose POMDP (θ0 ) can receive the message and include it in its belief update, by augmenting its observation space and consequently observation function. Observation space now becomes a Cartesian product of usual observation space and message space. Ω0 = Ω × M (8) The joint probability of observation and message received is obtained by combining the likelihood function for a physical observation with message distribution. The message distribution is represented by a triangular distribution with the idea that the likelihood of a message reflecting belief about the world state should increase monotonically as the belief. For e.g. if the belief is defined over two states (s1, s2), and the true state of the world is s1, the likelihood of received message reflecting belief of (1,0) should be higher than the likelihood of received message reflecting belief of (0.5, 0.5). This formulation makes the agent gullible. 1 |S| if mi,r = nil P (mi,r |s) = (9) (mi,r (s)) otherwise Given the state of the world, observation is conditionally independent of the message received, then the augmented observation function can be defined as ∀mi,r ∈M and ∀o∈Ω and ∀s∈S P (o, mi,r |s, a) O0 (s, a, o, mi,r ) = P o,mi,r P (o, mi,r |s, a) P (o|s, a)P (mi,r |s, a) =P P o P (o|s, a) mi,r P (mi,r |s) O(s, a, o)P (mi,r |s) = P (10) mi,r P (mi,r |s) 4.4 Algorithm 4.4.1 Value Iteration As in POMDPs, the value function of CIPOMDPs is represented in terms of max over linear segments called alpha-vectors. Each alpha vector corresponds to a policy and each component of alpha vector ascribes value to an interactive state. Value iteration proceeds by backing up alpha-vectors to a higher time horizon starting from horizon 1. 4.4.2 Complexity POMDPs suffer from the curse of dimensionality and the curse of history. Naturally, These curses are carried over to IPOMDPs and CIPOMDPs, which require solving of nested POMDPs and CIPOMDPs. The curse of history is more prominent in CIPOMDPs because the policy is now conditional on both observation and message received. Since, computing optimal policies for POMDPs by exact solution methods are proven to be PSPACE-complete for finite time horizon and undecidable for an infinite time horizon [MHC03], a large amount of work has been done in computing approximate solution. [PGT03] introduced a point-based value iteration 3 literal speaker generates a message reflecting its true belief about the physical states of the world b with probability 1 − α and t α all other messages including ’nil’ with probability |M|−1 . The messages are then broadcasted to ”no one in particular” (NOIP), and do not take part in belief update for POMDP agent 6 (PBVI) algorithm to approximate exact value iteration by selecting a fixed set of representative belief points and maintaining alpha vectors that are optimal at those points only. Our work builds on the interactive point-based value iteration [DP08] which showed improvement in runtime over other IPOMDP solution methods like [DG05]. Exact value iteration quickly becomes intractable in PODMPs and IPOMDPs due to generation of large number of alpha-vectors which is exponential in observation space |A||ν t+1 ||Ω| , where ν t+1 denote set of alpha vectors being backed-up from t + 1 to t. In the case of CIPOMDP, the size is further exploded due to the inclusion of message space in the policy. The exact number of alpha-vectors generated at time t + 1 will be |A||ν t+1 ||Ω||M| . To keep the size of the alpha-set in each iteration tractable, we can use the point-based method, which only retains the vectors which are optimal at the fixed set of belief points. As in IPOMDP, we need to solve the lower-level model to compute the alpha vectors. Accordingly, we limit the initial model of other agents to a finite set. 4.4.3 IPBVI-Comm The point-based value iteration approach backs up the alpha-vectors optimal at a fixed set of belief points in each iteration starting from horizon 1. Each iteration consists of three steps, which we describe below, and along the way we highlight the difference with IPBVI [DP08]. The belief set for each horizon consists of randomly generated belief points across all interactive states. Step 1 The first step involves calculating the intermediate set of alpha vectors Γai ,mi,s ,∗ representing immediate reward for action ai and message mi,s (equation 11), and Γai ,oi ,mi,s ,mi,r representing future reward after receiving observation oi and message mi,r (equation 12). The step is performed for all actions, observations and messages. For horizon 1, computation of immediate reward is sufficient and will be used as initial alpha set for subsequent backups. Different from point-based algorithm for IPOMDPs, we need to calculate P r(mi,r |θj,l−1 ) and perform belief update for the other agent j which now depends on message sent by i. Due to one-step delay in message exchange, the message sent in the current time step allows computing interactive states in next time step and backup the values from next time step. Assuming sincerity for θj,0 , only the message closest to belief will get probability 1- α. All the other messages, including ’nil’, will share the probability α. When a message received is other than a sincere message, level-1 CIPOMDP ignores the message, and belief update proceeds as IPOMDP. For higher-level CIPOMDPs, the probability of message received is uniformly distributed among all the messages in OP T (θj ) set, as defined in section 3.3. ∀ai ∈ Ai , ∀oi ∈ Ωi , ∀mi,s ∈ M, ∀is ∈ IS X Γai ,mi,s ,∗ ← αai ,mi,s ,∗ (is) = Ri (s, ai , aj , mi,s )P r(aj |θj,l−1 ) (11) aj ∈Aj X X Γai ,oi ,mi,s ,mi,r ← αai ,oi ,mi,s ,mi,r (is) = γ P r(mi,r , aj |θj,l−1 )Ti (s, ai , aj , s0 ) is0 ∈IS 0 aj ∈Aj X 0 0 Oi (s , ai , aj , oi ) Oj (s , ai , aj , oj )δD (SEθˆj (bj,l−1 , aj , oj , mj,s , mi,s ) − b0j,l−1 )α (is0 ) 0 (12) oj Here, δD is the Dirac delta function taking the current belief and updated belief as an argument. The updated belief is returned by state estimator function SEθˆj . Step 2 The second step involves combining intermediate alpha vectors calculated in step 1 weighted by the observation and message likelihood using a cross sum operation. Due to the point-based approach, the cross sum operation in exact value iteration is simplified. The step proceeds by selecting only those intermediate alpha vectors which are optimal at any of the given set of belief points. Γai ,mi,s ← Γai ,mi,s ,∗ ⊕ arg max (αai ,oi ,mi,s ,mi,r .bi,l ) oi ∈Ωi ,mi,r ∈M Γai ,oi ,mi,s ,mi,r ∀bi,l ∈ Bi,l (13) 7 (a) (b) Figure 2: (a)Policy tree for CIPOMDP agent with friend reward function modeling sinciere and gullible agent (b)Alpha vectors for friend reward function and sensor accuracy 0.85, for IPOMDP (without communication - in red) and for CIPOMDP (with communication - in blue). Each vector is labelled with a 3-step policy. For example L\(); L(∗); OL\(GR; GR), L(?) means that agent starts by listening (list of observations is empty), followed by listening no matter what it hears, followed in the third step by OL if previous two observations were GR and GR, but followed by another L if previous two observations were anything but (GR,GR). Communicative policies additionally include contents of messages received simultaneously with growls. Step 3 In the final step, the belief points in set Bi,l are used again to select the alpha vectors for the final set for the current iteration. Since different action, message pairs can be optimal for the modeled agent, we need to include alpha vectors corresponding to all optimal action, message pairs in the final alpha set. νt ← arg max (αt .bi,l ) αt ∈∪ai Γai ,mi ∀bi,l ∈ Bi,l (14) The recursion bottoms out as POMDP at level-0, which we assume to be a literal speaker. Since POMDP policy only computes physical action, we need to augment the policy with a sincere message. 5 Experiments and results We report the results on the multi-agent tiger game, which elegantly models complex interaction with the environment and other agents in sequential decision-making scenarios under uncertainty. The game provides intuitive scenarios for the study of sincerity and deception because the agents can be incentivized to influence another agent to open right or wrong door while not ignoring uncertainty in the environment. 5.1 Multi-agent tiger game In this version, two agents are facing two doors: “left” and “right”. Behind one door lies a hungry tiger and behind the other is a pot of gold but the agents do not know the position of either. Thus, the set of states is S = {T L, T R} indicating the tiger’s presence behind the left, or right, door. Each agent can open either door. Agents can also independently listen for the presence of the tiger, so the actions are A = {OR, OL, L} for opening the right door, opening the left door, and listening and is the same for both agents. The transition function T , specifies that every time either agent opens one of the doors, the state is reset to T R or T L with equal probability, regardless of the action of the other agent. However, if both agents listen, the state remains unchanged. After every action, each agent can hear the tiger’s growl coming either from the left, GL, or from the right door, GR. The observation function O (identical for both agents) specifies the accuracy of observations. We assume that tiger’s growls are informative, with predefined sensor accuracy, only if the agents listen. If the agent opens the doors the growls have an equal chance to come from the left or right door and are thus completely uninformative. 8 Table 1: Neutral Re- Table 2: Friend Re- Table 3: Enemy Re- Table 4: Enemy Re- ward ward ward A ward B hai , aj i TL TR hai , aj i TL TR hai , aj i TL TR hai , aj i TL TR OR, L 10 -100 OR, L 9.5 -100.5 OR, L 10.5 -99.5 OR, L 10 -100 OL, L -100 10 OL, L -100.5 9.5 OL, L -99.5 10.5 OL, L -100 10 L, L -1 -1 L, L -1.5 -1.5 L, L -0.5 -0.5 L, L -1 -1 OR, OL 10 -100 OR, OL -40 -95 OR, OL 60 -105 OR, OL 10 -150 OL, OL -100 10 OL, OL -150 15 OL, OL -50 5 OL, OL -100 -40 L, OL -1 -1 L, OL -51 4 L, OL 49 -6 L, OL -1 -51 OR, OR 10 -100 OR, OR 15 -150 OR, OR 5 -50 OR, OR -40 -100 OL, OR -100 10 OL, OR -95 -40 OL, OR -105 60 OL, OR -150 10 L, OR -1 -1 L, OR 4 -51 L, OR -6 49 L, OR -51 -1 Table 5: Reward comparison for CIPOMDP agent against IPOMDP agent in different scenarios. For Enemy, reward function in Table 3 is used. Reward Nesting Agent Opponent h=3 h=4 h=5 Level CIPOMDP IPOMDP CIPOMDP IPOMDP CIPOMDP IPOMDP Sincere and 3.4 2.84 3.9 2.39 3.5 1.067 Neutral Gullible ± 8.92 ± 16.02 ± 8.29 ± 7.95 ± 8.037 ± 20.275 1 Sincere and 46 1.53 66.48 1.07 86.00 -1.31 Enemy Gullible ± 24.69 ± 18.07 ± 42.15 ± 8.887 ± 36.57 ± 24.79 Sincere and 5.08 4.15 6.05 3.56 5.71 -0.81 Friend Gullible ± 10.91 ± 18.02 ± 9.42 ± 8.86 ± 17.10 ± 23.32 3.39 2.81 3.9 2.39 3.4942 1.55 Neutral Enemy ± 8.08 ± 16.02 ± 11.40 ± 7.67 ± 8.94 ± 17.14 2 5.08 4.14 6.21 3.67 5.02 3.65 Friend Friend ± 10.5 ± 18.02 ± 8.56 ± 8.97 ± 10.099 ± 17.94 5.44 1.53 8.99 2.32 10.78 0.5 Enemy Enemy ± 10.83 ± 18.07 ± 18.88 ± 15.72 ± 18.09 ± 19.81 Uncertain 3.43 1.53 4.19 2.44 3.45 0.82 Neutral (Enemy or ± 7.80 ± 18.07 ± 9.162 ± 7.87 ± 8.33 ± 13.71 Friend) 5.1.1 Reward functions The reward functions are chosen to simulate cooperative and competitive scenarios. Table 1 represents the scenario when the reward of the agent is independent of the action of the other agent. Table 2 is the friend reward function where the agent gets half of the reward obtained by the other agent, in addition to its own reward. In table 3 the agent gets half of the negative of the reward obtained by the other agent, hence represents the competitive case. In other reward function Table 4, the agent gets -50 reward if another agent opens the correct door but there is no extra reward if the other agent opens the wrong door. Table 3 incentivizes the extreme lie while 4 incentivizes more believable lie. 5.1.2 Experimental setup Table 5 shows the total reward collected in multi-agent tiger game averaged across 10000 episodes for sensor accuracy of 0.85. The message space M is limited to distribution over physical states only and has been quantized into 5 equally spaced belief points (0.0,1.0),(0.25, 0.75),(0.5,0.5),(0.75, 0.25), (1.0, 0.0), and nil. The value of α is fixed to 0.01. The results show that the CIPOMDP agent outperforms the IPOMDP agent in terms of the average reward collected due to the sophistication of message exchange. The difference is more prominent when the agent is able to deceive the other agent. The behavior of the agent across multiple scenarios is discussed below. 5.2 Cooperative Scenarios When level 2 friend i models a friend j, agent i sends a sincere message to j reflecting its belief and further includes a message from another agent in its belief update. For e.g. after starting from uniform belief, if the agent i hears GL, it will send the sincere message mi,s = (0.75, 0.25), which assigns probability 0.75 to TL and 0.25 to TR. Figure 2 (b) shows the alpha vectors plot corresponding to communicative policies for CIPOMDP 9 Table 6: Different scenarios of belief update of level 2 CIPOMDP agent i modeling enemy, friend, and a random agent (initial belief is uniform over all the models and location of the tiger). For all scenarios, agent i sends a nil message and executes listen action in each time-step. The belief update is shown in figure 3 Scenario time-step observation message received Remarks 1 GL nil The received message contradicts agent’s 1 consecutive observations of GLs. The other 2 GL (0.25, 0.75) agent is more likely to be enemy than friend 1 GL nil The received message jives in with the agent’s 2 consecutive observations of GLs. The other is 2 GL (0.75, 0.25) more likely to be friend than enemy 1 GR nil Agent gets contradictory growls, there is an 3 equal chance the other agent is an enemy (and 2 GL (0.25, 0.75) tiger on left) or a friend (and tiger on right) 1 GR nil Agent gets contradictory growls, there is an 4 equal chance the other agent is a friend (and 2 GL (0.75, 0.25) tiger on left) or an enemy (and tiger on right) 1 GL nil Agent gets the message that is in OPT set of 5 neither enemy nor friend. Hence, is certain 2 GL nil that the other agent is a random agent agent with friend reward function modeling CIPOMDP (also with a friend reward). The resulting policy tree when the agents are uncertain about the initial location of the tiger is shown in figure 2 (a). 5.3 Non-cooperative Scenarios When a level 1 CIPOMDP agent i models a gullible agent j, it can collect a large reward by sending a deceitful message. For e.g. after starting from uniform belief and getting GL, the agent sends a message mi,s = (0, 1) which indicates agent i is certain tiger is on the right, opposite to its own observation. When level 2 enemy i models enemy j, the sophisticated agent i can prevent itself from being deceived by j and further take advantage of the deceitful message as an extra observation. Also, since level-2 agent knows level-1 CIPOMDP models the other as a sincere agent, the former tends to deceive the latter by sending a deceitful message. When level 2 neutral agent models an enemy, it has no incentive to deceive but can prevent itself from being deceived by ignoring the message from the other agent. When the nesting level is increased to level 3, the agent tends to send a sincere message with the intention to deceive the level 2 agent which is expecting a deceitful message.When the agents are indifferent to each other’s actions i.e. both have a neutral reward function, then communication doesn’t provide any added value and the reward collected is the same as the IPOMDP agent. 5.4 Uncertainty about the opponent 5.4.1 Message non-revealing of the agent’s type Let’s consider a two-agent interaction in a multi-agent tiger game where both the agents i and j start with no information about the location of the tiger. The level-2 CIPOMDP agent i is uncertain about the level-1 opponent j ’s type and thus assigns uniform probability over the other agent being friend, enemy or random 4 . The friend is characterized by the friend reward function (Rf ) which incentivizes sincere communication while the enemy is characterized by the enemy reward function (Re ) incentivizing the deceptive behavior. Also, all other elements in the frame of the CIPOMDP agent at all levels are assumed to be the same. Behavior of level-0 agent The POMDP agent at level-0 which starts with a uniform distribution over the physical states of the world T L and T R, would open the door if it received two consecutive growls indicating the same location of tiger and the message non-indicative of opposing growl. Further, the agent would open the door at any time-step, regardless 4 It can be the case that level-2 CIPOMDP is uncertain about the strategic level of the opponent and might want to model POMDP as well, but to simplify the illustration, we stick to 3 types of models 10 Figure 3: Belief Update for specific scenarios for level-2 CIPOMDP agent i modeling enemy, friend, and random agent. Each row represents a scenario from table 6 and column represent time-step. In each scenario, the agent i has a uniform initial belief over the state and the frame of another agent j. Each belief update is due to physical action, message sent, observation and message received which can be referenced from table 6 of its observation, if it got the extreme message indicating the location of a tiger with certainty. Also, being a literal speaker, the agent sends out its true belief at every time-step with probability α. Behavior of level-1 agent The less sophisticated level-1 agent θj,1 models sincere and gullible POMDP agent θi,0 , and the random agent SMi . Let’s consider the behavior of level-1 agent j which has predicted the behavior of θi,0 for every possible e message it can send. The agent θj,1 with enemy reward function (Table 4), has incentive to deceive the opponent, e because every time i opens the right door, j gets -50 reward. Let’s suppose the agent θj,1 listens at time step t = 0 , to be certain of the tiger’s location and then sends the message at time step t = 1 indicating growl from e opposite location than what it heard. Let’s suppose θj,1 got GL, then it would send the message (0.25, 0.75). On other hand, if the level-1 agent had a friend reward function (Table 2), it would get a share of i’s reward. f f Let’s call the agent θj,1 . θj,1 would want the other agent i to open the right door, and hence send the sincere message, indicative of its belief updated based on the observation. For an observation of GR, it would send a message (0.25, 0.75) at t = 1. Belief update for level-2 agent f e Level-2 CIPOMDP i uses IPBVI-Comm to solve for the anticipated behavior of modeled agents θj,1 and θj,1 . Now let’s see how the belief update proceeds for i. Note, that optimal messages from level-1 agents are not indicative of the agent type. For e.g. if the enemy received GL, it would send the message (0.25, 0.75) and if the friend received GR, it would again send the message (0.25, 0.75). We study the scenario when the message itself is indicative of agent type in the next section. It turns out the agent i is still able to assign a higher probability to one agent type than the other based on the observation it received. For e.g. if the observation and message received for i in two time-steps are < GL,0 nil0 > and < GL, (0.25, 0.75) >, as shown in table 6, scenario 1, the agent is more likely to be an enemy than the friend because its observations contradict the received message. The belief update of other scenarios are shown in figure 3. The description of each scenario is provided in table 6. 11 Figure 4: The performance profile of IPBVI-Comm for level 2 (left) and level 1 (right) CIPOMDP agent. The increase in expected reward is due to improvement in policy after increasing number of belief points. It also shows the comparison with exact value iteration. 5.4.2 Message revealing the agent’s type Again we consider the scenario when there is uncertainty about the opponent type, i.e. when level 2 neutral agent i models both enemy and friend. The only difference now is that the agent modeled at level-1 θj,1 has the reward function that incentivizes the extreme lie (Table4). In this case, the higher level agent θi,2 can figure out if the other agent j is friend or enemy based on the message content only. The CIPOMDP agent θi,2 incorporates the message from j as a sincere message if it is closer to the belief of the modeled agent θj,1 and discards the message as an insincere message if the incoming message is way off the belief of the modeled agent. For e.g. if the belief of the modeled agent is (0.85, 0.15), (0.75, 0.25) is considered a sincere message while (1, 0) is considered insincere message. Let’s suppose θi,2 starts by listening at t = 0. After receiving GL and message nil, the belief shifts towards physical state T L but belief is equally distributed among both frames. At t = 1, after receiving message (0.75, 0.25) and observation GL, the belief concentrates on tiger being on left and other agent being a friend. The agent is able to detect friend from enemy by calculation of sincerity term P (mi,r |θj ). When θj = (bj , Rf ), P (mi,r |θj ) = 1 and when θj = (bj , Re ), P (mi,r |θj ) = 0. This happens because for level-1 CIPOMDP with enemy reward function, the optimal action would be to lie to the extreme. This would make θi,0 open the door intended by the deceiver θj,1 , disregarding its own observation. This message reveals j as an enemy to the higher level agent i. 5.5 Algorithm Performance Since the point-based algorithm only backs up alpha-vectors optimal at fixed set of belief points, the performance would depend on the number of belief points chosen. Figure 4 shows the comparison of expected reward for the different number of belief points. 6 Conclusion We started by devising a technique to incorporate the message into POMDP belief update in order to allow an agent that does not model other agents to take part in exchange of (literal) messages. We then formalized the notion of sincerity and deception in terms of belief of the agent and messages in message space. We adopted a point based solution method to CIPOMDPs to alleviate the complexities of considering communication as well as observations and physical actions. The analysis of computed policies shows the added sophistication of communication results in policies of superior quality, which is further supported by the empirical results on several experiments conducted on multi-agent tiger game. More importantly, we showed how higher levels of theory of mind may allow an agent to disambiguate a sincere friend from deceitful foe based on received message and observation from the environment. In future work, we want to explore the higher depth of nesting, and more relaxed soft maximization criterion for action selection which can give rise to richer rational communicative behavior agents can engage in. We are considering online planning method like Monte Carlo tree search for computing policy which provides scalability and with some variation, could accomodate continuous message space. 12 References [AASN+ 20] Md Ali Reza Al Amin, Sachin Shetty, Laurent L. Njilla, Deepak K. Tosh, and Charles A. Kamhoua. Dynamic Cyber Deception Using Partially Observable Monte-Carlo Planning Framework, chapter 14, pages 331–355. John Wiley Sons, Ltd, 2020. [AdR18] Saul Albert and J. P. de Ruiter. Repair: The interface between interaction and cognition. Topics in Cognitive Science, 10(2):279–313, 2018. [BAG+ 05] P. Beautement, David H. Allsopp, M. Greaves, Steve Goldsmith, S. Spires, S. Thompson, and H. Janicke. Autonomous agents and multi -agent systems (aamas) for the military - issues and challenges. In DAMAS, 2005. [BI11] Will Bridewell and Alistair Isaac. Recognizing deception: A model of dynamic belief attribution. In AAAI Fall Symposium: Advances in Cognitive Systems, 2011. [BTK08] Cynthia Breazeal, Atsuo Takanishi, and Tetsunori Kobayashi. Social Robots that Interact with People, pages 1349–1369. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. [CT16] Crystal Chao and Andrea Thomaz. Timed petri nets for fluent turn-taking over multimodal in- teraction resources in human-robot collaboration. The International Journal of Robotics Research, 35(11):1330–1353, 2016. [DA16] S. Devin and R. Alami. An implemented theory of mind to improve human-robot shared plans execution. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 319–326, 2016. [DCA17] Sandra Devin, Aurélie Clodic, and Rachid Alami. About decisions during human-robot shared plan achievement: Who should act and how? In ICSR, 2017. [DG05] Prashant Doshi and Piotr Gmytrasiewicz. Approximating state estimation in multiagent settings using particle filters. In Proceeding of AAMAS 2005, 2005. [DP08] Prashant Doshi and Dennis Perez. Generalized point based value iteration for interactive pomdps. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 1, AAAI’08, page 63–68. AAAI Press, 2008. [DWW+ 15] Xiao Pan Ding, Henry M. Wellman, Yu Wang, Genyue Fu, and Kang Lee. Theory-of-mind training causes honest young children to lie. Psychological Science, 26(11):1812–1821, 2015. PMID: 26431737. [EG19] Piotr Evdokimov and Umberto Garfagnini. Communication and behavior in organizations: An experiment. Quantitative Economics, 10(2):775–801, 2019. [EJ10] David Ettinger and Philippe Jehiel. A theory of deception. American Economic Journal: Microe- conomics, 2(1):1–20, February 2010. [FAdFW16] Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to com- municate to solve riddles with deep distributed recurrent q-networks, 2016. [FSH+ 19] Jakob N. Foerster, H. Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew M Botvinick, and Michael H. Bowling. Bayesian action decoder for deep multi-agent reinforcement learning. ArXiv, abs/1811.01458, 2019. [GA14] Matthias Gamer and Wolfgang Ambach. Deception research today. Frontiers in Psychology, 5:256, 2014. [GD00] Piotr J. Gmytrasiewicz and Edmund H. Durfee. Rational coordination in multi-agent environments. Autonomous Agents and Multiagent Systems Journal, 3(4):319–350, 2000. [GD05] Piotr Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multiagent set- tings. Journal of Artificial Intelligence Research, 24:49–79, 2005. http://jair.org/contents/v24.html. 13 [GL20] Andrea L Guzman and Seth C Lewis. Artificial intelligence and communication: A human–machine communication research agenda. New Media & Society, 22(1):70–86, 2020. [Gmy20] Piotr J. Gmytrasiewicz. How to do things with words: A bayesian approach. J. Artif. Intell. Res., 68:753–776, 2020. [Gne05] U. Gneezy. Deception: The role of consequences. The American Economic Review, 95:384–394, 2005. [GYPC20] J. George, C. T. Yilmaz, A. Parayil, and A. Chakrabortty. A model-free approach to distributed transmit beamforming. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5170–5174, 2020. [IB17] Alistair Isaac and Will Bridewell. White lies on silver tongues: Why robots need to deceive (and how). pages 157–172, 10 2017. [KGB17] A. Kurakin, Ian J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. ArXiv, abs/1607.02533, 2017. [MHC03] Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence, 147(1):5 – 34, 2003. Planning with Uncertainty and Incomplete Information. [NMT13] A. Nayyar, A. Mahajan, and D. Teneketzis. Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644– 1658, 2013. [NPY+ 04] Ranjit Nair, David Pynadath, Makoto Yokoo, Milind Tambe, and Stacy Marsella. Communica- tion for improving policy computation in distributed pomdps. In Proceedings of the Agents and Autonomous Multiagent Systems (AAMAS), 2004. [NRYT03] Ranjit Nair, Maayan Roth, Makoto Yokoo, and Milind Tambe. Taming decentralized pomdps: Towards efficient policy computation for multiagent settings. In Proceedings of the Eighteenth In- ternational Joint Conference on Artificial Intelligence, 2003. [OSV07] Frans Olihoek, Mathijs Spaan, and Nikos Vlassis. Dec-pomdps with delayed communication. In Proceedings of MSDM 2007 May 15, 2007, Honolulu, Hawai’i, USA, 2007. [OSV19] Lauren A Oey, Adena Schachner, and Edward Vul. Designing good deception: Recursive theory of mind in lying and lie detection, May 2019. [PGT03] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based value iteration: An anytime al- gorithm for pomdps. In Proceedings of the 18th International Joint Conference on Artificial Intel- ligence, IJCAI’03, page 1025–1030, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc. [RMG15] J. Renoux, A. Mouaddib, and S. L. Gloannec. A decision-theoretic planning approach for multi- robot exploration and event search. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5287–5293, 2015. [RO98] Nancy K. Ratner and Rose R. Olver. Reading a tale of deception, learning a theory of mind? Early Childhood Research Quarterly, 13(2):219 – 239, 1998. [SDS20] Aditya Shinde, Prashant Doshi, and Omid Setayeshfar. Active deception using factored interactive pomdps to recognize cyber attacker’s intent, 2020. [Son78] Edward J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Oper. Res., 26(2):282–304, April 1978. [SPB+ 19] S. Sarkadi, Alison R. Panisson, Rafael Heitor Bordini, P. McBurney, S. Parsons, and M. Chapman. Modelling deception using theory of mind in multi-agent systems. AI Commun., 32:287–302, 2019. 14 [SsF16] Sainbayar Sukhbaatar, arthur szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2244–2252. Curran Associates, Inc., 2016. [SSW+ 17] K. Shu, A. Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media: A data mining perspective. SIGKDD Explor., 19:22–36, 2017. [STHP91] Beate Sodian, Catherine Taylor, Paul L. Harris, and Josef Perner. Early deception and the child’s theory of mind: False trails and genuine markers. Child Development, 62(3):468–483, 1991. [STX+ 18] Aaron Schlenker, Omkar Thakoor, Haifeng Xu, Fei Fang, Milind Tambe, Long Tran-Thanh, Phebe Vayanos, and Yevgeniy Vorobeychik. Deceiving cyber adversaries: A game theoretic approach. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, page 892–900, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems. [ULS20] Vaibhav V. Unhelkar, Shen Li, and Julie A. Shah. Decision-making for bidirectional communication in sequential human-robot collaborative tasks. In Proceedings of the 2020 ACM/IEEE Interna- tional Conference on Human-Robot Interaction, HRI ’20, page 329–341, New York, NY, USA, 2020. Association for Computing Machinery. [WPH16] Ning Wang, David V. Pynadath, and Susan G. Hill. Trust calibration within a human-robot team: Comparing automatically generated explanations. In The Eleventh ACM/IEEE International Con- ference on Human Robot Interaction, HRI ’16, page 109–116. IEEE Press, 2016. [YFS+ 20] Luyao Yuan, Zipeng Fu, Jingyue Shen, Lu Xu, Junhong Shen, and Song-Chun Zhu. Emergence of Pragmatics from Referential Game between Theory of Mind Agents. arXiv e-prints, page arXiv:2001.07752, Jan 2020. [ZVlY04] Yu Zhang, Richard A. Volz, Thomas R. loerger, and John Yen. A decision-theoretic approach for designing proactive communication in multi-agent teamwork. In Proceedings of the 2004 ACM Symposium on Applied Computing, SAC ’04, page 64–71, New York, NY, USA, 2004. Association for Computing Machinery. 15