Introduction

Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception

Sarit Adhikari

sadhik6@uic.edu 0

Piotr J. Gmytrasiewicz

piotr@uic.edu 0 0 University of Illinois at Chicago , Chicago, IL 60607 , USA

In Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs), agents use Bayes update to process their observations and messages without the usual assumption of cooperative discourse. We formalize the notion of sincerity and deception in terms of message in message space and belief of the agent sending the message. We then use a variant of the point-based value iteration method called IPBVI-Comm to study the optimal interactive behavior of agents in cooperative and competitive variants of the Tiger game. Modeling the belief and preferences of the opponent allows an agent to predict their optimal communicative behavior and physical action. Unsurprisingly, it may be optimal for agents to attempt to mislead others if their preferences are not aligned. But it turns out the higher depth of reasoning allows an agent to detect insincere communication and to guard against it. Speci cally, in some scenarios, the agent can distinguish a truthful friend from a deceptive foe when the message received contradicts the agent's observations, even when the received message does not directly reveal the opponent type.

Introduction

Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) In: R. Falcone, J. Zhang, and D. Wang (eds.): Proceedings of the 22nd International Workshop on Trust in Agent Societies, London, UK on May 3-7, 2021, published at http://ceur-ws.org align, then they bene t from sincere communication. Like physical actions, communicative actions are guided by the expected utility obtained in a particular state. Agents sometimes bene t from being sincere and sometimes it is in their best interest to deceive. To be able to cooperate, deceive or guard against deception, the agent needs to model the belief, intention, and preference of the other agent [BI11].

Humans are not very adept in detecting lies; we nd it di cult to tell a skillful liar from a truthful, but possibly misinformed, friend. The main reason is that both may behave the same. We show examples of interactions during which agents, using Bayes update, can form informative beliefs about the identity of their opponents by keeping careful track of their possible states of knowledge, and by comparing the content of their communications to the agent's own observations. It is important to note that analyzing communicative behavior as separate from physical actions is bound to be insu cient, unless agents care directly about states of other agents' beliefs, as opposed to caring only about physical states of the world. In that latter, arguably more realistic, case, rational agents will attempt to modify others' beliefs only because this could induce them into advantageous physical actions - this could not be analyzed if one abstracts physical actions away.

Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs) provide a principled framework for rational interaction and communication in a multi-agent environments [Gmy20]. CIPOMDP framework is an extension of interactive POMDP [GD05] to include the exchange of messages among the agents. IPOMDP, in turn, extends Partially Observable Markov Decision Process (POMDP) [Son78] to include other agents by incorporating their models into its state space. Thus, while POMDPs provide a theoretically sound framework to model uncertainty about the state of the world, the Theory of Mind (ToM) approach of IPOMDPs and CIPOMDPs allows modeling of uncertainty about other agents, including their beliefs about the world and other agents. Figure 1 shows the theory of mind (ToM) of the decision-making agent which is uncertain about the type of another agent and how it may model others including the original agent. At the bottom of the hierarchy of models could be a random agent, or a rational agent that does not model others, i.e., a classical POMDP agent. A great deal of research in psychology establishes a connection of deception to the recursive theory of mind reasoning, which starts at an early age in humans [STHP91, RO98, DWW+15]. More recently, [OSV19] provides a comprehensive quantitative analysis of the role of rationality and theory of mind in deception and detecting deception.

We rst formalize the de nition of sincerity and deception in terms of the message sent by the agent and its belief about the state of the world. We then propose an o ine solution method for communicative IPOMDPs, which adopts a point-based solution technique to solve for the optimal communicative behavior. The algorithm computes the policy to nd the optimal action-message pair at each time-step of interaction with the other agent. The executed action may not be known to other agents but we assume that the message sent is received by the other agent with certainty. The subsequent policies are conditional on not only observation but also message received from the other agent. Based on the preference of the agent and what is known about the preferences of the modeled agent, the agent may bene t from incorporating the message from another agent into its belief, but discounting it if it thinks the other agent has an incentive to lie. Since CIPOMDP agent needs message space to calculate the policy o ine and messages express agent's beliefs, we construct the message space by discretizing the interactive belief space. The policy of POMDP agent on the bottom of a ToM hierarchy is augmented with a sincere messages it can send using a look-ahead reachability tree from the initial belief in the interactive state of the modeling agent. Similarly, we propose a way for an agent modeled as a POMDP to receive messages by augmenting its observation space. We apply CIPOMDPs to agents interacting in the multi-agent tiger game and we show that communication is valuable to agents and results in superior policies compared to its no communication counterpart. In cooperative scenarios, the agent can take advantage of messages from a sincere agent as additional observations, and can send sincere messages that inform the other agent. In competitive scenarios, the agent not only attempts to deceive the other agent but also ignores the message it knows to be deceitful. We then show how Bayesian update allows an agent higher in a cognitive hierarchy to tell a friend from foe based on the message received and its own observation. 2

Related work

The problem of agents communicating and interacting simultaneously has been addressed in several decisiontheoretic as well as RL settings. [FAdFW16] uses DDRQN to learn communication protocol and solve a riddle by coordination. [FSH+19] combines multiagent RL with a bayesian update to compute communication protocols and policies in cooperative, partially observable multi-agent settings. [SsF16] uses a neural model that learns communication along with the policy. In the planning and control setting, [NMT13] uses communication for controllers to share part of their observation and control history at each step. More recently, [ULS20] used POMDP with communication for human-robot collaboration task. Other work in HRI include [CT16], [DCA17] and [WPH16] . [DA16] uses a theory of mind approach for the execution of a shared plan.

Communication has been studied extensively in other multi-agent decision theoretic frameworks [NRYT03], [NPY+04], [ZVlY04], [OSV07] In [RMG15], agents use extended belief state that contain approximation of other agents' beliefs. But these works assume fully cooperative interactions and mostly involve central planning. CIPOMDPs, on the other hand, provide subjective theory of mind reasoning during communication, and follows Bayesian approaches to pragmatics.

Deception has been widely studied across multiple disciplines including game theory [STX+18], [EJ10], psychology [GA14] and economics [Gne05]. When it comes to a sequential decision process to study attacker's and defender's approaches in cybersecurity research , decision theoretic framework of POMDP [AASN+20] and IPOMDP [SDS20] has been used. [SPB+19] combines ToM with components from deception theory and implements an epistemic agent using Agent-Oriented Programming Language. 3 3.1

Background Communicative Interactive POMDPs

CIPOMDP [Gmy20] is the rst general framework for an autonomous self-interested agent to communicate and interact with other agents in the environment based on Bayesian decision theory. A nitely nested communicative interactive POMDP of agent i in an environment with agent j, is de ned as:

CIP OM DPi = hISi;l; Ai; M; i; Ti; Oi; Rii (1) where ISi;l is a set of interactive states, de ned as ISi;l = S Mj;k; l 1, where S is the set of physical states and Mj;k is the set of possible models of agent j, l is the strategy (nesting) level, and k < l. The set of possible models Mj;k consists of intentional models, j , or sub-intentional ones, SMj . While the intentional models ascribe beliefs, preferences and rationality in action selection to the modeled agent, the sub-intentional models do not. We consider kth (less than l) level intentional models of agent j de ned as j;k = hbj;k; Aj ; j ; Tj ; Oj ; Rj i, where bj;k is agent j's belief nested to the level k, bj;k 2 (ISj;k). The intentional model j;k, is sometimes called type, can be rewritten as j;k = hbj;k; ^j i, where ^j includes all elements of the intentional model other than the belief and is called the agent j's frame. Among the classes of sub-intentional models, we consider no-information model [GD00] which randomly selects action to execute and message to send in each time-step. The random models are possible in each level starting with level-0.

In contrast to classical POMDPs and similar to IPOMDPs, the transition, observation and reward functions in CIPOMDPs take actions of other agents into account. A = Ai Aj is the set of joint actions of all agents, i is the set of agent i's possible observations, Ti : S A S ! [0; 1] is the state transition function, Oi : S A i ! [0; 1] is the observation function, Ri : S A ! R is the reward function.

The ISi;l can be de ned inductively

ISi;0 = S; ISi;1 = S

Mj;0; :::::: ISi;l = S

Mj;k; l 1 k=0 j;0 = fhbj;0; ^j i : bj;0 2

(S)g Mj;0 =

j;0 [ SMj Mj;1 =

j;1 [ SMj j;1 = fhbj;1; ^j i : bj;1 2

(ISj;1)g j;l = fhbj;l; ^j i : bj;l 2

(ISj;l)g Mj;l =

j;l [ SMj

The above de nes the 0-level model, j;0 as having beliefs only over the physical state space, S. The level 1 agent model maintains beliefs over the physical states and 0-level models of the opponent. A level l agent, j;l, maintains beliefs over S and over models of the opponent nested up to l 1.

M is a set of messages the agents can send to and receive from each other , i.e., it is a communication language the agents share. Since agents' beliefs are probability distributions and communication is intended to share beliefs, it is natural to interpret a message in M as a marginal probability distribution over a subset of variables in the agents' interactive state spaces ISi and ISj , which overlap. That way M is a set of probabilistic statements about the interactive state space. The message nil, i.e., silence, contains no variables. Note that we do not assume that messages re ect agents' actual beliefs. We will further discretize M below. 3.2

Belief update in CIPOMDPs

Belief update in CIPOMDPs is analogous to belief update in IPOMDPs when it comes to actions and observations. At any particular time step agents i and j can not only perform physical actions and observe but also send and receive messages. Call the message i sent at time t 1 mit;s1, and one i received at time t mit;r, and analogously for j. We assume all messages are in M and that message transmission is perfect. We provide precise de nition of message space in our formulation in section 4.1. The belief update in CIPOMDPs has to update the probability of interactive state given the previous belief, action and observation, and given the message sent (at the previous time step) and received (at the current time): P (istjbit 1; ait 1; mit;s1; oit; mit;r): bit(ist) = P (istjbit 1; ait 1; mit;s1; oit; mit;r) =

X bit 1(ist 1) X

P (mtj;s1; atj 1j jt 1) ist 1 at 1 j Oi(st; at 1; oit)Ti(st 1; at 1; st) X

jt (btj 1; atj 1; mtj;s1; otj ; mtj;r; btj )Oj (st; at 1; otj ) ot j The term P (mtj;s1; atj 1 t 1) quanti es the relation between the message i received from j and the model, j j j , of agent j that generated the message.1 This term is the measure of j0s sincerity, i.e., whether the message j sent re ects j's beliefs which are part of the model j . is the normalizing constant. 3.3

Planning in CIPOMDPs

The utility of interactive belief of agent i, contained in i's type i, is: (2) (3) Ui( i) = max(mi;s;ai) n X bis(s)ERi(is; mi;s; ai)

is2IS +

X (mi;r;oi)

P (mi;r; oijbi; ai)Ui(hSE i (bi; ai; mi;s; oi; mi;r); ^ii) o 1Note that mtj;s1 = mit;r because message transmission is assumed to be perfect.

ERi(is; mi;s; ai) above is the immediate reward to i for sending mi;s and executing action ai given the interactive state is and is equal to aj Ri(is; ai; aj ; mi;s)P (aj j j ).

The planning in CIPOMDP makes use of equation 3 , which is based on the Bellman optimality principle. The policy computes optimal action, message pair which results in a maximum expected reward. Consequently, value iteration in CIPOMDP is analogous to that in IPOMDP and POMDP. The set of optimal message-action pairs, (mi;s; ai ) is obtained by replacing max in equation 3 with argmax. We call the resulting set of optimal message-action pairs OP T ( i). When agent i models agent j as a strict optimizer, i predicts j would choose action-message pair in OP T set with equal probability:

P (mj;s; aj j j ) =

1 jOP T ( j )j and that P (mj;s; aj j j ) is equal to zero if (mj;s; aj ) is not in OP T . The possibility that agents may be less than optimal is considered in [Gmy20]

Being able to compute the probabilities of messages given the preferences and beliefs of a speaker is of crucial importance when sincerity is not guaranteed. The belief update in CIPOMDPs provides the principled way to discount content of messages that may be insincere because it is in the interest of the speaker to transmit them. We give an example of this further below.

Msincere = [

arg min kbtX X2}(X) m2MX

mk Minsincere = [

MX arg min kbtX m2MX mk

Insincere(deceptive) message can be de ned as any message in message space except the one closest to true belief bt of the agent at time t. Thus the set of insincere messages is given by

X2}(X) 2For example, if the physical state space, S, is factored into variables X and Y , a message, m, might be "P (0 X 100) = 0:7". We de ne message space, M, as a set of marginal probability distributions over a subset of variables in the interactive states of the agent. Since the belief space is continuous we make the computation more tractable by quantizing M into nite set of belief points. The message space is augmented with nil, which is analogous to no-op operation in the physical action set. Limiting message space to only nil message reduces CIPOMDP to IPOMDP. Usually, a message contains information about only the subset of possible interactive states.2 The variables that the message doesn't mention are interpreted as being marginalized. We use the principle of indi erence and assume that probability is uniformly distributed among the remaining variables.

The message received mi;r can provide a mapping from physical state to belief marginalizing other variables of interactive state (belief of another agent, frame, etc). Then mi;r(s) denotes belief the message ascribes to physical state s 2 S. Further, message may only contain probabilities of subset of values that the variables mentioned in message can take. Let S0 be the set of states not mentioned in the message. If a message m does not mention state s' then

Sincerity and Deception

m(s0) = 1 8s0 2 S0

Ps2S S0 m(s) jSj jS0j Let X denote a set of variables describing the interactive state of the agent. }(X) is a set of all non-empty subsets of X. The joint belief bt of an agent can be marginalized over the subset of variables in X. Let bX represent belief marginalized over variables in X. Accordingly message space M can be factored into the sets of messages marginalized over each of X 2 }(X).

Sincere message can be de ned as a message in message space m 2 MX which is closest to the marginalized belief btX consistent with the true joint belief bt of the agent at time t. The distance is de ned in terms of the L1 norm. Thus the set of sincere messages is given by (4) (5) (6) (7) The recursion in CIPOMDP bottoms out as a at POMDP which does not have a model of the other agent. We use the de nition of literal speaker 3 from [Gmy20]. A literal listener can incorporate the incoming message as additional observation, which we describe in the following section. These assumptions enable an agent that does not model the other agent to participate in the exchange of messages. We propose POMDP ( 0) can receive the message and include it in its belief update, by augmenting its observation space and consequently observation function. Observation space now becomes a Cartesian product of usual observation space and message space.

The joint probability of observation and message received is obtained by combining the likelihood function for a physical observation with message distribution. The message distribution is represented by a triangular distribution with the idea that the likelihood of a message re ecting belief about the world state should increase monotonically as the belief. For e.g. if the belief is de ned over two states (s1; s2), and the true state of the world is s1, the likelihood of received message re ecting belief of (1,0) should be higher than the likelihood of received message re ecting belief of (0.5, 0.5). This formulation makes the agent gullible.

Given the state of the world, observation is conditionally independent of the message received, then the augmented observation function can be de ned as

P (mi;rjs) = 1 jSj (mi;r(s)) if mi;r = nil otherwise 8mi;r2M and 8o2

and 8s2S O0(s; a; o; mi;r) =

P (o; mi;rjs; a)

Po;mi;r P (o; mi;rjs; a) = Po PP((oojjss;;aa))PP(mmi;ir;rPjs(;ma)i;rjs)

O(s; a; o)P (mi;rjs) = Pmi;r P (mi;rjs) (8) (9) (10) 4.4 4.4.1

Algorithm Value Iteration

As in POMDPs, the value function of CIPOMDPs is represented in terms of max over linear segments called alpha-vectors. Each alpha vector corresponds to a policy and each component of alpha vector ascribes value to an interactive state. Value iteration proceeds by backing up alpha-vectors to a higher time horizon starting from horizon 1. 4.4.2

Complexity

POMDPs su er from the curse of dimensionality and the curse of history. Naturally, These curses are carried over to IPOMDPs and CIPOMDPs, which require solving of nested POMDPs and CIPOMDPs. The curse of history is more prominent in CIPOMDPs because the policy is now conditional on both observation and message received. Since, computing optimal policies for POMDPs by exact solution methods are proven to be PSPACE-complete for nite time horizon and undecidable for an in nite time horizon [MHC03], a large amount of work has been done in computing approximate solution. [PGT03] introduced a point-based value iteration 3literal speaker generates a message re ecting its true belief about the physical states of the world bt with probability 1 and all other messages including 'nil' with probability jMj 1 . The messages are then broadcasted to "no one in particular" (NOIP), and do not take part in belief update for POMDP agent (PBVI) algorithm to approximate exact value iteration by selecting a xed set of representative belief points and maintaining alpha vectors that are optimal at those points only. Our work builds on the interactive point-based value iteration [DP08] which showed improvement in runtime over other IPOMDP solution methods like [DG05]. Exact value iteration quickly becomes intractable in PODMPs and IPOMDPs due to generation of large number of alpha-vectors which is exponential in observation space jAjj t+1jj j, where t+1 denote set of alpha vectors being backed-up from t + 1 to t. In the case of CIPOMDP, the size is further exploded due to the inclusion of message space in the policy. The exact number of alpha-vectors generated at time t + 1 will be jAjj t+1jj jjMj . To keep the size of the alpha-set in each iteration tractable, we can use the point-based method, which only retains the vectors which are optimal at the xed set of belief points. As in IPOMDP, we need to solve the lower-level model to compute the alpha vectors. Accordingly, we limit the initial model of other agents to a nite set. 4.4.3

IPBVI-Comm

The point-based value iteration approach backs up the alpha-vectors optimal at a xed set of belief points in each iteration starting from horizon 1. Each iteration consists of three steps, which we describe below, and along the way we highlight the di erence with IPBVI [DP08]. The belief set for each horizon consists of randomly generated belief points across all interactive states.

Step 1

The rst step involves calculating the intermediate set of alpha vectors ai;mi;s; representing immediate reward for action ai and message mi;s (equation 11), and ai;oi;mi;s;mi;r representing future reward after receiving observation oi and message mi;r (equation 12). The step is performed for all actions, observations and messages. For horizon 1, computation of immediate reward is su cient and will be used as initial alpha set for subsequent backups.

Di erent from point-based algorithm for IPOMDPs, we need to calculate P r(mi;rj j;l 1) and perform belief update for the other agent j which now depends on message sent by i. Due to one-step delay in message exchange, the message sent in the current time step allows computing interactive states in next time step and backup the values from next time step. Assuming sincerity for j;0, only the message closest to belief will get probability 1- . All the other messages, including 'nil', will share the probability . When a message received is other than a sincere message, level-1 CIPOMDP ignores the message, and belief update proceeds as IPOMDP. For higher-level CIPOMDPs, the probability of message received is uniformly distributed among all the messages in OP T ( j ) set, as de ned in section 3.3.

8ai 2 Ai; 8oi 2 i; 8mi;s 2 M; 8is 2 IS ai;mi;s; ai;mi;s; (is) =

Ri(s; ai; aj ; mi;s)P r(aj j j;l 1) ai;oi;mi;s;mi;r ai;oi;mi;s;mi;r (is) =

P r(mi;r; aj j j;l 1)Ti(s; ai; aj ; s0) Oi(s0; ai; aj ; oi) X Oj (s0; ai; aj ; oj ) D(SE ^j (bj;l 1; aj ; oj ; mj;s; mi;s) oj b0j;l 1) 0 (is0)

Here, D is the Dirac delta function taking the current belief and updated belief as an argument. The updated belief is returned by state estimator function SE ^ .

Step 2

The second step involves combining intermediate alpha vectors calculated in step 1 weighted by the observation and message likelihood using a cross sum operation. Due to the point-based approach, the cross sum operation in exact value iteration is simpli ed. The step proceeds by selecting only those intermediate alpha vectors which are optimal at any of the given set of belief points.

ai;mi;s ai;mi;s;

arg max oi2 i;mi;r2M ai;oi;mi;s;mi;r

( ai;oi;mi;s;mi;r :bi;l) X aj2Aj X

X is02IS0 aj2Aj 8bi;l 2 Bi;l (11) (12) (13) (a) (b) In the nal step, the belief points in set Bi;l are used again to select the alpha vectors for the nal set for the current iteration. Since di erent action, message pairs can be optimal for the modeled agent, we need to include alpha vectors corresponding to all optimal action, message pairs in the nal alpha set.

t arg max ( t:bi;l) t2[ai ai;mi We report the results on the multi-agent tiger game, which elegantly models complex interaction with the environment and other agents in sequential decision-making scenarios under uncertainty. The game provides intuitive scenarios for the study of sincerity and deception because the agents can be incentivized to in uence another agent to open right or wrong door while not ignoring uncertainty in the environment. 5.1

Multi-agent tiger game

In this version, two agents are facing two doors: \left" and \right". Behind one door lies a hungry tiger and behind the other is a pot of gold but the agents do not know the position of either. Thus, the set of states is S = fT L; T Rg indicating the tiger's presence behind the left, or right, door. Each agent can open either door. Agents can also independently listen for the presence of the tiger, so the actions are A = fOR; OL; Lg for opening the right door, opening the left door, and listening and is the same for both agents. The transition function T , speci es that every time either agent opens one of the doors, the state is reset to T R or T L with equal probability, regardless of the action of the other agent. However, if both agents listen, the state remains unchanged. After every action, each agent can hear the tiger's growl coming either from the left, GL, or from the right door, GR. The observation function O (identical for both agents) speci es the accuracy of observations. We assume that tiger's growls are informative, with prede ned sensor accuracy, only if the agents listen. If the agent opens the doors the growls have an equal chance to come from the left or right door and are thus completely uninformative. The reward functions are chosen to simulate cooperative and competitive scenarios. Table 1 represents the scenario when the reward of the agent is independent of the action of the other agent. Table 2 is the friend reward function where the agent gets half of the reward obtained by the other agent, in addition to its own reward. In table 3 the agent gets half of the negative of the reward obtained by the other agent, hence represents the competitive case. In other reward function Table 4, the agent gets -50 reward if another agent opens the correct door but there is no extra reward if the other agent opens the wrong door. Table 3 incentivizes the extreme lie while 4 incentivizes more believable lie. 5.1.2

Experimental setup

When level 2 friend i models a friend j, agent i sends a sincere message to j re ecting its belief and further includes a message from another agent in its belief update. For e.g. after starting from uniform belief, if the agent i hears GL, it will send the sincere message mi;s = (0:75; 0:25), which assigns probability 0.75 to TL and 0.25 to TR. Figure 2 (b) shows the alpha vectors plot corresponding to communicative policies for CIPOMDP agent with friend reward function modeling CIPOMDP (also with a friend reward). The resulting policy tree when the agents are uncertain about the initial location of the tiger is shown in gure 2 (a). 5.3

Non-cooperative Scenarios

When a level 1 CIPOMDP agent i models a gullible agent j, it can collect a large reward by sending a deceitful message. For e.g. after starting from uniform belief and getting GL, the agent sends a message mi;s = (0; 1) which indicates agent i is certain tiger is on the right, opposite to its own observation. When level 2 enemy i models enemy j, the sophisticated agent i can prevent itself from being deceived by j and further take advantage of the deceitful message as an extra observation. Also, since level-2 agent knows level-1 CIPOMDP models the other as a sincere agent, the former tends to deceive the latter by sending a deceitful message. When level 2 neutral agent models an enemy, it has no incentive to deceive but can prevent itself from being deceived by ignoring the message from the other agent. When the nesting level is increased to level 3, the agent tends to send a sincere message with the intention to deceive the level 2 agent which is expecting a deceitful message.When the agents are indi erent to each other's actions i.e. both have a neutral reward function, then communication doesn't provide any added value and the reward collected is the same as the IPOMDP agent. 5.4 5.4.1

Uncertainty about the opponent Message non-revealing of the agent's type

Let's consider a two-agent interaction in a multi-agent tiger game where both the agents i and j start with no information about the location of the tiger. The level-2 CIPOMDP agent i is uncertain about the level-1 opponent j 's type and thus assigns uniform probability over the other agent being friend, enemy or random 4. The friend is characterized by the friend reward function (Rf ) which incentivizes sincere communication while the enemy is characterized by the enemy reward function (Re) incentivizing the deceptive behavior. Also, all other elements in the frame of the CIPOMDP agent at all levels are assumed to be the same.

Behavior of level-0 agent

The POMDP agent at level-0 which starts with a uniform distribution over the physical states of the world T L and T R, would open the door if it received two consecutive growls indicating the same location of tiger and the message non-indicative of opposing growl. Further, the agent would open the door at any time-step, regardless 4It can be the case that level-2 CIPOMDP is uncertain about the strategic level of the opponent and might want to model POMDP as well, but to simplify the illustration, we stick to 3 types of models of its observation, if it got the extreme message indicating the location of a tiger with certainty. Also, being a literal speaker, the agent sends out its true belief at every time-step with probability .

Behavior of level-1 agent

The less sophisticated level-1 agent j;1 models sincere and gullible POMDP agent i;0, and the random agent SMi. Let's consider the behavior of level-1 agent j which has predicted the behavior of i;0 for every possible message it can send. The agent je;1 with enemy reward function (Table 4), has incentive to deceive the opponent, e because every time i opens the right door, j gets -50 reward. Let's suppose the agent j;1 listens at time step t = 0 , to be certain of the tiger's location and then sends the message at time step t = 1 indicating growl from opposite location than what it heard. Let's suppose je;1 got GL, then it would send the message (0.25, 0.75). On other hand, if the level-1 agent had a friend reward function (Table 2), it would get a share of i's reward. Let's call the agent jf;1. jf;1 would want the other agent i to open the right door, and hence send the sincere message, indicative of its belief updated based on the observation. For an observation of GR, it would send a message (0.25, 0.75) at t = 1.

Belief update for level-2 agent

f Level-2 CIPOMDP i uses IPBVI-Comm to solve for the anticipated behavior of modeled agents j;1 and je;1. Now let's see how the belief update proceeds for i. Note, that optimal messages from level-1 agents are not indicative of the agent type. For e.g. if the enemy received GL, it would send the message (0.25, 0.75) and if the friend received GR, it would again send the message (0.25, 0.75). We study the scenario when the message itself is indicative of agent type in the next section. It turns out the agent i is still able to assign a higher probability to one agent type than the other based on the observation it received. For e.g. if the observation and message received for i in two time-steps are < GL;0 nil0 > and < GL; (0:25; 0:75) >, as shown in table 6, scenario 1, the agent is more likely to be an enemy than the friend because its observations contradict the received message. The belief update of other scenarios are shown in gure 3. The description of each scenario is provided in table 6. Again we consider the scenario when there is uncertainty about the opponent type, i.e. when level 2 neutral agent i models both enemy and friend. The only di erence now is that the agent modeled at level-1 j;1 has the reward function that incentivizes the extreme lie (Table4). In this case, the higher level agent i;2 can gure out if the other agent j is friend or enemy based on the message content only. The CIPOMDP agent i;2 incorporates the message from j as a sincere message if it is closer to the belief of the modeled agent j;1 and discards the message as an insincere message if the incoming message is way o the belief of the modeled agent. For e.g. if the belief of the modeled agent is (0:85; 0:15), (0:75; 0:25) is considered a sincere message while (1; 0) is considered insincere message. Let's suppose i;2 starts by listening at t = 0. After receiving GL and message nil, the belief shifts towards physical state T L but belief is equally distributed among both frames. At t = 1, after receiving message (0:75; 0:25) and observation GL, the belief concentrates on tiger being on left and other agent being a friend. The agent is able to detect friend from enemy by calculation of sincerity term P (mi;rj j ). When j = (bj ; Rf ), P (mi;rj j ) = 1 and when j = (bj ; Re), P (mi;rj j ) = 0. This happens because for level-1 CIPOMDP with enemy reward function, the optimal action would be to lie to the extreme. This would make i;0 open the door intended by the deceiver j;1, disregarding its own observation. This message reveals j as an enemy to the higher level agent i. Since the point-based algorithm only backs up alpha-vectors optimal at xed set of belief points, the performance would depend on the number of belief points chosen. Figure 4 shows the comparison of expected reward for the di erent number of belief points. 6

Conclusion

We started by devising a technique to incorporate the message into POMDP belief update in order to allow an agent that does not model other agents to take part in exchange of (literal) messages. We then formalized the notion of sincerity and deception in terms of belief of the agent and messages in message space. We adopted a point based solution method to CIPOMDPs to alleviate the complexities of considering communication as well as observations and physical actions. The analysis of computed policies shows the added sophistication of communication results in policies of superior quality, which is further supported by the empirical results on several experiments conducted on multi-agent tiger game. More importantly, we showed how higher levels of theory of mind may allow an agent to disambiguate a sincere friend from deceitful foe based on received message and observation from the environment.

In future work, we want to explore the higher depth of nesting, and more relaxed soft maximization criterion for action selection which can give rise to richer rational communicative behavior agents can engage in. We are considering online planning method like Monte Carlo tree search for computing policy which provides scalability and with some variation, could accomodate continuous message space. [AdR18] [BAG+05] [Gmy20]

Andrea L Guzman and Seth C Lewis. Arti cial intelligence and communication: A human{machine communication research agenda. New Media & Society, 22(1):70{86, 2020. [SSW+17] [STHP91] [STX+18] [YFS+20] [ZVlY04]

Sainbayar Sukhbaatar, arthur szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2244{2252. Curran A ssociates, Inc., 2016 .

[AASN+20]

Ali Reza Al Amin ,

Sachin

Shetty , Laurent L. Njilla, Deepak K. Tosh , and Charles

Kamhoua .

Dynamic

Cyber Deception Using Partially Observable Monte-Carlo Planning Framework, chapter 14 , pages 331 { 355 . John Wiley Sons, Ltd, 2020 .

Saul

Albert and J. P. de Ruiter . Repair: The interface between interaction and cognition . Topics in Cognitive Science , 10 ( 2 ): 279 { 313 , 2018 .

Beautement ,

David H.

Allsopp ,

Greaves , Steve Goldsmith,

Spires ,

Thompson , and

Janicke . Autonomous agents and multi -agent systems (aamas) for the military - issues and challenges . In DAMAS , 2005 .

In AAAI Fall Symposium: Advances in Cognitive Systems , 2011 .

Cynthia

Breazeal , Atsuo Takanishi, and

Tetsunori

Kobayashi . Social Robots that Interact with People , pages 1349 { 1369 . Springer Berlin Heidelberg, Berlin, Heidelberg, 2008 .

Crystal

Chao and

Andrea

Thomaz . Timed petri nets for uent turn-taking over multimodal interaction resources in human-robot collaboration . The International Journal of Robotics Research , 35 ( 11 ): 1330 { 1353 , 2016 .

Devin and

Alami . An implemented theory of mind to improve human-robot shared plans execution . In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages 319 { 326 , 2016 .

Sandra

Devin , Aurelie Clodic, and

Rachid

Alami . About decisions during human-robot shared plan achievement: Who should act and how ? In ICSR, 2017 .

Prashant

Doshi and

Piotr

Gmytrasiewicz . Approximating state estimation in multiagent settings using particle lters . In Proceeding of AAMAS 2005 , 2005 .

In Proceedings of the 23rd National Conference on Arti cial Intelligence - Volume 1, AAAI'08 , page 63 { 68 . AAAI Press, 2008 .

[DWW+15]

Xiao

Pan Ding , Henry M. Wellman , Yu

Wang

, Genyue Fu , and Kang Lee . Theory-of-mind training causes honest young children to lie . Psychological Science , 26 ( 11 ): 1812 { 1821 , 2015 . PMID: 26431737 .

Piotr

Evdokimov and

Umberto

Garfagnini . Communication and behavior in organizations: An experiment . Quantitative Economics , 10 ( 2 ): 775 { 801 , 2019 .

David

Ettinger and

Philippe

Jehiel . A theory of deception . American Economic Journal: Microeconomics , 2 ( 1 ):1{ 20 , February 2010 .

[FAdFW16] Jakob

Foerster , Yannis M. Assael , Nando de Freitas, and Shimon Whiteson . Learning to communicate to solve riddles with deep distributed recurrent q-networks , 2016 .

Jakob N.

Foerster ,

H. Francis

Song , Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew M Botvinick , and Michael H. Bowling . Bayesian action decoder for deep multi-agent reinforcement learning . ArXiv , abs/ 1811 .01458, 2019 .

Matthias

Gamer and

Wolfgang

Ambach . Deception research today. Frontiers in Psychology, 5 : 256 , 2014 .

Autonomous Agents and Multiagent Systems Journal , 3 ( 4 ): 319 { 350 , 2000 .

Piotr

Gmytrasiewicz and

Prashant

Doshi . A framework for sequential planning in multiagent settings . Journal of Arti cial Intelligence Research , 24 : 49 { 79 , 2005 . http://jair.org/contents/v24.html.

Piotr J.

Gmytrasiewicz . How to do things with words: A bayesian approach . J. Artif. Intell. Res. , 68 : 753 { 776 , 2020 .

Gneezy . Deception: The role of consequences . The American Economic Review , 95 : 384 { 394 , 2005 .

[GYPC20]

George ,

C. T.

Yilmaz ,

Parayil , and

Chakrabortty . A model-free approach to distributed transmit beamforming . In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5170 { 5174 , 2020 .

Alistair

Isaac and

Will

Bridewell . White lies on silver tongues: Why robots need to deceive (and how) . pages 157{172 , 10 2017 .

Kurakin ,

Ian J.

Goodfellow , and

Bengio . Adversarial examples in the physical world . ArXiv, abs/1607.02533 , 2017 .

Omid

Madani ,

Steve

Hanks , and

Anne

Condon . On the undecidability of probabilistic planning and related stochastic optimization problems . Arti cial Intelligence , 147 ( 1 ):5 { 34 , 2003 . Planning with Uncertainty and Incomplete Information .

Nayyar ,

Mahajan , and

Teneketzis . Decentralized stochastic control with partial history sharing: A common information approach . IEEE Transactions on Automatic Control , 58 ( 7 ): 1644 { 1658 , 2013 .

Ranjit

Nair , David Pynadath,

Makoto

Yokoo , Milind Tambe, and

Stacy

Marsella . Communication for improving policy computation in distributed pomdps . In Proceedings of the Agents and Autonomous Multiagent Systems (AAMAS) , 2004 .

Ranjit

Nair , Maayan Roth, Makoto Yokoo, and

Milind

Tambe . Taming decentralized pomdps: Towards e cient policy computation for multiagent settings . In Proceedings of the Eighteenth International Joint Conference on Arti cial Intelligence , 2003 .

Frans

Olihoek , Mathijs Spaan, and

Nikos

Vlassis . Dec-pomdps with delayed communication . In Proceedings of MSDM 2007 May 15 , 2007 , Honolulu, Hawai'i, USA , 2007 .

Lauren A Oey ,

Adena

Schachner , and

Edward

Vul . Designing good deception: Recursive theory of mind in lying and lie detection , May 2019 .

Joelle

Pineau , Geo Gordon, and

Sebastian

Thrun . Point-based value iteration: An anytime algorithm for pomdps . In Proceedings of the 18th International Joint Conference on Arti cial Intelligence , IJCAI'03 , page 1025 { 1030 , San Francisco, CA, USA, 2003 . Morgan Kaufmann Publishers Inc.

Renoux ,

Mouaddib , and

S. L.

Gloannec . A decision-theoretic planning approach for multirobot exploration and event search . In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5287 { 5293 , 2015 .

Nancy

Ratner and Rose R. Olver . Reading a tale of deception, learning a theory of mind? Early Childhood Research Quarterly , 13 ( 2 ): 219 { 239 , 1998 .

Aditya

Shinde , Prashant Doshi, and

Omid

Setayeshfar . Active deception using factored interactive pomdps to recognize cyber attacker's intent , 2020 .

Edward J. Sondik.

The optimal control of partially observable markov processes over the in nite horizon: Discounted costs . Oper. Res. , 26 ( 2 ): 282 { 304 , April 1978 .

Modelling deception using theory of mind in multi-agent systems . AI Commun ., 32 : 287 { 302 , 2019 .

Shu ,

Sliva , Suhang Wang,

Jiliang

Tang , and Huan Liu. Fake news detection on social media: A data mining perspective . SIGKDD Explor ., 19 : 22 { 36 , 2017 .

Beate

Sodian , Catherine Taylor, Paul L. Harris, and

Josef

Perner . Early deception and the child's theory of mind: False trails and genuine markers . Child Development , 62 ( 3 ): 468 { 483 , 1991 .

Aaron

Schlenker , Omkar Thakoor, Haifeng Xu, Fei Fang, Milind Tambe, Long Tran-Thanh, Phebe Vayanos, and Yevgeniy Vorobeychik. Deceiving cyber adversaries: A game theoretic approach . In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS ' 18 , page 892 { 900 , Richland , SC , 2018 . International Foundation for Autonomous Agents and

Multiagent

Systems .

Vaibhav

Unhelkar , Shen

Li , and Julie

Shah . Decision-making for bidirectional communication in sequential human-robot collaborative tasks . In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, HRI '20, page 329 { 341 , New York, NY, USA, 2020 .

Ning

Wang ,

David V.

Pynadath , and Susan G. Hill. Trust calibration within a human-robot team: Comparing automatically generated explanations . In The Eleventh ACM/IEEE International Conference on Human Robot Interaction , HRI ' 16 , page 109 { 116 . IEEE Press, 2016 .

Luyao

Yuan , Zipeng Fu, Jingyue Shen, Lu Xu,

Junhong

Shen , and Song-Chun Zhu . Emergence of Pragmatics from Referential Game between Theory of Mind Agents . arXiv e-prints, page arXiv: 2001 .07752, Jan 2020 .

Zhang

, Richard A. Volz, Thomas R. loerger, and John Yen. A decision-theoretic approach for designing proactive communication in multi-agent teamwork . In Proceedings of the 2004 ACM Symposium on Applied Computing , SAC ' 04 , page 64 { 71 , New York, NY, USA, 2004 . Association for Computing Machinery .