<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarit Adhikari</string-name>
          <email>sadhik6@uic.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piotr J. Gmytrasiewicz</string-name>
          <email>piotr@uic.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Illinois at Chicago</institution>
          ,
          <addr-line>Chicago, IL 60607</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs), agents use Bayes update to process their observations and messages without the usual assumption of cooperative discourse. We formalize the notion of sincerity and deception in terms of message in message space and belief of the agent sending the message. We then use a variant of the point-based value iteration method called IPBVI-Comm to study the optimal interactive behavior of agents in cooperative and competitive variants of the Tiger game. Modeling the belief and preferences of the opponent allows an agent to predict their optimal communicative behavior and physical action. Unsurprisingly, it may be optimal for agents to attempt to mislead others if their preferences are not aligned. But it turns out the higher depth of reasoning allows an agent to detect insincere communication and to guard against it. Speci cally, in some scenarios, the agent can distinguish a truthful friend from a deceptive foe when the message received contradicts the agent's observations, even when the received message does not directly reveal the opponent type.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC
BY 4.0)
In: R. Falcone, J. Zhang, and D. Wang (eds.): Proceedings of the 22nd International Workshop on Trust in Agent Societies, London,
UK on May 3-7, 2021, published at http://ceur-ws.org
align, then they bene t from sincere communication. Like physical actions, communicative actions are guided by
the expected utility obtained in a particular state. Agents sometimes bene t from being sincere and sometimes
it is in their best interest to deceive. To be able to cooperate, deceive or guard against deception, the agent
needs to model the belief, intention, and preference of the other agent [BI11].</p>
      <p>Humans are not very adept in detecting lies; we nd it di cult to tell a skillful liar from a truthful, but possibly
misinformed, friend. The main reason is that both may behave the same. We show examples of interactions
during which agents, using Bayes update, can form informative beliefs about the identity of their opponents by
keeping careful track of their possible states of knowledge, and by comparing the content of their communications
to the agent's own observations. It is important to note that analyzing communicative behavior as separate from
physical actions is bound to be insu cient, unless agents care directly about states of other agents' beliefs, as
opposed to caring only about physical states of the world. In that latter, arguably more realistic, case, rational
agents will attempt to modify others' beliefs only because this could induce them into advantageous physical
actions - this could not be analyzed if one abstracts physical actions away.</p>
      <p>Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs) provide a principled
framework for rational interaction and communication in a multi-agent environments [Gmy20]. CIPOMDP
framework is an extension of interactive POMDP [GD05] to include the exchange of messages among the agents.
IPOMDP, in turn, extends Partially Observable Markov Decision Process (POMDP) [Son78] to include other
agents by incorporating their models into its state space. Thus, while POMDPs provide a theoretically sound
framework to model uncertainty about the state of the world, the Theory of Mind (ToM) approach of IPOMDPs
and CIPOMDPs allows modeling of uncertainty about other agents, including their beliefs about the world and
other agents. Figure 1 shows the theory of mind (ToM) of the decision-making agent which is uncertain about
the type of another agent and how it may model others including the original agent. At the bottom of the
hierarchy of models could be a random agent, or a rational agent that does not model others, i.e., a classical
POMDP agent. A great deal of research in psychology establishes a connection of deception to the recursive
theory of mind reasoning, which starts at an early age in humans [STHP91, RO98, DWW+15]. More recently,
[OSV19] provides a comprehensive quantitative analysis of the role of rationality and theory of mind in deception
and detecting deception.</p>
      <p>We rst formalize the de nition of sincerity and deception in terms of the message sent by the agent and its
belief about the state of the world. We then propose an o ine solution method for communicative IPOMDPs,
which adopts a point-based solution technique to solve for the optimal communicative behavior. The algorithm
computes the policy to nd the optimal action-message pair at each time-step of interaction with the other agent.
The executed action may not be known to other agents but we assume that the message sent is received by the
other agent with certainty. The subsequent policies are conditional on not only observation but also message
received from the other agent. Based on the preference of the agent and what is known about the preferences of
the modeled agent, the agent may bene t from incorporating the message from another agent into its belief, but
discounting it if it thinks the other agent has an incentive to lie. Since CIPOMDP agent needs message space to
calculate the policy o ine and messages express agent's beliefs, we construct the message space by discretizing
the interactive belief space. The policy of POMDP agent on the bottom of a ToM hierarchy is augmented with
a sincere messages it can send using a look-ahead reachability tree from the initial belief in the interactive state
of the modeling agent. Similarly, we propose a way for an agent modeled as a POMDP to receive messages by
augmenting its observation space. We apply CIPOMDPs to agents interacting in the multi-agent tiger game
and we show that communication is valuable to agents and results in superior policies compared to its no
communication counterpart. In cooperative scenarios, the agent can take advantage of messages from a sincere
agent as additional observations, and can send sincere messages that inform the other agent. In competitive
scenarios, the agent not only attempts to deceive the other agent but also ignores the message it knows to be
deceitful. We then show how Bayesian update allows an agent higher in a cognitive hierarchy to tell a friend
from foe based on the message received and its own observation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>The problem of agents communicating and interacting simultaneously has been addressed in several
decisiontheoretic as well as RL settings. [FAdFW16] uses DDRQN to learn communication protocol and solve a riddle by
coordination. [FSH+19] combines multiagent RL with a bayesian update to compute communication protocols
and policies in cooperative, partially observable multi-agent settings. [SsF16] uses a neural model that learns
communication along with the policy. In the planning and control setting, [NMT13] uses communication for
controllers to share part of their observation and control history at each step. More recently, [ULS20] used
POMDP with communication for human-robot collaboration task. Other work in HRI include [CT16], [DCA17]
and [WPH16] . [DA16] uses a theory of mind approach for the execution of a shared plan.</p>
      <p>Communication has been studied extensively in other multi-agent decision theoretic frameworks [NRYT03],
[NPY+04], [ZVlY04], [OSV07] In [RMG15], agents use extended belief state that contain approximation of other
agents' beliefs. But these works assume fully cooperative interactions and mostly involve central planning.
CIPOMDPs, on the other hand, provide subjective theory of mind reasoning during communication, and follows
Bayesian approaches to pragmatics.</p>
      <p>Deception has been widely studied across multiple disciplines including game theory [STX+18], [EJ10],
psychology [GA14] and economics [Gne05]. When it comes to a sequential decision process to study attacker's
and defender's approaches in cybersecurity research , decision theoretic framework of POMDP [AASN+20] and
IPOMDP [SDS20] has been used. [SPB+19] combines ToM with components from deception theory and
implements an epistemic agent using Agent-Oriented Programming Language.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <sec id="sec-3-1">
        <title>Communicative Interactive POMDPs</title>
        <p>CIPOMDP [Gmy20] is the rst general framework for an autonomous self-interested agent to communicate and
interact with other agents in the environment based on Bayesian decision theory. A nitely nested communicative
interactive POMDP of agent i in an environment with agent j, is de ned as:</p>
        <p>CIP OM DPi = hISi;l; Ai; M; i; Ti; Oi; Rii
(1)
where ISi;l is a set of interactive states, de ned as ISi;l = S Mj;k; l 1, where S is the set of physical states
and Mj;k is the set of possible models of agent j, l is the strategy (nesting) level, and k &lt; l. The set of possible
models Mj;k consists of intentional models, j , or sub-intentional ones, SMj . While the intentional models
ascribe beliefs, preferences and rationality in action selection to the modeled agent, the sub-intentional models
do not. We consider kth (less than l) level intentional models of agent j de ned as j;k = hbj;k; Aj ; j ; Tj ; Oj ; Rj i,
where bj;k is agent j's belief nested to the level k, bj;k 2 (ISj;k). The intentional model j;k, is sometimes called
type, can be rewritten as j;k = hbj;k; ^j i, where ^j includes all elements of the intentional model other than the
belief and is called the agent j's frame. Among the classes of sub-intentional models, we consider no-information
model [GD00] which randomly selects action to execute and message to send in each time-step. The random
models are possible in each level starting with level-0.</p>
        <p>In contrast to classical POMDPs and similar to IPOMDPs, the transition, observation and reward functions in
CIPOMDPs take actions of other agents into account. A = Ai Aj is the set of joint actions of all agents, i is the
set of agent i's possible observations, Ti : S A S ! [0; 1] is the state transition function, Oi : S A i ! [0; 1]
is the observation function, Ri : S A ! R is the reward function.</p>
        <p>The ISi;l can be de ned inductively</p>
        <p>ISi;0 = S;
ISi;1 = S</p>
        <p>Mj;0;
::::::
ISi;l = S</p>
        <p>Mj;k;
l 1
k=0
j;0 = fhbj;0; ^j i : bj;0 2</p>
        <p>(S)g
Mj;0 =</p>
        <p>j;0 [ SMj
Mj;1 =</p>
        <p>j;1 [ SMj
j;1 = fhbj;1; ^j i : bj;1 2</p>
        <p>(ISj;1)g
j;l = fhbj;l; ^j i : bj;l 2</p>
        <p>(ISj;l)g
Mj;l =</p>
        <p>j;l [ SMj</p>
        <p>The above de nes the 0-level model, j;0 as having beliefs only over the physical state space, S. The level 1
agent model maintains beliefs over the physical states and 0-level models of the opponent. A level l agent, j;l,
maintains beliefs over S and over models of the opponent nested up to l 1.</p>
        <p>M is a set of messages the agents can send to and receive from each other , i.e., it is a communication
language the agents share. Since agents' beliefs are probability distributions and communication is intended to
share beliefs, it is natural to interpret a message in M as a marginal probability distribution over a subset of
variables in the agents' interactive state spaces ISi and ISj , which overlap. That way M is a set of probabilistic
statements about the interactive state space. The message nil, i.e., silence, contains no variables. Note that we
do not assume that messages re ect agents' actual beliefs. We will further discretize M below.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Belief update in CIPOMDPs</title>
        <p>Belief update in CIPOMDPs is analogous to belief update in IPOMDPs when it comes to actions and
observations. At any particular time step agents i and j can not only perform physical actions and observe but also
send and receive messages. Call the message i sent at time t 1 mit;s1, and one i received at time t mit;r, and
analogously for j. We assume all messages are in M and that message transmission is perfect. We provide precise
de nition of message space in our formulation in section 4.1. The belief update in CIPOMDPs has to update
the probability of interactive state given the previous belief, action and observation, and given the message sent
(at the previous time step) and received (at the current time): P (istjbit 1; ait 1; mit;s1; oit; mit;r):
bit(ist) = P (istjbit 1; ait 1; mit;s1; oit; mit;r) =</p>
        <p>X bit 1(ist 1) X</p>
        <p>P (mtj;s1; atj 1j jt 1)
ist 1
at 1
j
Oi(st; at 1; oit)Ti(st 1; at 1; st) X</p>
        <p>jt (btj 1; atj 1; mtj;s1; otj ; mtj;r; btj )Oj (st; at 1; otj )
ot
j
The term P (mtj;s1; atj 1 t 1) quanti es the relation between the message i received from j and the model,
j j
j , of agent j that generated the message.1 This term is the measure of j0s sincerity, i.e., whether the message
j sent re ects j's beliefs which are part of the model j . is the normalizing constant.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Planning in CIPOMDPs</title>
        <p>The utility of interactive belief of agent i, contained in i's type i, is:
(2)
(3)
Ui( i) = max(mi;s;ai)
n X bis(s)ERi(is; mi;s; ai)</p>
        <p>is2IS
+</p>
        <p>X
(mi;r;oi)</p>
        <p>P (mi;r; oijbi; ai)Ui(hSE i (bi; ai; mi;s; oi; mi;r); ^ii)
o
1Note that mtj;s1 = mit;r because message transmission is assumed to be perfect.</p>
        <p>ERi(is; mi;s; ai) above is the immediate reward to i for sending mi;s and executing action ai given the
interactive state is and is equal to aj Ri(is; ai; aj ; mi;s)P (aj j j ).</p>
        <p>The planning in CIPOMDP makes use of equation 3 , which is based on the Bellman optimality principle.
The policy computes optimal action, message pair which results in a maximum expected reward. Consequently,
value iteration in CIPOMDP is analogous to that in IPOMDP and POMDP. The set of optimal message-action
pairs, (mi;s; ai ) is obtained by replacing max in equation 3 with argmax. We call the resulting set of optimal
message-action pairs OP T ( i). When agent i models agent j as a strict optimizer, i predicts j would choose
action-message pair in OP T set with equal probability:</p>
        <p>P (mj;s; aj j j ) =</p>
        <p>1
jOP T ( j )j
and that P (mj;s; aj j j ) is equal to zero if (mj;s; aj ) is not in OP T . The possibility that agents may be less
than optimal is considered in [Gmy20]</p>
        <p>Being able to compute the probabilities of messages given the preferences and beliefs of a speaker is of crucial
importance when sincerity is not guaranteed. The belief update in CIPOMDPs provides the principled way to
discount content of messages that may be insincere because it is in the interest of the speaker to transmit them.
We give an example of this further below.</p>
        <p>Msincere =
[</p>
        <p>arg min kbtX
X2}(X) m2MX</p>
        <p>mk
Minsincere =
[</p>
        <p>MX
arg min kbtX
m2MX
mk</p>
        <p>Insincere(deceptive) message can be de ned as any message in message space except the one closest to true
belief bt of the agent at time t. Thus the set of insincere messages is given by</p>
        <p>X2}(X)
2For example, if the physical state space, S, is factored into variables X and Y , a message, m, might be "P (0
X
100) = 0:7".
We de ne message space, M, as a set of marginal probability distributions over a subset of variables in the
interactive states of the agent. Since the belief space is continuous we make the computation more tractable by
quantizing M into nite set of belief points. The message space is augmented with nil, which is analogous to
no-op operation in the physical action set. Limiting message space to only nil message reduces CIPOMDP to
IPOMDP. Usually, a message contains information about only the subset of possible interactive states.2 The
variables that the message doesn't mention are interpreted as being marginalized. We use the principle of
indi erence and assume that probability is uniformly distributed among the remaining variables.</p>
        <p>The message received mi;r can provide a mapping from physical state to belief marginalizing other variables
of interactive state (belief of another agent, frame, etc). Then mi;r(s) denotes belief the message ascribes to
physical state s 2 S. Further, message may only contain probabilities of subset of values that the variables
mentioned in message can take. Let S0 be the set of states not mentioned in the message. If a message m does
not mention state s' then</p>
      </sec>
      <sec id="sec-3-4">
        <title>Sincerity and Deception</title>
        <p>m(s0) =
1
8s0 2 S0</p>
        <p>Ps2S S0 m(s)
jSj jS0j
Let X denote a set of variables describing the interactive state of the agent. }(X) is a set of all non-empty
subsets of X. The joint belief bt of an agent can be marginalized over the subset of variables in X. Let bX
represent belief marginalized over variables in X. Accordingly message space M can be factored into the sets of
messages marginalized over each of X 2 }(X).</p>
        <p>Sincere message can be de ned as a message in message space m 2 MX which is closest to the marginalized
belief btX consistent with the true joint belief bt of the agent at time t. The distance is de ned in terms of the
L1 norm. Thus the set of sincere messages is given by
(4)
(5)
(6)
(7)
The recursion in CIPOMDP bottoms out as a at POMDP which does not have a model of the other agent. We
use the de nition of literal speaker 3 from [Gmy20]. A literal listener can incorporate the incoming message as
additional observation, which we describe in the following section. These assumptions enable an agent that does
not model the other agent to participate in the exchange of messages.
We propose POMDP ( 0) can receive the message and include it in its belief update, by augmenting its observation
space and consequently observation function. Observation space now becomes a Cartesian product of usual
observation space and message space.</p>
        <p>The joint probability of observation and message received is obtained by combining the likelihood function
for a physical observation with message distribution. The message distribution is represented by a triangular
distribution with the idea that the likelihood of a message re ecting belief about the world state should increase
monotonically as the belief. For e.g. if the belief is de ned over two states (s1; s2), and the true state of the
world is s1, the likelihood of received message re ecting belief of (1,0) should be higher than the likelihood of
received message re ecting belief of (0.5, 0.5). This formulation makes the agent gullible.</p>
        <p>Given the state of the world, observation is conditionally independent of the message received, then the
augmented observation function can be de ned as</p>
        <p>P (mi;rjs) =
1
jSj
(mi;r(s))
if mi;r = nil
otherwise
8mi;r2M and 8o2</p>
        <p>and 8s2S
O0(s; a; o; mi;r) =</p>
        <p>P (o; mi;rjs; a)</p>
        <p>Po;mi;r P (o; mi;rjs; a)
= Po PP((oojjss;;aa))PP(mmi;ir;rPjs(;ma)i;rjs)</p>
        <p>O(s; a; o)P (mi;rjs)
= Pmi;r P (mi;rjs)
(8)
(9)
(10)
4.4
4.4.1</p>
      </sec>
      <sec id="sec-3-5">
        <title>Algorithm</title>
      </sec>
      <sec id="sec-3-6">
        <title>Value Iteration</title>
        <p>As in POMDPs, the value function of CIPOMDPs is represented in terms of max over linear segments called
alpha-vectors. Each alpha vector corresponds to a policy and each component of alpha vector ascribes value to
an interactive state. Value iteration proceeds by backing up alpha-vectors to a higher time horizon starting from
horizon 1.
4.4.2</p>
      </sec>
      <sec id="sec-3-7">
        <title>Complexity</title>
        <p>POMDPs su er from the curse of dimensionality and the curse of history. Naturally, These curses are carried
over to IPOMDPs and CIPOMDPs, which require solving of nested POMDPs and CIPOMDPs. The curse
of history is more prominent in CIPOMDPs because the policy is now conditional on both observation and
message received. Since, computing optimal policies for POMDPs by exact solution methods are proven to be
PSPACE-complete for nite time horizon and undecidable for an in nite time horizon [MHC03], a large amount
of work has been done in computing approximate solution. [PGT03] introduced a point-based value iteration
3literal speaker generates a message re ecting its true belief about the physical states of the world bt with probability 1 and
all other messages including 'nil' with probability jMj 1 . The messages are then broadcasted to "no one in particular" (NOIP), and
do not take part in belief update for POMDP agent
(PBVI) algorithm to approximate exact value iteration by selecting a xed set of representative belief points and
maintaining alpha vectors that are optimal at those points only. Our work builds on the interactive point-based
value iteration [DP08] which showed improvement in runtime over other IPOMDP solution methods like [DG05].
Exact value iteration quickly becomes intractable in PODMPs and IPOMDPs due to generation of large number
of alpha-vectors which is exponential in observation space jAjj t+1jj j, where t+1 denote set of alpha vectors
being backed-up from t + 1 to t. In the case of CIPOMDP, the size is further exploded due to the inclusion of
message space in the policy. The exact number of alpha-vectors generated at time t + 1 will be jAjj t+1jj jjMj
. To keep the size of the alpha-set in each iteration tractable, we can use the point-based method, which only
retains the vectors which are optimal at the xed set of belief points. As in IPOMDP, we need to solve the
lower-level model to compute the alpha vectors. Accordingly, we limit the initial model of other agents to a nite
set.
4.4.3</p>
      </sec>
      <sec id="sec-3-8">
        <title>IPBVI-Comm</title>
        <p>The point-based value iteration approach backs up the alpha-vectors optimal at a xed set of belief points in
each iteration starting from horizon 1. Each iteration consists of three steps, which we describe below, and along
the way we highlight the di erence with IPBVI [DP08]. The belief set for each horizon consists of randomly
generated belief points across all interactive states.</p>
        <sec id="sec-3-8-1">
          <title>Step 1</title>
          <p>The rst step involves calculating the intermediate set of alpha vectors ai;mi;s; representing immediate reward
for action ai and message mi;s (equation 11), and ai;oi;mi;s;mi;r representing future reward after receiving
observation oi and message mi;r (equation 12). The step is performed for all actions, observations and messages.
For horizon 1, computation of immediate reward is su cient and will be used as initial alpha set for subsequent
backups.</p>
          <p>Di erent from point-based algorithm for IPOMDPs, we need to calculate P r(mi;rj j;l 1) and perform belief
update for the other agent j which now depends on message sent by i. Due to one-step delay in message exchange,
the message sent in the current time step allows computing interactive states in next time step and backup the
values from next time step. Assuming sincerity for j;0, only the message closest to belief will get probability
1- . All the other messages, including 'nil', will share the probability . When a message received is other
than a sincere message, level-1 CIPOMDP ignores the message, and belief update proceeds as IPOMDP. For
higher-level CIPOMDPs, the probability of message received is uniformly distributed among all the messages in
OP T ( j ) set, as de ned in section 3.3.</p>
          <p>8ai 2 Ai; 8oi 2 i; 8mi;s 2 M; 8is 2 IS
ai;mi;s;
ai;mi;s; (is) =</p>
          <p>Ri(s; ai; aj ; mi;s)P r(aj j j;l 1)
ai;oi;mi;s;mi;r
ai;oi;mi;s;mi;r (is) =</p>
          <p>P r(mi;r; aj j j;l 1)Ti(s; ai; aj ; s0)
Oi(s0; ai; aj ; oi) X Oj (s0; ai; aj ; oj ) D(SE ^j (bj;l 1; aj ; oj ; mj;s; mi;s)
oj
b0j;l 1) 0 (is0)</p>
          <p>Here, D is the Dirac delta function taking the current belief and updated belief as an argument. The updated
belief is returned by state estimator function SE ^ .</p>
          <p>j</p>
        </sec>
        <sec id="sec-3-8-2">
          <title>Step 2</title>
          <p>The second step involves combining intermediate alpha vectors calculated in step 1 weighted by the observation
and message likelihood using a cross sum operation. Due to the point-based approach, the cross sum operation
in exact value iteration is simpli ed. The step proceeds by selecting only those intermediate alpha vectors which
are optimal at any of the given set of belief points.</p>
          <p>ai;mi;s ai;mi;s;</p>
          <p>arg max
oi2 i;mi;r2M ai;oi;mi;s;mi;r</p>
          <p>( ai;oi;mi;s;mi;r :bi;l)
X
aj2Aj
X</p>
          <p>X
is02IS0 aj2Aj
8bi;l 2 Bi;l
(11)
(12)
(13)
(a)
(b)
In the nal step, the belief points in set Bi;l are used again to select the alpha vectors for the nal set for the
current iteration. Since di erent action, message pairs can be optimal for the modeled agent, we need to include
alpha vectors corresponding to all optimal action, message pairs in the nal alpha set.</p>
          <p>t
arg max ( t:bi;l)
t2[ai ai;mi
We report the results on the multi-agent tiger game, which elegantly models complex interaction with the
environment and other agents in sequential decision-making scenarios under uncertainty. The game provides
intuitive scenarios for the study of sincerity and deception because the agents can be incentivized to in uence
another agent to open right or wrong door while not ignoring uncertainty in the environment.
5.1</p>
        </sec>
      </sec>
      <sec id="sec-3-9">
        <title>Multi-agent tiger game</title>
        <p>In this version, two agents are facing two doors: \left" and \right". Behind one door lies a hungry tiger and
behind the other is a pot of gold but the agents do not know the position of either. Thus, the set of states
is S = fT L; T Rg indicating the tiger's presence behind the left, or right, door. Each agent can open either
door. Agents can also independently listen for the presence of the tiger, so the actions are A = fOR; OL; Lg
for opening the right door, opening the left door, and listening and is the same for both agents. The transition
function T , speci es that every time either agent opens one of the doors, the state is reset to T R or T L with
equal probability, regardless of the action of the other agent. However, if both agents listen, the state remains
unchanged. After every action, each agent can hear the tiger's growl coming either from the left, GL, or from the
right door, GR. The observation function O (identical for both agents) speci es the accuracy of observations.
We assume that tiger's growls are informative, with prede ned sensor accuracy, only if the agents listen. If
the agent opens the doors the growls have an equal chance to come from the left or right door and are thus
completely uninformative.
The reward functions are chosen to simulate cooperative and competitive scenarios. Table 1 represents the
scenario when the reward of the agent is independent of the action of the other agent. Table 2 is the friend
reward function where the agent gets half of the reward obtained by the other agent, in addition to its own
reward. In table 3 the agent gets half of the negative of the reward obtained by the other agent, hence represents
the competitive case. In other reward function Table 4, the agent gets -50 reward if another agent opens the
correct door but there is no extra reward if the other agent opens the wrong door. Table 3 incentivizes the
extreme lie while 4 incentivizes more believable lie.
5.1.2</p>
      </sec>
      <sec id="sec-3-10">
        <title>Experimental setup</title>
        <p>When level 2 friend i models a friend j, agent i sends a sincere message to j re ecting its belief and further
includes a message from another agent in its belief update. For e.g. after starting from uniform belief, if the
agent i hears GL, it will send the sincere message mi;s = (0:75; 0:25), which assigns probability 0.75 to TL and
0.25 to TR. Figure 2 (b) shows the alpha vectors plot corresponding to communicative policies for CIPOMDP
agent with friend reward function modeling CIPOMDP (also with a friend reward). The resulting policy tree
when the agents are uncertain about the initial location of the tiger is shown in gure 2 (a).
5.3</p>
      </sec>
      <sec id="sec-3-11">
        <title>Non-cooperative Scenarios</title>
        <p>When a level 1 CIPOMDP agent i models a gullible agent j, it can collect a large reward by sending a deceitful
message. For e.g. after starting from uniform belief and getting GL, the agent sends a message mi;s = (0; 1)
which indicates agent i is certain tiger is on the right, opposite to its own observation. When level 2 enemy i
models enemy j, the sophisticated agent i can prevent itself from being deceived by j and further take advantage
of the deceitful message as an extra observation. Also, since level-2 agent knows level-1 CIPOMDP models the
other as a sincere agent, the former tends to deceive the latter by sending a deceitful message. When level 2
neutral agent models an enemy, it has no incentive to deceive but can prevent itself from being deceived by
ignoring the message from the other agent. When the nesting level is increased to level 3, the agent tends to send
a sincere message with the intention to deceive the level 2 agent which is expecting a deceitful message.When
the agents are indi erent to each other's actions i.e. both have a neutral reward function, then communication
doesn't provide any added value and the reward collected is the same as the IPOMDP agent.
5.4
5.4.1</p>
      </sec>
      <sec id="sec-3-12">
        <title>Uncertainty about the opponent</title>
      </sec>
      <sec id="sec-3-13">
        <title>Message non-revealing of the agent's type</title>
        <p>Let's consider a two-agent interaction in a multi-agent tiger game where both the agents i and j start with
no information about the location of the tiger. The level-2 CIPOMDP agent i is uncertain about the level-1
opponent j 's type and thus assigns uniform probability over the other agent being friend, enemy or random 4.
The friend is characterized by the friend reward function (Rf ) which incentivizes sincere communication while
the enemy is characterized by the enemy reward function (Re) incentivizing the deceptive behavior. Also, all
other elements in the frame of the CIPOMDP agent at all levels are assumed to be the same.</p>
        <sec id="sec-3-13-1">
          <title>Behavior of level-0 agent</title>
          <p>The POMDP agent at level-0 which starts with a uniform distribution over the physical states of the world T L
and T R, would open the door if it received two consecutive growls indicating the same location of tiger and the
message non-indicative of opposing growl. Further, the agent would open the door at any time-step, regardless
4It can be the case that level-2 CIPOMDP is uncertain about the strategic level of the opponent and might want to model
POMDP as well, but to simplify the illustration, we stick to 3 types of models
of its observation, if it got the extreme message indicating the location of a tiger with certainty. Also, being a
literal speaker, the agent sends out its true belief at every time-step with probability .</p>
        </sec>
        <sec id="sec-3-13-2">
          <title>Behavior of level-1 agent</title>
          <p>The less sophisticated level-1 agent j;1 models sincere and gullible POMDP agent i;0, and the random agent
SMi. Let's consider the behavior of level-1 agent j which has predicted the behavior of i;0 for every possible
message it can send. The agent je;1 with enemy reward function (Table 4), has incentive to deceive the opponent,
e
because every time i opens the right door, j gets -50 reward. Let's suppose the agent j;1 listens at time step
t = 0 , to be certain of the tiger's location and then sends the message at time step t = 1 indicating growl from
opposite location than what it heard. Let's suppose je;1 got GL, then it would send the message (0.25, 0.75).
On other hand, if the level-1 agent had a friend reward function (Table 2), it would get a share of i's reward.
Let's call the agent jf;1. jf;1 would want the other agent i to open the right door, and hence send the sincere
message, indicative of its belief updated based on the observation. For an observation of GR, it would send a
message (0.25, 0.75) at t = 1.</p>
        </sec>
        <sec id="sec-3-13-3">
          <title>Belief update for level-2 agent</title>
          <p>f
Level-2 CIPOMDP i uses IPBVI-Comm to solve for the anticipated behavior of modeled agents j;1 and je;1.
Now let's see how the belief update proceeds for i. Note, that optimal messages from level-1 agents are not
indicative of the agent type. For e.g. if the enemy received GL, it would send the message (0.25, 0.75) and if the
friend received GR, it would again send the message (0.25, 0.75). We study the scenario when the message itself
is indicative of agent type in the next section. It turns out the agent i is still able to assign a higher probability
to one agent type than the other based on the observation it received. For e.g. if the observation and message
received for i in two time-steps are &lt; GL;0 nil0 &gt; and &lt; GL; (0:25; 0:75) &gt;, as shown in table 6, scenario 1, the
agent is more likely to be an enemy than the friend because its observations contradict the received message.
The belief update of other scenarios are shown in gure 3. The description of each scenario is provided in table
6.
Again we consider the scenario when there is uncertainty about the opponent type, i.e. when level 2 neutral
agent i models both enemy and friend. The only di erence now is that the agent modeled at level-1 j;1 has
the reward function that incentivizes the extreme lie (Table4). In this case, the higher level agent i;2 can
gure out if the other agent j is friend or enemy based on the message content only. The CIPOMDP agent i;2
incorporates the message from j as a sincere message if it is closer to the belief of the modeled agent j;1 and
discards the message as an insincere message if the incoming message is way o the belief of the modeled agent.
For e.g. if the belief of the modeled agent is (0:85; 0:15), (0:75; 0:25) is considered a sincere message while (1; 0)
is considered insincere message. Let's suppose i;2 starts by listening at t = 0. After receiving GL and message
nil, the belief shifts towards physical state T L but belief is equally distributed among both frames. At t = 1,
after receiving message (0:75; 0:25) and observation GL, the belief concentrates on tiger being on left and other
agent being a friend. The agent is able to detect friend from enemy by calculation of sincerity term P (mi;rj j ).
When j = (bj ; Rf ), P (mi;rj j ) = 1 and when j = (bj ; Re), P (mi;rj j ) = 0. This happens because for level-1
CIPOMDP with enemy reward function, the optimal action would be to lie to the extreme. This would make
i;0 open the door intended by the deceiver j;1, disregarding its own observation. This message reveals j as an
enemy to the higher level agent i.
Since the point-based algorithm only backs up alpha-vectors optimal at xed set of belief points, the performance
would depend on the number of belief points chosen. Figure 4 shows the comparison of expected reward for the
di erent number of belief points.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We started by devising a technique to incorporate the message into POMDP belief update in order to allow an
agent that does not model other agents to take part in exchange of (literal) messages. We then formalized the
notion of sincerity and deception in terms of belief of the agent and messages in message space. We adopted
a point based solution method to CIPOMDPs to alleviate the complexities of considering communication as
well as observations and physical actions. The analysis of computed policies shows the added sophistication
of communication results in policies of superior quality, which is further supported by the empirical results on
several experiments conducted on multi-agent tiger game. More importantly, we showed how higher levels of
theory of mind may allow an agent to disambiguate a sincere friend from deceitful foe based on received message
and observation from the environment.</p>
      <p>In future work, we want to explore the higher depth of nesting, and more relaxed soft maximization criterion
for action selection which can give rise to richer rational communicative behavior agents can engage in. We are
considering online planning method like Monte Carlo tree search for computing policy which provides scalability
and with some variation, could accomodate continuous message space.
[AdR18]
[BAG+05]
[Gmy20]</p>
      <p>Andrea L Guzman and Seth C Lewis. Arti cial intelligence and communication: A human{machine
communication research agenda. New Media &amp; Society, 22(1):70{86, 2020.
[SSW+17]
[STHP91]
[STX+18]
[YFS+20]
[ZVlY04]</p>
      <p>
        Sainbayar Sukhbaatar, arthur szlam, and Rob Fergus. Learning multiagent communication with
backpropagation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems 29, pages 2244{2252. Curran A
        <xref ref-type="bibr" rid="ref8">ssociates, Inc.,
2016</xref>
        .
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [AASN+20]
          <string-name>
            <given-names>Md</given-names>
            <surname>Ali Reza Al Amin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sachin</given-names>
            <surname>Shetty</surname>
          </string-name>
          , Laurent L. Njilla,
          <string-name>
            <surname>Deepak K. Tosh</surname>
          </string-name>
          , and
          <string-name>
            <surname>Charles</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kamhoua</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dynamic</given-names>
            <surname>Cyber Deception Using Partially Observable Monte-Carlo Planning</surname>
          </string-name>
          Framework, chapter
          <volume>14</volume>
          , pages
          <fpage>331</fpage>
          {
          <fpage>355</fpage>
          . John Wiley Sons, Ltd,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Saul</given-names>
            <surname>Albert and J. P. de Ruiter</surname>
          </string-name>
          .
          <article-title>Repair: The interface between interaction and cognition</article-title>
          .
          <source>Topics in Cognitive Science</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <volume>279</volume>
          {
          <fpage>313</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Beautement</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David H.</given-names>
            <surname>Allsopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greaves</surname>
          </string-name>
          , Steve Goldsmith,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Janicke</surname>
          </string-name>
          .
          <article-title>Autonomous agents and multi -agent systems (aamas) for the military - issues and challenges</article-title>
          .
          <source>In DAMAS</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>In AAAI Fall Symposium: Advances in Cognitive Systems</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Breazeal</surname>
          </string-name>
          , Atsuo Takanishi, and
          <string-name>
            <given-names>Tetsunori</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          .
          <source>Social Robots that Interact with People</source>
          , pages
          <volume>1349</volume>
          {
          <fpage>1369</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Crystal</given-names>
            <surname>Chao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Thomaz</surname>
          </string-name>
          .
          <article-title>Timed petri nets for uent turn-taking over multimodal interaction resources in human-robot collaboration</article-title>
          .
          <source>The International Journal of Robotics Research</source>
          ,
          <volume>35</volume>
          (
          <issue>11</issue>
          ):
          <volume>1330</volume>
          {
          <fpage>1353</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Devin</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Alami</surname>
          </string-name>
          .
          <article-title>An implemented theory of mind to improve human-robot shared plans execution</article-title>
          .
          <source>In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI)</source>
          , pages
          <fpage>319</fpage>
          {
          <fpage>326</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Sandra</given-names>
            <surname>Devin</surname>
          </string-name>
          , Aurelie Clodic, and
          <string-name>
            <given-names>Rachid</given-names>
            <surname>Alami</surname>
          </string-name>
          .
          <article-title>About decisions during human-robot shared plan achievement: Who should act and how</article-title>
          ? In ICSR,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Prashant</given-names>
            <surname>Doshi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Gmytrasiewicz</surname>
          </string-name>
          .
          <article-title>Approximating state estimation in multiagent settings using particle lters</article-title>
          .
          <source>In Proceeding of AAMAS</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>In Proceedings of the 23rd National Conference on Arti cial Intelligence - Volume 1, AAAI'08</source>
          , page
          <volume>63</volume>
          {
          <fpage>68</fpage>
          . AAAI Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [DWW+15]
          <string-name>
            <given-names>Xiao</given-names>
            <surname>Pan</surname>
          </string-name>
          <string-name>
            <given-names>Ding</given-names>
            ,
            <surname>Henry M. Wellman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Genyue Fu</surname>
            , and
            <given-names>Kang</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Theory-of-mind training causes honest young children to lie</article-title>
          .
          <source>Psychological Science</source>
          ,
          <volume>26</volume>
          (
          <issue>11</issue>
          ):
          <year>1812</year>
          {
          <year>1821</year>
          ,
          <year>2015</year>
          . PMID:
          <volume>26431737</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Evdokimov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Umberto</given-names>
            <surname>Garfagnini</surname>
          </string-name>
          .
          <article-title>Communication and behavior in organizations: An experiment</article-title>
          .
          <source>Quantitative Economics</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <volume>775</volume>
          {
          <fpage>801</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Ettinger</surname>
          </string-name>
          and
          <string-name>
            <given-names>Philippe</given-names>
            <surname>Jehiel</surname>
          </string-name>
          .
          <article-title>A theory of deception</article-title>
          .
          <source>American Economic Journal: Microeconomics</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>20</fpage>
          ,
          <year>February 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [FAdFW16]
          <string-name>
            <surname>Jakob</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Foerster</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yannis M. Assael</surname>
            , Nando de Freitas, and
            <given-names>Shimon</given-names>
          </string-name>
          <string-name>
            <surname>Whiteson</surname>
          </string-name>
          .
          <article-title>Learning to communicate to solve riddles with deep distributed recurrent q-networks</article-title>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Jakob N.</given-names>
            <surname>Foerster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Francis</given-names>
            <surname>Song</surname>
          </string-name>
          , Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson,
          <string-name>
            <surname>Matthew M Botvinick</surname>
            ,
            <given-names>and Michael H.</given-names>
          </string-name>
          <string-name>
            <surname>Bowling</surname>
          </string-name>
          .
          <article-title>Bayesian action decoder for deep multi-agent reinforcement learning</article-title>
          .
          <source>ArXiv</source>
          , abs/
          <year>1811</year>
          .01458,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Gamer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Ambach</surname>
          </string-name>
          . Deception research today. Frontiers in Psychology,
          <volume>5</volume>
          :
          <fpage>256</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>Autonomous Agents and Multiagent Systems Journal</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ):
          <volume>319</volume>
          {
          <fpage>350</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Gmytrasiewicz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Prashant</given-names>
            <surname>Doshi</surname>
          </string-name>
          .
          <article-title>A framework for sequential planning in multiagent settings</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          ,
          <volume>24</volume>
          :
          <fpage>49</fpage>
          {
          <fpage>79</fpage>
          ,
          <year>2005</year>
          . http://jair.org/contents/v24.html.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Piotr J.</given-names>
            <surname>Gmytrasiewicz</surname>
          </string-name>
          .
          <article-title>How to do things with words: A bayesian approach</article-title>
          .
          <source>J. Artif. Intell. Res.</source>
          ,
          <volume>68</volume>
          :
          <fpage>753</fpage>
          {
          <fpage>776</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>U.</given-names>
            <surname>Gneezy</surname>
          </string-name>
          . Deception:
          <article-title>The role of consequences</article-title>
          .
          <source>The American Economic Review</source>
          ,
          <volume>95</volume>
          :
          <fpage>384</fpage>
          {
          <fpage>394</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [GYPC20]
          <string-name>
            <given-names>J.</given-names>
            <surname>George</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parayil</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Chakrabortty</surname>
          </string-name>
          .
          <article-title>A model-free approach to distributed transmit beamforming</article-title>
          .
          <source>In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>5170</fpage>
          {
          <fpage>5174</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Alistair</given-names>
            <surname>Isaac</surname>
          </string-name>
          and
          <string-name>
            <given-names>Will</given-names>
            <surname>Bridewell</surname>
          </string-name>
          .
          <article-title>White lies on silver tongues: Why robots need to deceive (and how)</article-title>
          .
          <source>pages 157{172</source>
          , 10
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurakin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ian J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Adversarial examples in the physical world</article-title>
          .
          <source>ArXiv, abs/1607.02533</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Omid</given-names>
            <surname>Madani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Steve</given-names>
            <surname>Hanks</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Anne</given-names>
            <surname>Condon</surname>
          </string-name>
          .
          <article-title>On the undecidability of probabilistic planning and related stochastic optimization problems</article-title>
          .
          <source>Arti cial Intelligence</source>
          ,
          <volume>147</volume>
          (
          <issue>1</issue>
          ):5 {
          <fpage>34</fpage>
          ,
          <year>2003</year>
          .
          <article-title>Planning with Uncertainty and Incomplete Information</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Nayyar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Teneketzis</surname>
          </string-name>
          .
          <article-title>Decentralized stochastic control with partial history sharing: A common information approach</article-title>
          .
          <source>IEEE Transactions on Automatic Control</source>
          ,
          <volume>58</volume>
          (
          <issue>7</issue>
          ):
          <volume>1644</volume>
          {
          <fpage>1658</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Ranjit</given-names>
            <surname>Nair</surname>
          </string-name>
          , David Pynadath,
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Yokoo</surname>
          </string-name>
          , Milind Tambe, and
          <string-name>
            <given-names>Stacy</given-names>
            <surname>Marsella</surname>
          </string-name>
          .
          <article-title>Communication for improving policy computation in distributed pomdps</article-title>
          .
          <source>In Proceedings of the Agents and Autonomous Multiagent Systems (AAMAS)</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Ranjit</given-names>
            <surname>Nair</surname>
          </string-name>
          , Maayan Roth, Makoto Yokoo, and
          <string-name>
            <given-names>Milind</given-names>
            <surname>Tambe</surname>
          </string-name>
          .
          <article-title>Taming decentralized pomdps: Towards e cient policy computation for multiagent settings</article-title>
          .
          <source>In Proceedings of the Eighteenth International Joint Conference on Arti cial Intelligence</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Frans</given-names>
            <surname>Olihoek</surname>
          </string-name>
          , Mathijs Spaan, and
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Vlassis</surname>
          </string-name>
          .
          <article-title>Dec-pomdps with delayed communication</article-title>
          .
          <source>In Proceedings of MSDM 2007 May</source>
          <volume>15</volume>
          ,
          <year>2007</year>
          , Honolulu,
          <source>Hawai'i, USA</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Lauren A Oey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Adena</given-names>
            <surname>Schachner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Vul</surname>
          </string-name>
          .
          <article-title>Designing good deception: Recursive theory of mind in lying and lie detection</article-title>
          , May
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          , Geo Gordon, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Thrun</surname>
          </string-name>
          .
          <article-title>Point-based value iteration: An anytime algorithm for pomdps</article-title>
          .
          <source>In Proceedings of the 18th International Joint Conference on Arti cial Intelligence</source>
          ,
          <source>IJCAI'03</source>
          , page
          <volume>1025</volume>
          {
          <fpage>1030</fpage>
          , San Francisco, CA, USA,
          <year>2003</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Renoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mouaddib</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Gloannec</surname>
          </string-name>
          .
          <article-title>A decision-theoretic planning approach for multirobot exploration and event search</article-title>
          .
          <source>In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          , pages
          <fpage>5287</fpage>
          {
          <fpage>5293</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Nancy</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ratner and Rose R. Olver</surname>
          </string-name>
          .
          <article-title>Reading a tale of deception, learning a theory of mind?</article-title>
          <source>Early Childhood Research Quarterly</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <volume>219</volume>
          {
          <fpage>239</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Shinde</surname>
          </string-name>
          , Prashant Doshi, and
          <string-name>
            <given-names>Omid</given-names>
            <surname>Setayeshfar</surname>
          </string-name>
          .
          <article-title>Active deception using factored interactive pomdps to recognize cyber attacker's intent</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Edward J. Sondik.</surname>
          </string-name>
          <article-title>The optimal control of partially observable markov processes over the in nite horizon: Discounted costs</article-title>
          .
          <source>Oper. Res.</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <volume>282</volume>
          {
          <fpage>304</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>Modelling deception using theory of mind in multi-agent systems</article-title>
          .
          <source>AI Commun</source>
          .,
          <volume>32</volume>
          :
          <fpage>287</fpage>
          {
          <fpage>302</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sliva</surname>
          </string-name>
          , Suhang Wang,
          <string-name>
            <given-names>Jiliang</given-names>
            <surname>Tang</surname>
          </string-name>
          , and Huan Liu.
          <article-title>Fake news detection on social media: A data mining perspective</article-title>
          .
          <source>SIGKDD Explor</source>
          .,
          <volume>19</volume>
          :
          <fpage>22</fpage>
          {
          <fpage>36</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>Beate</given-names>
            <surname>Sodian</surname>
          </string-name>
          , Catherine Taylor, Paul L. Harris, and
          <string-name>
            <given-names>Josef</given-names>
            <surname>Perner</surname>
          </string-name>
          .
          <article-title>Early deception and the child's theory of mind: False trails and genuine markers</article-title>
          .
          <source>Child Development</source>
          ,
          <volume>62</volume>
          (
          <issue>3</issue>
          ):
          <volume>468</volume>
          {
          <fpage>483</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Schlenker</surname>
          </string-name>
          , Omkar Thakoor, Haifeng Xu, Fei Fang,
          <article-title>Milind Tambe, Long Tran-Thanh, Phebe Vayanos, and Yevgeniy Vorobeychik. Deceiving cyber adversaries: A game theoretic approach</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems</source>
          , AAMAS '
          <volume>18</volume>
          , page
          <volume>892</volume>
          {
          <fpage>900</fpage>
          ,
          <string-name>
            <surname>Richland</surname>
            ,
            <given-names>SC</given-names>
          </string-name>
          ,
          <year>2018</year>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Vaibhav</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Unhelkar</surname>
            ,
            <given-names>Shen</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Julie</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Decision-making for bidirectional communication in sequential human-robot collaborative tasks</article-title>
          .
          <source>In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, HRI '20, page</source>
          <volume>329</volume>
          {
          <fpage>341</fpage>
          , New York, NY, USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <given-names>Ning</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David V.</given-names>
            <surname>Pynadath</surname>
          </string-name>
          , and Susan G. Hill.
          <article-title>Trust calibration within a human-robot team: Comparing automatically generated explanations</article-title>
          .
          <source>In The Eleventh ACM/IEEE International Conference on Human Robot Interaction</source>
          , HRI '
          <volume>16</volume>
          , page
          <volume>109</volume>
          {
          <fpage>116</fpage>
          . IEEE Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <given-names>Luyao</given-names>
            <surname>Yuan</surname>
          </string-name>
          , Zipeng Fu, Jingyue Shen, Lu Xu,
          <string-name>
            <given-names>Junhong</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <surname>Song-Chun Zhu</surname>
          </string-name>
          .
          <article-title>Emergence of Pragmatics from Referential Game between Theory of Mind Agents</article-title>
          . arXiv e-prints, page arXiv:
          <year>2001</year>
          .07752,
          <year>Jan 2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>Yu</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Richard A. Volz, Thomas R. loerger, and John Yen.
          <article-title>A decision-theoretic approach for designing proactive communication in multi-agent teamwork</article-title>
          .
          <source>In Proceedings of the 2004 ACM Symposium on Applied Computing</source>
          , SAC '
          <volume>04</volume>
          , page
          <volume>64</volume>
          {
          <fpage>71</fpage>
          , New York, NY, USA,
          <year>2004</year>
          .
          <article-title>Association for Computing Machinery</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>