=Paper= {{Paper |id=Vol-3022/paper7 |storemode=property |title=Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception |pdfUrl=https://ceur-ws.org/Vol-3022/paper7.pdf |volume=Vol-3022 |authors=Sarit Adhikari,Piotr J. Gmytrasiewicz |dblpUrl=https://dblp.org/rec/conf/atal/AdhikariG21 }} ==Telling Friend from Foe - Towards a Bayesian Approach to Sincerity and Deception== https://ceur-ws.org/Vol-3022/paper7.pdf

Telling Friend from Foe - Towards a Bayesian Approach
to Sincerity and Deception

Sarit Adhikari Piotr J. Gmytrasiewicz
University of Illinois at Chicago University of Illinois at Chicago
Chicago, IL 60607 Chicago, IL 60607
sadhik6@uic.edu piotr@uic.edu

Abstract
In Communicative Interactive Partially Observable Markov Decision
Processes (CIPOMDPs), agents use Bayes update to process their ob-
servations and messages without the usual assumption of cooperative
discourse. We formalize the notion of sincerity and deception in terms
of message in message space and belief of the agent sending the message.
We then use a variant of the point-based value iteration method called
IPBVI-Comm to study the optimal interactive behavior of agents in
cooperative and competitive variants of the Tiger game. Modeling the
belief and preferences of the opponent allows an agent to predict their
optimal communicative behavior and physical action. Unsurprisingly,
it may be optimal for agents to attempt to mislead others if their pref-
erences are not aligned. But it turns out the higher depth of reasoning
allows an agent to detect insincere communication and to guard against
it. Specifically, in some scenarios, the agent can distinguish a truthful
friend from a deceptive foe when the message received contradicts the
agent’s observations, even when the received message does not directly
reveal the opponent type.

1 Introduction
The study of communication and interaction among self-interested agents in a partially observable and stochastic
domain has application in several fields ranging from military [BAG+ 05, GYPC20] to social robotics [BTK08].
The communicative behavior among the agents in multi-agent systems has been studied in cognitive science
[AdR18], economics [EG19] and artificial intelligence [GL20, YFS+ 20]. With the advancement of artificial intel-
ligence, the topic of machine deception has become more important. In particular, since communication among
agents is becoming ubiquitous, malicious agents trying to exploit the vulnerabilities in other AI systems and
humans might be a common problem of the future. Thus it is important to lay the foundation for deception-
resistant AI systems. Further, as more AI agents are becoming part of our social life, the study of emergent
social behavior among communicating agents (both artificial and human) with varied preferences is vital.
Although vast literature exists on the topic of machine fooling humans through fake content [SSW+ 17] and
human fooling machines with adversarial attacks [KGB17], the study of deception in a sequential decision-making
scenario, by modeling other agents have rarely been explored. As argued in [IB17], AI needs to guard itself against
malevolent humans and sometimes be able to deceive as well. On the other hand, when the agents’ preferences

Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC
BY 4.0)
In: R. Falcone, J. Zhang, and D. Wang (eds.): Proceedings of the 22nd International Workshop on Trust in Agent Societies, London,
UK on May 3-7, 2021, published at http://ceur-ws.org

1
Figure 1: Theory of Mind (ToM) reasoning from the perspective of agent i interacting with agent j when there is
uncertainty about the opponent type. Level is indicative of cognitive sophistication. The neutral agent i thinks
j might be enemy, friend or random agent. Further, i thinks j thinks i might be a sincere and gullible agent or a
random agent. The behavior of j is simulated by i by putting itself in j’s shoes. Within that simulation, i needs
to reason how j simulates i’s behavior.

align, then they benefit from sincere communication. Like physical actions, communicative actions are guided by
the expected utility obtained in a particular state. Agents sometimes benefit from being sincere and sometimes
it is in their best interest to deceive. To be able to cooperate, deceive or guard against deception, the agent
needs to model the belief, intention, and preference of the other agent [BI11].
Humans are not very adept in detecting lies; we find it difficult to tell a skillful liar from a truthful, but possibly
misinformed, friend. The main reason is that both may behave the same. We show examples of interactions
during which agents, using Bayes update, can form informative beliefs about the identity of their opponents by
keeping careful track of their possible states of knowledge, and by comparing the content of their communications
to the agent’s own observations. It is important to note that analyzing communicative behavior as separate from
physical actions is bound to be insufficient, unless agents care directly about states of other agents’ beliefs, as
opposed to caring only about physical states of the world. In that latter, arguably more realistic, case, rational
agents will attempt to modify others’ beliefs only because this could induce them into advantageous physical
actions - this could not be analyzed if one abstracts physical actions away.
Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs) provide a principled
framework for rational interaction and communication in a multi-agent environments [Gmy20]. CIPOMDP
framework is an extension of interactive POMDP [GD05] to include the exchange of messages among the agents.
IPOMDP, in turn, extends Partially Observable Markov Decision Process (POMDP) [Son78] to include other
agents by incorporating their models into its state space. Thus, while POMDPs provide a theoretically sound
framework to model uncertainty about the state of the world, the Theory of Mind (ToM) approach of IPOMDPs
and CIPOMDPs allows modeling of uncertainty about other agents, including their beliefs about the world and
other agents. Figure 1 shows the theory of mind (ToM) of the decision-making agent which is uncertain about
the type of another agent and how it may model others including the original agent. At the bottom of the
hierarchy of models could be a random agent, or a rational agent that does not model others, i.e., a classical
POMDP agent. A great deal of research in psychology establishes a connection of deception to the recursive
theory of mind reasoning, which starts at an early age in humans [STHP91, RO98, DWW+ 15]. More recently,
[OSV19] provides a comprehensive quantitative analysis of the role of rationality and theory of mind in deception
and detecting deception.
We first formalize the definition of sincerity and deception in terms of the message sent by the agent and its
belief about the state of the world. We then propose an offline solution method for communicative IPOMDPs,
which adopts a point-based solution technique to solve for the optimal communicative behavior. The algorithm
computes the policy to find the optimal action-message pair at each time-step of interaction with the other agent.
The executed action may not be known to other agents but we assume that the message sent is received by the
other agent with certainty. The subsequent policies are conditional on not only observation but also message

2
received from the other agent. Based on the preference of the agent and what is known about the preferences of
the modeled agent, the agent may benefit from incorporating the message from another agent into its belief, but
discounting it if it thinks the other agent has an incentive to lie. Since CIPOMDP agent needs message space to
calculate the policy offline and messages express agent’s beliefs, we construct the message space by discretizing
the interactive belief space. The policy of POMDP agent on the bottom of a ToM hierarchy is augmented with
a sincere messages it can send using a look-ahead reachability tree from the initial belief in the interactive state
of the modeling agent. Similarly, we propose a way for an agent modeled as a POMDP to receive messages by
augmenting its observation space. We apply CIPOMDPs to agents interacting in the multi-agent tiger game
and we show that communication is valuable to agents and results in superior policies compared to its no
communication counterpart. In cooperative scenarios, the agent can take advantage of messages from a sincere
agent as additional observations, and can send sincere messages that inform the other agent. In competitive
scenarios, the agent not only attempts to deceive the other agent but also ignores the message it knows to be
deceitful. We then show how Bayesian update allows an agent higher in a cognitive hierarchy to tell a friend
from foe based on the message received and its own observation.

2 Related work
The problem of agents communicating and interacting simultaneously has been addressed in several decision-
theoretic as well as RL settings. [FAdFW16] uses DDRQN to learn communication protocol and solve a riddle by
coordination. [FSH+ 19] combines multiagent RL with a bayesian update to compute communication protocols
and policies in cooperative, partially observable multi-agent settings. [SsF16] uses a neural model that learns
communication along with the policy. In the planning and control setting, [NMT13] uses communication for
controllers to share part of their observation and control history at each step. More recently, [ULS20] used
POMDP with communication for human-robot collaboration task. Other work in HRI include [CT16], [DCA17]
and [WPH16] . [DA16] uses a theory of mind approach for the execution of a shared plan.
Communication has been studied extensively in other multi-agent decision theoretic frameworks [NRYT03],
[NPY+ 04], [ZVlY04], [OSV07] In [RMG15], agents use extended belief state that contain approximation of other
agents’ beliefs. But these works assume fully cooperative interactions and mostly involve central planning.
CIPOMDPs, on the other hand, provide subjective theory of mind reasoning during communication, and follows
Bayesian approaches to pragmatics.
Deception has been widely studied across multiple disciplines including game theory [STX+ 18], [EJ10], psy-
chology [GA14] and economics [Gne05]. When it comes to a sequential decision process to study attacker’s
and defender’s approaches in cybersecurity research , decision theoretic framework of POMDP [AASN+ 20] and
IPOMDP [SDS20] has been used. [SPB+ 19] combines ToM with components from deception theory and imple-
ments an epistemic agent using Agent-Oriented Programming Language.

3 Background
3.1 Communicative Interactive POMDPs
CIPOMDP [Gmy20] is the first general framework for an autonomous self-interested agent to communicate and
interact with other agents in the environment based on Bayesian decision theory. A finitely nested communicative
interactive POMDP of agent i in an environment with agent j, is defined as:

CIP OM DPi = hISi,l , Ai , M, Ωi , Ti , Oi , Ri i (1)

where ISi,l is a set of interactive states, defined as ISi,l = S × Mj,k , l ≥ 1, where S is the set of physical states
and Mj,k is the set of possible models of agent j, l is the strategy (nesting) level, and k < l. The set of possible
models Mj,k consists of intentional models, Θj , or sub-intentional ones, SMj . While the intentional models
ascribe beliefs, preferences and rationality in action selection to the modeled agent, the sub-intentional models
do not. We consider kth (less than l) level intentional models of agent j defined as θj,k = hbj,k , Aj , Ωj , Tj , Oj , Rj i,
where bj,k is agent j’s belief nested to the level k, bj,k ∈ ∆(ISj,k ). The intentional model θj,k , is sometimes called
type, can be rewritten as θj,k = hbj,k , θˆj i, where θˆj includes all elements of the intentional model other than the
belief and is called the agent j’s frame. Among the classes of sub-intentional models, we consider no-information
model [GD00] which randomly selects action to execute and message to send in each time-step. The random
models are possible in each level starting with level-0.

3
In contrast to classical POMDPs and similar to IPOMDPs, the transition, observation and reward functions in
CIPOMDPs take actions of other agents into account. A = Ai ×Aj is the set of joint actions of all agents, Ωi is the
set of agent i’s possible observations, Ti : S ×A×S → [0, 1] is the state transition function, Oi : S ×A×Ωi → [0, 1]
is the observation function, Ri : S × A → R is the reward function.
The ISi,l can be defined inductively

ISi,0 = S, Θj,0 = {hbj,0 , θˆj i : bj,0 ∈ ∆(S)}
Mj,0 = Θj,0 ∪ SMj
ISi,1 = S × Mj,0 , Θj,1 = {hbj,1 , θˆj i : bj,1 ∈ ∆(ISj,1 )}
Mj,1 = Θj,1 ∪ SMj
......
l−1
ISi,l = S × Mj,k , Θj,l = {hbj,l , θˆj i : bj,l ∈ ∆(ISj,l )}
k=0
Mj,l = Θj,l ∪ SMj

The above defines the 0-level model, θj,0 as having beliefs only over the physical state space, S. The level 1
agent model maintains beliefs over the physical states and 0-level models of the opponent. A level l agent, θj,l ,
maintains beliefs over S and over models of the opponent nested up to l − 1.
M is a set of messages the agents can send to and receive from each other , i.e., it is a communication
language the agents share. Since agents’ beliefs are probability distributions and communication is intended to
share beliefs, it is natural to interpret a message in M as a marginal probability distribution over a subset of
variables in the agents’ interactive state spaces ISi and ISj , which overlap. That way M is a set of probabilistic
statements about the interactive state space. The message nil, i.e., silence, contains no variables. Note that we
do not assume that messages reflect agents’ actual beliefs. We will further discretize M below.

3.2 Belief update in CIPOMDPs
Belief update in CIPOMDPs is analogous to belief update in IPOMDPs when it comes to actions and observa-
tions. At any particular time step agents i and j can not only perform physical actions and observe but also
send and receive messages. Call the message i sent at time t − 1 mt−1 t
i,s , and one i received at time t mi,r , and
analogously for j. We assume all messages are in M and that message transmission is perfect. We provide precise
definition of message space in our formulation in section 4.1. The belief update in CIPOMDPs has to update
the probability of interactive state given the previous belief, action and observation, and given the message sent
(at the previous time step) and received (at the current time): P (ist |bit−1 , at−1
i , mt−1 t t
i,s , oi , mi,r ):
X X
bti (ist ) = P (ist |bt−1
i , at−1
i , mt−1 t t
i,s , oi , mi,r ) = η bt−1
i (ist−1 ) P (mt−1 t−1 t−1
j,s , aj |θj )
ist−1 at−1
j
X
× Oi (st , at−1 , oti )Ti (st−1 , at−1 , st ) τθjt (bt−1 t−1 t−1 t t t t t−1 t
j , aj , mj,s , oj , mj,r , bj )Oj (s , a , oj ) (2)
otj

The term P (mt−1 t−1 t−1
j,s , aj |θj ) quantifies the relation between the message i received from j and the model,
θj , of agent j that generated the message.1 This term is the measure of j 0 s sincerity, i.e., whether the message
j sent reflects j’s beliefs which are part of the model θj . η is the normalizing constant.

3.3 Planning in CIPOMDPs
The utility of interactive belief of agent i, contained in i’s type θi , is:
n X
Ui (θi ) = max(mi,s ,ai ) bis (s)ERi (is, mi,s , ai ) (3)
is∈IS
X o
+γ P (mi,r , oi |bi , ai )Ui (hSEθi (bi , ai , mi,s , oi , mi,r ), θˆi i)
(mi,r ,oi )

1 Note that mt−1 = mt
j,s i,r because message transmission is assumed to be perfect.

4
ERi (is, mi,s , ai ) above is the immediate reward to i for sending mi,s and executing action ai given the
interactive state is and is equal to Σaj Ri (is, ai , aj , mi,s )P (aj |θj ).
The planning in CIPOMDP makes use of equation 3 , which is based on the Bellman optimality principle.
The policy computes optimal action, message pair which results in a maximum expected reward. Consequently,
value iteration in CIPOMDP is analogous to that in IPOMDP and POMDP. The set of optimal message-action
pairs, (m∗i,s , a∗i ) is obtained by replacing max in equation 3 with argmax. We call the resulting set of optimal
message-action pairs OP T (θi ). When agent i models agent j as a strict optimizer, i predicts j would choose
action-message pair in OP T set with equal probability:
1
P (mj,s , aj |θj ) = (4)
|OP T (θj )|
and that P (mj,s , aj |θj ) is equal to zero if (mj,s , aj ) is not in OP T . The possibility that agents may be less
than optimal is considered in [Gmy20]
Being able to compute the probabilities of messages given the preferences and beliefs of a speaker is of crucial
importance when sincerity is not guaranteed. The belief update in CIPOMDPs provides the principled way to
discount content of messages that may be insincere because it is in the interest of the speaker to transmit them.
We give an example of this further below.

4 Approach
4.1 Message space
We define message space, M, as a set of marginal probability distributions over a subset of variables in the
interactive states of the agent. Since the belief space is continuous we make the computation more tractable by
quantizing M into finite set of belief points. The message space is augmented with nil, which is analogous to
no-op operation in the physical action set. Limiting message space to only nil message reduces CIPOMDP to
IPOMDP. Usually, a message contains information about only the subset of possible interactive states.2 The
variables that the message doesn’t mention are interpreted as being marginalized. We use the principle of
indifference and assume that probability is uniformly distributed among the remaining variables.
The message received mi,r can provide a mapping from physical state to belief marginalizing other variables
of interactive state (belief of another agent, frame, etc). Then mi,r (s) denotes belief the message ascribes to
physical state s ∈ S. Further, message may only contain probabilities of subset of values that the variables
mentioned in message can take. Let S 0 be the set of states not mentioned in the message. If a message m does
not mention state s’ then
∀s0 ∈ S 0
P
0
1 − s∈S−S 0 m(s)
m(s ) = (5)
|S| − |S 0 |

4.2 Sincerity and Deception
Let X denote a set of variables describing the interactive state of the agent. ℘(X) is a set of all non-empty
subsets of X. The joint belief bt of an agent can be marginalized over the subset of variables in X. Let bX
represent belief marginalized over variables in X. Accordingly message space M can be factored into the sets of
messages marginalized over each of X ∈ ℘(X).
Sincere message can be defined as a message in message space m ∈ MX which is closest to the marginalized
belief btX consistent with the true joint belief bt of the agent at time t. The distance is defined in terms of the
L1 norm. Thus the set of sincere messages is given by
[
t
Msincere = arg min kbX − mk (6)
m∈MX
X∈℘(X)

Insincere(deceptive) message can be defined as any message in message space except the one closest to true
belief bt of the agent at time t. Thus the set of insincere messages is given by
[
t
Minsincere = MX − arg min kbX − mk (7)
m∈MX
X∈℘(X)
2 For example, if the physical state space, S, is factored into variables X and Y , a message, m, might be ”P (0 ≤ X ≤ 100) = 0.7”.

5
4.3 Communication for POMDP (level-0 CIPOMDP)
The recursion in CIPOMDP bottoms out as a flat POMDP which does not have a model of the other agent. We
use the definition of literal speaker 3 from [Gmy20]. A literal listener can incorporate the incoming message as
additional observation, which we describe in the following section. These assumptions enable an agent that does
not model the other agent to participate in the exchange of messages.

4.3.1 Augmented observation space and function for POMDP
We propose POMDP (θ0 ) can receive the message and include it in its belief update, by augmenting its observation
space and consequently observation function. Observation space now becomes a Cartesian product of usual
observation space and message space.

Ω0 = Ω × M (8)

The joint probability of observation and message received is obtained by combining the likelihood function
for a physical observation with message distribution. The message distribution is represented by a triangular
distribution with the idea that the likelihood of a message reflecting belief about the world state should increase
monotonically as the belief. For e.g. if the belief is defined over two states (s1, s2), and the true state of the
world is s1, the likelihood of received message reflecting belief of (1,0) should be higher than the likelihood of
received message reflecting belief of (0.5, 0.5). This formulation makes the agent gullible.
1
|S| if mi,r = nil
P (mi,r |s) = (9)
(mi,r (s)) otherwise

Given the state of the world, observation is conditionally independent of the message received, then the
augmented observation function can be defined as

∀mi,r ∈M and ∀o∈Ω and ∀s∈S

4.4 Algorithm
4.4.1 Value Iteration
As in POMDPs, the value function of CIPOMDPs is represented in terms of max over linear segments called
alpha-vectors. Each alpha vector corresponds to a policy and each component of alpha vector ascribes value to
an interactive state. Value iteration proceeds by backing up alpha-vectors to a higher time horizon starting from
horizon 1.

4.4.2 Complexity
POMDPs suffer from the curse of dimensionality and the curse of history. Naturally, These curses are carried
over to IPOMDPs and CIPOMDPs, which require solving of nested POMDPs and CIPOMDPs. The curse
of history is more prominent in CIPOMDPs because the policy is now conditional on both observation and
message received. Since, computing optimal policies for POMDPs by exact solution methods are proven to be
PSPACE-complete for finite time horizon and undecidable for an infinite time horizon [MHC03], a large amount
of work has been done in computing approximate solution. [PGT03] introduced a point-based value iteration
3 literal speaker generates a message reflecting its true belief about the physical states of the world b with probability 1 − α and
t
α
all other messages including ’nil’ with probability |M|−1 . The messages are then broadcasted to ”no one in particular” (NOIP), and
do not take part in belief update for POMDP agent

6
(PBVI) algorithm to approximate exact value iteration by selecting a fixed set of representative belief points and
maintaining alpha vectors that are optimal at those points only. Our work builds on the interactive point-based
value iteration [DP08] which showed improvement in runtime over other IPOMDP solution methods like [DG05].
Exact value iteration quickly becomes intractable in PODMPs and IPOMDPs due to generation of large number
of alpha-vectors which is exponential in observation space |A||ν t+1 ||Ω| , where ν t+1 denote set of alpha vectors
being backed-up from t + 1 to t. In the case of CIPOMDP, the size is further exploded due to the inclusion of
message space in the policy. The exact number of alpha-vectors generated at time t + 1 will be |A||ν t+1 ||Ω||M|
. To keep the size of the alpha-set in each iteration tractable, we can use the point-based method, which only
retains the vectors which are optimal at the fixed set of belief points. As in IPOMDP, we need to solve the
lower-level model to compute the alpha vectors. Accordingly, we limit the initial model of other agents to a finite
set.

4.4.3 IPBVI-Comm
The point-based value iteration approach backs up the alpha-vectors optimal at a fixed set of belief points in
each iteration starting from horizon 1. Each iteration consists of three steps, which we describe below, and along
the way we highlight the difference with IPBVI [DP08]. The belief set for each horizon consists of randomly
generated belief points across all interactive states.

Step 1
The first step involves calculating the intermediate set of alpha vectors Γai ,mi,s ,∗ representing immediate reward
for action ai and message mi,s (equation 11), and Γai ,oi ,mi,s ,mi,r representing future reward after receiving
observation oi and message mi,r (equation 12). The step is performed for all actions, observations and messages.
For horizon 1, computation of immediate reward is sufficient and will be used as initial alpha set for subsequent
backups.
Different from point-based algorithm for IPOMDPs, we need to calculate P r(mi,r |θj,l−1 ) and perform belief
update for the other agent j which now depends on message sent by i. Due to one-step delay in message exchange,
the message sent in the current time step allows computing interactive states in next time step and backup the
values from next time step. Assuming sincerity for θj,0 , only the message closest to belief will get probability
1- α. All the other messages, including ’nil’, will share the probability α. When a message received is other
than a sincere message, level-1 CIPOMDP ignores the message, and belief update proceeds as IPOMDP. For
higher-level CIPOMDPs, the probability of message received is uniformly distributed among all the messages in
OP T (θj ) set, as defined in section 3.3.
∀ai ∈ Ai , ∀oi ∈ Ωi , ∀mi,s ∈ M, ∀is ∈ IS
X
Γai ,mi,s ,∗ ← αai ,mi,s ,∗ (is) = Ri (s, ai , aj , mi,s )P r(aj |θj,l−1 ) (11)
aj ∈Aj

X X
Γai ,oi ,mi,s ,mi,r ← αai ,oi ,mi,s ,mi,r (is) = γ P r(mi,r , aj |θj,l−1 )Ti (s, ai , aj , s0 )
is0 ∈IS 0 aj ∈Aj
X 0
0
Oi (s , ai , aj , oi ) Oj (s , ai , aj , oj )δD (SEθˆj (bj,l−1 , aj , oj , mj,s , mi,s ) − b0j,l−1 )α (is0 )
0
(12)
oj

Here, δD is the Dirac delta function taking the current belief and updated belief as an argument. The updated
belief is returned by state estimator function SEθˆj .

Step 2
The second step involves combining intermediate alpha vectors calculated in step 1 weighted by the observation
and message likelihood using a cross sum operation. Due to the point-based approach, the cross sum operation
in exact value iteration is simplified. The step proceeds by selecting only those intermediate alpha vectors which
are optimal at any of the given set of belief points.
Γai ,mi,s ← Γai ,mi,s ,∗ ⊕ arg max (αai ,oi ,mi,s ,mi,r .bi,l )
oi ∈Ωi ,mi,r ∈M Γai ,oi ,mi,s ,mi,r

∀bi,l ∈ Bi,l (13)

7
(a) (b)

Figure 2: (a)Policy tree for CIPOMDP agent with friend reward function modeling sinciere and gullible agent
(b)Alpha vectors for friend reward function and sensor accuracy 0.85, for IPOMDP (without communication - in
red) and for CIPOMDP (with communication - in blue). Each vector is labelled with a 3-step policy. For example
L\(); L(∗); OL\(GR; GR), L(?) means that agent starts by listening (list of observations is empty), followed by
listening no matter what it hears, followed in the third step by OL if previous two observations were GR and
GR, but followed by another L if previous two observations were anything but (GR,GR). Communicative policies
additionally include contents of messages received simultaneously with growls.

Step 3
In the final step, the belief points in set Bi,l are used again to select the alpha vectors for the final set for the
current iteration. Since different action, message pairs can be optimal for the modeled agent, we need to include
alpha vectors corresponding to all optimal action, message pairs in the final alpha set.

νt ← arg max (αt .bi,l )
αt ∈∪ai Γai ,mi

∀bi,l ∈ Bi,l (14)

The recursion bottoms out as POMDP at level-0, which we assume to be a literal speaker. Since POMDP
policy only computes physical action, we need to augment the policy with a sincere message.

5 Experiments and results
We report the results on the multi-agent tiger game, which elegantly models complex interaction with the
environment and other agents in sequential decision-making scenarios under uncertainty. The game provides
intuitive scenarios for the study of sincerity and deception because the agents can be incentivized to influence
another agent to open right or wrong door while not ignoring uncertainty in the environment.

5.1 Multi-agent tiger game
In this version, two agents are facing two doors: “left” and “right”. Behind one door lies a hungry tiger and
behind the other is a pot of gold but the agents do not know the position of either. Thus, the set of states
is S = {T L, T R} indicating the tiger’s presence behind the left, or right, door. Each agent can open either
door. Agents can also independently listen for the presence of the tiger, so the actions are A = {OR, OL, L}
for opening the right door, opening the left door, and listening and is the same for both agents. The transition
function T , specifies that every time either agent opens one of the doors, the state is reset to T R or T L with
equal probability, regardless of the action of the other agent. However, if both agents listen, the state remains
unchanged. After every action, each agent can hear the tiger’s growl coming either from the left, GL, or from the
right door, GR. The observation function O (identical for both agents) specifies the accuracy of observations.
We assume that tiger’s growls are informative, with predefined sensor accuracy, only if the agents listen. If
the agent opens the doors the growls have an equal chance to come from the left or right door and are thus
completely uninformative.

8
Table 1: Neutral Re- Table 2: Friend Re- Table 3: Enemy Re- Table 4: Enemy Re-
ward ward ward A ward B

hai , aj i TL TR hai , aj i TL TR hai , aj i TL TR hai , aj i TL TR
OR, L 10 -100 OR, L 9.5 -100.5 OR, L 10.5 -99.5 OR, L 10 -100
OL, L -100 10 OL, L -100.5 9.5 OL, L -99.5 10.5 OL, L -100 10
L, L -1 -1 L, L -1.5 -1.5 L, L -0.5 -0.5 L, L -1 -1
OR, OL 10 -100 OR, OL -40 -95 OR, OL 60 -105 OR, OL 10 -150
OL, OL -100 10 OL, OL -150 15 OL, OL -50 5 OL, OL -100 -40
L, OL -1 -1 L, OL -51 4 L, OL 49 -6 L, OL -1 -51
OR, OR 10 -100 OR, OR 15 -150 OR, OR 5 -50 OR, OR -40 -100
OL, OR -100 10 OL, OR -95 -40 OL, OR -105 60 OL, OR -150 10
L, OR -1 -1 L, OR 4 -51 L, OR -6 49 L, OR -51 -1

Table 5: Reward comparison for CIPOMDP agent against IPOMDP agent in different scenarios. For Enemy,
reward function in Table 3 is used.

Reward
Nesting
Agent Opponent h=3 h=4 h=5
Level
CIPOMDP IPOMDP CIPOMDP IPOMDP CIPOMDP IPOMDP
Sincere and 3.4 2.84 3.9 2.39 3.5 1.067
Neutral
Gullible ± 8.92 ± 16.02 ± 8.29 ± 7.95 ± 8.037 ± 20.275
1
Sincere and 46 1.53 66.48 1.07 86.00 -1.31
Enemy
Gullible ± 24.69 ± 18.07 ± 42.15 ± 8.887 ± 36.57 ± 24.79
Sincere and 5.08 4.15 6.05 3.56 5.71 -0.81
Friend
Gullible ± 10.91 ± 18.02 ± 9.42 ± 8.86 ± 17.10 ± 23.32
3.39 2.81 3.9 2.39 3.4942 1.55
Neutral Enemy
± 8.08 ± 16.02 ± 11.40 ± 7.67 ± 8.94 ± 17.14
2 5.08 4.14 6.21 3.67 5.02 3.65
Friend Friend
± 10.5 ± 18.02 ± 8.56 ± 8.97 ± 10.099 ± 17.94
5.44 1.53 8.99 2.32 10.78 0.5
Enemy Enemy
± 10.83 ± 18.07 ± 18.88 ± 15.72 ± 18.09 ± 19.81
Uncertain
3.43 1.53 4.19 2.44 3.45 0.82
Neutral (Enemy or
± 7.80 ± 18.07 ± 9.162 ± 7.87 ± 8.33 ± 13.71
Friend)

5.1.1 Reward functions
The reward functions are chosen to simulate cooperative and competitive scenarios. Table 1 represents the
scenario when the reward of the agent is independent of the action of the other agent. Table 2 is the friend
reward function where the agent gets half of the reward obtained by the other agent, in addition to its own
reward. In table 3 the agent gets half of the negative of the reward obtained by the other agent, hence represents
the competitive case. In other reward function Table 4, the agent gets -50 reward if another agent opens the
correct door but there is no extra reward if the other agent opens the wrong door. Table 3 incentivizes the
extreme lie while 4 incentivizes more believable lie.

5.1.2 Experimental setup
Table 5 shows the total reward collected in multi-agent tiger game averaged across 10000 episodes for sensor
accuracy of 0.85. The message space M is limited to distribution over physical states only and has been quantized
into 5 equally spaced belief points (0.0,1.0),(0.25, 0.75),(0.5,0.5),(0.75, 0.25), (1.0, 0.0), and nil. The value of α
is fixed to 0.01. The results show that the CIPOMDP agent outperforms the IPOMDP agent in terms of the
average reward collected due to the sophistication of message exchange. The difference is more prominent when
the agent is able to deceive the other agent. The behavior of the agent across multiple scenarios is discussed
below.

5.2 Cooperative Scenarios
When level 2 friend i models a friend j, agent i sends a sincere message to j reflecting its belief and further
includes a message from another agent in its belief update. For e.g. after starting from uniform belief, if the
agent i hears GL, it will send the sincere message mi,s = (0.75, 0.25), which assigns probability 0.75 to TL and
0.25 to TR. Figure 2 (b) shows the alpha vectors plot corresponding to communicative policies for CIPOMDP

9
Table 6: Different scenarios of belief update of level 2 CIPOMDP agent i modeling enemy, friend, and a random
agent (initial belief is uniform over all the models and location of the tiger). For all scenarios, agent i sends a
nil message and executes listen action in each time-step. The belief update is shown in figure 3

Scenario time-step observation message received Remarks
1 GL nil The received message contradicts agent’s
1 consecutive observations of GLs. The other
2 GL (0.25, 0.75) agent is more likely to be enemy than friend

1 GL nil The received message jives in with the agent’s
2 consecutive observations of GLs. The other is
2 GL (0.75, 0.25) more likely to be friend than enemy

1 GR nil Agent gets contradictory growls, there is an
3 equal chance the other agent is an enemy (and
2 GL (0.25, 0.75) tiger on left) or a friend (and tiger on right)

1 GR nil Agent gets contradictory growls, there is an
4 equal chance the other agent is a friend (and
2 GL (0.75, 0.25) tiger on left) or an enemy (and tiger on right)

1 GL nil Agent gets the message that is in OPT set of
5 neither enemy nor friend. Hence, is certain
2 GL nil that the other agent is a random agent

agent with friend reward function modeling CIPOMDP (also with a friend reward). The resulting policy tree
when the agents are uncertain about the initial location of the tiger is shown in figure 2 (a).

5.3 Non-cooperative Scenarios
When a level 1 CIPOMDP agent i models a gullible agent j, it can collect a large reward by sending a deceitful
message. For e.g. after starting from uniform belief and getting GL, the agent sends a message mi,s = (0, 1)
which indicates agent i is certain tiger is on the right, opposite to its own observation. When level 2 enemy i
models enemy j, the sophisticated agent i can prevent itself from being deceived by j and further take advantage
of the deceitful message as an extra observation. Also, since level-2 agent knows level-1 CIPOMDP models the
other as a sincere agent, the former tends to deceive the latter by sending a deceitful message. When level 2
neutral agent models an enemy, it has no incentive to deceive but can prevent itself from being deceived by
ignoring the message from the other agent. When the nesting level is increased to level 3, the agent tends to send
a sincere message with the intention to deceive the level 2 agent which is expecting a deceitful message.When
the agents are indifferent to each other’s actions i.e. both have a neutral reward function, then communication
doesn’t provide any added value and the reward collected is the same as the IPOMDP agent.

5.4 Uncertainty about the opponent
5.4.1 Message non-revealing of the agent’s type
Let’s consider a two-agent interaction in a multi-agent tiger game where both the agents i and j start with
no information about the location of the tiger. The level-2 CIPOMDP agent i is uncertain about the level-1
opponent j ’s type and thus assigns uniform probability over the other agent being friend, enemy or random 4 .
The friend is characterized by the friend reward function (Rf ) which incentivizes sincere communication while
the enemy is characterized by the enemy reward function (Re ) incentivizing the deceptive behavior. Also, all
other elements in the frame of the CIPOMDP agent at all levels are assumed to be the same.

Behavior of level-0 agent
The POMDP agent at level-0 which starts with a uniform distribution over the physical states of the world T L
and T R, would open the door if it received two consecutive growls indicating the same location of tiger and the
message non-indicative of opposing growl. Further, the agent would open the door at any time-step, regardless
4 It can be the case that level-2 CIPOMDP is uncertain about the strategic level of the opponent and might want to model

POMDP as well, but to simplify the illustration, we stick to 3 types of models

10
Figure 3: Belief Update for specific scenarios for level-2 CIPOMDP agent i modeling enemy, friend, and random
agent. Each row represents a scenario from table 6 and column represent time-step. In each scenario, the agent i
has a uniform initial belief over the state and the frame of another agent j. Each belief update is due to physical
action, message sent, observation and message received which can be referenced from table 6

of its observation, if it got the extreme message indicating the location of a tiger with certainty. Also, being a
literal speaker, the agent sends out its true belief at every time-step with probability α.

Behavior of level-1 agent
The less sophisticated level-1 agent θj,1 models sincere and gullible POMDP agent θi,0 , and the random agent
SMi . Let’s consider the behavior of level-1 agent j which has predicted the behavior of θi,0 for every possible
e
message it can send. The agent θj,1 with enemy reward function (Table 4), has incentive to deceive the opponent,
e
because every time i opens the right door, j gets -50 reward. Let’s suppose the agent θj,1 listens at time step
t = 0 , to be certain of the tiger’s location and then sends the message at time step t = 1 indicating growl from
e
opposite location than what it heard. Let’s suppose θj,1 got GL, then it would send the message (0.25, 0.75).
On other hand, if the level-1 agent had a friend reward function (Table 2), it would get a share of i’s reward.
f f
Let’s call the agent θj,1 . θj,1 would want the other agent i to open the right door, and hence send the sincere
message, indicative of its belief updated based on the observation. For an observation of GR, it would send a
message (0.25, 0.75) at t = 1.

Belief update for level-2 agent
f e
Level-2 CIPOMDP i uses IPBVI-Comm to solve for the anticipated behavior of modeled agents θj,1 and θj,1 .
Now let’s see how the belief update proceeds for i. Note, that optimal messages from level-1 agents are not
indicative of the agent type. For e.g. if the enemy received GL, it would send the message (0.25, 0.75) and if the
friend received GR, it would again send the message (0.25, 0.75). We study the scenario when the message itself
is indicative of agent type in the next section. It turns out the agent i is still able to assign a higher probability
to one agent type than the other based on the observation it received. For e.g. if the observation and message
received for i in two time-steps are < GL,0 nil0 > and < GL, (0.25, 0.75) >, as shown in table 6, scenario 1, the
agent is more likely to be an enemy than the friend because its observations contradict the received message.
The belief update of other scenarios are shown in figure 3. The description of each scenario is provided in table
6.

11
Figure 4: The performance profile of IPBVI-Comm for level 2 (left) and level 1 (right) CIPOMDP agent. The
increase in expected reward is due to improvement in policy after increasing number of belief points. It also
shows the comparison with exact value iteration.
5.4.2 Message revealing the agent’s type
Again we consider the scenario when there is uncertainty about the opponent type, i.e. when level 2 neutral
agent i models both enemy and friend. The only difference now is that the agent modeled at level-1 θj,1 has
the reward function that incentivizes the extreme lie (Table4). In this case, the higher level agent θi,2 can
figure out if the other agent j is friend or enemy based on the message content only. The CIPOMDP agent θi,2
incorporates the message from j as a sincere message if it is closer to the belief of the modeled agent θj,1 and
discards the message as an insincere message if the incoming message is way off the belief of the modeled agent.
For e.g. if the belief of the modeled agent is (0.85, 0.15), (0.75, 0.25) is considered a sincere message while (1, 0)
is considered insincere message. Let’s suppose θi,2 starts by listening at t = 0. After receiving GL and message
nil, the belief shifts towards physical state T L but belief is equally distributed among both frames. At t = 1,
after receiving message (0.75, 0.25) and observation GL, the belief concentrates on tiger being on left and other
agent being a friend. The agent is able to detect friend from enemy by calculation of sincerity term P (mi,r |θj ).
When θj = (bj , Rf ), P (mi,r |θj ) = 1 and when θj = (bj , Re ), P (mi,r |θj ) = 0. This happens because for level-1
CIPOMDP with enemy reward function, the optimal action would be to lie to the extreme. This would make
θi,0 open the door intended by the deceiver θj,1 , disregarding its own observation. This message reveals j as an
enemy to the higher level agent i.

5.5 Algorithm Performance
Since the point-based algorithm only backs up alpha-vectors optimal at fixed set of belief points, the performance
would depend on the number of belief points chosen. Figure 4 shows the comparison of expected reward for the
different number of belief points.

6 Conclusion
We started by devising a technique to incorporate the message into POMDP belief update in order to allow an
agent that does not model other agents to take part in exchange of (literal) messages. We then formalized the
notion of sincerity and deception in terms of belief of the agent and messages in message space. We adopted
a point based solution method to CIPOMDPs to alleviate the complexities of considering communication as
well as observations and physical actions. The analysis of computed policies shows the added sophistication
of communication results in policies of superior quality, which is further supported by the empirical results on
several experiments conducted on multi-agent tiger game. More importantly, we showed how higher levels of
theory of mind may allow an agent to disambiguate a sincere friend from deceitful foe based on received message
and observation from the environment.
In future work, we want to explore the higher depth of nesting, and more relaxed soft maximization criterion
for action selection which can give rise to richer rational communicative behavior agents can engage in. We are
considering online planning method like Monte Carlo tree search for computing policy which provides scalability
and with some variation, could accomodate continuous message space.

12
References
[AASN+ 20] Md Ali Reza Al Amin, Sachin Shetty, Laurent L. Njilla, Deepak K. Tosh, and Charles A. Kamhoua.
Dynamic Cyber Deception Using Partially Observable Monte-Carlo Planning Framework, chapter 14,
pages 331–355. John Wiley Sons, Ltd, 2020.

[AdR18] Saul Albert and J. P. de Ruiter. Repair: The interface between interaction and cognition. Topics
in Cognitive Science, 10(2):279–313, 2018.

[BAG+ 05] P. Beautement, David H. Allsopp, M. Greaves, Steve Goldsmith, S. Spires, S. Thompson, and
H. Janicke. Autonomous agents and multi -agent systems (aamas) for the military - issues and
challenges. In DAMAS, 2005.

[BI11] Will Bridewell and Alistair Isaac. Recognizing deception: A model of dynamic belief attribution.
In AAAI Fall Symposium: Advances in Cognitive Systems, 2011.

[BTK08] Cynthia Breazeal, Atsuo Takanishi, and Tetsunori Kobayashi. Social Robots that Interact with
People, pages 1349–1369. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.

[CT16] Crystal Chao and Andrea Thomaz. Timed petri nets for fluent turn-taking over multimodal in-
teraction resources in human-robot collaboration. The International Journal of Robotics Research,
35(11):1330–1353, 2016.

[DA16] S. Devin and R. Alami. An implemented theory of mind to improve human-robot shared plans
execution. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI),
pages 319–326, 2016.

[DCA17] Sandra Devin, Aurélie Clodic, and Rachid Alami. About decisions during human-robot shared plan
achievement: Who should act and how? In ICSR, 2017.

[DG05] Prashant Doshi and Piotr Gmytrasiewicz. Approximating state estimation in multiagent settings
using particle filters. In Proceeding of AAMAS 2005, 2005.

[DP08] Prashant Doshi and Dennis Perez. Generalized point based value iteration for interactive pomdps.
In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 1, AAAI’08, page
63–68. AAAI Press, 2008.

[DWW+ 15] Xiao Pan Ding, Henry M. Wellman, Yu Wang, Genyue Fu, and Kang Lee. Theory-of-mind training
causes honest young children to lie. Psychological Science, 26(11):1812–1821, 2015. PMID: 26431737.

[EG19] Piotr Evdokimov and Umberto Garfagnini. Communication and behavior in organizations: An
experiment. Quantitative Economics, 10(2):775–801, 2019.

[EJ10] David Ettinger and Philippe Jehiel. A theory of deception. American Economic Journal: Microe-
conomics, 2(1):1–20, February 2010.

[FAdFW16] Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to com-
municate to solve riddles with deep distributed recurrent q-networks, 2016.

[FSH+ 19] Jakob N. Foerster, H. Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson,
Matthew M Botvinick, and Michael H. Bowling. Bayesian action decoder for deep multi-agent
reinforcement learning. ArXiv, abs/1811.01458, 2019.

[GA14] Matthias Gamer and Wolfgang Ambach. Deception research today. Frontiers in Psychology, 5:256,
2014.

[GD00] Piotr J. Gmytrasiewicz and Edmund H. Durfee. Rational coordination in multi-agent environments.
Autonomous Agents and Multiagent Systems Journal, 3(4):319–350, 2000.

[GD05] Piotr Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multiagent set-
tings. Journal of Artificial Intelligence Research, 24:49–79, 2005. http://jair.org/contents/v24.html.

13
[GL20] Andrea L Guzman and Seth C Lewis. Artificial intelligence and communication: A human–machine
communication research agenda. New Media & Society, 22(1):70–86, 2020.
[Gmy20] Piotr J. Gmytrasiewicz. How to do things with words: A bayesian approach. J. Artif. Intell. Res.,
68:753–776, 2020.
[Gne05] U. Gneezy. Deception: The role of consequences. The American Economic Review, 95:384–394,
2005.
[GYPC20] J. George, C. T. Yilmaz, A. Parayil, and A. Chakrabortty. A model-free approach to distributed
transmit beamforming. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 5170–5174, 2020.
[IB17] Alistair Isaac and Will Bridewell. White lies on silver tongues: Why robots need to deceive (and
how). pages 157–172, 10 2017.
[KGB17] A. Kurakin, Ian J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. ArXiv,
abs/1607.02533, 2017.
[MHC03] Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and
related stochastic optimization problems. Artificial Intelligence, 147(1):5 – 34, 2003. Planning with
Uncertainty and Incomplete Information.
[NMT13] A. Nayyar, A. Mahajan, and D. Teneketzis. Decentralized stochastic control with partial history
sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644–
1658, 2013.
[NPY+ 04] Ranjit Nair, David Pynadath, Makoto Yokoo, Milind Tambe, and Stacy Marsella. Communica-
tion for improving policy computation in distributed pomdps. In Proceedings of the Agents and
Autonomous Multiagent Systems (AAMAS), 2004.
[NRYT03] Ranjit Nair, Maayan Roth, Makoto Yokoo, and Milind Tambe. Taming decentralized pomdps:
Towards efficient policy computation for multiagent settings. In Proceedings of the Eighteenth In-
ternational Joint Conference on Artificial Intelligence, 2003.
[OSV07] Frans Olihoek, Mathijs Spaan, and Nikos Vlassis. Dec-pomdps with delayed communication. In
Proceedings of MSDM 2007 May 15, 2007, Honolulu, Hawai’i, USA, 2007.
[OSV19] Lauren A Oey, Adena Schachner, and Edward Vul. Designing good deception: Recursive theory of
mind in lying and lie detection, May 2019.
[PGT03] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based value iteration: An anytime al-
gorithm for pomdps. In Proceedings of the 18th International Joint Conference on Artificial Intel-
ligence, IJCAI’03, page 1025–1030, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers
Inc.
[RMG15] J. Renoux, A. Mouaddib, and S. L. Gloannec. A decision-theoretic planning approach for multi-
robot exploration and event search. In 2015 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 5287–5293, 2015.
[RO98] Nancy K. Ratner and Rose R. Olver. Reading a tale of deception, learning a theory of mind? Early
Childhood Research Quarterly, 13(2):219 – 239, 1998.
[SDS20] Aditya Shinde, Prashant Doshi, and Omid Setayeshfar. Active deception using factored interactive
pomdps to recognize cyber attacker’s intent, 2020.
[Son78] Edward J. Sondik. The optimal control of partially observable markov processes over the infinite
horizon: Discounted costs. Oper. Res., 26(2):282–304, April 1978.
[SPB+ 19] S. Sarkadi, Alison R. Panisson, Rafael Heitor Bordini, P. McBurney, S. Parsons, and M. Chapman.
Modelling deception using theory of mind in multi-agent systems. AI Commun., 32:287–302, 2019.

14
[SsF16] Sainbayar Sukhbaatar, arthur szlam, and Rob Fergus. Learning multiagent communication with
backpropagation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems 29, pages 2244–2252. Curran Associates, Inc.,
2016.
[SSW+ 17] K. Shu, A. Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media:
A data mining perspective. SIGKDD Explor., 19:22–36, 2017.
[STHP91] Beate Sodian, Catherine Taylor, Paul L. Harris, and Josef Perner. Early deception and the child’s
theory of mind: False trails and genuine markers. Child Development, 62(3):468–483, 1991.
[STX+ 18] Aaron Schlenker, Omkar Thakoor, Haifeng Xu, Fei Fang, Milind Tambe, Long Tran-Thanh, Phebe
Vayanos, and Yevgeniy Vorobeychik. Deceiving cyber adversaries: A game theoretic approach. In
Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems,
AAMAS ’18, page 892–900, Richland, SC, 2018. International Foundation for Autonomous Agents
and Multiagent Systems.
[ULS20] Vaibhav V. Unhelkar, Shen Li, and Julie A. Shah. Decision-making for bidirectional communication
in sequential human-robot collaborative tasks. In Proceedings of the 2020 ACM/IEEE Interna-
tional Conference on Human-Robot Interaction, HRI ’20, page 329–341, New York, NY, USA, 2020.
Association for Computing Machinery.

[WPH16] Ning Wang, David V. Pynadath, and Susan G. Hill. Trust calibration within a human-robot team:
Comparing automatically generated explanations. In The Eleventh ACM/IEEE International Con-
ference on Human Robot Interaction, HRI ’16, page 109–116. IEEE Press, 2016.
[YFS+ 20] Luyao Yuan, Zipeng Fu, Jingyue Shen, Lu Xu, Junhong Shen, and Song-Chun Zhu. Emergence
of Pragmatics from Referential Game between Theory of Mind Agents. arXiv e-prints, page
arXiv:2001.07752, Jan 2020.
[ZVlY04] Yu Zhang, Richard A. Volz, Thomas R. loerger, and John Yen. A decision-theoretic approach
for designing proactive communication in multi-agent teamwork. In Proceedings of the 2004 ACM
Symposium on Applied Computing, SAC ’04, page 64–71, New York, NY, USA, 2004. Association
for Computing Machinery.