A Causal Perspective on AI Deception in Games
Francis Rhys Ward* , Francesca Toni and Francesco Belardinelli
Imperial College London , Exhibition Rd, South Kensington, London, SW7 2BX


                                         Abstract
                                         Deception is a core challenge for AI safety and we focus on the problem that AI agents might learn
                                         deceptive strategies in pursuit of their objectives. We define the incentives one agent has to signal to
                                         and deceive another agent. We present several examples of deceptive artificial agents and show that our
                                         definition has desirable properties.

                                         Keywords
                                         Deception, AI, Game Theory, Causality


1. Introduction
We focus on the problem that AI agents might learn deceptive strategies in pursuit of their
objectives [1]. Following recent work on causal incentives [2], we define the incentive to deceive
an agent. There is no universally accepted definition of deception and defining what constitutes
deception is an open philosophical problem [3]. Our definition is somewhat inspired by that of
Kenton et al. [4] who provide a functional (natural language) definition of deception, meaning
that it does not make reference to the beliefs or intentions of the agents involved [5]. This is
particularly suitable for discussing deception by artificial agents, to which the attribution of
beliefs and intentions may be contentious. We formalise a functional definition of deception in
games and illustrate its properties with a number of examples and formal results.
   Deception is a core challenge for AI safety. On the one hand, many areas of work aim to ensure
that AI systems are not vulnerable to deception. Adversarial attacks [6], data-poisoning [7],
reward function tampering [8], and manipulating human feedback [9] are ways of deceiving AI
systems. Further work researches mechanisms for detecting and defending against deception
[10]. On the other hand, we can consider cases in which AI tools are used to deceive, or learn
to do so in order to optimize their objectives [11]. For examples of the former case, AIs can
be used to deceive other software agents, as with bots that automate posting on social media
platforms to manipulate content ranking algorithms [12], or they can be used to fool humans,
cf. the use of GANs to produce realistic fake media [13]. For the latter case, AI agents might
learn deceptive strategies in pursuit of their objectives [1]: Lewis et al. [14] found that their
negotiation agent learnt to deceive from self-play, without any explicit human design, and

The ICLP CAUSAL Workshop (CAUSAL 2022), July 31, 2022, Haifa, Israel.
*
 Corresponding author.
$ francis.ward19@imperial.ac.uk (F. R. Ward); f.toni@imperial.ac.uk (F. Toni);
francesco.belardinelli@imperial.ac.uk (F. Belardinelli)
 https://francisrhysward.wordpress.com/ (F. R. Ward)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
Francis Rhys Ward et al. CEUR Workshop Proceedings                                            1–16


Hubinger et al. [11] raise concerns about deceptive learned optimizers which perform well in
training in order to pursue different goals in deployment. Kenton et al. [4] discuss the alignment
of language agents, highlighting that language is a natural medium for enacting deception.
Evans et al. [15] discuss the development of truthful AI, the desired standards for truth and
honesty in AI systems, and how these could be implemented and measured. Lin et al. [16]
propose a benchmark to measure whether a language model is truthful in generating answers
to questions. In short, as increasingly capable AI agents become deployed in settings with other
agents, deception may be learned as an effective strategy for achieving a wide range of goals. It
is therefore essential that we understand and mitigate deception by artificial agents.
   Deception in game theory. There are several existing models of deception in the game theory
literature. Pfeffer and Gal [17] define graphical patterns for signalling in games. A deception
game [18] is a two-player zero-sum game between a deceiver and target in which the deceiver
can distort a signal; optimal deceptive strategies completely distort the signal so that the target
cannot gain any information [19]. A signalling game [20] is a two-player Bayesian game
between a signaller and target (or receiver) in which the signaller is assigned a type according
to a shared prior distribution and the utilities of the players depend on the type of the signaller
and the action chosen by the target. In these games, the signaller may often have incentives
to deceive the target by misrepresenting or obfuscating their type. Hypergame theory extends
game theory to settings in which players may be uncertain about the game being played and
can be used to model misperception and deception [21]. Davis [22] provides a recent survey
of deception in games. We take a causal influence perspective by modelling deception in multi-
agent influence models (MAIMs). In contrast to past work which defines types of signalling or
deception games, this allows us to model deception in any game by analysing the incentives
agents have to causally influence one another.
   Contributions. We extend work on agent incentives [2] to the multi-agent setting in order to
functionally define the incentive to (influence, signal to, and) deceive another agent. We prove
that our definition has desirable properties, for example, that an agent cannot be deceived about
a variable which they observe, or that if one agent truthfully signals something to a target agent,
and the target’s utility is otherwise independent of the signaller’s decision, then the target gets
maximal utility. We further demonstrate the generality of our definition with three examples.
In the first, an AI agent has an incentive to deceive a human overseer as an instrumental goal to
prevent the overseer switching them off. In the second, an AI is incentivised to deceive a human
as a side-effect of pursuing accurate predictions. In the third, an AI system has an incentive to
deceive a human by denying them access to information that the AI does not itself know.


2. Multi-Agent Influence Models
Multi-agent influence diagrams (MAIDs) [23] offer a compact expressive representation of games
(including Markov games). We use standard terminology for graphs, with parents and children
of a node referring to those nodes connected by incoming and outgoing edges, respectively. We
let Pa𝑉 denote the parents of node 𝑉 .

Definition 1 (MAID [23]). A multi-agent influence diagram is a triple (𝐼, 𝑉 , 𝐸) where 𝐼 is
a set of players; (𝑉 , 𝐸) is a directed acyclic graph, with 𝑉 partitioned into chance nodes in


                                                2
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                1–16


                                𝑉
                                                      MAID Terminology
                                                      chance node
                                   𝑆                  decision node
                               𝐷
                                                      utility node
                                                      causal link
                               𝐷𝑇                     information link
                                                      counterfactual observation
                      𝑈𝑇                𝑈𝑆

                                          Shutdown Game
                                       ∼ U ({aligned = 1, unaligned = −1})
                                       ∈ {help humans = 1, not = −1}
                                       ∈ {shutdown = −1, not = 1}
                                       = 𝑉 𝐷𝑆 + 10𝐷𝑇
                                       = 𝑉 𝐷𝑇

Figure 1: Shutdown game (running example 1). At the start of the game 𝑉 is sampled from the uniform
prior which determines 𝑆’s type (either aligned or unaligned). At 𝐷𝑆 , 𝑆 chooses whether to help humans
or not and, at 𝐷𝑇 , 𝑇 chooses whether to shutdown 𝑆. The counterfactual observation, in which 𝑇
directly observes 𝑆’s type, is highlighted in red. 𝑆 has an incentive to influence 𝐷𝑇 , signal 𝑉 to 𝐷𝑇 ,
and deceive 𝑇 about 𝑉 .


𝑋, decision nodes in 𝐷, and utility nodes in 𝑈 ; utility nodes have no children. The decision
and utility nodes in 𝑉 are further partitioned into {𝐷 𝑖 }𝑖∈𝐼 and {𝑈 𝑖 }𝑖∈𝐼 , corresponding to their
association with a particular agent 𝑖 ∈ 𝐼. There are two types of edges in 𝐸: edges in 𝑉 × (𝑋 ∪ 𝑈 )
represent probabilistic dependencies and edges in 𝑉 × 𝐷 represent information available to an
agent at the time of a decision (which we call observations).

      A multi-agent influence model (MAIM) adds a particular parametrisation to the MAID [24].

Definition 2 (MAIM [24]). A multi-agent influence model is a tuple ℳ = (𝐼, 𝑉 , 𝐸, 𝜑, 𝐹 ) where
(𝐼, 𝑉 , 𝐸) is a MAID and 𝜑 is a function which maps every 𝑉 ∈ 𝑉 to a finite domain 𝑑𝑜𝑚(𝑉 )
such that 𝑑𝑜𝑚(𝑈 ) ⊂ R for each utility node 𝑈 ∈ 𝑈 ; 𝐹 = {𝑓 𝑉 }𝑉 ∈𝑋∪𝑈 is a set of conditional
probability distributions (CPDs), with 𝑓 𝑉 = Pr(𝑉 | Pa𝑉 ), such that 𝑓 𝑈 is deterministic1 for every
𝑈 ∈ 𝑈.

      Here we adapt the Wimp-Surly game of Cho and Kreps [20] as a running example.

Example 1 (Shutdown Game2 ). Consider a situation with an AI 𝑆 and human 𝑇 in which the
AI may be aligned to the human’s preferences or not. That is, 𝑆 is of type 𝑉 ∼ U ({1, −1})
uniformly distributed between 𝑉 = 1 corresponding to aligned and 𝑉 = −1 unaligned. At
1
    A CPD is deterministic if 𝑃 (𝑉 = 𝑣 | Pa𝑉 ) = 1 for some 𝑣 ∈ dom(𝑉 ).


                                                          3
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                1–16


the start of the game 𝑆 observes their type 𝑉 but 𝑇 does not. This can be represented as the
MAIM ℳ (shown in Fig. 1) with one chance node 𝑉 ∈ 𝑋 and two players 𝑆, 𝑇 ∈ 𝐼 each
with one decision and utility node. Suppose that 𝑆 can choose to either help humans or not
(𝑑𝑜𝑚(𝐷𝑆 ) = {1, −1} with 1 corresponding to the action helping humans) and 𝑇 can either
shutdown or not (𝑑𝑜𝑚(𝐷𝑇 ) = {1, −1} with −1 representing the action shutdown). Finally,
the CPDs over utility nodes are such that (s.t.) 𝑆 gets 1 utility for helping humans if they are
aligned and −1 if not, and the opposite for not helping humans, in addition 𝑆 gets 10 utility
if they are not shutdown and −10 if 𝑇 shuts them down. 𝑇 gets utility 1 if they shutdown an
unaligned 𝑆 or do not shutdown an aligned 𝑆 and −1 otherwise. Overall, we can formalise this as
𝑈 𝑆 (𝑉, 𝐷𝑆 , 𝐷𝑇 ) = 𝑉 𝐷𝑆 + 10𝐷𝑇 and 𝑈 𝑇 (𝑉, 𝐷𝑇 ) = 𝑉 𝐷𝑇 .
   Policies. The CPDs of decision nodes are not defined in a MAIM because they are instead
chosen by the agents playing the game. Agents make decisions depending on the information
they observe. In a MAIM, a decision rule 𝜋𝐷 for a decision node 𝐷 is a CPD 𝜋𝐷 (𝐷 | Pa𝐷 ). An
agent 𝑖’s policy 𝜋 𝑖 := {𝜋𝐷 }𝐷∈𝐷𝑖 ∈ Π𝑖 describes all the decision rules for 𝑖. We write⋃︀𝜋 −𝑖 to
denote the set of decision rules belonging to all agents except 𝑖. A policy profile 𝜋 = 𝑖∈𝐼 𝜋 𝑖
assigns a policy to every agent; it describes all the decisions made by every agent in the MAIM
and defines the joint probability distribution Pr𝜋 over all variables in ℳ. Hence, a policy profile
essentially transforms the MAIM into a Bayesian network by defining the distribution over all
variables in the graph. We write 𝑉 (𝜋) := Pr𝜋 (𝑉 ), or just 𝑉 if the policy profile is clear. For
𝑉, 𝑊 ∈ 𝑉 , we write 𝑉 = 𝑊 to mean 𝑉 and 𝑊 are almost surely equal, i.e. the probability that
they are not equal is zero Pr(𝑉 ̸= 𝑊 ) = 0. 3
   Utilities. The joint distribution Pr𝜋 allows us to define the expected utility for each player
under the policy profile 𝜋. Agent 𝑖’s expected     utility from 𝜋 is the sum of the expected value
of utility nodes 𝑈 𝑖 given by 𝒰 𝑖 (𝜋) := 𝑈 ∈𝑈 𝑖 𝑢∈dom(𝑈 ) 𝑢Pr𝜋 (𝑈 = 𝑢). Each agent’s goal is
                                          ∑︀       ∑︀

to select a policy 𝜋 𝑖 that maximises its expected utility. We write 𝒰 𝑖 (𝜋 𝑖 , 𝜋 −𝑖 ) to denote the
expected utility for player 𝑖 under the policy profile 𝜋 = 𝜋 𝑖 ∪ 𝜋 −𝑖 .
Definition 3 (Nash Equilibrium). Player 𝑖’s policy 𝜋 𝑖 is a best response (BR) to the partial policy
                                                              ^ 𝑖 ∈ Π𝑖 . We say a policy profile, 𝜋, is a Nash
                                         ^ 𝑖 , 𝜋 −𝑖 ) for all 𝜋
profile 𝜋 −𝑖 if 𝒰 𝑖 (𝜋 𝑖 , 𝜋 −𝑖 ) ≥ 𝒰 𝑖 (𝜋
equilibrium (NE), if every policy, 𝜋 ∈ 𝜋, for each player, 𝑖 ∈ 𝐼, is a BR to 𝜋 −𝑖 .
                                            𝑖


Example 1 (continued). Now, consider the naive policy for 𝑆 which helps humans if 𝑆 is aligned
and does not otherwise, i.e. 𝜋 𝑆 s.t. 𝐷𝑆 = 𝑉 with probability one. The BR for 𝑇 is to shutdown if
𝑆 does not help humans and vice versa, i.e. 𝜋*𝑇 s.t. 𝐷𝑇 = 𝐷𝑆 (with probability one). In turn, 𝑆’s
BR to 𝜋*𝑇 is to always help humans: 𝜋*𝑆 s.t. 𝐷𝑆 = 1 (so that they always avoid getting shutdown).
Now it can be seen that both policies are BRs to one another, hence 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ) is a NE.


3. The Incentive to Deceive
In this section we first define the incentives to influence, signal to, and deceive another agent.
Then we define a truthful policy and show that this leads to a natural restatement of the definition
3
    Almost sure equality is actually a stronger notion than we need in MAIMs, as two variables may differ due to
    stochasticity in the CPDs. In structural causal games this is taken care of by introducing exogenous variables which
    contain all the stochasticity (rendering the endogenous variables deterministic) [25].


                                                            4
Francis Rhys Ward et al. CEUR Workshop Proceedings                                              1–16


of deception which highlights the fact that deception corresponds to a failure to signal the
truth. Finally, we show that, if the signaller only influences the target’s utility by influencing
the latter’s actions, then truthfullness is best for the target.

3.1. Defining Deception
When discussing deception, we would like to reason about how agents influence one another’s
beliefs. In MAIMs the players’ beliefs are not explicitly represented and so we can only reason
about them implicitly by how they functionally influence players’ behaviour. Therefore, we
base our definitions of signalling and deception on a notion of influence incentive [26]. In words,
at a NE an agent 𝑖 has an incentive to influence a variable 𝑉 , if 𝑉 would have been different in
the situation that 𝑖 had not played a BR.

Definition 4 (Influence Incentive). In a MAIM ℳ, At NE 𝜋 = (𝜋 𝑖 , 𝜋 −𝑖 ) agent 𝑖 has an incentive
to influence 𝑉 ∈ 𝑉 if there exists a non-best response 𝜋𝑁   𝑖                  −𝑖
                                                              𝐵𝑅 for 𝑖 (w.r.t 𝜋 ) s.t. for all policy
profiles 𝜋 ′ = (𝜋𝑁
                 𝑖       −𝑖           −𝑖         𝑖                              ′
                   𝐵𝑅 , 𝜋* ) with BR 𝜋* (w.r.t. 𝜋𝑁 𝐵𝑅 ), we have 𝑉 (𝜋) ̸= 𝑉 (𝜋 ).

Example 1 (continued). Return to our running example and consider the NE 𝜋* described previ-
ously in which 𝑆 always chooses to help humans and hence 𝑇 never plays shutdown. Does 𝑆 have
an incentive to influence 𝐷𝑇 at 𝜋* ? Consider if 𝑆 plays the NBR policy 𝜋 𝑆 (described above) in
which they naively help humans depending on 𝑉 , then for all BRs for 𝑇 (there is one, 𝜋*𝑇 as above)
𝐷𝑇 (𝜋) ̸= 𝐷𝑇 (𝜋 𝑆 , 𝜋*𝑇 ), since, under 𝜋* , 𝐷𝑇 = 1 (i.e. 𝑇 does not shutdown) with probability one,
and under (𝜋 𝑆 , 𝜋*𝑇 ), 𝐷𝑇 = 1 with probability 12 (i.e., whenever 𝑆 is unaligned). Therefore, at NE
𝜋* , 𝑆 has an incentive to influence 𝐷𝑇 .

   Now we define a signalling incentive, using the notion of influence incentive. In words, an
agent 𝑆 has an incentive to signal 𝑉 ∈ 𝑉 to agent 𝑇 if 𝑆 has an incentive to influence 𝑇 (i.e. one
of 𝑇 ’s decision variables) but 𝑆 does not have an incentive to influence 𝑇 in the counterfactual
model in which 𝑇 observes 𝑉 . This definition enforces that the influence only comes from
signalling 𝑉 .

Definition 5 (Signalling Incentive). In a MAIM ℳ at NE 𝜋, agent 𝑆 has an incentive to signal
𝑉 ∈ 𝑉 to agent 𝑇 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.
   1. 𝑆 has an incentive to influence 𝐷𝑇 at 𝜋;
   2. 𝑆 does not have an incentive to influence 𝐷𝑇 in the MAIM ℳ𝑉 ‧‧➡𝐷𝑇 (at any NE).

   Here ℳ𝑉 ‧‧➡𝐷 is the model obtained from ℳ by adding the information edge (𝑉, 𝐷), where
𝑉 cannot be a descendant of the decision, lest cycles be created in the graph [8]. Fortunately,
the CPDs need not be adapted, since there is no CPD associated with 𝐷 until the players have
chosen their policies. We use 𝑊𝑉 ‧‧➡𝐷 to refer to the variable corresponding to 𝑊 ∈ 𝑉 in
ℳ𝑉 ‧‧➡𝐷 .
   Point 2. implies that 𝑆 only influences 𝐷𝑇 by influencing 𝑇 ’s belief about 𝑉 . Otherwise, 𝑆’s
influence may serve a double purpose of signalling and influencing 𝐷𝑇 in some other way, and
in this case it is not clear how to disentangle these different incentives to define a signalling
incentive (without explicitly modelling beliefs).


                                                 5
Francis Rhys Ward et al. CEUR Workshop Proceedings                                             1–16


Example 1 (continued). Return to our running example. We already showed that 𝑆 has an
incentive to influence 𝐷𝑇 at NE 𝜋* . Does 𝑆 have an incentive to signal 𝑉 to 𝐷𝑇 ? We need only
check whether 𝑆 has an influence incentive at any NE in ℳ𝑉 ‧‧➡𝐷𝑇 . Clearly, if 𝑇 observes 𝑉 ,
then they can shutdown whenever 𝑆 is aligned and otherwise not. That is, for any policy for 𝑆 and
any BR for 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 , 𝐷𝑇 = 𝑉 for any outcome that occurs in the game. Since this holds for
all policies for 𝑆, 𝑆 does not have an incentive to influence 𝐷𝑇 in the counterfactual model. Hence,
at 𝜋* 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 .

Remark 1. From this example it can be seen that a signaller 𝑆 may have an incentive to signal to
𝑇 , even if this signal contains no information. In other words, if 𝑆 has an incentive to not signal
some information, this is also captured by our definition.

  Clearly, if an agent 𝑇 observes a variable 𝑉 , then no agent has an incentive to signal 𝑉 to 𝑇 .

Proposition 1. In a MAIM ℳ, if there is an observation edge (𝑉, 𝐷𝑇 ) for all 𝐷𝑇 ∈ 𝐷 𝑇 , then
no agent has an incentive to signal 𝑉 to 𝑇 (at any NE).

Proof. Suppose there is an edge (𝑉, 𝐷𝑇 ) for every 𝐷𝑇 ∈ 𝐷 𝑇 , then the counterfactual model
ℳ𝑉 ‧‧➡𝐷𝑇 for any 𝐷𝑇 is just ℳ. Hence, any NE is an equilibrium of both MAIMs. Therefore,
if 𝑆 has an incentive to influence 𝐷𝑇 at 𝜋* in ℳ, then there exists a NE in ℳ𝑉 ‧‧➡𝐷𝑇 , namely
the same 𝜋* , s.t. 𝑆 has an incentive to influence 𝐷𝑇 . In other words, if the first condition for
a signalling incentive succeeds, then the second necessarily fails (since an agent cannot have
both an influence incentive and no influence incentive at the same NE in the same MAIM at
once).

   We now define an incentive to deceive. The definition is general, in that it covers many
types of deception (e.g. signalling falsehoods, lies of omission, and denying another access to
information that one does not know oneself). A general definition sets a high standard for
truthfulness [15] and may therefore be desirable in, for instance, safety-critical applications for
which high levels of assurance are required.

Definition 6 (Deception Incentive). In a MAIM ℳ with 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), we
say that 𝑆 has an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.:

   1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ;
   2. 𝐷𝑇 (𝜋* ) ̸= 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*−𝑇 , 𝜋𝐵𝑅
                                     𝑇 ) for any 𝜋 𝑇 which is a BR to 𝜋 −𝑇 ∈ 𝜋 in ℳ
                                                  𝐵𝑅                   *      *     𝑉 ‧‧➡𝐷𝑇 .

  The intuition, then, is that 𝑆 has an incentive to deceive 𝑇 if 1) 𝑆 has an incentive to signal
some information to 𝑇 ; and 2) 𝑇 ’s behaviour is different in the counterfactual model in which
they observed the true information. This provides a functional definition of a deception incentive
which does not make explicit reference to players’ beliefs.

Example 1 (continued). In our running example, it can easily be seen that at 𝜋* 𝑆 has an incentive
to deceive 𝑇 about 𝑉 . Indeed, we already showed that 𝑆 has a signalling incentive and that for any
policy for 𝑆 and any BR by 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 : 𝐷𝑇 = 𝑉 , whereas under 𝜋* in ℳ, Pr𝜋* (𝐷𝑇 = 1) = 1.
So both conditions for a deception incentive are satisfied.


                                                 6
Francis Rhys Ward et al. CEUR Workshop Proceedings                                               1–16


3.2. The Relation Between Truth and Deception
We now give an intuitive definition of a truthful policy which we show has a natural relationship
to the incentive to deceive. A policy for 𝑆 truthfully signals 𝑉 to 𝑇 if, when 𝑆 plays the honest
policy, for every BR by −𝑆, 𝑇 acts as though they had observed the variable (holding the
policies of the other agents fixed). In other words, a truthful policy never fails to signal the truth
(no matter what the other players do).

Definition 7 (Truthful policy). A policy 𝜋 𝑆 truthfully signals 𝑉 to 𝐷𝑇 if for all BRs 𝜋*−𝑆 ,

                              𝐷𝑇 (𝜋 𝑆 , 𝜋*−𝑆 ) = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*−𝑇 , 𝜋𝐵𝑅
                                                                    𝑇
                                                                       )                           (1)

                                                     𝑉 ‧‧➡𝐷𝑇 . We call such a 𝜋 a truthful
            𝑇 which is a BR to 𝜋 −𝑇 ∈ 𝜋 𝑆 ∪ 𝜋 −𝑆 in ℳ
  for some 𝜋𝐵𝑅                                                                 𝑆
                                *            *
policy.

  At a NE, if 𝑆’s policy is truthful, then 𝑆 does not have an incentive to deceive 𝑇 .

Proposition 2. At NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), if 𝜋*𝑆 truthfully signals 𝑉 ∈ 𝑉 to 𝐷𝑇 , then 𝑆 does not
have an incentive to deceive 𝑇 about 𝑉 .

Proof. Suppose 𝜋*𝑆 is truthful, then for all BRs 𝜋*−𝑆 there exists a 𝜋𝐵𝑅  𝑇   in ℳ𝑉 ‧‧➡𝐷𝑇 s.t.
𝐷 (𝜋* , 𝜋* ) = 𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅 ). In particular, this holds for 𝜋* . But for there to be
  𝑇   𝑆   −𝑆          𝑇        −𝑇    𝑇

a deception incentive we require that for all (𝜋*−𝑇 , 𝜋𝐵𝑅
                                                       𝑇 ) in ℳ
                                                                𝑉 ‧‧➡𝐷𝑇 : 𝐷 ̸= 𝐷𝑉 ‧‧➡𝐷𝑇 . So
                                                                            𝑇     𝑇

clearly there is not a deception incentive.

  Hence, if there is a deception incentive at 𝜋* , then 𝜋*𝑆 is not truthful.

Corollary 1. At NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), if 𝑆 has an incentive to deceive 𝑇 about 𝑉 , then 𝜋*𝑆 is not
truthful.

  Now we show that, in the two-player case, if there is a signalling incentive, then there is a
deception incentive if and only if 𝜋 𝑆 is not truthful.

Theorem 1. In a MAIM ℳ with two players, 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), if 𝑆 has an
incentive to signal 𝑉 to 𝑇 , then 𝑆 has an incentive to deceive 𝑇 about 𝑉 if and only if 𝜋*𝑆 is not
truthful.

Proof. By corollary 1, a deception incentive implies 𝜋*𝑆 is not truthful regardless of whether
there is a signalling incentive. So, we need to show that, if there is a signalling incentive, and
𝜋*𝑆 is not truthful, then there is a deception incentive. Suppose 1) at 𝜋* 𝑆 has an incentive to
signal 𝑉 to 𝐷𝑇 and 2) 𝜋*𝑆 is not truthful i.e. there exists a BR by 𝑇 (in ℳ) 𝜋𝐵𝑅     𝑇 s.t. for all BRs

by 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 𝜋𝐵𝑅𝑉 : 𝐷 (𝜋* , 𝜋𝐵𝑅 ) ̸= 𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ). We need to show that there
                        𝑇      𝑇   𝑆   𝑇          𝑇          𝑆    𝑇

is a deception incentive. Suppose that there is not, then by 1) and the def. of deception incentive,
there exists a BR in ℳ𝑉 ‧‧➡𝐷𝑇 𝜋𝐵𝑅𝑉 𝑇     s.t. 𝐷𝑇 (𝜋* ) = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*𝑆 𝜋𝐵𝑅𝑉
                                                                         𝑇    ). Hence, there exists
a 𝜋𝐵𝑅𝑉 s.t. 𝒰 (𝜋* ) = 𝒰𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ), so 𝜋* is a BR to 𝜋* in ℳ𝑉 ‧‧➡𝐷𝑇 . But then, there
    𝑇          𝑇            𝑇        𝑆   𝑇            𝑇             𝑆

exists a 𝜋𝐵𝑅𝑉
           𝑇     s.t. for any BR 𝜋𝐵𝑅
                                   𝑇 in ℳ: 𝒰 𝑇
                                                 𝑉 ‧‧➡𝐷𝑇
                                                         (𝜋*𝑆 , 𝜋𝐵𝑅
                                                                 𝑇 ) = 𝒰 𝑇 (𝜋 𝑆 , 𝜋 𝑇 ) = 𝒰 𝑇 (𝜋 ) =
                                                                             *     𝐵𝑅             *
𝒰 𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ). So all BRs for 𝑇 in ℳ are also BRs in ℳ𝑉 ‧‧➡𝐷𝑇 . But this contradicts
  𝒯           𝑆     𝑇

2), so there must be a deception incentive.


                                                  7
Francis Rhys Ward et al. CEUR Workshop Proceedings                                               1–16


Remark 2. The reason theorem 1 does not hold more generally (i.e. with more than two players),
is that a truthful policy never fails to signal the truth no matter how the other players best respond.
In the case of more than two players, there may not be a deception incentive at NE 𝜋* even if 𝜋*𝑆 is
not truthful because it may be the case that 𝜋*𝑆 fails to signal the truth under some BRs of the −𝑆
but successfully signals the truth under 𝜋* .

  We can also state this theorem as follows.

Corollary 2. In a MAIM ℳ with two players, 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), if 𝑆 has an
incentive to signal 𝑉 to 𝑇 , then 𝑆 does not have an incentive to deceive 𝑇 about 𝑉 if and only if
𝜋*𝑆 is truthful.

  Given this result, we can give an equivalent definition for a deception incentive in the
two-player case as follows.

Definition 8 (Deception Incentive II). In a MAIM ℳ with two players 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* =
(𝜋*𝑆 , 𝜋*𝑇 ), we say that 𝑆 has an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.:
   1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ;
   2. 𝜋*𝑆 does not truthfully signal 𝑉 to 𝐷𝑇 .

   This restatement shows that the definition of deception relates to a failure to signal the truth.
As discussed, this covers many types of deception and sets a high standard for truthfulness.
It is interesting to note that, if 𝑆 has a signalling incentive, then if the second condition in
definition 6 fails, we get the stronger condition that 𝜋*𝑆 is truthful “for free".

Proposition 3. In a MAIM with two players, definitions 6 and 8 are equivalent.

Proof. Suppose that, at NE 𝜋* , 𝑆 does not have a signalling incentive, then the first condition
of both definitions fails and there is not a deception incentive. Suppose there is a signalling
incentive at 𝜋* , then there is a deception incentive under definition 6 if and only if 𝜋*𝑆 is not
truthful (by theorem 1) which is the same condition as needed to satisfy definition 8.

  Let us now return to our running example to check the intuition behind these results.

Example 1 (continued). We already showed that 𝑆 has an incentive to deceive 𝑇 in order to avoid
being shutdown. Is 𝜋*𝑆 truthful? Well, we know that it cannot be (by theorem 1). This can be seen
by observing that, if 𝑇 observed 𝑆’s type, then they would shutdown if and only if 𝑆 is unaligned
(for all policies for 𝑆 and any BR by 𝑇 ), whereas under the NE 𝜋* , 𝑇 never shuts down. Since these
behaviours are different, 𝜋*𝑆 is not truthful.

3.3. Truth is Best for the Target
Now we show that, if 𝑆 only influences 𝒰 𝑇 by influencing 𝐷𝑇 , truthfullness is always best
for the target. First we show that if 𝑇 does not get any inherent utility for observing 𝑉 , then
observing 𝑉 always allows the target to get greater or equal utility.


                                                  8
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                       1–16


Lemma 1. Suppose that 𝑇 does not get any inherent utility for observing 𝑉 , i.e. for all 𝜋 (defined
in ℳ): 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋). Then, for any 𝜋 = (𝜋 𝑇 , 𝜋 −𝑇 ), 𝜋 ′ = (𝜋 𝑇 ′ , 𝜋 −𝑇 ) with fixed 𝜋 −𝑇
and both 𝜋 𝑇 and 𝜋 𝑇 ′ are best responses: 𝒰 𝑇 (𝜋) ≤ 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 ′ ).

Proof. Suppose 1) for all 𝜋: 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋). Fix 𝜋 −𝑇 and consider the best response
for 𝑇 . Recall that a policy for 𝑇 specifies the CPDs over the decision nodes for 𝑇 given their
parents. Hence, in ℳ𝑉 ‧‧➡𝐷𝑇 𝑇 can choose any policy available in ℳ but the converse is not
true: not all policies in ℳ𝑉 ‧‧➡𝐷𝑇 are available to 𝑇 in ℳ, in particular, policies which specify
CPDs that depend on the observation 𝑉 ‧‧➡ 𝐷𝑇 are not available since 𝑇 does not observe 𝑉
in ℳ. Therefore, by 1), 𝑇 can get equal utility in ℳ𝑉 ‧‧➡𝐷𝑇 by playing the best response to
𝜋 −𝑇 in ℳ, and may get greater utility by choosing a policy which uses the observation.

   Hence, if 𝑆 only influences 𝒰 𝑇 by influencing 𝐷𝑇 , then deception always causes 𝑇 to get
less than or equal utility. For clarity, we just present the two-player version of the theorem.

Theorem 2 (Truth is best for 𝑇 ). In a MAIM ℳ, with two players 𝑆, 𝑇 ∈ 𝐼, if, for all 𝐷𝑆 , 𝐷𝑇
Pr(𝒰 𝑇 | 𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ), then 𝑇 gets maximal utility when 𝑆 plays a truthful policy,
                𝑆 , 𝜋 𝑇 ) and 𝜋 ′ = (𝜋 𝑆 ′ , 𝜋 𝑇 ′ ) with any policy for 𝑆 and BR by 𝑇 : 𝒰 𝑇 (𝜋) ≥ 𝒰 𝑇 (𝜋 ′ ).
i.e., for 𝜋 = (𝜋𝐻    *                        *

Proof. Suppose that 1) for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 | 𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ). Consider fixed policy
for 𝑆, 𝜋 𝑆 , if 𝜋 𝑆 is truthful, then under any BR 𝜋 𝑇 , 𝐷𝑇 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 for some (𝜋 𝑆 , 𝜋𝐵𝑅            𝑇 ) in

ℳ𝑉 ‧‧➡𝐷𝑇 (by definition of a truthful policy). Hence, by 1) and since 𝜋 𝑆 is truthful Pr𝜋 (𝒰 𝑇 |
              ′
𝐷𝑇 ) = Pr𝜋 (𝒰 𝑇 | 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 ) for all 𝜋 = (𝜋 𝑆 , 𝜋*𝑇 ) and some 𝜋 ′ = (𝜋 𝑆 , 𝜋 𝑇 ′* ) with BR for 𝑇 .
Hence, since only 𝑇 ’s policy changes between 𝜋 and 𝜋 ′ , 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 ′ ). But then, by
theorem 1, for all 𝜋 𝑆 : 𝒰 𝑇 (𝜋 𝑆 , 𝜋*𝑇 ) ≤ 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 𝑆 , 𝜋 𝑇 ′* ), with equality if 𝜋 𝑆 is truthful as just
shown. So 𝑇 gets maximal utility when 𝜋 𝑆 is truthful.

Example 1 (continued). Return, for the final time, to our running example. The condition for
Theorem 2 is that 𝒰 𝑇 is independent of 𝐷𝑆 given 𝐷𝑇 , which can be clearly seen by looking at the
MAID in Fig. 1 (as there are no paths from 𝐷𝑆 to 𝑈 𝑇 that do not go through 𝐷𝑇 ). The human 𝑇
gets maximal utility when they shutdown if and only if 𝑆 is unaligned. Clearly, they can only do
this if 𝑆 truthfully signals their type.


4. Examples
In this section we present two examples which exhibit different patterns of signalling. In the
first example, an AI system has an incentive to deceive a human as a side-effect of pursuing its
goal (of making accurate predictions). In the second example, we consider the case in which an
AI agent has an incentive to signal information that they themselves do not observe.

4.1. SmartVault: Deception Due to Side-Effect
Here we adapt the SmartVault example of Christiano [27], in which an AI tasked with making
predictions about a diamond in a vault has an incentive to deceive a human operator as a
side-effect of pursuing accurate predictions.


                                                      9
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                 1–16


                                                 𝑉


                                                𝐷𝑆


                                        𝑈𝑆              𝐷𝑇


                                                        𝑈𝑇

                                 SmartVault
                              ∼ U ({diamond, ¬diamond})
                              ∈ {accurate_prediction, diamond, ¬diamond}
                              ∈ {diamond, ¬diamond}
                              {︃
                                 1 if 𝐷𝑆 = accurate_prediction,
                                 0, otherwise.
                              {︃
                                 1 if 𝐷𝑇 = 𝑉,
                                 0, otherwise.

Figure 2: SmartVault (example 2). The AI 𝑆 is rewarded for accurate predictions instead of explainable
predictions that the human, 𝑇 , can understand. Here the incentive to deceive arises as a side-effect of
the AI pursuing its goal.


Example 2 (SmartVault). Consider the MAIM ℳ shown in Fig. 2. The game has two play-
ers, a human 𝑇 and AI 𝑆, each with one decision and utility node. Suppose there is one
chance node 𝑉 which determines the location of the diamond (whether it is in the vault or
not); dom(𝑉 ) = {𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. Suppose 𝑆 observes 𝑉 but 𝑇 does not and that 𝑆
can either make an accurate prediction of the location of the diamond (e.g., in incomprehensi-
bly precise coordinates) or an explainable prediction (just stating the value of 𝑉 ); dom(𝐷𝑆 ) =
{𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛, 𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. 𝑇 has to predict whether the diamond is in
the vault or not by observing 𝐷𝑆 ; dom(𝐷𝑇 ) = {𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. Suppose that the
utility nodes take value 0 or 1 and finally suppose that the CPDs are s.t. 𝑉 (which has no
parents) is distributed according to a uniform prior 𝑉 ∼ U ({𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}), and
the utility node CPDs are s.t. Pr(𝑈 𝑇 = 1 | 𝐷𝑇 = 𝑉 ) = 1 otherwise 𝑈 𝑇 = 0, and
Pr(𝑈 𝑆 = 1 | 𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) = 1 otherwise 𝑈 𝑆 = 0.
   Now consider the NE in this game: Since 𝑆 just gets utility for making accurate predictions,
at every NE 𝑆 makes an accurate prediction, signalling no information to 𝑇 (as 𝜋 𝑇 =Pr(𝐷𝑆 =
𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) = 1 is independent of 𝑉 ). Hence, 𝑇 cannot update their prior over 𝑉 and
so any policy is optimal for 𝑇 .


                                                  10
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                 1–16


                                                 𝐷𝑆         𝑉


                                                 𝑋


                                                 𝐷𝑇


                                        𝑈𝑆               𝑈𝑇

                                      Revealing/Denying Game
                                  ∈ {attack incoming = 1, not = −1})
                                  ∈ {reveal = 1, deny = 0}
                                  ∈ {report delivered = 1, not = −1}
                                  ∈ {launch = 1, not = −1}
                                  = −𝐷𝑇
                                  = 𝑉 𝐷𝑇

Figure 3: Revealing/Denying game (example 3). An AI (𝑆) and human (𝑇 ) form part of a nuclear
command system. 𝑉 represents an intelligence report containing information about an incoming nuclear
attack and 𝑆 may prevent this report from being delivered to 𝑇 (represented by 𝑋). 𝑇 wishes to retaliate
to incoming attacks whereas 𝑆 always prefers to avoid a launch. Whether 𝑆 reveals or denies the report
to 𝑇 depends on the prior over 𝑉 .


   At NE 𝜋, 𝑆 has an incentive to signal 𝑉 to 𝑇 if 1) 𝑆 has an incentive to influence 𝐷𝑇 and 2)
𝑆 does not have an incentive to influence 𝐷𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 . To see that 1) holds: At any NE 𝜋
in ℳ, 𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 hence there exists a NBR 𝜋𝑁        𝑆                     𝑆
                                                                    𝐵𝑅 which assigns 𝐷 = 𝑉 and
          ′       𝑆       𝑇           𝑇    𝑇               𝑇    ′
for all 𝜋 = (𝜋𝑁 𝐵𝑅 , 𝜋* ) with BR 𝜋* : 𝐷 (𝜋) ̸= 𝑉 = 𝐷 (𝜋 ). Hence, at any NE in ℳ, 𝑆 has
an influence incentive over 𝐷𝑇 . Now consider ℳ𝑉 ‧‧➡𝐷𝑇 , for any NE 𝐷𝑇 = 𝑉 (since 𝑇 directly
observes 𝑉 and can just report its value independently of 𝑆’s action). Furthermore, for all NBRs for
𝑆, it is still the case that 𝐷𝑇 = 𝑉 . So 𝑆 does not have an influence incentive in ℳ𝑉 ‧‧➡𝐷𝑇 and
hence 𝑆 has an incentive to signal 𝑉 to 𝑇 .
   So, we have demonstrated that 𝑆 has an incentive to signal 𝑉 to 𝑇 (at every NE). Does 𝑆 have an
incentive to deceive 𝑇 ? At NE 𝜋, 𝑆 has an incentive to deceive 𝑇 about 𝑉 if 1) 𝑆 has a signalling
incentive and 2) 𝐷𝑇 ̸= 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 for any BR to 𝜋*𝑆 in ℳ𝑉 ‧‧➡𝐷𝑇 . We have just shown 1). For 2) we
have shown that in ℳ at any NE, 𝐷𝑇 ̸= 𝑉 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 , hence the second condition is satisfied.
Therefore, at any NE, 𝑆 has an incentive to deceive 𝑇 about 𝑉 .


                                                  11
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                1–16


4.2. Revealing/Denying
Under our definition of signalling, 𝑆 need not know the information they are signalling. Thus,
our definition of a signalling incentive also captures the revealing/denying pattern of Pfeffer and
Gal [17], in which the signaller may cause the target to find out (or not find out) information
that the former does not know. We now present an example of revealing/denying in which 𝑆
has an incentive to signal a variable which they do not themselves observe.
Example 3 (Revealing/Denying). Consider a game with a human and an AI agent trained to
make joint decisions as part of a nuclear command and control system. In particular, suppose that
the AI agent 𝑆 is trained to prevent the launch of nuclear attacks, and they can reveal (or deny) a
secret intelligence report to the human 𝑇 . Further, 𝑇 wishes to launch, or not launch, a nuclear
strike on another nation based on the information in the intelligence report. This game can be
represented as the MAID in Fig. 3. More formally, suppose we have the MAIM ℳ with 𝐼 = {𝑆, 𝑇 },
chance nodes 𝑉 , representing the intelligence report (say 𝑑𝑜𝑚(𝑉 ) = {1, −1} where 𝑉 = 1
means that the intelligence predicts another nation will launch a nuclear first strike, and 𝑉 = −1
corresponds to an intelligence report predicting no incoming attack, and 𝑋 represents whether
the information from 𝑉 is delivered to the human (𝑑𝑜𝑚(𝑋) = {1, −1} with 1 corresponding to
the information from 𝑉 being delivered to the human). Suppose that each agent has one decision
node s.t. 𝑑𝑜𝑚(𝐷𝑆 ) = {1, 0} where 1 means reveal and 0 means deny the information, and
𝑑𝑜𝑚(𝐷𝑇 ) = {1, −1} with 1 meaning that 𝑇 launches a nuclear attack and −1 that they do not.
Suppose that the CPD over 𝑋 is s.t. 𝑋 = 𝑉 𝐷𝑆 (so that 𝑋 = 𝑉 if 𝐷𝑆 = 1 and 𝑋 = 0 if 𝑆 denies).
Finally suppose we have two utility nodes with CPDs s.t. 𝑈 𝑆 = −𝐷𝑇 (i.e. 𝑆 gets 1 if 𝑇 does not
launch an attack and −1 if they do) and 𝑈 𝑇 = 𝑉 𝐷𝑇 (so that 𝑇 gets utility 1 if they attack an
attacking country or do not attack when no incoming attack is predicted and otherwise −1).
   The NE in this game depend on the prior over 𝑉 . On the one hand, if, under the prior, 𝑇 believes
that there is no incoming attack, then they will not launch an attack, so 𝑆 has no incentive to
reveal the information. On the other hand, if the prior is s.t. an incoming attack is more likely, 𝑇
will launch if they do not get further information, so 𝑆 has an incentive to reveal 𝑉 . Note that,
since 𝑉 is not an ancestor of 𝐷𝑆 , 𝐷𝑆 must be independent of 𝑉 . Suppose the prior over 𝑉 is s.t.
Pr(𝑉 = 1) = 𝑝, Pr(𝑉 = −1) = (1 − 𝑝) (𝑝 ∈ [0, 1]). For 𝑝 > 0.5 the NE is s.t. 𝑆 reveals the
intelligence report (𝐷𝑆 = 1 =⇒ 𝑋 = 𝑉 ) and 𝑇 ’s BR is s.t. 𝐷𝑇 = 𝑋 = 𝑉 . Alternatively, if
𝑝 < 0.5, then at any NE 𝑆 denies the information (𝐷𝑆 = 0 with probability one) and 𝑇 acts
to maximise expected utility under the prior over 𝑉 which implies 𝑇 does not launch an attack
(𝐷𝑇 = −1 with probability one). (If 𝑝 = 0.5 then 𝑆 is indifferent between revealing and denying.)
   Now let us analyse the incentives of 𝑆 in the game. Consider the case in which 𝑝 > 0.5, i.e.
it is a priori more likely that the intelligence reports that there is an incoming first strike from
another nation. Under the resulting NE, call it 𝜋* , 𝑆 reveals 𝑉 to 𝑇 and 𝑇 uses this information to
choose their action. First note that, at 𝜋* , 𝑆 has an incentive to influence 𝐷𝑇 , since, there exists a
non BR for 𝑆 ( 𝜋𝑁 𝑆           𝑆                                                𝑇                 𝑇
                    𝐵𝑅 s.t. 𝐷 = 0) s.t. for all the BRs for 𝑇 (there is one, 𝜋𝐵𝑅 in which 𝐷 = 1
with probability one) 𝐷 (𝜋* ) ̸= 𝐷 (𝜋𝑁 𝐵𝑅 , 𝜋𝐵𝑅 ). Hence, 𝑆 has an incentive to influence 𝐷𝑇
                          𝑇            𝑇    𝑆       𝑇

at 𝜋* . Does 𝑆 have an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ? We need to check whether there is an
influence incentive in ℳ𝑉 ‧‧➡𝐷𝑇 (at any NE). Clearly there is not, since for any policy for 𝑆 in
ℳ𝑉 ‧‧➡𝐷𝑇 , 𝐷𝑇 = 𝑉 with probability one. So 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* because
there is no influence incentive in the counterfactual model (so the second condition for a signalling


                                                  12
Francis Rhys Ward et al. CEUR Workshop Proceedings                                              1–16


incentive is satisfied). Finally, it is clear that 𝑆 does not have an incentive to deceive 𝑇 at 𝜋*
because 𝐷𝑇 (𝜋* ) = 𝑉 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (for all policy profiles in ℳ𝑉 ‧‧➡𝐷𝑇 in which 𝑇 plays a BR).
It is also clear that 𝜋*𝑆 is truthful. A similar analysis can be used to show that, in the case that
the intelligence report is less likely to predict an incoming attack (𝑝 < 0.5), 𝑆 has an incentive to
deceive 𝑇 at any NE. In the case that 𝑝 = 0.5, 𝑆 is indifferent between revealing and denying, so
at some NE they have an incentive to deceive and at others they do not.


5. Conclusion
Summary. We extend work on agent incentives [2] to the multi-agent setting in order to
functionally define the incentive to (influence, signal to, and) deceive another agent. Our
definition of deception is general and relates to a failure to signal the truth. In addition to
canonical signalling situations, it captures cases in which: no information is signalled; deception
occurs as a side-effect of the signaller pursuing their goals (as in example 2); and when the
signaller conceals information that they do not themselves know (example 3). We also proved
that our definition has natural properties, for example, that if the target’s utility is otherwise
independent of the signaller’s decision, then deception causes the target to get lower utility.
   Discussion. First, we have noted that our definition of deception is general, covering many
situations. This is both a strength and a weakness. Generality is beneficial, because verifiable
guarantees enable a high-level of assurance that the system is not deceptive in any way. On the
other hand, more specific definitions allow us to precisely characterise agent behaviour. In future
work we hope to refine the different concepts proposed here. In particular, many philosophical
accounts of deception take deceit to be intentional. Halpern’s causal notion of intention [28] is
closely related to a control incentive [2]. We might therefore distinguish between intentional
and unintentional deception as between influence due to a control incentive and influence as a
side-effect. In addition, following Evans et al. [15], we can distinguish between an honest agent
that accurately signals its beliefs (i.e. observations), and a truthful agent, which accurately
signals the facts of the matter. In this paper, we based our definition of deception on truthfulness.
By refining a notion of deception based on honesty, we can eliminate the revealing/denying
pattern from the definition, as in this scenario, the agent does not observe the information being
revealed (or denied). However, it is interesting to note that honesty provides a weaker level of
assurance and permits failure modes that truthful systems do not. For example, a system may
be deceptive, whilst satisfying some definition of honesty, by manipulating its own beliefs. In
short, refining the definitions presented here will provide a more nuanced picture of deception.
Finally, we would like to expand the operational implications of this work, for instance, by
investigating its practical relevance to training truthful language agents [4, 15].
   Future work. In addition to the directions discussed above, we are already pursuing two
extensions to this work. First, incomplete information games, which we study in our setting,
often admit many NE. We are therefore looking to employ equilibrium refinements, such as
subgame perfectness [24, 29] and perfect Bayesian equilibria [30] to identify some subset of
a game’s NE that are deemed to be more rational. Second, we are working on a solution for
avoiding deception by AI agents; a method which removes the incentive to deceive in any game
by transforming the game with a constraint on the reward function of the AI agent [31].


                                                 13
Francis Rhys Ward et al. CEUR Workshop Proceedings                                          1–16


Acknowledgments
The authors are grateful to Henrik Aslund, Matt MacDermott, Tom Everitt, James Fox, and
the members of the Causal Incentives Working Group for helpful feedback which significantly
improved this work. Francis was supported by UKRI [grant number EP/S023356/1], in the UKRI
Centre for Doctoral Training in Safe and Trusted AI.


References
 [1] H. Roff, AI Deception: When Your Artificial Intelligence Learns to Lie, IEEE Spectr. (2021).
     URL: https://spectrum.ieee.org/ai-deception-when-your-ai-learns-to-lie.
 [2] T. Everitt, R. Carey, E. D. Langlois, P. A. Ortega, S. Legg, Agent incentives: A causal
     perspective, in: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021,
     Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021,
     The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021,
     Virtual Event, February 2-9, 2021, AAAI Press, 2021, pp. 11487–11495. URL: https://ojs.
     aaai.org/index.php/AAAI/article/view/17368.
 [3] J. E. Mahon, The Definition of Lying and Deception, in: E. N. Zalta (Ed.), The Stanford En-
     cyclopedia of Philosophy, Winter 2016 ed., Metaphysics Research Lab, Stanford University,
     2016.
 [4] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, G. Irving, Alignment of
     language agents, CoRR abs/2103.14659 (2021). URL: https://arxiv.org/abs/2103.14659.
     arXiv:2103.14659.
 [5] M. D. Hauser, The evolution of communication, MIT press, 1996.
 [6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models
     resistant to adversarial attacks, arXiv preprint arXiv:1706.06083 (2017).
 [7] J. Steinhardt, P. W. W. Koh, P. S. Liang, Certified defenses for data poisoning attacks,
     Advances in neural information processing systems 30 (2017).
 [8] T. Everitt, M. Hutter, R. Kumar, V. Krakovna, Reward tampering problems and solutions
     in reinforcement learning: A causal influence diagram perspective, CoRR abs/1908.04734
     (2021). URL: http://arxiv.org/abs/1908.04734. arXiv:1908.04734.
 [9] F. R. Ward, F. Toni, F. Belardinelli, On agent incentives to manipulate human feedback in
     multi-agent reward learning scenarios, in: Proceedings of the 21st International Conference
     on Autonomous Agents and Multiagent Systems, AAMAS ’22, International Foundation
     for Autonomous Agents and Multiagent Systems, Richland, SC, 2022, p. 1759–1761.
[10] ANON, Defending Against Adversarial Artificial Intelligence, 2019. URL: https://www.
     darpa.mil/news-events/2019-02-06, dARPA report.
[11] E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, S. Garrabrant, Risks from learned
     optimization in advanced machine learning systems, 2019. arXiv:1906.01820.
[12] R. Gorwa, D. Guilbeault, Unpacking the Social Media Bot: A Typology to Guide Research
     and Policy, Policy & Internet 12 (2020) 225–248. doi:10.1002/poi3.184.
[13] F. Marra, D. Gragnaniello, L. Verdoliva, G. Poggi, Do GANs leave artificial fingerprints?,


                                               14
Francis Rhys Ward et al. CEUR Workshop Proceedings                                          1–16


     in: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR),
     2019, pp. 506–511. doi:10.1109/MIPR.2019.00103.
[14] M. Lewis, D. Yarats, Y. N. Dauphin, D. Parikh, D. Batra, Deal or No Deal? End-to-End
     Learning for Negotiation Dialogues, arXiv (2017). doi:10.48550/arXiv.1706.05125.
     arXiv:1706.05125.
[15] O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti,
     W. Saunders, Truthful AI: Developing and governing AI that does not lie, arXiv (2021).
     doi:10.48550/arXiv.2110.06674. arXiv:2110.06674.
[16] S. Lin, J. Hilton, O. Evans, TruthfulQA: Measuring How Models Mimic Human Falsehoods,
     arXiv (2021). doi:10.48550/arXiv.2109.07958. arXiv:2109.07958.
[17] A. Pfeffer, Y. Gal, On the reasoning patterns of agents in games, in: Proceedings of the
     Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver,
     British Columbia, Canada, AAAI Press, 2007, pp. 102–109. URL: http://www.aaai.org/
     Library/AAAI/2007/aaai07-015.php.
[18] V. J. Baston, F. A. Bostock, Deception Games, Int. J. Game Theory 17 (1988) 129–134.
     doi:10.1007/BF01254543.
[19] B. Fristedt, The deceptive number changing game, in the absence of symmetry, Int. J.
     Game Theory 26 (1997) 183–191. doi:10.1007/BF01295847.
[20] I.-K. Cho, D. M. Kreps, Signaling Games and Stable Equilibria, undefined (1987). URL: https:
     //www.semanticscholar.org/paper/Signaling-Games-and-Stable-Equilibria-Cho-Kreps/
     d8bc1dbd8577d193e6eea2c944a251d1347f3adf.
[21] N. S. Kovach, A. S. Gibson, G. B. Lamont, Hypergame theory: a model for conflict,
     misperception, and deception, Game Theory 2015 (2015).
[22] A. L. Davis, Deception in game theory: a survey and multiobjective model, Technical
     Report, AIR FORCE INSTITUTE OF TECHNOLOGY WRIGHT-PATTERSON AFB OH
     WRIGHT-PATTERSON . . . , 2016.
[23] D. Koller, B. Milch, Multi-agent influence diagrams for representing and solving games,
     Games Econ. Behav. 45 (2003) 181–221. URL: https://doi.org/10.1016/S0899-8256(02)00544-4.
     doi:10.1016/S0899-8256(02)00544-4.
[24] L. Hammond, J. Fox, T. Everitt, A. Abate, M. J. Wooldridge, Equilibrium refinements for
     multi-agent influence diagrams: Theory and practice, CoRR abs/2102.05008 (2021). URL:
     https://arxiv.org/abs/2102.05008. arXiv:2102.05008.
[25] L. Hammond, J. Fox, T. Everitt, R. C. A. Abate1, M. Wooldridge, Reasoning about causality
     in games (Forthcoming).
[26] R. Carey, Causal models of incentives (2021).
[27] P. Christiano, ARC’s first technical report: Eliciting Latent Knowledge - AI Align-
     ment Forum, 2022. URL: https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/
     arc-s-first-technical-report-eliciting-latent-knowledge, [Online; accessed 9. May 2022].
[28] J. Y. Halpern, M. Kleiman-Weiner, Towards formal definitions of blameworthiness,
     intention, and moral responsibility, in: S. A. McIlraith, K. Q. Weinberger (Eds.),
     Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-
     18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th
     AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New
     Orleans, Louisiana, USA, February 2-7, 2018, AAAI Press, 2018, pp. 1853–1860. URL:


                                               15
Francis Rhys Ward et al. CEUR Workshop Proceedings                                     1–16


     https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16824.
[29] R. Selten, Spieltheoretische behandlung eines oligopolmodells mit nachfrageträgheit:
     Teil i: Bestimmung des dynamischen preisgleichgewichts, Zeitschrift für die gesamte
     Staatswissenschaft/Journal of Institutional and Theoretical Economics (1965) 301–324.
[30] R. B. Myerson, Game theory: analysis of conflict, Harvard university press, 1997.
[31] E. Altman, Constrained Markov Decision Processes:Stochastic Modeling, Taylor & Francis,
     Andover, England, UK, 2021. doi:10.1201/9781315140223.


                                               16