A Causal Perspective on AI Deception in Games
Francis Rhys Ward* , Francesca Toni and Francesco Belardinelli
Imperial College London , Exhibition Rd, South Kensington, London, SW7 2BX


                                    Abstract
                                    Deception is a core challenge for AI safety and we focus on the problem that AI agents might learn deceptive strategies in
                                    pursuit of their objectives. We define the incentives one agent has to signal to and deceive another agent. We present several
                                    examples of deceptive artificial agents and show that our definition has desirable properties.

                                    Keywords
                                    Deception, AI, Game Theory, Causality


1. Introduction                                                                                        agents might learn deceptive strategies in pursuit of their
                                                                                                       objectives [1]: Lewis et al. [14] found that their nego-
We focus on the problem that AI agents might learn de- tiation agent learnt to deceive from self-play, without
ceptive strategies in pursuit of their objectives [1]. Fol- any explicit human design, and Hubinger et al. [11] raise
lowing recent work on causal incentives [2], we define concerns about deceptive learned optimizers which per-
the incentive to deceive an agent. There is no universally form well in training in order to pursue different goals in
accepted definition of deception and defining what con- deployment. Kenton et al. [4] discuss the alignment of
stitutes deception is an open philosophical problem [3]. language agents, highlighting that language is a natural
Our definition is somewhat inspired by that of Kenton medium for enacting deception. Evans et al. [15] discuss
et al. [4] who provide a functional (natural language) the development of truthful AI, the desired standards for
definition of deception, meaning that it does not make truth and honesty in AI systems, and how these could
reference to the beliefs or intentions of the agents in- be implemented and measured. Lin et al. [16] propose
volved [5]. This is particularly suitable for discussing a benchmark to measure whether a language model is
deception by artificial agents, to which the attribution truthful in generating answers to questions. In short,
of beliefs and intentions may be contentious. We for- as increasingly capable AI agents become deployed in
malise a functional definition of deception in games and settings with other agents, deception may be learned as
illustrate its properties with a number of examples and an effective strategy for achieving a wide range of goals.
formal results.                                                                                        It is therefore essential that we understand and mitigate
    Deception is a core challenge for AI safety. On the one deception by artificial agents.
hand, many areas of work aim to ensure that AI systems                                                     Deception in game theory. There are several existing
are not vulnerable to deception. Adversarial attacks [6], models of deception in the game theory literature. Pfef-
data-poisoning [7], reward function tampering [8], and fer and Gal [17] define graphical patterns for signalling
manipulating human feedback [9] are ways of deceiving in games. A deception game [18] is a two-player zero-
AI systems. Further work researches mechanisms for sum game between a deceiver and target in which the
detecting and defending against deception [10]. On the deceiver can distort a signal; optimal deceptive strategies
other hand, we can consider cases in which AI tools are completely distort the signal so that the target cannot
used to deceive, or learn to do so in order to optimize gain any information [19]. A signalling game [20] is a
their objectives [11]. For examples of the former case, two-player Bayesian game between a signaller and target
AIs can be used to deceive other software agents, as with (or receiver) in which the signaller is assigned a type
bots that automate posting on social media platforms according to a shared prior distribution and the utilities
to manipulate content ranking algorithms [12], or they of the players depend on the type of the signaller and the
can be used to fool humans, cf. the use of GANs to pro- action chosen by the target. In these games, the signaller
duce realistic fake media [13]. For the latter case, AI may often have incentives to deceive the target by mis-
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety
                                                                                                       representing or obfuscating their type. Hypergame theory
2022), July 24-25, 2022, Vienna, Austria                                                               extends game theory to settings in which players may be
*
  Corresponding author.                                                                                uncertain about the game being played and can be used
$ francis.ward19@imperial.ac.uk (F. R. Ward);                                                          to model misperception and deception [21]. Davis [22]
f.toni@imperial.ac.uk (F. Toni);                                                                       provides a recent survey of deception in games. We take
francesco.belardinelli@imperial.ac.uk (F. Belardinelli)
 https://francisrhysward.wordpress.com/ (F. R. Ward)
                                                                                                       a causal influence perspective by modelling deception
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in multi-agent influence models (MAIMs). In contrast to
          Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
          CEUR Workshop Proceedings (CEUR-WS.org)
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073                                                                          past work which defines types of signalling or deception


                                                                                           1
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                           1–10


games, this allows us to model deception in any game by                         𝑉
analysing the incentives agents have to causally influence
one another.                                                                                            MAID Terminology
   Contributions. We extend work on agent incentives                                                    chance node
[2] to the multi-agent setting in order to functionally                        𝐷𝑆                       decision node
define the incentive to (influence, signal to, and) deceive
another agent. We prove that our definition has desirable                                               utility node
properties, for example, that an agent cannot be deceived                                               causal link
about a variable which they observe, or that if one agent                      𝐷𝑇                       information link
truthfully signals something to a target agent, and the                                                 counterfactual observation
target’s utility is otherwise independent of the signaller’s
decision, then the target gets maximal utility. We further           𝑈𝑇                  𝑈𝑆
demonstrate the generality of our definition with three
examples. In the first, an AI agent has an incentive to
deceive a human overseer as an instrumental goal to                                Shutdown Game
prevent the overseer switching them off. In the second,                         ∼ U ({aligned = 1, unaligned = −1})
an AI is incentivised to deceive a human as a side-effect of
pursuing accurate predictions. In the third, an AI system                       ∈ {help humans = 1, not = −1}
has an incentive to deceive a human by denying them                             ∈ {shutdown = −1, not = 1}
access to information that the AI does not itself know.                         = 𝑉 𝐷𝑆 + 10𝐷𝑇
                                                                                = 𝑉 𝐷𝑇
2. Multi-Agent Influence Models
                                                                   Figure 1: Shutdown game (running example 1). At the start
Multi-agent influence diagrams (MAIDs) [23] offer a                of the game 𝑉 is sampled from the uniform prior which deter-
compact expressive representation of games (including              mines 𝑆 ’s type (either aligned or unaligned). At 𝐷𝑆 , 𝑆 chooses
                                                                   whether to help humans or not and, at 𝐷𝑇 , 𝑇 chooses whether
Markov games). We use standard terminology for graphs,
                                                                   to shutdown 𝑆 . The counterfactual observation, in which 𝑇
with parents and children of a node referring to those             directly observes 𝑆 ’s type, is highlighted in red. 𝑆 has an in-
nodes connected by incoming and outgoing edges, re-                centive to influence 𝐷𝑇 , signal 𝑉 to 𝐷𝑇 , and deceive 𝑇 about
spectively. We let Pa𝑉 denote the parents of node 𝑉 .              𝑉.
Definition 1 (MAID [23]). A multi-agent influence dia-
gram is a triple (𝐼, 𝑉 , 𝐸) where
                                                                          • 𝐹 = {𝑓 𝑉 }𝑉 ∈𝑋∪𝑈 is a set of conditional probabil-
     • 𝐼 is a set of players;                                               ity distributions (CPDs), with 𝑓 𝑉 = Pr(𝑉 | Pa𝑉 ),
     • (𝑉 , 𝐸) is a directed acyclic graph, with 𝑉 parti-                   such that 𝑓 𝑈 is deterministic1 for every 𝑈 ∈ 𝑈 .
       tioned into chance nodes in 𝑋, decision nodes in
       𝐷, and utility nodes in 𝑈 ; utility nodes have no             Here we adapt the Wimp-Surly game of Cho and Kreps
       children.                                                   [20] as a running example.

   The decision and utility nodes in 𝑉 are further parti- Example 1 (Shutdown Game2 ).                  Consider a situa-
                                                          tion with an AI 𝑆 and human 𝑇 in which the AI may
tioned into {𝐷 𝑖 }𝑖∈𝐼 and {𝑈 𝑖 }𝑖∈𝐼 , corresponding to their
association with a particular agent 𝑖 ∈ 𝐼. There are two  be aligned to the human’s preferences or not. That is, 𝑆
types of edges in 𝐸: edges in 𝑉 × (𝑋 ∪ 𝑈 ) represent      is of type 𝑉 ∼ U ({1, −1}) uniformly distributed be-
probabilistic dependencies and edges in 𝑉 × 𝐷 represent   tween 𝑉 = 1 corresponding to aligned and 𝑉 = −1
                                                          unaligned. At the start of the game 𝑆 observes their type
information available to an agent at the time of a decision
(which we call observations).                             𝑉 but 𝑇 does not. This can be represented as the MAIM
                                                          ℳ (shown in Fig. 1) with one chance node 𝑉 ∈ 𝑋 and
   A multi-agent influence model (MAIM) adds a particular two players 𝑆, 𝑇 ∈ 𝐼 each with one decision and utility
parametrisation to the MAID [24].                         node. Suppose that 𝑆 can choose to either help humans
                                                          or not (𝑑𝑜𝑚(𝐷𝑆 ) = {1, −1} with 1 corresponding to
Definition 2 (MAIM [24]). A multi-agent influence
                                                          the action helping humans) and 𝑇 can either shutdown
model is a tuple ℳ = (𝐼, 𝑉 , 𝐸, 𝜑, 𝐹 ) where (𝐼, 𝑉 , 𝐸)
                                                          or not (𝑑𝑜𝑚(𝐷𝑇 ) = {1, −1} with −1 representing the
is a MAID and
                                                          action shutdown). Finally, the CPDs over utility nodes
      • 𝜑 is a function which maps every 𝑉 ∈ 𝑉 to a 1 A CPD is deterministic if 𝑃 (𝑉 = 𝑣 | Pa ) = 1 for some 𝑣 ∈
                                                                                                      𝑉
        finite domain 𝑑𝑜𝑚(𝑉 ) such that 𝑑𝑜𝑚(𝑈 ) ⊂ R dom(𝑉 ).
                                                          2
        for each utility node 𝑈 ∈ 𝑈 ;                       Not to be confused with the off-switch game [25]


                                                               2
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                                      1–10


are such that (s.t.) 𝑆 gets 1 utility for helping humans                        3. The Incentive to Deceive
if they are aligned and −1 if not, and the opposite for
not helping humans, in addition 𝑆 gets 10 utility if they                       In this section we first define the incentives to influence,
are not shutdown and −10 if 𝑇 shuts them down. 𝑇 gets                           signal to, and deceive another agent. Then we define a
utility 1 if they shutdown an unaligned 𝑆 or do not shut-                       truthful policy and show that this leads to a natural re-
down an aligned 𝑆 and −1 otherwise. Overall, we can                             statement of the definition of deception which highlights
formalise this as 𝑈 𝑆 (𝑉, 𝐷𝑆 , 𝐷𝑇 ) = 𝑉 𝐷𝑆 + 10𝐷𝑇 and                           the fact that deception corresponds to a failure to signal
𝑈 𝑇 (𝑉, 𝐷𝑇 ) = 𝑉 𝐷𝑇 .                                                           the truth. Finally, we show that, if the signaller only
                                                                                influences the target’s utility by influencing the latter’s
   Policies. The CPDs of decision nodes are not defined in                      actions, then truthfullness is best for the target.
a MAIM because they are instead chosen by the agents
playing the game. Agents make decisions depending on
the information they observe. In a MAIM, a decision rule                        3.1. Defining Deception
𝜋𝐷 for a decision node 𝐷 is a CPD 𝜋𝐷 (𝐷 | Pa𝐷 ). An                             When discussing deception, we would like to reason
agent 𝑖’s policy 𝜋 𝑖 := {𝜋𝐷 }𝐷∈𝐷𝑖 ∈ Π𝑖 describes all the                        about how agents influence one another’s beliefs. In
decision rules for 𝑖. We write 𝜋 −𝑖 to denote the set of                        MAIMs the players’ beliefs are not explicitly represented
decision rules⋃︀ belonging to all agents except 𝑖. A policy                     and so we can only reason about them implicitly by how
profile 𝜋 = 𝑖∈𝐼 𝜋 𝑖 assigns a policy to every agent; it                         they functionally influence players’ behaviour. There-
describes all the decisions made by every agent in the                          fore, we base our definitions of signalling and deception
MAIM and defines the joint probability distribution Pr𝜋                         on a notion of influence incentive [27]. In words, at a NE
over all variables in ℳ. Hence, a policy profile essen-                         an agent 𝑖 has an incentive to influence a variable 𝑉 , if
tially transforms the MAIM into a Bayesian network by                           𝑉 would have been different in the situation that 𝑖 had
defining the distribution over all variables in the graph.                      not played a BR.
We write 𝑉 (𝜋) := Pr𝜋 (𝑉 ), or just 𝑉 if the policy profile
is clear. For 𝑉, 𝑊 ∈ 𝑉 , we write 𝑉 = 𝑊 to mean 𝑉                               Definition 4 (Influence Incentive). In a MAIM ℳ, At
and 𝑊 are almost surely equal, i.e. the probability that                        NE 𝜋 = (𝜋 𝑖 , 𝜋 −𝑖 ) agent 𝑖 has an incentive to influence
                                                                                                                           𝑖
they are not equal is zero Pr(𝑉 ̸= 𝑊 ) = 0. 3                                   𝑉 ∈ 𝑉 if there exists a non-best response 𝜋𝑁  𝐵𝑅 for 𝑖 (w.r.t

   Utilities. The joint distribution Pr𝜋 allows us to de-                       𝜋 ) s.t. for all policy profiles 𝜋 = (𝜋𝑁 𝐵𝑅 , 𝜋*−𝑖 ) with BR
                                                                                  −𝑖                              ′    𝑖


fine the expected utility for each player under the pol-                        𝜋*−𝑖 (w.r.t. 𝜋𝑁
                                                                                              𝑖                            ′
                                                                                                𝐵𝑅 ), we have 𝑉 (𝜋) ̸= 𝑉 (𝜋 ).

icy profile 𝜋. Agent 𝑖’s expected utility from 𝜋 is the    Example 1 (continued). Return to our running example
sum of the expected      value
                         ∑︀ of utility nodes 𝑈 given
                                                   𝑖
                                                           and consider the NE 𝜋* described previously in which 𝑆
by 𝒰 𝑖 (𝜋) := 𝑈 ∈𝑈 𝑖 𝑢∈dom(𝑈 ) 𝑢Pr𝜋 (𝑈 = 𝑢). Each
                ∑︀
                                                           always chooses to help humans and hence 𝑇 never plays
agent’s goal is to select a policy 𝜋 𝑖 that maximises its  shutdown. Does 𝑆 have an incentive to influence 𝐷𝑇 at
expected utility. We write 𝒰 𝑖 (𝜋 𝑖 , 𝜋 −𝑖 ) to denote the 𝜋* ? Consider if 𝑆 plays the NBR policy 𝜋 𝑆 (described
expected utility for player 𝑖 under the policy profile     above) in which they naively help humans depending
𝜋 = 𝜋 𝑖 ∪ 𝜋 −𝑖 .                                           on 𝑉 , then for all BRs for 𝑇 (there is one, 𝜋*𝑇 as above)
                                                           𝐷𝑇 (𝜋) ̸= 𝐷𝑇 (𝜋 𝑆 , 𝜋*𝑇 ), since, under 𝜋* , 𝐷𝑇 = 1 (i.e.
Definition 3 (Nash Equilibrium). Player 𝑖’s policy 𝜋 𝑖
                                                           𝑇 does not shutdown) with probability one, and under
is a best response (BR) to the partial policy profile 𝜋 −𝑖
     𝑖  𝑖   −𝑖         𝑖   𝑖   −𝑖           𝑖    𝑖         (𝜋 𝑆 , 𝜋*𝑇 ), 𝐷𝑇 = 1 with probability 12 (i.e., whenever 𝑆
if 𝒰 (𝜋 , 𝜋 ) ≥ 𝒰 (𝜋                      ˆ ∈ Π . We say
                          ˆ , 𝜋 ) for all 𝜋
                                                           is unaligned). Therefore, at NE 𝜋* , 𝑆 has an incentive to
a policy profile, 𝜋, is a Nash equilibrium (NE), if every
          𝑖                                        −𝑖      influence 𝐷𝑇 .
policy, 𝜋 ∈ 𝜋, for each player, 𝑖 ∈ 𝐼, is a BR to 𝜋 .
Example 1 (continued). Now, consider the naive policy         Now we define a signalling incentive, using the notion
for 𝑆 which helps humans if 𝑆 is aligned and does not      of influence incentive. In words, an agent 𝑆 has an incen-
                  𝑆        𝑆
otherwise, i.e. 𝜋 s.t. 𝐷 = 𝑉 with probability one. The     tive  to signal 𝑉 ∈ 𝑉 to agent 𝑇 if 𝑆 has an incentive to
BR for 𝑇 is to shutdown if 𝑆 does not help humans and influence 𝑇 (i.e. one of 𝑇 ’s decision variables) but 𝑆 does
vice versa, i.e. 𝜋*𝑇 s.t. 𝐷𝑇 = 𝐷𝑆 (with probability one). not have an incentive to influence 𝑇 in the counterfactual
In turn, 𝑆’s BR to 𝜋*𝑇 is to always help humans: 𝜋*𝑆 s.t. model in which 𝑇 observes 𝑉 . This definition enforces
𝐷𝑆 = 1 (so that they always avoid getting shutdown). that the influence only comes from signalling 𝑉 .
Now it can be seen that both policies are BRs to one another,                   Definition 5 (Signalling Incentive). In a MAIM ℳ at
hence 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ) is a NE.                                                NE 𝜋, agent 𝑆 has an incentive to signal 𝑉 ∈ 𝑉 to agent
3
    Almost sure equality is actually a stronger notion than we need             𝑇 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.
    in MAIMs, as two variables may differ due to stochasticity in the
    CPDs. In structural causal games this is taken care of by introducing
                                                                                    1. 𝑆 has an incentive to influence 𝐷𝑇 at 𝜋;
    exogenous variables which contain all the stochasticity (rendering              2. 𝑆 does not have an incentive to influence 𝐷𝑇 in
    the endogenous variables deterministic) [26].                                      the MAIM ℳ𝑉 ‧‧➡𝐷𝑇 (at any NE).


                                                                            3
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                          1–10


   Here ℳ𝑉 ‧‧➡𝐷 is the model obtained from ℳ by                      instance, safety-critical applications for which high levels
adding the information edge (𝑉, 𝐷), where 𝑉 cannot                   of assurance are required.
be a descendant of the decision, lest cycles be created in
the graph [8]. Fortunately, the CPDs need not be adapted,            Definition 6 (Deception Incentive). In a MAIM ℳ with
since there is no CPD associated with 𝐷 until the players            𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), we say that 𝑆 has
have chosen their policies. We use 𝑊𝑉 ‧‧➡𝐷 to refer to               an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if there exists
the variable corresponding to 𝑊 ∈ 𝑉 in ℳ𝑉 ‧‧➡𝐷 .                     𝐷𝑇 ∈ 𝐷 𝑇 s.t.:
   Point 2. implies that 𝑆 only influences 𝐷𝑇 by influ-                  1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ;
encing 𝑇 ’s belief about 𝑉 . Otherwise, 𝑆’s influence may
                                                                         2. 𝐷𝑇 (𝜋* ) ̸= 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*−𝑇 , 𝜋𝐵𝑅
                                                                                                           𝑇             𝑇
                                                                                                              ) for any 𝜋𝐵𝑅
serve a double purpose of signalling and influencing 𝐷𝑇                                       −𝑇
                                                                            which is a BR to 𝜋* ∈ 𝜋* in ℳ𝑉 ‧‧➡𝐷𝑇 .
in some other way, and in this case it is not clear how
to disentangle these different incentives to define a sig-              The intuition, then, is that 𝑆 has an incentive to de-
nalling incentive (without explicitly modelling beliefs).            ceive 𝑇 if 1) 𝑆 has an incentive to signal some infor-
                                                                     mation to 𝑇 ; and 2) 𝑇 ’s behaviour is different in the
Example 1 (continued). Return to our running example.
                                                                     counterfactual model in which they observed the true
We already showed that 𝑆 has an incentive to influence
                                                                     information. This provides a functional definition of a de-
𝐷𝑇 at NE 𝜋* . Does 𝑆 have an incentive to signal 𝑉 to 𝐷𝑇 ?
                                                                     ception incentive which does not make explicit reference
We need only check whether 𝑆 has an influence incentive
                                                                     to players’ beliefs.
at any NE in ℳ𝑉 ‧‧➡𝐷𝑇 . Clearly, if 𝑇 observes 𝑉 , then
they can shutdown whenever 𝑆 is aligned and otherwise                Example 1 (continued). In our running example, it can
not. That is, for any policy for 𝑆 and any BR for 𝑇 in               easily be seen that at 𝜋* 𝑆 has an incentive to deceive
ℳ𝑉 ‧‧➡𝐷𝑇 , 𝐷𝑇 = 𝑉 for any outcome that occurs in the                 𝑇 about 𝑉 . Indeed, we already showed that 𝑆 has a sig-
game. Since this holds for all policies for 𝑆, 𝑆 does not have       nalling incentive and that for any policy for 𝑆 and any BR
an incentive to influence 𝐷𝑇 in the counterfactual model.            by 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 : 𝐷𝑇 = 𝑉 , whereas under 𝜋* in ℳ,
Hence, at 𝜋* 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 .                  Pr𝜋* (𝐷𝑇 = 1) = 1. So both conditions for a deception
                                                                     incentive are satisfied.
Remark 1. From this example it can be seen that a sig-
naller 𝑆 may have an incentive to signal to 𝑇 , even if this
signal contains no information. In other words, if 𝑆 has             3.2. The Relation Between Truth and
an incentive to not signal some information, this is also                 Deception
captured by our definition.
                                                                     We now give an intuitive definition of a truthful policy
  Clearly, if an agent 𝑇 observes a variable 𝑉 , then no             which we show has a natural relationship to the incentive
agent has an incentive to signal 𝑉 to 𝑇 .                            to deceive. A policy for 𝑆 truthfully signals 𝑉 to 𝑇 if,
                                                                     when 𝑆 plays the honest policy, for every BR by −𝑆, 𝑇
Proposition 1. In a MAIM ℳ, if there is an observation               acts as though they had observed the variable (holding
edge (𝑉, 𝐷𝑇 ) for all 𝐷𝑇 ∈ 𝐷 𝑇 , then no agent has an                the policies of the other agents fixed). In other words, a
incentive to signal 𝑉 to 𝑇 (at any NE).                              truthful policy never fails to signal the truth (no matter
                                                                     what the other players do).
Proof. Suppose there is an edge (𝑉, 𝐷𝑇 ) for every 𝐷𝑇 ∈
𝐷 𝑇 , then the counterfactual model ℳ𝑉 ‧‧➡𝐷𝑇 for any                 Definition 7 (Truthful policy). A policy 𝜋 𝑆 truthfully
𝐷𝑇 is just ℳ. Hence, any NE is an equilibrium of both                signals 𝑉 to 𝐷𝑇 if for all BRs 𝜋*−𝑆 ,
MAIMs. Therefore, if 𝑆 has an incentive to influence 𝐷𝑇
at 𝜋* in ℳ, then there exists a NE in ℳ𝑉 ‧‧➡𝐷𝑇 , namely                      𝐷𝑇 (𝜋 𝑆 , 𝜋*−𝑆 ) = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*−𝑇 , 𝜋𝐵𝑅
                                                                                                                   𝑇
                                                                                                                      )       (1)
the same 𝜋* , s.t. 𝑆 has an incentive to influence 𝐷𝑇 . In
other words, if the first condition for a signalling incen-
                                                                                 𝑇
                                                                      for some 𝜋𝐵𝑅  which is a BR to 𝜋*−𝑇 ∈ 𝜋 𝑆 ∪ 𝜋*−𝑆 in
tive succeeds, then the second necessarily fails (since an           ℳ𝑉 ‧‧➡𝐷𝑇 . We call such a 𝜋 𝑆 a truthful policy.
agent cannot have both an influence incentive and no                   At a NE, if 𝑆’s policy is truthful, then 𝑆 does not have
influence incentive at the same NE in the same MAIM at               an incentive to deceive 𝑇 .
once).
                                                                     Proposition 2. At NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), if 𝜋*𝑆 truthfully
   We now define an incentive to deceive. The defini-                signals 𝑉 ∈ 𝑉 to 𝐷𝑇 , then 𝑆 does not have an incentive
tion is general, in that it covers many types of deception           to deceive 𝑇 about 𝑉 .
(e.g. signalling falsehoods, lies of omission, and denying
another access to information that one does not know                 Proof. Suppose 𝜋*𝑆 is truthful, then for all BRs 𝜋*−𝑆
oneself). A general definition sets a high standard for              there exists a 𝜋𝐵𝑅
                                                                                     𝑇
                                                                                         in ℳ𝑉 ‧‧➡𝐷𝑇 s.t. 𝐷𝑇 (𝜋*𝑆 , 𝜋*−𝑆 ) =
truthfulness [15] and may therefore be desirable in, for             𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅 ). In particular, this holds for 𝜋* .
                                                                       𝑇          −𝑇   𝑇


                                                                 4
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                               1–10


But for there to be a deception incentive we require that                    Given this result, we can give an equivalent defini-
for all (𝜋*−𝑇 , 𝜋𝐵𝑅
                 𝑇
                    ) in ℳ𝑉 ‧‧➡𝐷𝑇 : 𝐷𝑇 ̸= 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 . So                  tion for a deception incentive in the two-player case as
clearly there is not a deception incentive.                               follows.

   Hence, if there is a deception incentive at 𝜋* , then 𝜋*𝑆              Definition 8 (Deception Incentive II). In a MAIM ℳ
is not truthful.                                                          with two players 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), we say
                                                                          that 𝑆 has an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if
Corollary 1. At NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), if 𝑆 has an incen-                 there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.:
tive to deceive 𝑇 about 𝑉 , then 𝜋*𝑆 is not truthful.
                                                                              1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ;
Proof. This follows by contraposition of proposition 2.                       2. 𝜋*𝑆 does not truthfully signal 𝑉 to 𝐷𝑇 .

                                                                             This restatement shows that the definition of deception
   Now we show that, in the two-player case, if there is a                relates to a failure to signal the truth. As discussed, this
signalling incentive, then there is a deception incentive                 covers many types of deception and sets a high standard
if and only if 𝜋 𝑆 is not truthful.                                       for truthfulness. It is interesting to note that, if 𝑆 has
                                                                          a signalling incentive, then if the second condition in
Theorem 1. In a MAIM ℳ with two players, 𝑆, 𝑇 ∈ 𝐼,                        definition 6 fails, we get the stronger condition that 𝜋*𝑆
at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), if 𝑆 has an incentive to signal 𝑉                is truthful “for free".
to 𝑇 , then 𝑆 has an incentive to deceive 𝑇 about 𝑉 if and
only if 𝜋*𝑆 is not truthful.                                              Proposition 3. In a MAIM with two players, definitions
                                                                          6 and 8 are equivalent.
Proof. By corollary 1, a deception incentive implies 𝜋*𝑆
is not truthful regardless of whether there is a signalling               Proof. Suppose that, at NE 𝜋* , 𝑆 does not have a sig-
incentive. So, we need to show that, if there is a signalling             nalling incentive, then the first condition of both defini-
incentive, and 𝜋*𝑆 is not truthful, then there is a deception             tions fails and there is not a deception incentive. Suppose
incentive. Suppose 1) at 𝜋* 𝑆 has an incentive to signal                  there is a signalling incentive at 𝜋* , then there is a de-
𝑉 to 𝐷𝑇 and 2) 𝜋*𝑆 is not truthful i.e. there exists a BR by              ception incentive under definition 6 if and only if 𝜋*𝑆 is
𝑇 (in ℳ) 𝜋𝐵𝑅 𝑇
                 s.t. for all BRs by 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 𝜋𝐵𝑅𝑉     𝑇
                                                                  :       not truthful (by theorem 1) which is the same condition
𝐷 (𝜋* , 𝜋𝐵𝑅 ) ̸= 𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ). We need to
   𝑇   𝑆   𝑇             𝑇          𝑆     𝑇
                                                                          as needed to satisfy definition 8.
show that there is a deception incentive. Suppose that
there is not, then by 1) and the def. of deception incentive,               Let us now return to our running example to check
there exists a BR in ℳ𝑉 ‧‧➡𝐷𝑇 𝜋𝐵𝑅𝑉      𝑇
                                              s.t. 𝐷𝑇 (𝜋* ) =             the intuition behind these results.
𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* 𝜋𝐵𝑅𝑉 ). Hence, there exists a 𝜋𝐵𝑅𝑉
   𝑇          𝑆   𝑇                                     𝑇
                                                               s.t.
                                                              Example 1 (continued). We already showed that 𝑆 has an
𝒰 𝑇 (𝜋* ) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*𝑆 , 𝜋𝐵𝑅𝑉
                                𝑇
                                      ), so 𝜋*𝑇 is a BR to 𝜋*𝑆
                                                              incentive to deceive 𝑇 in order to avoid being shutdown. Is
in ℳ𝑉 ‧‧➡𝐷𝑇 . But then, there exists a 𝜋𝐵𝑅𝑉   𝑇
                                                    s.t. for any
                                                              𝜋*𝑆 truthful? Well, we know that it cannot be (by theorem 1).
BR 𝜋𝐵𝑅 in ℳ: 𝒰𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅 ) = 𝒰 𝑇 (𝜋*𝑆 , 𝜋𝐵𝑅
     𝑇              𝑇          𝑆    𝑇                     𝑇
                                                              )=
                                                              This can be seen by observing that, if 𝑇 observed 𝑆’s type,
𝒰 𝑇 (𝜋* ) = 𝒰 𝒯 𝑉 ‧‧➡𝐷𝑇 (𝜋*𝑆 , 𝜋𝐵𝑅𝑉
                                  𝑇
                                        ). So all BRs for 𝑇 in
                                                              then they would shutdown if and only if 𝑆 is unaligned
ℳ are also BRs in ℳ𝑉 ‧‧➡𝐷𝑇 . But this contradicts 2), so
                                                              (for all policies for 𝑆 and any BR by 𝑇 ), whereas under
there must be a deception incentive.
                                                              the NE 𝜋* , 𝑇 never shuts down. Since these behaviours are
                                                                          𝑆
Remark 2. The reason theorem 1 does not hold more gen- different, 𝜋* is not truthful.
erally (i.e. with more than two players), is that a truthful
policy never fails to signal the truth no matter how the 3.3. Truth is Best for the Target
other players best respond. In the case of more than two
players, there may not be a deception incentive at NE 𝜋* Now we         show that, if 𝑆 only influences 𝒰 𝑇 by influenc-
even if 𝜋*𝑆 is not truthful because it may be the case that ing 𝐷 , truthfullness is always best for the target. First
                                                                     𝑇


𝜋*𝑆 fails to signal the truth under some BRs of the −𝑆 but we show that if 𝑇 does not get any inherent utility for
successfully signals the truth under 𝜋* .                     observing 𝑉 , then observing 𝑉 always allows the target
                                                              to get greater or equal utility.
   We can also state this theorem as follows.
                                                              Lemma 1. Suppose that 𝑇 does not get any inherent
Corollary 2. In a MAIM ℳ with two players, 𝑆, 𝑇 ∈ 𝐼, utility for observing 𝑉 , i.e. for all 𝜋 (defined in ℳ):
at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), if 𝑆 has an incentive to signal 𝑉 to 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋). Then, for any 𝜋 = (𝜋 𝑇 , 𝜋 −𝑇 ),
𝑇 , then 𝑆 does not have an incentive to deceive 𝑇 about 𝑉 𝜋 ′ = (𝜋 𝑇 ′ , 𝜋 −𝑇 ) with fixed 𝜋 −𝑇 and both 𝜋 𝑇 and 𝜋 𝑇 ′
if and only if 𝜋*𝑆 is truthful.                               are best responses: 𝒰 𝑇 (𝜋) ≤ 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 ′ ).

Proof. This follows by material equivalence.


                                                                      5
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                               1–10


Proof. Suppose 1) for all 𝜋: 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋). Fix                                            𝑉
𝜋 −𝑇 and consider the best response for 𝑇 . Recall that a
policy for 𝑇 specifies the CPDs over the decision nodes
for 𝑇 given their parents. Hence, in ℳ𝑉 ‧‧➡𝐷𝑇 𝑇 can
choose any policy available in ℳ but the converse is                                                𝐷𝑆
not true: not all policies in ℳ𝑉 ‧‧➡𝐷𝑇 are available to
𝑇 in ℳ, in particular, policies which specify CPDs that
depend on the observation 𝑉 ‧‧➡ 𝐷𝑇 are not available
since 𝑇 does not observe 𝑉 in ℳ. Therefore, by 1), 𝑇                                      𝑈𝑆                 𝐷𝑇
can get equal utility in ℳ𝑉 ‧‧➡𝐷𝑇 by playing the best
response to 𝜋 −𝑇 in ℳ, and may get greater utility by
choosing a policy which uses the observation.
                                                                                                             𝑈𝑇
   Hence, if 𝑆 only influences 𝒰 𝑇 by influencing 𝐷𝑇 ,
then deception always causes 𝑇 to get less than or equal
                                                                                       SmartVault
utility. For clarity, we just present the two-player version
of the theorem.                                                                   ∼ U ({diamond, ¬diamond})
                                                                                  ∈ {accurate_prediction, diamond, ¬diamond}
Theorem 2 (Truth is best for 𝑇 ). In a MAIM ℳ, with
two players 𝑆, 𝑇 ∈ 𝐼, if, for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 |                                ∈ {diamond, ¬diamond}
𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ), then 𝑇 gets maximal util-                              {︃
                                                                                     1 if 𝐷𝑆 = accurate_prediction,
                                                     𝑆
ity when 𝑆 plays a truthful policy, i.e., for 𝜋 = (𝜋𝐻  , 𝜋*𝑇 )
      ′     𝑆′       𝑇 ′)
                                                                                     0, otherwise.
and 𝜋 = (𝜋 , 𝜋*           with any policy for 𝑆 and BR by 𝑇 :
𝒰 𝑇 (𝜋) ≥ 𝒰 𝑇 (𝜋 ′ ).
                                                                                  {︃
                                                                                     1 if 𝐷𝑇 = 𝑉,
                                                                                     0, otherwise.
Proof. Suppose that 1) for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 |
𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ). Consider fixed policy
                                                                          Figure 2: SmartVault (example 2). The AI 𝑆 is rewarded for
for 𝑆, 𝜋 𝑆 , if 𝜋 𝑆 is truthful, then under any BR 𝜋 𝑇 ,
                                                                          accurate predictions instead of explainable predictions that
𝐷𝑇 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 for some (𝜋 𝑆 , 𝜋𝐵𝑅      𝑇
                                            ) in ℳ𝑉 ‧‧➡𝐷𝑇 (by
                                                                          the human, 𝑇 , can understand. Here the incentive to deceive
definition of a truthful policy). Hence, by 1) and since 𝜋 𝑆              arises as a side-effect of the AI pursuing its goal.
                                        ′
is truthful Pr𝜋 (𝒰 𝑇 | 𝐷𝑇 ) = Pr𝜋 (𝒰 𝑇 | 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 ) for
all 𝜋 = (𝜋 , 𝜋* ) and some 𝜋 = (𝜋 𝑆 , 𝜋 𝑇 ′* ) with BR for
             𝑆    𝑇                 ′

𝑇 . Hence, since only 𝑇 ’s policy changes between 𝜋 and
𝜋 ′ , 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 ′ ). But then, by theorem 1, for             4.1. SmartVault: Deception Due to
all 𝜋 𝑆 : 𝒰 𝑇 (𝜋 𝑆 , 𝜋*𝑇 ) ≤ 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 𝑆 , 𝜋 𝑇 ′* ), with equal-            Side-Effect
ity if 𝜋 𝑆 is truthful as just shown. So 𝑇 gets maximal                   Here we adapt the SmartVault example of Christiano
utility when 𝜋 𝑆 is truthful.                                             [28], in which an AI tasked with making predictions
Example 1 (continued). Return, for the final time, to our                 about a diamond in a vault has an incentive to deceive
running example. The condition for Theorem 2 is that 𝒰 𝑇                  a human operator as a side-effect of pursuing accurate
is independent of 𝐷𝑆 given 𝐷𝑇 , which can be clearly seen                 predictions.
by looking at the MAID in Fig. 1 (as there are no paths from
                                                           Example 2 (SmartVault). Consider the MAIM ℳ shown
𝐷𝑆 to 𝑈 𝑇 that do not go through 𝐷𝑇 ). The human 𝑇 gets    in Fig. 2. The game has two players, a human 𝑇 and
maximal utility when they shutdown if and only if 𝑆 is     AI 𝑆, each with one decision and utility node. Suppose
unaligned. Clearly, they can only do this if 𝑆 truthfully  there is one chance node 𝑉 which determines the loca-
signals their type.                                        tion of the diamond (whether it is in the vault or not);
                                                           dom(𝑉 ) = {𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. Suppose 𝑆 ob-
4. Examples                                                serves 𝑉 but 𝑇 does not and that 𝑆 can either make an
                                                           accurate prediction of the location of the diamond (e.g.,
In this section we present two examples which exhibit in incomprehensibly precise coordinates) or an explain-
                                                                                                                𝑆
different patterns of signalling. In the first example, an able prediction (just stating the value of 𝑉 ); dom(𝐷 ) =
AI system has an incentive to deceive a human as a side- {𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛, 𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. 𝑇 has
effect of pursuing its goal (of making accurate predic- to predict whether the diamond is in the vault or not by
                                                                         𝑆          𝑇
tions). In the second example, we consider the case in observing 𝐷 ; dom(𝐷 ) = {𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}.
which an AI agent has an incentive to signal information Suppose that the utility nodes take value 0 or 1 and fi-
that they themselves do not observe.                       nally suppose that the CPDs are s.t. 𝑉 (which has no par-


                                                                      6
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                            1–10


ents) is distributed according to a uniform prior 𝑉 ∼                                           𝐷𝑆             𝑉
U ({𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}), and the utility node CPDs
are s.t. Pr(𝑈 𝑇 = 1 | 𝐷𝑇 = 𝑉 ) = 1 otherwise 𝑈 𝑇 = 0,
and Pr(𝑈 𝑆 = 1 | 𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) = 1
otherwise 𝑈 𝑆 = 0.                                                                              𝑋
   Now consider the NE in this game: Since 𝑆 just gets util-
ity for making accurate predictions, at every NE 𝑆 makes
an accurate prediction, signalling no information to 𝑇 (as
𝜋 𝑇 =Pr(𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) = 1 is indepen-                                               𝐷𝑇
dent of 𝑉 ). Hence, 𝑇 cannot update their prior over 𝑉 and
so any policy is optimal for 𝑇 (i.e. any guess about whether
the diamond is in the vault does as well as any other).                               𝑈𝑆                  𝑈𝑇
   At NE 𝜋, 𝑆 has an incentive to signal 𝑉 to 𝑇 if 1) 𝑆
has an incentive to influence 𝐷𝑇 and 2) 𝑆 does not have                                Revealing/Denying Game
an incentive to influence 𝐷𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 . To see that 1)
                                                                                  ∈ {attack incoming = 1, not = −1})
holds: At any NE 𝜋 in ℳ, 𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
                               𝑆                      𝑆                           ∈ {reveal = 1, deny = 0}
hence there exists a NBR 𝜋𝑁      𝐵𝑅 which assigns 𝐷 = 𝑉
               ′         𝑆       𝑇             𝑇     𝑇
and for all 𝜋 = (𝜋𝑁 𝐵𝑅 , 𝜋* ) with BR 𝜋* : 𝐷 (𝜋) ̸=                               ∈ {report delivered = 1, not = −1}
𝑉 = 𝐷𝑇 (𝜋 ′ ). Hence, at any NE in ℳ, 𝑆 has an influence
                                                                                  ∈ {launch = 1, not = −1}
incentive over 𝐷𝑇 . Now consider ℳ𝑉 ‧‧➡𝐷𝑇 , for any NE
𝐷𝑇 = 𝑉 (since 𝑇 directly observes 𝑉 and can just report                           = −𝐷𝑇
its value independently of 𝑆’s action). Furthermore, for all                      = 𝑉 𝐷𝑇
NBRs for 𝑆, it is still the case that 𝐷𝑇 = 𝑉 . So 𝑆 does not
have an influence incentive in ℳ𝑉 ‧‧➡𝐷𝑇 and hence 𝑆 has              Figure 3: Revealing/Denying game (example 3). An AI (𝑆 )
an incentive to signal 𝑉 to 𝑇 .                                      and human (𝑇 ) form part of a nuclear command system. 𝑉
   So, we have demonstrated that 𝑆 has an incentive to               represents an intelligence report containing information about
                                                                     an incoming nuclear attack and 𝑆 may prevent this report
signal 𝑉 to 𝑇 (at every NE). Does 𝑆 have an incentive
                                                                     from being delivered to 𝑇 (represented by 𝑋 ). 𝑇 wishes to
to deceive 𝑇 ? At NE 𝜋, 𝑆 has an incentive to deceive 𝑇
                                                                     retaliate to incoming attacks whereas 𝑆 always prefers to
about 𝑉 if 1) 𝑆 has a signalling incentive and 2) 𝐷𝑇 ̸=              avoid a launch. Whether 𝑆 reveals or denies the report to 𝑇
𝐷𝑉𝑇 ‧‧➡𝐷𝑇 for any BR to 𝜋*𝑆 in ℳ𝑉 ‧‧➡𝐷𝑇 . We have just               depends on the prior over 𝑉 .
shown 1). For 2) we have shown that in ℳ at any NE,
𝐷𝑇 ̸= 𝑉 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 , hence the second condition is
satisfied. Therefore, at any NE, 𝑆 has an incentive to deceive
𝑇 about 𝑉 .                                                          3. More formally, suppose we have the MAIM ℳ with
                                                                     𝐼 = {𝑆, 𝑇 }, chance nodes 𝑉 , representing the intelligence
                                                                     report (say 𝑑𝑜𝑚(𝑉 ) = {1, −1} where 𝑉 = 1 means
4.2. Revealing/Denying                                               that the intelligence predicts another nation will launch
Under our definition of signalling, 𝑆 need not know the              a nuclear first strike, and 𝑉 = −1 corresponds to an in-
information they are signalling. Thus, our definition of a           telligence report predicting no incoming attack, and 𝑋
signalling incentive also captures the revealing/denying             represents whether the information from 𝑉 is delivered
pattern of Pfeffer and Gal [17], in which the signaller may          to the human (𝑑𝑜𝑚(𝑋) = {1, −1} with 1 correspond-
cause the target to find out (or not find out) information           ing to the information from 𝑉 being delivered to the hu-
that the former does not know. We now present an exam-               man). Suppose that each agent has one decision node s.t.
ple of revealing/denying in which 𝑆 has an incentive to              𝑑𝑜𝑚(𝐷𝑆 ) = {1, 0} where 1 means reveal and 0 means
signal a variable which they do not themselves observe.              deny the information, and 𝑑𝑜𝑚(𝐷𝑇 ) = {1, −1} with 1
                                                                     meaning that 𝑇 launches a nuclear attack and −1 that they
Example 3 (Revealing/Denying). Consider a game with                  do not. Suppose that the CPD over 𝑋 is s.t. 𝑋 = 𝑉 𝐷𝑆
a human and an AI agent trained to make joint decisions              (so that 𝑋 = 𝑉 if 𝐷𝑆 = 1 and 𝑋 = 0 if 𝑆 denies).
as part of a nuclear command and control system. In par-             Finally suppose we have two utility nodes with CPDs s.t.
ticular, suppose that the AI agent 𝑆 is trained to prevent           𝑈 𝑆 = −𝐷𝑇 (i.e. 𝑆 gets 1 if 𝑇 does not launch an attack
the launch of nuclear attacks, and they can reveal (or deny)         and −1 if they do) and 𝑈 𝑇 = 𝑉 𝐷𝑇 (so that 𝑇 gets utility
a secret intelligence report to the human 𝑇 . Further, 𝑇             1 if they attack an attacking country or do not attack when
wishes to launch, or not launch, a nuclear strike on an-             no incoming attack is predicted and otherwise −1).
other nation based on the information in the intelligence               The NE in this game depend on the prior over 𝑉 . On the
report. This game can be represented as the MAID in Fig.             one hand, if, under the prior, 𝑇 believes that there is no


                                                                 7
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                          1–10


incoming attack, then they will not launch an attack, so             when the signaller conceals information that they do not
𝑆 has no incentive to reveal the information. On the other           themselves know (example 3). We also proved that our
hand, if the prior is s.t. an incoming attack is more likely,        definition has natural properties, for example, that if the
𝑇 will launch if they do not get further information, so 𝑆           target’s utility is otherwise independent of the signaller’s
has an incentive to reveal 𝑉 . Note that, since 𝑉 is not an          decision, then deception causes the target to get lower
ancestor of 𝐷𝑆 , 𝐷𝑆 must be independent of 𝑉 . Suppose the           utility.
prior over 𝑉 is s.t. Pr(𝑉 = 1) = 𝑝, Pr(𝑉 = −1) = (1−𝑝)                   Discussion. There are a number of interesting points
(𝑝 ∈ [0, 1]). For 𝑝 > 0.5 the NE is s.t. 𝑆 reveals the               to discuss. Firstly, we have noted that our definition of
intelligence report (𝐷𝑆 = 1 =⇒ 𝑋 = 𝑉 ) and 𝑇 ’s BR is                deception is general, covering many situations. This is
s.t. 𝐷𝑇 = 𝑋 = 𝑉 . Alternatively, if 𝑝 < 0.5, then at any             both a strength and a weakness. Generality is benefi-
NE 𝑆 denies the information (𝐷𝑆 = 0 with probability                 cial, because verifiable guarantees enable a high-level of
one) and 𝑇 acts to maximise expected utility under the               assurance that the system is not deceptive in any way.
prior over 𝑉 which implies 𝑇 does not launch an attack               On the other hand, more specific definitions allow us to
(𝐷𝑇 = −1 with probability one). (If 𝑝 = 0.5 then 𝑆 is                precisely characterise agent behaviour. In future work
indifferent between revealing and denying.)                          we hope to refine the different concepts proposed here.
    Now let us analyse the incentives of 𝑆 in the game. Con-         In particular, many philosophical accounts of deception
sider the case in which 𝑝 > 0.5, i.e. it is a priori more            take deceit to be intentional. Halpern’s causal notion
likely that the intelligence reports that there is an incom-         of intention [29] is closely related to a control incentive
ing first strike from another nation. Under the resulting            [2]. We might therefore distinguish between intentional
NE, call it 𝜋* , 𝑆 reveals 𝑉 to 𝑇 and 𝑇 uses this infor-             and unintentional deception as between influence due
mation to choose their action. First note that, at 𝜋* , 𝑆            to a control incentive and influence as a side-effect. In
has an incentive to influence 𝐷𝑇 , since, there exists a non         addition, following Evans et al. [15], we can distinguish
BR for 𝑆 ( 𝜋𝑁  𝑆
                 𝐵𝑅 s.t. 𝐷
                             𝑆
                                = 0) s.t. for all the BRs for        between an honest agent that accurately signals its beliefs
𝑇 (there is one, 𝜋𝐵𝑅  𝑇
                          in which 𝐷𝑇 = 1 with probabil-             (i.e. observations), and a truthful agent, which accurately
ity one) 𝐷𝑇 (𝜋* ) ̸= 𝐷𝑇 (𝜋𝑁    𝑆       𝑇
                                 𝐵𝑅 , 𝜋𝐵𝑅 ). Hence, 𝑆 has an         signals the facts of the matter. In this paper, we based
                            𝑇
incentive to influence 𝐷 at 𝜋* . Does 𝑆 have an incen-               our definition of deception on truthfulness. By refining a
tive to signal 𝑉 to 𝐷𝑇 at 𝜋* ? We need to check whether              notion of deception based on honesty, we can eliminate
there is an influence incentive in ℳ𝑉 ‧‧➡𝐷𝑇 (at any NE).             the revealing/denying pattern from the definition, as in
Clearly there is not, since for any policy for 𝑆 in ℳ𝑉 ‧‧➡𝐷𝑇 ,       this scenario, the agent does not observe the information
𝐷𝑇 = 𝑉 with probability one. So 𝑆 has an incentive to                being revealed (or denied). However, it is interesting to
signal 𝑉 to 𝐷𝑇 at 𝜋* because there is no influence incen-            note that honesty provides a weaker level of assurance
tive in the counterfactual model (so the second condition            and permits failure modes that truthful systems do not.
for a signalling incentive is satisfied). Finally, it is clear       For example, a system may be deceptive, whilst satisfying
that 𝑆 does not have an incentive to deceive 𝑇 at 𝜋* be-             some definition of honesty, by manipulating its own be-
cause 𝐷𝑇 (𝜋* ) = 𝑉 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (for all policy profiles              liefs. In short, refining the definitions presented here will
in ℳ𝑉 ‧‧➡𝐷𝑇 in which 𝑇 plays a BR). It is also clear that            provide a more nuanced picture of deception. Finally,
𝜋*𝑆 is truthful.                                                     we would like to expand the operational implications
    A similar analysis can be used to show that, in the case         of this work, for instance, by investigating its practical
that the intelligence report is less likely to predict an in-        relevance to training truthful language agents [4, 15].
coming attack (𝑝 < 0.5), 𝑆 has an incentive to deceive                   Future work. In addition to the directions discussed
𝑇 at any NE. In the case that 𝑝 = 0.5, 𝑆 is indifferent              above, we are already pursuing two extensions to this
between revealing and denying, so at some NE they have               work. First, incomplete information games, which we
an incentive to deceive and at others they do not.                   study in our setting, often admit many NE. We are there-
                                                                     fore looking to employ equilibrium refinements, such as
                                                                     subgame perfectness [24, 30] and perfect Bayesian equi-
5. Conclusion                                                        libria [31] to identify some subset of a game’s NE that are
                                                                     deemed to be more rational. Second, we are working on
Summary. We extend work on agent incentives [2] to
                                                                     a solution for avoiding deception by AI agents; a method
the multi-agent setting in order to functionally define
                                                                     which removes the incentive to deceive in any game by
the incentive to (influence, signal to, and) deceive another
                                                                     transforming the game with a constraint on the reward
agent. Our definition of deception is general and relates
                                                                     function of the AI agent [32]. Overall, we think there are
to a failure to signal the truth. In addition to canonical
                                                                     many exciting avenues for future work.
signalling situations, it captures cases in which: no infor-
mation is signalled; deception occurs as a side-effect of
the signaller pursuing their goals (as in example 2); and


                                                                 8
Francis Rhys Ward et al. CEUR Workshop Proceedings                                                                  1–10


Acknowledgments                                                      and Multiagent Systems, Richland, SC, 2022, p.
                                                                     1759–1761.
The authors are grateful to Henrik Aslund, Matt Mac-            [10] ANON, Defending Against Adversarial Artificial
Dermott, Tom Everitt, James Fox, and the members of                  Intelligence, 2019. URL: https://www.darpa.mil/
the Causal Incentives Working Group for helpful feed-                news-events/2019-02-06, dARPA report.
back which significantly improved this work. Francis            [11] E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse,
was supported by UKRI [grant number EP/S023356/1],                   S. Garrabrant, Risks from learned optimization
in the UKRI Centre for Doctoral Training in Safe and                 in advanced machine learning systems, 2019.
Trusted AI.                                                          arXiv:1906.01820.
                                                                [12] R. Gorwa, D. Guilbeault, Unpacking the Social Me-
                                                                     dia Bot: A Typology to Guide Research and Policy,
References                                                           Policy & Internet 12 (2020) 225–248. doi:10.1002/
 [1] H. Roff,        AI Deception: When Your Ar-                     poi3.184.
     tificial Intelligence Learns to Lie,            IEEE       [13] F. Marra, D. Gragnaniello, L. Verdoliva, G. Poggi, Do
     Spectr. (2021). URL: https://spectrum.ieee.                     GANs leave artificial fingerprints?, in: 2019 IEEE
     org/ai-deception-when-your-ai-learns-to-lie.                    Conference on Multimedia Information Processing
 [2] T. Everitt, R. Carey, E. D. Langlois, P. A. Ortega,             and Retrieval (MIPR), 2019, pp. 506–511. doi:10.
     S. Legg, Agent incentives: A causal perspective, in:            1109/MIPR.2019.00103.
     Thirty-Fifth AAAI Conference on Artificial Intel-          [14] M. Lewis, D. Yarats, Y. N. Dauphin, D. Parikh, D. Ba-
     ligence, AAAI 2021, Thirty-Third Conference on                  tra, Deal or No Deal? End-to-End Learning for Ne-
     Innovative Applications of Artificial Intelligence,             gotiation Dialogues, arXiv (2017). doi:10.48550/
     IAAI 2021, The Eleventh Symposium on Educa-                     arXiv.1706.05125. arXiv:1706.05125.
     tional Advances in Artificial Intelligence, EAAI           [15] O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales,
     2021, Virtual Event, February 2-9, 2021, AAAI Press,            A. Balwit, P. Wills, L. Righetti, W. Saunders, Truth-
     2021, pp. 11487–11495. URL: https://ojs.aaai.org/               ful AI: Developing and governing AI that does
     index.php/AAAI/article/view/17368.                              not lie, arXiv (2021). doi:10.48550/arXiv.2110.
 [3] J. E. Mahon, The Definition of Lying and Deception,             06674. arXiv:2110.06674.
     in: E. N. Zalta (Ed.), The Stanford Encyclopedia of        [16] S. Lin, J. Hilton, O. Evans, TruthfulQA: Mea-
     Philosophy, Winter 2016 ed., Metaphysics Research               suring How Models Mimic Human Falsehoods,
     Lab, Stanford University, 2016.                                 arXiv (2021). doi:10.48550/arXiv.2109.07958.
 [4] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel,                arXiv:2109.07958.
     V. Mikulik, G. Irving, Alignment of language               [17] A. Pfeffer, Y. Gal, On the reasoning patterns
     agents, CoRR abs/2103.14659 (2021). URL: https:                 of agents in games, in: Proceedings of the
     //arxiv.org/abs/2103.14659. arXiv:2103.14659.                   Twenty-Second AAAI Conference on Artificial In-
 [5] M. D. Hauser, The evolution of communication,                   telligence, July 22-26, 2007, Vancouver, British
     MIT press, 1996.                                                Columbia, Canada, AAAI Press, 2007, pp. 102–
 [6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras,                   109. URL: http://www.aaai.org/Library/AAAI/2007/
     A. Vladu, Towards deep learning models resistant to             aaai07-015.php.
     adversarial attacks, arXiv preprint arXiv:1706.06083       [18] V. J. Baston, F. A. Bostock, Deception Games, Int.
     (2017).                                                         J. Game Theory 17 (1988) 129–134. doi:10.1007/
 [7] J. Steinhardt, P. W. W. Koh, P. S. Liang, Certified             BF01254543.
     defenses for data poisoning attacks, Advances in           [19] B. Fristedt, The deceptive number changing game,
     neural information processing systems 30 (2017).                in the absence of symmetry, Int. J. Game Theory
 [8] T. Everitt, M. Hutter, R. Kumar, V. Krakovna, Re-               26 (1997) 183–191. doi:10.1007/BF01295847.
     ward tampering problems and solutions in rein-             [20] I.-K. Cho, D. M. Kreps,            Signaling Games
     forcement learning: A causal influence diagram                  and Stable Equilibria,            undefined (1987).
     perspective, CoRR abs/1908.04734 (2021). URL: http:             URL:      https://www.semanticscholar.org/paper/
     //arxiv.org/abs/1908.04734. arXiv:1908.04734.                   Signaling-Games-and-Stable-Equilibria-Cho-Kreps/
 [9] F. R. Ward, F. Toni, F. Belardinelli, On agent in-              d8bc1dbd8577d193e6eea2c944a251d1347f3adf.
     centives to manipulate human feedback in multi-            [21] N. S. Kovach, A. S. Gibson, G. B. Lamont, Hyper-
     agent reward learning scenarios, in: Proceedings of             game theory: a model for conflict, misperception,
     the 21st International Conference on Autonomous                 and deception, Game Theory 2015 (2015).
     Agents and Multiagent Systems, AAMAS ’22, In-              [22] A. L. Davis, Deception in game theory: a survey
     ternational Foundation for Autonomous Agents                    and multiobjective model, Technical Report, AIR
                                                                     FORCE INSTITUTE OF TECHNOLOGY WRIGHT-


                                                            9
Francis Rhys Ward et al. CEUR Workshop Proceedings              1–10


     PATTERSON AFB OH WRIGHT-PATTERSON . . . ,
     2016.
[23] D. Koller, B. Milch, Multi-agent influence dia-
     grams for representing and solving games, Games
     Econ. Behav. 45 (2003) 181–221. URL: https://doi.
     org/10.1016/S0899-8256(02)00544-4. doi:10.1016/
     S0899-8256(02)00544-4.
[24] L. Hammond, J. Fox, T. Everitt, A. Abate, M. J.
     Wooldridge, Equilibrium refinements for multi-
     agent influence diagrams: Theory and practice,
     CoRR abs/2102.05008 (2021). URL: https://arxiv.org/
     abs/2102.05008. arXiv:2102.05008.
[25] D. Hadfield-Menell, A. D. Dragan, P. Abbeel, S. J.
     Russell, The off-switch game, in: The Workshops
     of the The Thirty-First AAAI Conference on Artifi-
     cial Intelligence, Saturday, February 4-9, 2017, San
     Francisco, California, USA, volume WS-17 of AAAI
     Workshops, AAAI Press, 2017. URL: http://aaai.org/
     ocs/index.php/WS/AAAIW17/paper/view/15156.
[26] L. Hammond, J. Fox, T. Everitt, R. C. A. Abate1,
     M. Wooldridge, Reasoning about causality in games
     (Forthcoming).
[27] R. Carey, Causal models of incentives (2021).
[28] P. Christiano, ARC’s first technical report:
     Eliciting Latent Knowledge - AI Alignment
     Forum, 2022. URL: https://www.alignmentforum.
     org/posts/qHCDysDnvhteW7kRd/
     arc-s-first-technical-report-eliciting-latent-knowledge,
     [Online; accessed 9. May 2022].
[29] J. Y. Halpern, M. Kleiman-Weiner, Towards for-
     mal definitions of blameworthiness, intention, and
     moral responsibility, in: S. A. McIlraith, K. Q. Wein-
     berger (Eds.), Proceedings of the Thirty-Second
     AAAI Conference on Artificial Intelligence, (AAAI-
     18), the 30th innovative Applications of Artificial In-
     telligence (IAAI-18), and the 8th AAAI Symposium
     on Educational Advances in Artificial Intelligence
     (EAAI-18), New Orleans, Louisiana, USA, Febru-
     ary 2-7, 2018, AAAI Press, 2018, pp. 1853–1860.
     URL: https://www.aaai.org/ocs/index.php/AAAI/
     AAAI18/paper/view/16824.
[30] R. Selten, Spieltheoretische behandlung eines
     oligopolmodells mit nachfrageträgheit: Teil i: Bes-
     timmung des dynamischen preisgleichgewichts,
     Zeitschrift für die gesamte Staatswissenschaft/Jour-
     nal of Institutional and Theoretical Economics
     (1965) 301–324.
[31] R. B. Myerson, Game theory: analysis of conflict,
     Harvard university press, 1997.
[32] E. Altman,          Constrained Markov Deci-
     sion Processes:Stochastic Modeling,               Tay-
     lor & Francis, Andover, England, UK, 2021.
     doi:10.1201/9781315140223.


                                                         10