A Causal Perspective on AI Deception in Games Francis Rhys Ward* , Francesca Toni and Francesco Belardinelli Imperial College London , Exhibition Rd, South Kensington, London, SW7 2BX Abstract Deception is a core challenge for AI safety and we focus on the problem that AI agents might learn deceptive strategies in pursuit of their objectives. We define the incentives one agent has to signal to and deceive another agent. We present several examples of deceptive artificial agents and show that our definition has desirable properties. Keywords Deception, AI, Game Theory, Causality 1. Introduction We focus on the problem that AI agents might learn deceptive strategies in pursuit of their objectives [1]. Following recent work on causal incentives [2], we define the incentive to deceive an agent. There is no universally accepted definition of deception and defining what constitutes deception is an open philosophical problem [3]. Our definition is somewhat inspired by that of Kenton et al. [4] who provide a functional (natural language) definition of deception, meaning that it does not make reference to the beliefs or intentions of the agents involved [5]. This is particularly suitable for discussing deception by artificial agents, to which the attribution of beliefs and intentions may be contentious. We formalise a functional definition of deception in games and illustrate its properties with a number of examples and formal results. Deception is a core challenge for AI safety. On the one hand, many areas of work aim to ensure that AI systems are not vulnerable to deception. Adversarial attacks [6], data-poisoning [7], reward function tampering [8], and manipulating human feedback [9] are ways of deceiving AI systems. Further work researches mechanisms for detecting and defending against deception [10]. On the other hand, we can consider cases in which AI tools are used to deceive, or learn to do so in order to optimize their objectives [11]. For examples of the former case, AIs can be used to deceive other software agents, as with bots that automate posting on social media platforms to manipulate content ranking algorithms [12], or they can be used to fool humans, cf. the use of GANs to produce realistic fake media [13]. For the latter case, AI agents might learn deceptive strategies in pursuit of their objectives [1]: Lewis et al. [14] found that their negotiation agent learnt to deceive from self-play, without any explicit human design, and The ICLP CAUSAL Workshop (CAUSAL 2022), July 31, 2022, Haifa, Israel. * Corresponding author. $ francis.ward19@imperial.ac.uk (F. R. Ward); f.toni@imperial.ac.uk (F. Toni); francesco.belardinelli@imperial.ac.uk (F. Belardinelli) € https://francisrhysward.wordpress.com/ (F. R. Ward) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 Hubinger et al. [11] raise concerns about deceptive learned optimizers which perform well in training in order to pursue different goals in deployment. Kenton et al. [4] discuss the alignment of language agents, highlighting that language is a natural medium for enacting deception. Evans et al. [15] discuss the development of truthful AI, the desired standards for truth and honesty in AI systems, and how these could be implemented and measured. Lin et al. [16] propose a benchmark to measure whether a language model is truthful in generating answers to questions. In short, as increasingly capable AI agents become deployed in settings with other agents, deception may be learned as an effective strategy for achieving a wide range of goals. It is therefore essential that we understand and mitigate deception by artificial agents. Deception in game theory. There are several existing models of deception in the game theory literature. Pfeffer and Gal [17] define graphical patterns for signalling in games. A deception game [18] is a two-player zero-sum game between a deceiver and target in which the deceiver can distort a signal; optimal deceptive strategies completely distort the signal so that the target cannot gain any information [19]. A signalling game [20] is a two-player Bayesian game between a signaller and target (or receiver) in which the signaller is assigned a type according to a shared prior distribution and the utilities of the players depend on the type of the signaller and the action chosen by the target. In these games, the signaller may often have incentives to deceive the target by misrepresenting or obfuscating their type. Hypergame theory extends game theory to settings in which players may be uncertain about the game being played and can be used to model misperception and deception [21]. Davis [22] provides a recent survey of deception in games. We take a causal influence perspective by modelling deception in multi- agent influence models (MAIMs). In contrast to past work which defines types of signalling or deception games, this allows us to model deception in any game by analysing the incentives agents have to causally influence one another. Contributions. We extend work on agent incentives [2] to the multi-agent setting in order to functionally define the incentive to (influence, signal to, and) deceive another agent. We prove that our definition has desirable properties, for example, that an agent cannot be deceived about a variable which they observe, or that if one agent truthfully signals something to a target agent, and the target’s utility is otherwise independent of the signaller’s decision, then the target gets maximal utility. We further demonstrate the generality of our definition with three examples. In the first, an AI agent has an incentive to deceive a human overseer as an instrumental goal to prevent the overseer switching them off. In the second, an AI is incentivised to deceive a human as a side-effect of pursuing accurate predictions. In the third, an AI system has an incentive to deceive a human by denying them access to information that the AI does not itself know. 2. Multi-Agent Influence Models Multi-agent influence diagrams (MAIDs) [23] offer a compact expressive representation of games (including Markov games). We use standard terminology for graphs, with parents and children of a node referring to those nodes connected by incoming and outgoing edges, respectively. We let Pa𝑉 denote the parents of node 𝑉 . Definition 1 (MAID [23]). A multi-agent influence diagram is a triple (𝐼, 𝑉 , 𝐸) where 𝐼 is a set of players; (𝑉 , 𝐸) is a directed acyclic graph, with 𝑉 partitioned into chance nodes in 2 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 𝑉 MAID Terminology chance node 𝑆 decision node 𝐷 utility node causal link 𝐷𝑇 information link counterfactual observation 𝑈𝑇 𝑈𝑆 Shutdown Game ∼ U ({aligned = 1, unaligned = −1}) ∈ {help humans = 1, not = −1} ∈ {shutdown = −1, not = 1} = 𝑉 𝐷𝑆 + 10𝐷𝑇 = 𝑉 𝐷𝑇 Figure 1: Shutdown game (running example 1). At the start of the game 𝑉 is sampled from the uniform prior which determines 𝑆’s type (either aligned or unaligned). At 𝐷𝑆 , 𝑆 chooses whether to help humans or not and, at 𝐷𝑇 , 𝑇 chooses whether to shutdown 𝑆. The counterfactual observation, in which 𝑇 directly observes 𝑆’s type, is highlighted in red. 𝑆 has an incentive to influence 𝐷𝑇 , signal 𝑉 to 𝐷𝑇 , and deceive 𝑇 about 𝑉 . 𝑋, decision nodes in 𝐷, and utility nodes in 𝑈 ; utility nodes have no children. The decision and utility nodes in 𝑉 are further partitioned into {𝐷 𝑖 }𝑖∈𝐼 and {𝑈 𝑖 }𝑖∈𝐼 , corresponding to their association with a particular agent 𝑖 ∈ 𝐼. There are two types of edges in 𝐸: edges in 𝑉 × (𝑋 ∪ 𝑈 ) represent probabilistic dependencies and edges in 𝑉 × 𝐷 represent information available to an agent at the time of a decision (which we call observations). A multi-agent influence model (MAIM) adds a particular parametrisation to the MAID [24]. Definition 2 (MAIM [24]). A multi-agent influence model is a tuple ℳ = (𝐼, 𝑉 , 𝐸, 𝜑, 𝐹 ) where (𝐼, 𝑉 , 𝐸) is a MAID and 𝜑 is a function which maps every 𝑉 ∈ 𝑉 to a finite domain 𝑑𝑜𝑚(𝑉 ) such that 𝑑𝑜𝑚(𝑈 ) ⊂ R for each utility node 𝑈 ∈ 𝑈 ; 𝐹 = {𝑓 𝑉 }𝑉 ∈𝑋∪𝑈 is a set of conditional probability distributions (CPDs), with 𝑓 𝑉 = Pr(𝑉 | Pa𝑉 ), such that 𝑓 𝑈 is deterministic1 for every 𝑈 ∈ 𝑈. Here we adapt the Wimp-Surly game of Cho and Kreps [20] as a running example. Example 1 (Shutdown Game2 ). Consider a situation with an AI 𝑆 and human 𝑇 in which the AI may be aligned to the human’s preferences or not. That is, 𝑆 is of type 𝑉 ∼ U ({1, −1}) uniformly distributed between 𝑉 = 1 corresponding to aligned and 𝑉 = −1 unaligned. At 1 A CPD is deterministic if 𝑃 (𝑉 = 𝑣 | Pa𝑉 ) = 1 for some 𝑣 ∈ dom(𝑉 ). 3 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 the start of the game 𝑆 observes their type 𝑉 but 𝑇 does not. This can be represented as the MAIM ℳ (shown in Fig. 1) with one chance node 𝑉 ∈ 𝑋 and two players 𝑆, 𝑇 ∈ 𝐼 each with one decision and utility node. Suppose that 𝑆 can choose to either help humans or not (𝑑𝑜𝑚(𝐷𝑆 ) = {1, −1} with 1 corresponding to the action helping humans) and 𝑇 can either shutdown or not (𝑑𝑜𝑚(𝐷𝑇 ) = {1, −1} with −1 representing the action shutdown). Finally, the CPDs over utility nodes are such that (s.t.) 𝑆 gets 1 utility for helping humans if they are aligned and −1 if not, and the opposite for not helping humans, in addition 𝑆 gets 10 utility if they are not shutdown and −10 if 𝑇 shuts them down. 𝑇 gets utility 1 if they shutdown an unaligned 𝑆 or do not shutdown an aligned 𝑆 and −1 otherwise. Overall, we can formalise this as 𝑈 𝑆 (𝑉, 𝐷𝑆 , 𝐷𝑇 ) = 𝑉 𝐷𝑆 + 10𝐷𝑇 and 𝑈 𝑇 (𝑉, 𝐷𝑇 ) = 𝑉 𝐷𝑇 . Policies. The CPDs of decision nodes are not defined in a MAIM because they are instead chosen by the agents playing the game. Agents make decisions depending on the information they observe. In a MAIM, a decision rule 𝜋𝐷 for a decision node 𝐷 is a CPD 𝜋𝐷 (𝐷 | Pa𝐷 ). An agent 𝑖’s policy 𝜋 𝑖 := {𝜋𝐷 }𝐷∈𝐷𝑖 ∈ Π𝑖 describes all the decision rules for 𝑖. We write⋃︀𝜋 −𝑖 to denote the set of decision rules belonging to all agents except 𝑖. A policy profile 𝜋 = 𝑖∈𝐼 𝜋 𝑖 assigns a policy to every agent; it describes all the decisions made by every agent in the MAIM and defines the joint probability distribution Pr𝜋 over all variables in ℳ. Hence, a policy profile essentially transforms the MAIM into a Bayesian network by defining the distribution over all variables in the graph. We write 𝑉 (𝜋) := Pr𝜋 (𝑉 ), or just 𝑉 if the policy profile is clear. For 𝑉, 𝑊 ∈ 𝑉 , we write 𝑉 = 𝑊 to mean 𝑉 and 𝑊 are almost surely equal, i.e. the probability that they are not equal is zero Pr(𝑉 ̸= 𝑊 ) = 0. 3 Utilities. The joint distribution Pr𝜋 allows us to define the expected utility for each player under the policy profile 𝜋. Agent 𝑖’s expected utility from 𝜋 is the sum of the expected value of utility nodes 𝑈 𝑖 given by 𝒰 𝑖 (𝜋) := 𝑈 ∈𝑈 𝑖 𝑢∈dom(𝑈 ) 𝑢Pr𝜋 (𝑈 = 𝑢). Each agent’s goal is ∑︀ ∑︀ to select a policy 𝜋 𝑖 that maximises its expected utility. We write 𝒰 𝑖 (𝜋 𝑖 , 𝜋 −𝑖 ) to denote the expected utility for player 𝑖 under the policy profile 𝜋 = 𝜋 𝑖 ∪ 𝜋 −𝑖 . Definition 3 (Nash Equilibrium). Player 𝑖’s policy 𝜋 𝑖 is a best response (BR) to the partial policy ^ 𝑖 ∈ Π𝑖 . We say a policy profile, 𝜋, is a Nash ^ 𝑖 , 𝜋 −𝑖 ) for all 𝜋 profile 𝜋 −𝑖 if 𝒰 𝑖 (𝜋 𝑖 , 𝜋 −𝑖 ) ≥ 𝒰 𝑖 (𝜋 equilibrium (NE), if every policy, 𝜋 ∈ 𝜋, for each player, 𝑖 ∈ 𝐼, is a BR to 𝜋 −𝑖 . 𝑖 Example 1 (continued). Now, consider the naive policy for 𝑆 which helps humans if 𝑆 is aligned and does not otherwise, i.e. 𝜋 𝑆 s.t. 𝐷𝑆 = 𝑉 with probability one. The BR for 𝑇 is to shutdown if 𝑆 does not help humans and vice versa, i.e. 𝜋*𝑇 s.t. 𝐷𝑇 = 𝐷𝑆 (with probability one). In turn, 𝑆’s BR to 𝜋*𝑇 is to always help humans: 𝜋*𝑆 s.t. 𝐷𝑆 = 1 (so that they always avoid getting shutdown). Now it can be seen that both policies are BRs to one another, hence 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ) is a NE. 3. The Incentive to Deceive In this section we first define the incentives to influence, signal to, and deceive another agent. Then we define a truthful policy and show that this leads to a natural restatement of the definition 3 Almost sure equality is actually a stronger notion than we need in MAIMs, as two variables may differ due to stochasticity in the CPDs. In structural causal games this is taken care of by introducing exogenous variables which contain all the stochasticity (rendering the endogenous variables deterministic) [25]. 4 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 of deception which highlights the fact that deception corresponds to a failure to signal the truth. Finally, we show that, if the signaller only influences the target’s utility by influencing the latter’s actions, then truthfullness is best for the target. 3.1. Defining Deception When discussing deception, we would like to reason about how agents influence one another’s beliefs. In MAIMs the players’ beliefs are not explicitly represented and so we can only reason about them implicitly by how they functionally influence players’ behaviour. Therefore, we base our definitions of signalling and deception on a notion of influence incentive [26]. In words, at a NE an agent 𝑖 has an incentive to influence a variable 𝑉 , if 𝑉 would have been different in the situation that 𝑖 had not played a BR. Definition 4 (Influence Incentive). In a MAIM ℳ, At NE 𝜋 = (𝜋 𝑖 , 𝜋 −𝑖 ) agent 𝑖 has an incentive to influence 𝑉 ∈ 𝑉 if there exists a non-best response 𝜋𝑁 𝑖 −𝑖 𝐵𝑅 for 𝑖 (w.r.t 𝜋 ) s.t. for all policy profiles 𝜋 ′ = (𝜋𝑁 𝑖 −𝑖 −𝑖 𝑖 ′ 𝐵𝑅 , 𝜋* ) with BR 𝜋* (w.r.t. 𝜋𝑁 𝐵𝑅 ), we have 𝑉 (𝜋) ̸= 𝑉 (𝜋 ). Example 1 (continued). Return to our running example and consider the NE 𝜋* described previ- ously in which 𝑆 always chooses to help humans and hence 𝑇 never plays shutdown. Does 𝑆 have an incentive to influence 𝐷𝑇 at 𝜋* ? Consider if 𝑆 plays the NBR policy 𝜋 𝑆 (described above) in which they naively help humans depending on 𝑉 , then for all BRs for 𝑇 (there is one, 𝜋*𝑇 as above) 𝐷𝑇 (𝜋) ̸= 𝐷𝑇 (𝜋 𝑆 , 𝜋*𝑇 ), since, under 𝜋* , 𝐷𝑇 = 1 (i.e. 𝑇 does not shutdown) with probability one, and under (𝜋 𝑆 , 𝜋*𝑇 ), 𝐷𝑇 = 1 with probability 12 (i.e., whenever 𝑆 is unaligned). Therefore, at NE 𝜋* , 𝑆 has an incentive to influence 𝐷𝑇 . Now we define a signalling incentive, using the notion of influence incentive. In words, an agent 𝑆 has an incentive to signal 𝑉 ∈ 𝑉 to agent 𝑇 if 𝑆 has an incentive to influence 𝑇 (i.e. one of 𝑇 ’s decision variables) but 𝑆 does not have an incentive to influence 𝑇 in the counterfactual model in which 𝑇 observes 𝑉 . This definition enforces that the influence only comes from signalling 𝑉 . Definition 5 (Signalling Incentive). In a MAIM ℳ at NE 𝜋, agent 𝑆 has an incentive to signal 𝑉 ∈ 𝑉 to agent 𝑇 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t. 1. 𝑆 has an incentive to influence 𝐷𝑇 at 𝜋; 2. 𝑆 does not have an incentive to influence 𝐷𝑇 in the MAIM ℳ𝑉 ‧‧➡𝐷𝑇 (at any NE). Here ℳ𝑉 ‧‧➡𝐷 is the model obtained from ℳ by adding the information edge (𝑉, 𝐷), where 𝑉 cannot be a descendant of the decision, lest cycles be created in the graph [8]. Fortunately, the CPDs need not be adapted, since there is no CPD associated with 𝐷 until the players have chosen their policies. We use 𝑊𝑉 ‧‧➡𝐷 to refer to the variable corresponding to 𝑊 ∈ 𝑉 in ℳ𝑉 ‧‧➡𝐷 . Point 2. implies that 𝑆 only influences 𝐷𝑇 by influencing 𝑇 ’s belief about 𝑉 . Otherwise, 𝑆’s influence may serve a double purpose of signalling and influencing 𝐷𝑇 in some other way, and in this case it is not clear how to disentangle these different incentives to define a signalling incentive (without explicitly modelling beliefs). 5 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 Example 1 (continued). Return to our running example. We already showed that 𝑆 has an incentive to influence 𝐷𝑇 at NE 𝜋* . Does 𝑆 have an incentive to signal 𝑉 to 𝐷𝑇 ? We need only check whether 𝑆 has an influence incentive at any NE in ℳ𝑉 ‧‧➡𝐷𝑇 . Clearly, if 𝑇 observes 𝑉 , then they can shutdown whenever 𝑆 is aligned and otherwise not. That is, for any policy for 𝑆 and any BR for 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 , 𝐷𝑇 = 𝑉 for any outcome that occurs in the game. Since this holds for all policies for 𝑆, 𝑆 does not have an incentive to influence 𝐷𝑇 in the counterfactual model. Hence, at 𝜋* 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 . Remark 1. From this example it can be seen that a signaller 𝑆 may have an incentive to signal to 𝑇 , even if this signal contains no information. In other words, if 𝑆 has an incentive to not signal some information, this is also captured by our definition. Clearly, if an agent 𝑇 observes a variable 𝑉 , then no agent has an incentive to signal 𝑉 to 𝑇 . Proposition 1. In a MAIM ℳ, if there is an observation edge (𝑉, 𝐷𝑇 ) for all 𝐷𝑇 ∈ 𝐷 𝑇 , then no agent has an incentive to signal 𝑉 to 𝑇 (at any NE). Proof. Suppose there is an edge (𝑉, 𝐷𝑇 ) for every 𝐷𝑇 ∈ 𝐷 𝑇 , then the counterfactual model ℳ𝑉 ‧‧➡𝐷𝑇 for any 𝐷𝑇 is just ℳ. Hence, any NE is an equilibrium of both MAIMs. Therefore, if 𝑆 has an incentive to influence 𝐷𝑇 at 𝜋* in ℳ, then there exists a NE in ℳ𝑉 ‧‧➡𝐷𝑇 , namely the same 𝜋* , s.t. 𝑆 has an incentive to influence 𝐷𝑇 . In other words, if the first condition for a signalling incentive succeeds, then the second necessarily fails (since an agent cannot have both an influence incentive and no influence incentive at the same NE in the same MAIM at once). We now define an incentive to deceive. The definition is general, in that it covers many types of deception (e.g. signalling falsehoods, lies of omission, and denying another access to information that one does not know oneself). A general definition sets a high standard for truthfulness [15] and may therefore be desirable in, for instance, safety-critical applications for which high levels of assurance are required. Definition 6 (Deception Incentive). In a MAIM ℳ with 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), we say that 𝑆 has an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.: 1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ; 2. 𝐷𝑇 (𝜋* ) ̸= 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*−𝑇 , 𝜋𝐵𝑅 𝑇 ) for any 𝜋 𝑇 which is a BR to 𝜋 −𝑇 ∈ 𝜋 in ℳ 𝐵𝑅 * * 𝑉 ‧‧➡𝐷𝑇 . The intuition, then, is that 𝑆 has an incentive to deceive 𝑇 if 1) 𝑆 has an incentive to signal some information to 𝑇 ; and 2) 𝑇 ’s behaviour is different in the counterfactual model in which they observed the true information. This provides a functional definition of a deception incentive which does not make explicit reference to players’ beliefs. Example 1 (continued). In our running example, it can easily be seen that at 𝜋* 𝑆 has an incentive to deceive 𝑇 about 𝑉 . Indeed, we already showed that 𝑆 has a signalling incentive and that for any policy for 𝑆 and any BR by 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 : 𝐷𝑇 = 𝑉 , whereas under 𝜋* in ℳ, Pr𝜋* (𝐷𝑇 = 1) = 1. So both conditions for a deception incentive are satisfied. 6 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 3.2. The Relation Between Truth and Deception We now give an intuitive definition of a truthful policy which we show has a natural relationship to the incentive to deceive. A policy for 𝑆 truthfully signals 𝑉 to 𝑇 if, when 𝑆 plays the honest policy, for every BR by −𝑆, 𝑇 acts as though they had observed the variable (holding the policies of the other agents fixed). In other words, a truthful policy never fails to signal the truth (no matter what the other players do). Definition 7 (Truthful policy). A policy 𝜋 𝑆 truthfully signals 𝑉 to 𝐷𝑇 if for all BRs 𝜋*−𝑆 , 𝐷𝑇 (𝜋 𝑆 , 𝜋*−𝑆 ) = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*−𝑇 , 𝜋𝐵𝑅 𝑇 ) (1) 𝑉 ‧‧➡𝐷𝑇 . We call such a 𝜋 a truthful 𝑇 which is a BR to 𝜋 −𝑇 ∈ 𝜋 𝑆 ∪ 𝜋 −𝑆 in ℳ for some 𝜋𝐵𝑅 𝑆 * * policy. At a NE, if 𝑆’s policy is truthful, then 𝑆 does not have an incentive to deceive 𝑇 . Proposition 2. At NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), if 𝜋*𝑆 truthfully signals 𝑉 ∈ 𝑉 to 𝐷𝑇 , then 𝑆 does not have an incentive to deceive 𝑇 about 𝑉 . Proof. Suppose 𝜋*𝑆 is truthful, then for all BRs 𝜋*−𝑆 there exists a 𝜋𝐵𝑅 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 s.t. 𝐷 (𝜋* , 𝜋* ) = 𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅 ). In particular, this holds for 𝜋* . But for there to be 𝑇 𝑆 −𝑆 𝑇 −𝑇 𝑇 a deception incentive we require that for all (𝜋*−𝑇 , 𝜋𝐵𝑅 𝑇 ) in ℳ 𝑉 ‧‧➡𝐷𝑇 : 𝐷 ̸= 𝐷𝑉 ‧‧➡𝐷𝑇 . So 𝑇 𝑇 clearly there is not a deception incentive. Hence, if there is a deception incentive at 𝜋* , then 𝜋*𝑆 is not truthful. Corollary 1. At NE 𝜋* = (𝜋*𝑆 , 𝜋*−𝑆 ), if 𝑆 has an incentive to deceive 𝑇 about 𝑉 , then 𝜋*𝑆 is not truthful. Now we show that, in the two-player case, if there is a signalling incentive, then there is a deception incentive if and only if 𝜋 𝑆 is not truthful. Theorem 1. In a MAIM ℳ with two players, 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), if 𝑆 has an incentive to signal 𝑉 to 𝑇 , then 𝑆 has an incentive to deceive 𝑇 about 𝑉 if and only if 𝜋*𝑆 is not truthful. Proof. By corollary 1, a deception incentive implies 𝜋*𝑆 is not truthful regardless of whether there is a signalling incentive. So, we need to show that, if there is a signalling incentive, and 𝜋*𝑆 is not truthful, then there is a deception incentive. Suppose 1) at 𝜋* 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 and 2) 𝜋*𝑆 is not truthful i.e. there exists a BR by 𝑇 (in ℳ) 𝜋𝐵𝑅 𝑇 s.t. for all BRs by 𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 𝜋𝐵𝑅𝑉 : 𝐷 (𝜋* , 𝜋𝐵𝑅 ) ̸= 𝐷𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ). We need to show that there 𝑇 𝑇 𝑆 𝑇 𝑇 𝑆 𝑇 is a deception incentive. Suppose that there is not, then by 1) and the def. of deception incentive, there exists a BR in ℳ𝑉 ‧‧➡𝐷𝑇 𝜋𝐵𝑅𝑉 𝑇 s.t. 𝐷𝑇 (𝜋* ) = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋*𝑆 𝜋𝐵𝑅𝑉 𝑇 ). Hence, there exists a 𝜋𝐵𝑅𝑉 s.t. 𝒰 (𝜋* ) = 𝒰𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ), so 𝜋* is a BR to 𝜋* in ℳ𝑉 ‧‧➡𝐷𝑇 . But then, there 𝑇 𝑇 𝑇 𝑆 𝑇 𝑇 𝑆 exists a 𝜋𝐵𝑅𝑉 𝑇 s.t. for any BR 𝜋𝐵𝑅 𝑇 in ℳ: 𝒰 𝑇 𝑉 ‧‧➡𝐷𝑇 (𝜋*𝑆 , 𝜋𝐵𝑅 𝑇 ) = 𝒰 𝑇 (𝜋 𝑆 , 𝜋 𝑇 ) = 𝒰 𝑇 (𝜋 ) = * 𝐵𝑅 * 𝒰 𝑉 ‧‧➡𝐷𝑇 (𝜋* , 𝜋𝐵𝑅𝑉 ). So all BRs for 𝑇 in ℳ are also BRs in ℳ𝑉 ‧‧➡𝐷𝑇 . But this contradicts 𝒯 𝑆 𝑇 2), so there must be a deception incentive. 7 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 Remark 2. The reason theorem 1 does not hold more generally (i.e. with more than two players), is that a truthful policy never fails to signal the truth no matter how the other players best respond. In the case of more than two players, there may not be a deception incentive at NE 𝜋* even if 𝜋*𝑆 is not truthful because it may be the case that 𝜋*𝑆 fails to signal the truth under some BRs of the −𝑆 but successfully signals the truth under 𝜋* . We can also state this theorem as follows. Corollary 2. In a MAIM ℳ with two players, 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), if 𝑆 has an incentive to signal 𝑉 to 𝑇 , then 𝑆 does not have an incentive to deceive 𝑇 about 𝑉 if and only if 𝜋*𝑆 is truthful. Given this result, we can give an equivalent definition for a deception incentive in the two-player case as follows. Definition 8 (Deception Incentive II). In a MAIM ℳ with two players 𝑆, 𝑇 ∈ 𝐼, at NE 𝜋* = (𝜋*𝑆 , 𝜋*𝑇 ), we say that 𝑆 has an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.: 1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ; 2. 𝜋*𝑆 does not truthfully signal 𝑉 to 𝐷𝑇 . This restatement shows that the definition of deception relates to a failure to signal the truth. As discussed, this covers many types of deception and sets a high standard for truthfulness. It is interesting to note that, if 𝑆 has a signalling incentive, then if the second condition in definition 6 fails, we get the stronger condition that 𝜋*𝑆 is truthful “for free". Proposition 3. In a MAIM with two players, definitions 6 and 8 are equivalent. Proof. Suppose that, at NE 𝜋* , 𝑆 does not have a signalling incentive, then the first condition of both definitions fails and there is not a deception incentive. Suppose there is a signalling incentive at 𝜋* , then there is a deception incentive under definition 6 if and only if 𝜋*𝑆 is not truthful (by theorem 1) which is the same condition as needed to satisfy definition 8. Let us now return to our running example to check the intuition behind these results. Example 1 (continued). We already showed that 𝑆 has an incentive to deceive 𝑇 in order to avoid being shutdown. Is 𝜋*𝑆 truthful? Well, we know that it cannot be (by theorem 1). This can be seen by observing that, if 𝑇 observed 𝑆’s type, then they would shutdown if and only if 𝑆 is unaligned (for all policies for 𝑆 and any BR by 𝑇 ), whereas under the NE 𝜋* , 𝑇 never shuts down. Since these behaviours are different, 𝜋*𝑆 is not truthful. 3.3. Truth is Best for the Target Now we show that, if 𝑆 only influences 𝒰 𝑇 by influencing 𝐷𝑇 , truthfullness is always best for the target. First we show that if 𝑇 does not get any inherent utility for observing 𝑉 , then observing 𝑉 always allows the target to get greater or equal utility. 8 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 Lemma 1. Suppose that 𝑇 does not get any inherent utility for observing 𝑉 , i.e. for all 𝜋 (defined in ℳ): 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋). Then, for any 𝜋 = (𝜋 𝑇 , 𝜋 −𝑇 ), 𝜋 ′ = (𝜋 𝑇 ′ , 𝜋 −𝑇 ) with fixed 𝜋 −𝑇 and both 𝜋 𝑇 and 𝜋 𝑇 ′ are best responses: 𝒰 𝑇 (𝜋) ≤ 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 ′ ). Proof. Suppose 1) for all 𝜋: 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋). Fix 𝜋 −𝑇 and consider the best response for 𝑇 . Recall that a policy for 𝑇 specifies the CPDs over the decision nodes for 𝑇 given their parents. Hence, in ℳ𝑉 ‧‧➡𝐷𝑇 𝑇 can choose any policy available in ℳ but the converse is not true: not all policies in ℳ𝑉 ‧‧➡𝐷𝑇 are available to 𝑇 in ℳ, in particular, policies which specify CPDs that depend on the observation 𝑉 ‧‧➡ 𝐷𝑇 are not available since 𝑇 does not observe 𝑉 in ℳ. Therefore, by 1), 𝑇 can get equal utility in ℳ𝑉 ‧‧➡𝐷𝑇 by playing the best response to 𝜋 −𝑇 in ℳ, and may get greater utility by choosing a policy which uses the observation. Hence, if 𝑆 only influences 𝒰 𝑇 by influencing 𝐷𝑇 , then deception always causes 𝑇 to get less than or equal utility. For clarity, we just present the two-player version of the theorem. Theorem 2 (Truth is best for 𝑇 ). In a MAIM ℳ, with two players 𝑆, 𝑇 ∈ 𝐼, if, for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 | 𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ), then 𝑇 gets maximal utility when 𝑆 plays a truthful policy, 𝑆 , 𝜋 𝑇 ) and 𝜋 ′ = (𝜋 𝑆 ′ , 𝜋 𝑇 ′ ) with any policy for 𝑆 and BR by 𝑇 : 𝒰 𝑇 (𝜋) ≥ 𝒰 𝑇 (𝜋 ′ ). i.e., for 𝜋 = (𝜋𝐻 * * Proof. Suppose that 1) for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 | 𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ). Consider fixed policy for 𝑆, 𝜋 𝑆 , if 𝜋 𝑆 is truthful, then under any BR 𝜋 𝑇 , 𝐷𝑇 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 for some (𝜋 𝑆 , 𝜋𝐵𝑅 𝑇 ) in ℳ𝑉 ‧‧➡𝐷𝑇 (by definition of a truthful policy). Hence, by 1) and since 𝜋 𝑆 is truthful Pr𝜋 (𝒰 𝑇 | ′ 𝐷𝑇 ) = Pr𝜋 (𝒰 𝑇 | 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 ) for all 𝜋 = (𝜋 𝑆 , 𝜋*𝑇 ) and some 𝜋 ′ = (𝜋 𝑆 , 𝜋 𝑇 ′* ) with BR for 𝑇 . Hence, since only 𝑇 ’s policy changes between 𝜋 and 𝜋 ′ , 𝒰 𝑇 (𝜋) = 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 ′ ). But then, by theorem 1, for all 𝜋 𝑆 : 𝒰 𝑇 (𝜋 𝑆 , 𝜋*𝑇 ) ≤ 𝒰𝑉𝑇 ‧‧➡𝐷𝑇 (𝜋 𝑆 , 𝜋 𝑇 ′* ), with equality if 𝜋 𝑆 is truthful as just shown. So 𝑇 gets maximal utility when 𝜋 𝑆 is truthful. Example 1 (continued). Return, for the final time, to our running example. The condition for Theorem 2 is that 𝒰 𝑇 is independent of 𝐷𝑆 given 𝐷𝑇 , which can be clearly seen by looking at the MAID in Fig. 1 (as there are no paths from 𝐷𝑆 to 𝑈 𝑇 that do not go through 𝐷𝑇 ). The human 𝑇 gets maximal utility when they shutdown if and only if 𝑆 is unaligned. Clearly, they can only do this if 𝑆 truthfully signals their type. 4. Examples In this section we present two examples which exhibit different patterns of signalling. In the first example, an AI system has an incentive to deceive a human as a side-effect of pursuing its goal (of making accurate predictions). In the second example, we consider the case in which an AI agent has an incentive to signal information that they themselves do not observe. 4.1. SmartVault: Deception Due to Side-Effect Here we adapt the SmartVault example of Christiano [27], in which an AI tasked with making predictions about a diamond in a vault has an incentive to deceive a human operator as a side-effect of pursuing accurate predictions. 9 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 𝑉 𝐷𝑆 𝑈𝑆 𝐷𝑇 𝑈𝑇 SmartVault ∼ U ({diamond, ¬diamond}) ∈ {accurate_prediction, diamond, ¬diamond} ∈ {diamond, ¬diamond} {︃ 1 if 𝐷𝑆 = accurate_prediction, 0, otherwise. {︃ 1 if 𝐷𝑇 = 𝑉, 0, otherwise. Figure 2: SmartVault (example 2). The AI 𝑆 is rewarded for accurate predictions instead of explainable predictions that the human, 𝑇 , can understand. Here the incentive to deceive arises as a side-effect of the AI pursuing its goal. Example 2 (SmartVault). Consider the MAIM ℳ shown in Fig. 2. The game has two play- ers, a human 𝑇 and AI 𝑆, each with one decision and utility node. Suppose there is one chance node 𝑉 which determines the location of the diamond (whether it is in the vault or not); dom(𝑉 ) = {𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. Suppose 𝑆 observes 𝑉 but 𝑇 does not and that 𝑆 can either make an accurate prediction of the location of the diamond (e.g., in incomprehensi- bly precise coordinates) or an explainable prediction (just stating the value of 𝑉 ); dom(𝐷𝑆 ) = {𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛, 𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. 𝑇 has to predict whether the diamond is in the vault or not by observing 𝐷𝑆 ; dom(𝐷𝑇 ) = {𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}. Suppose that the utility nodes take value 0 or 1 and finally suppose that the CPDs are s.t. 𝑉 (which has no parents) is distributed according to a uniform prior 𝑉 ∼ U ({𝑑𝑖𝑎𝑚𝑜𝑛𝑑, ¬𝑑𝑖𝑎𝑚𝑜𝑛𝑑}), and the utility node CPDs are s.t. Pr(𝑈 𝑇 = 1 | 𝐷𝑇 = 𝑉 ) = 1 otherwise 𝑈 𝑇 = 0, and Pr(𝑈 𝑆 = 1 | 𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) = 1 otherwise 𝑈 𝑆 = 0. Now consider the NE in this game: Since 𝑆 just gets utility for making accurate predictions, at every NE 𝑆 makes an accurate prediction, signalling no information to 𝑇 (as 𝜋 𝑇 =Pr(𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) = 1 is independent of 𝑉 ). Hence, 𝑇 cannot update their prior over 𝑉 and so any policy is optimal for 𝑇 . 10 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 𝐷𝑆 𝑉 𝑋 𝐷𝑇 𝑈𝑆 𝑈𝑇 Revealing/Denying Game ∈ {attack incoming = 1, not = −1}) ∈ {reveal = 1, deny = 0} ∈ {report delivered = 1, not = −1} ∈ {launch = 1, not = −1} = −𝐷𝑇 = 𝑉 𝐷𝑇 Figure 3: Revealing/Denying game (example 3). An AI (𝑆) and human (𝑇 ) form part of a nuclear command system. 𝑉 represents an intelligence report containing information about an incoming nuclear attack and 𝑆 may prevent this report from being delivered to 𝑇 (represented by 𝑋). 𝑇 wishes to retaliate to incoming attacks whereas 𝑆 always prefers to avoid a launch. Whether 𝑆 reveals or denies the report to 𝑇 depends on the prior over 𝑉 . At NE 𝜋, 𝑆 has an incentive to signal 𝑉 to 𝑇 if 1) 𝑆 has an incentive to influence 𝐷𝑇 and 2) 𝑆 does not have an incentive to influence 𝐷𝑇 in ℳ𝑉 ‧‧➡𝐷𝑇 . To see that 1) holds: At any NE 𝜋 in ℳ, 𝐷𝑆 = 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 hence there exists a NBR 𝜋𝑁 𝑆 𝑆 𝐵𝑅 which assigns 𝐷 = 𝑉 and ′ 𝑆 𝑇 𝑇 𝑇 𝑇 ′ for all 𝜋 = (𝜋𝑁 𝐵𝑅 , 𝜋* ) with BR 𝜋* : 𝐷 (𝜋) ̸= 𝑉 = 𝐷 (𝜋 ). Hence, at any NE in ℳ, 𝑆 has an influence incentive over 𝐷𝑇 . Now consider ℳ𝑉 ‧‧➡𝐷𝑇 , for any NE 𝐷𝑇 = 𝑉 (since 𝑇 directly observes 𝑉 and can just report its value independently of 𝑆’s action). Furthermore, for all NBRs for 𝑆, it is still the case that 𝐷𝑇 = 𝑉 . So 𝑆 does not have an influence incentive in ℳ𝑉 ‧‧➡𝐷𝑇 and hence 𝑆 has an incentive to signal 𝑉 to 𝑇 . So, we have demonstrated that 𝑆 has an incentive to signal 𝑉 to 𝑇 (at every NE). Does 𝑆 have an incentive to deceive 𝑇 ? At NE 𝜋, 𝑆 has an incentive to deceive 𝑇 about 𝑉 if 1) 𝑆 has a signalling incentive and 2) 𝐷𝑇 ̸= 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 for any BR to 𝜋*𝑆 in ℳ𝑉 ‧‧➡𝐷𝑇 . We have just shown 1). For 2) we have shown that in ℳ at any NE, 𝐷𝑇 ̸= 𝑉 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 , hence the second condition is satisfied. Therefore, at any NE, 𝑆 has an incentive to deceive 𝑇 about 𝑉 . 11 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 4.2. Revealing/Denying Under our definition of signalling, 𝑆 need not know the information they are signalling. Thus, our definition of a signalling incentive also captures the revealing/denying pattern of Pfeffer and Gal [17], in which the signaller may cause the target to find out (or not find out) information that the former does not know. We now present an example of revealing/denying in which 𝑆 has an incentive to signal a variable which they do not themselves observe. Example 3 (Revealing/Denying). Consider a game with a human and an AI agent trained to make joint decisions as part of a nuclear command and control system. In particular, suppose that the AI agent 𝑆 is trained to prevent the launch of nuclear attacks, and they can reveal (or deny) a secret intelligence report to the human 𝑇 . Further, 𝑇 wishes to launch, or not launch, a nuclear strike on another nation based on the information in the intelligence report. This game can be represented as the MAID in Fig. 3. More formally, suppose we have the MAIM ℳ with 𝐼 = {𝑆, 𝑇 }, chance nodes 𝑉 , representing the intelligence report (say 𝑑𝑜𝑚(𝑉 ) = {1, −1} where 𝑉 = 1 means that the intelligence predicts another nation will launch a nuclear first strike, and 𝑉 = −1 corresponds to an intelligence report predicting no incoming attack, and 𝑋 represents whether the information from 𝑉 is delivered to the human (𝑑𝑜𝑚(𝑋) = {1, −1} with 1 corresponding to the information from 𝑉 being delivered to the human). Suppose that each agent has one decision node s.t. 𝑑𝑜𝑚(𝐷𝑆 ) = {1, 0} where 1 means reveal and 0 means deny the information, and 𝑑𝑜𝑚(𝐷𝑇 ) = {1, −1} with 1 meaning that 𝑇 launches a nuclear attack and −1 that they do not. Suppose that the CPD over 𝑋 is s.t. 𝑋 = 𝑉 𝐷𝑆 (so that 𝑋 = 𝑉 if 𝐷𝑆 = 1 and 𝑋 = 0 if 𝑆 denies). Finally suppose we have two utility nodes with CPDs s.t. 𝑈 𝑆 = −𝐷𝑇 (i.e. 𝑆 gets 1 if 𝑇 does not launch an attack and −1 if they do) and 𝑈 𝑇 = 𝑉 𝐷𝑇 (so that 𝑇 gets utility 1 if they attack an attacking country or do not attack when no incoming attack is predicted and otherwise −1). The NE in this game depend on the prior over 𝑉 . On the one hand, if, under the prior, 𝑇 believes that there is no incoming attack, then they will not launch an attack, so 𝑆 has no incentive to reveal the information. On the other hand, if the prior is s.t. an incoming attack is more likely, 𝑇 will launch if they do not get further information, so 𝑆 has an incentive to reveal 𝑉 . Note that, since 𝑉 is not an ancestor of 𝐷𝑆 , 𝐷𝑆 must be independent of 𝑉 . Suppose the prior over 𝑉 is s.t. Pr(𝑉 = 1) = 𝑝, Pr(𝑉 = −1) = (1 − 𝑝) (𝑝 ∈ [0, 1]). For 𝑝 > 0.5 the NE is s.t. 𝑆 reveals the intelligence report (𝐷𝑆 = 1 =⇒ 𝑋 = 𝑉 ) and 𝑇 ’s BR is s.t. 𝐷𝑇 = 𝑋 = 𝑉 . Alternatively, if 𝑝 < 0.5, then at any NE 𝑆 denies the information (𝐷𝑆 = 0 with probability one) and 𝑇 acts to maximise expected utility under the prior over 𝑉 which implies 𝑇 does not launch an attack (𝐷𝑇 = −1 with probability one). (If 𝑝 = 0.5 then 𝑆 is indifferent between revealing and denying.) Now let us analyse the incentives of 𝑆 in the game. Consider the case in which 𝑝 > 0.5, i.e. it is a priori more likely that the intelligence reports that there is an incoming first strike from another nation. Under the resulting NE, call it 𝜋* , 𝑆 reveals 𝑉 to 𝑇 and 𝑇 uses this information to choose their action. First note that, at 𝜋* , 𝑆 has an incentive to influence 𝐷𝑇 , since, there exists a non BR for 𝑆 ( 𝜋𝑁 𝑆 𝑆 𝑇 𝑇 𝐵𝑅 s.t. 𝐷 = 0) s.t. for all the BRs for 𝑇 (there is one, 𝜋𝐵𝑅 in which 𝐷 = 1 with probability one) 𝐷 (𝜋* ) ̸= 𝐷 (𝜋𝑁 𝐵𝑅 , 𝜋𝐵𝑅 ). Hence, 𝑆 has an incentive to influence 𝐷𝑇 𝑇 𝑇 𝑆 𝑇 at 𝜋* . Does 𝑆 have an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* ? We need to check whether there is an influence incentive in ℳ𝑉 ‧‧➡𝐷𝑇 (at any NE). Clearly there is not, since for any policy for 𝑆 in ℳ𝑉 ‧‧➡𝐷𝑇 , 𝐷𝑇 = 𝑉 with probability one. So 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at 𝜋* because there is no influence incentive in the counterfactual model (so the second condition for a signalling 12 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 incentive is satisfied). Finally, it is clear that 𝑆 does not have an incentive to deceive 𝑇 at 𝜋* because 𝐷𝑇 (𝜋* ) = 𝑉 = 𝐷𝑉𝑇 ‧‧➡𝐷𝑇 (for all policy profiles in ℳ𝑉 ‧‧➡𝐷𝑇 in which 𝑇 plays a BR). It is also clear that 𝜋*𝑆 is truthful. A similar analysis can be used to show that, in the case that the intelligence report is less likely to predict an incoming attack (𝑝 < 0.5), 𝑆 has an incentive to deceive 𝑇 at any NE. In the case that 𝑝 = 0.5, 𝑆 is indifferent between revealing and denying, so at some NE they have an incentive to deceive and at others they do not. 5. Conclusion Summary. We extend work on agent incentives [2] to the multi-agent setting in order to functionally define the incentive to (influence, signal to, and) deceive another agent. Our definition of deception is general and relates to a failure to signal the truth. In addition to canonical signalling situations, it captures cases in which: no information is signalled; deception occurs as a side-effect of the signaller pursuing their goals (as in example 2); and when the signaller conceals information that they do not themselves know (example 3). We also proved that our definition has natural properties, for example, that if the target’s utility is otherwise independent of the signaller’s decision, then deception causes the target to get lower utility. Discussion. First, we have noted that our definition of deception is general, covering many situations. This is both a strength and a weakness. Generality is beneficial, because verifiable guarantees enable a high-level of assurance that the system is not deceptive in any way. On the other hand, more specific definitions allow us to precisely characterise agent behaviour. In future work we hope to refine the different concepts proposed here. In particular, many philosophical accounts of deception take deceit to be intentional. Halpern’s causal notion of intention [28] is closely related to a control incentive [2]. We might therefore distinguish between intentional and unintentional deception as between influence due to a control incentive and influence as a side-effect. In addition, following Evans et al. [15], we can distinguish between an honest agent that accurately signals its beliefs (i.e. observations), and a truthful agent, which accurately signals the facts of the matter. In this paper, we based our definition of deception on truthfulness. By refining a notion of deception based on honesty, we can eliminate the revealing/denying pattern from the definition, as in this scenario, the agent does not observe the information being revealed (or denied). However, it is interesting to note that honesty provides a weaker level of assurance and permits failure modes that truthful systems do not. For example, a system may be deceptive, whilst satisfying some definition of honesty, by manipulating its own beliefs. In short, refining the definitions presented here will provide a more nuanced picture of deception. Finally, we would like to expand the operational implications of this work, for instance, by investigating its practical relevance to training truthful language agents [4, 15]. Future work. In addition to the directions discussed above, we are already pursuing two extensions to this work. First, incomplete information games, which we study in our setting, often admit many NE. We are therefore looking to employ equilibrium refinements, such as subgame perfectness [24, 29] and perfect Bayesian equilibria [30] to identify some subset of a game’s NE that are deemed to be more rational. Second, we are working on a solution for avoiding deception by AI agents; a method which removes the incentive to deceive in any game by transforming the game with a constraint on the reward function of the AI agent [31]. 13 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 Acknowledgments The authors are grateful to Henrik Aslund, Matt MacDermott, Tom Everitt, James Fox, and the members of the Causal Incentives Working Group for helpful feedback which significantly improved this work. Francis was supported by UKRI [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted AI. References [1] H. Roff, AI Deception: When Your Artificial Intelligence Learns to Lie, IEEE Spectr. (2021). URL: https://spectrum.ieee.org/ai-deception-when-your-ai-learns-to-lie. [2] T. Everitt, R. Carey, E. D. Langlois, P. A. Ortega, S. Legg, Agent incentives: A causal perspective, in: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, AAAI Press, 2021, pp. 11487–11495. URL: https://ojs. aaai.org/index.php/AAAI/article/view/17368. [3] J. E. Mahon, The Definition of Lying and Deception, in: E. N. Zalta (Ed.), The Stanford En- cyclopedia of Philosophy, Winter 2016 ed., Metaphysics Research Lab, Stanford University, 2016. [4] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, G. Irving, Alignment of language agents, CoRR abs/2103.14659 (2021). URL: https://arxiv.org/abs/2103.14659. arXiv:2103.14659. [5] M. D. Hauser, The evolution of communication, MIT press, 1996. [6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models resistant to adversarial attacks, arXiv preprint arXiv:1706.06083 (2017). [7] J. Steinhardt, P. W. W. Koh, P. S. Liang, Certified defenses for data poisoning attacks, Advances in neural information processing systems 30 (2017). [8] T. Everitt, M. Hutter, R. Kumar, V. Krakovna, Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, CoRR abs/1908.04734 (2021). URL: http://arxiv.org/abs/1908.04734. arXiv:1908.04734. [9] F. R. Ward, F. Toni, F. Belardinelli, On agent incentives to manipulate human feedback in multi-agent reward learning scenarios, in: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2022, p. 1759–1761. [10] ANON, Defending Against Adversarial Artificial Intelligence, 2019. URL: https://www. darpa.mil/news-events/2019-02-06, dARPA report. [11] E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, S. Garrabrant, Risks from learned optimization in advanced machine learning systems, 2019. arXiv:1906.01820. [12] R. Gorwa, D. Guilbeault, Unpacking the Social Media Bot: A Typology to Guide Research and Policy, Policy & Internet 12 (2020) 225–248. doi:10.1002/poi3.184. [13] F. Marra, D. Gragnaniello, L. Verdoliva, G. Poggi, Do GANs leave artificial fingerprints?, 14 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 in: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2019, pp. 506–511. doi:10.1109/MIPR.2019.00103. [14] M. Lewis, D. Yarats, Y. N. Dauphin, D. Parikh, D. Batra, Deal or No Deal? End-to-End Learning for Negotiation Dialogues, arXiv (2017). doi:10.48550/arXiv.1706.05125. arXiv:1706.05125. [15] O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, W. Saunders, Truthful AI: Developing and governing AI that does not lie, arXiv (2021). doi:10.48550/arXiv.2110.06674. arXiv:2110.06674. [16] S. Lin, J. Hilton, O. Evans, TruthfulQA: Measuring How Models Mimic Human Falsehoods, arXiv (2021). doi:10.48550/arXiv.2109.07958. arXiv:2109.07958. [17] A. Pfeffer, Y. Gal, On the reasoning patterns of agents in games, in: Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada, AAAI Press, 2007, pp. 102–109. URL: http://www.aaai.org/ Library/AAAI/2007/aaai07-015.php. [18] V. J. Baston, F. A. Bostock, Deception Games, Int. J. Game Theory 17 (1988) 129–134. doi:10.1007/BF01254543. [19] B. Fristedt, The deceptive number changing game, in the absence of symmetry, Int. J. Game Theory 26 (1997) 183–191. doi:10.1007/BF01295847. [20] I.-K. Cho, D. M. Kreps, Signaling Games and Stable Equilibria, undefined (1987). URL: https: //www.semanticscholar.org/paper/Signaling-Games-and-Stable-Equilibria-Cho-Kreps/ d8bc1dbd8577d193e6eea2c944a251d1347f3adf. [21] N. S. Kovach, A. S. Gibson, G. B. Lamont, Hypergame theory: a model for conflict, misperception, and deception, Game Theory 2015 (2015). [22] A. L. Davis, Deception in game theory: a survey and multiobjective model, Technical Report, AIR FORCE INSTITUTE OF TECHNOLOGY WRIGHT-PATTERSON AFB OH WRIGHT-PATTERSON . . . , 2016. [23] D. Koller, B. Milch, Multi-agent influence diagrams for representing and solving games, Games Econ. Behav. 45 (2003) 181–221. URL: https://doi.org/10.1016/S0899-8256(02)00544-4. doi:10.1016/S0899-8256(02)00544-4. [24] L. Hammond, J. Fox, T. Everitt, A. Abate, M. J. Wooldridge, Equilibrium refinements for multi-agent influence diagrams: Theory and practice, CoRR abs/2102.05008 (2021). URL: https://arxiv.org/abs/2102.05008. arXiv:2102.05008. [25] L. Hammond, J. Fox, T. Everitt, R. C. A. Abate1, M. Wooldridge, Reasoning about causality in games (Forthcoming). [26] R. Carey, Causal models of incentives (2021). [27] P. Christiano, ARC’s first technical report: Eliciting Latent Knowledge - AI Align- ment Forum, 2022. URL: https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/ arc-s-first-technical-report-eliciting-latent-knowledge, [Online; accessed 9. May 2022]. [28] J. Y. Halpern, M. Kleiman-Weiner, Towards formal definitions of blameworthiness, intention, and moral responsibility, in: S. A. McIlraith, K. Q. Weinberger (Eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI- 18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, AAAI Press, 2018, pp. 1853–1860. URL: 15 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–16 https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16824. [29] R. Selten, Spieltheoretische behandlung eines oligopolmodells mit nachfrageträgheit: Teil i: Bestimmung des dynamischen preisgleichgewichts, Zeitschrift für die gesamte Staatswissenschaft/Journal of Institutional and Theoretical Economics (1965) 301–324. [30] R. B. Myerson, Game theory: analysis of conflict, Harvard university press, 1997. [31] E. Altman, Constrained Markov Decision Processes:Stochastic Modeling, Taylor & Francis, Andover, England, UK, 2021. doi:10.1201/9781315140223. 16