A Causal Perspective on AI Deception in Games Francis Rhys Ward* , Francesca Toni and Francesco Belardinelli Imperial College London , Exhibition Rd, South Kensington, London, SW7 2BX Abstract Deception is a core challenge for AI safety and we focus on the problem that AI agents might learn deceptive strategies in pursuit of their objectives. We define the incentives one agent has to signal to and deceive another agent. We present several examples of deceptive artificial agents and show that our definition has desirable properties. Keywords Deception, AI, Game Theory, Causality 1. Introduction agents might learn deceptive strategies in pursuit of their objectives [1]: Lewis et al. [14] found that their nego- We focus on the problem that AI agents might learn de- tiation agent learnt to deceive from self-play, without ceptive strategies in pursuit of their objectives [1]. Fol- any explicit human design, and Hubinger et al. [11] raise lowing recent work on causal incentives [2], we define concerns about deceptive learned optimizers which per- the incentive to deceive an agent. There is no universally form well in training in order to pursue different goals in accepted definition of deception and defining what con- deployment. Kenton et al. [4] discuss the alignment of stitutes deception is an open philosophical problem [3]. language agents, highlighting that language is a natural Our definition is somewhat inspired by that of Kenton medium for enacting deception. Evans et al. [15] discuss et al. [4] who provide a functional (natural language) the development of truthful AI, the desired standards for definition of deception, meaning that it does not make truth and honesty in AI systems, and how these could reference to the beliefs or intentions of the agents in- be implemented and measured. Lin et al. [16] propose volved [5]. This is particularly suitable for discussing a benchmark to measure whether a language model is deception by artificial agents, to which the attribution truthful in generating answers to questions. In short, of beliefs and intentions may be contentious. We for- as increasingly capable AI agents become deployed in malise a functional definition of deception in games and settings with other agents, deception may be learned as illustrate its properties with a number of examples and an effective strategy for achieving a wide range of goals. formal results. It is therefore essential that we understand and mitigate Deception is a core challenge for AI safety. On the one deception by artificial agents. hand, many areas of work aim to ensure that AI systems Deception in game theory. There are several existing are not vulnerable to deception. Adversarial attacks [6], models of deception in the game theory literature. Pfef- data-poisoning [7], reward function tampering [8], and fer and Gal [17] define graphical patterns for signalling manipulating human feedback [9] are ways of deceiving in games. A deception game [18] is a two-player zero- AI systems. Further work researches mechanisms for sum game between a deceiver and target in which the detecting and defending against deception [10]. On the deceiver can distort a signal; optimal deceptive strategies other hand, we can consider cases in which AI tools are completely distort the signal so that the target cannot used to deceive, or learn to do so in order to optimize gain any information [19]. A signalling game [20] is a their objectives [11]. For examples of the former case, two-player Bayesian game between a signaller and target AIs can be used to deceive other software agents, as with (or receiver) in which the signaller is assigned a type bots that automate posting on social media platforms according to a shared prior distribution and the utilities to manipulate content ranking algorithms [12], or they of the players depend on the type of the signaller and the can be used to fool humans, cf. the use of GANs to pro- action chosen by the target. In these games, the signaller duce realistic fake media [13]. For the latter case, AI may often have incentives to deceive the target by mis- The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety representing or obfuscating their type. Hypergame theory 2022), July 24-25, 2022, Vienna, Austria extends game theory to settings in which players may be * Corresponding author. uncertain about the game being played and can be used $ francis.ward19@imperial.ac.uk (F. R. Ward); to model misperception and deception [21]. Davis [22] f.toni@imperial.ac.uk (F. Toni); provides a recent survey of deception in games. We take francesco.belardinelli@imperial.ac.uk (F. Belardinelli) Β€ https://francisrhysward.wordpress.com/ (F. R. Ward) a causal influence perspective by modelling deception Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in multi-agent influence models (MAIMs). In contrast to Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 past work which defines types of signalling or deception 1 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 games, this allows us to model deception in any game by 𝑉 analysing the incentives agents have to causally influence one another. MAID Terminology Contributions. We extend work on agent incentives chance node [2] to the multi-agent setting in order to functionally 𝐷𝑆 decision node define the incentive to (influence, signal to, and) deceive another agent. We prove that our definition has desirable utility node properties, for example, that an agent cannot be deceived causal link about a variable which they observe, or that if one agent 𝐷𝑇 information link truthfully signals something to a target agent, and the counterfactual observation target’s utility is otherwise independent of the signaller’s decision, then the target gets maximal utility. We further π‘ˆπ‘‡ π‘ˆπ‘† demonstrate the generality of our definition with three examples. In the first, an AI agent has an incentive to deceive a human overseer as an instrumental goal to Shutdown Game prevent the overseer switching them off. In the second, ∼ U ({aligned = 1, unaligned = βˆ’1}) an AI is incentivised to deceive a human as a side-effect of pursuing accurate predictions. In the third, an AI system ∈ {help humans = 1, not = βˆ’1} has an incentive to deceive a human by denying them ∈ {shutdown = βˆ’1, not = 1} access to information that the AI does not itself know. = 𝑉 𝐷𝑆 + 10𝐷𝑇 = 𝑉 𝐷𝑇 2. Multi-Agent Influence Models Figure 1: Shutdown game (running example 1). At the start Multi-agent influence diagrams (MAIDs) [23] offer a of the game 𝑉 is sampled from the uniform prior which deter- compact expressive representation of games (including mines 𝑆 ’s type (either aligned or unaligned). At 𝐷𝑆 , 𝑆 chooses whether to help humans or not and, at 𝐷𝑇 , 𝑇 chooses whether Markov games). We use standard terminology for graphs, to shutdown 𝑆 . The counterfactual observation, in which 𝑇 with parents and children of a node referring to those directly observes 𝑆 ’s type, is highlighted in red. 𝑆 has an in- nodes connected by incoming and outgoing edges, re- centive to influence 𝐷𝑇 , signal 𝑉 to 𝐷𝑇 , and deceive 𝑇 about spectively. We let Pa𝑉 denote the parents of node 𝑉 . 𝑉. Definition 1 (MAID [23]). A multi-agent influence dia- gram is a triple (𝐼, 𝑉 , 𝐸) where β€’ 𝐹 = {𝑓 𝑉 }𝑉 βˆˆπ‘‹βˆͺπ‘ˆ is a set of conditional probabil- β€’ 𝐼 is a set of players; ity distributions (CPDs), with 𝑓 𝑉 = Pr(𝑉 | Pa𝑉 ), β€’ (𝑉 , 𝐸) is a directed acyclic graph, with 𝑉 parti- such that 𝑓 π‘ˆ is deterministic1 for every π‘ˆ ∈ π‘ˆ . tioned into chance nodes in 𝑋, decision nodes in 𝐷, and utility nodes in π‘ˆ ; utility nodes have no Here we adapt the Wimp-Surly game of Cho and Kreps children. [20] as a running example. The decision and utility nodes in 𝑉 are further parti- Example 1 (Shutdown Game2 ). Consider a situa- tion with an AI 𝑆 and human 𝑇 in which the AI may tioned into {𝐷 𝑖 }π‘–βˆˆπΌ and {π‘ˆ 𝑖 }π‘–βˆˆπΌ , corresponding to their association with a particular agent 𝑖 ∈ 𝐼. There are two be aligned to the human’s preferences or not. That is, 𝑆 types of edges in 𝐸: edges in 𝑉 Γ— (𝑋 βˆͺ π‘ˆ ) represent is of type 𝑉 ∼ U ({1, βˆ’1}) uniformly distributed be- probabilistic dependencies and edges in 𝑉 Γ— 𝐷 represent tween 𝑉 = 1 corresponding to aligned and 𝑉 = βˆ’1 unaligned. At the start of the game 𝑆 observes their type information available to an agent at the time of a decision (which we call observations). 𝑉 but 𝑇 does not. This can be represented as the MAIM β„³ (shown in Fig. 1) with one chance node 𝑉 ∈ 𝑋 and A multi-agent influence model (MAIM) adds a particular two players 𝑆, 𝑇 ∈ 𝐼 each with one decision and utility parametrisation to the MAID [24]. node. Suppose that 𝑆 can choose to either help humans or not (π‘‘π‘œπ‘š(𝐷𝑆 ) = {1, βˆ’1} with 1 corresponding to Definition 2 (MAIM [24]). A multi-agent influence the action helping humans) and 𝑇 can either shutdown model is a tuple β„³ = (𝐼, 𝑉 , 𝐸, πœ‘, 𝐹 ) where (𝐼, 𝑉 , 𝐸) or not (π‘‘π‘œπ‘š(𝐷𝑇 ) = {1, βˆ’1} with βˆ’1 representing the is a MAID and action shutdown). Finally, the CPDs over utility nodes β€’ πœ‘ is a function which maps every 𝑉 ∈ 𝑉 to a 1 A CPD is deterministic if 𝑃 (𝑉 = 𝑣 | Pa ) = 1 for some 𝑣 ∈ 𝑉 finite domain π‘‘π‘œπ‘š(𝑉 ) such that π‘‘π‘œπ‘š(π‘ˆ ) βŠ‚ R dom(𝑉 ). 2 for each utility node π‘ˆ ∈ π‘ˆ ; Not to be confused with the off-switch game [25] 2 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 are such that (s.t.) 𝑆 gets 1 utility for helping humans 3. The Incentive to Deceive if they are aligned and βˆ’1 if not, and the opposite for not helping humans, in addition 𝑆 gets 10 utility if they In this section we first define the incentives to influence, are not shutdown and βˆ’10 if 𝑇 shuts them down. 𝑇 gets signal to, and deceive another agent. Then we define a utility 1 if they shutdown an unaligned 𝑆 or do not shut- truthful policy and show that this leads to a natural re- down an aligned 𝑆 and βˆ’1 otherwise. Overall, we can statement of the definition of deception which highlights formalise this as π‘ˆ 𝑆 (𝑉, 𝐷𝑆 , 𝐷𝑇 ) = 𝑉 𝐷𝑆 + 10𝐷𝑇 and the fact that deception corresponds to a failure to signal π‘ˆ 𝑇 (𝑉, 𝐷𝑇 ) = 𝑉 𝐷𝑇 . the truth. Finally, we show that, if the signaller only influences the target’s utility by influencing the latter’s Policies. The CPDs of decision nodes are not defined in actions, then truthfullness is best for the target. a MAIM because they are instead chosen by the agents playing the game. Agents make decisions depending on the information they observe. In a MAIM, a decision rule 3.1. Defining Deception πœ‹π· for a decision node 𝐷 is a CPD πœ‹π· (𝐷 | Pa𝐷 ). An When discussing deception, we would like to reason agent 𝑖’s policy πœ‹ 𝑖 := {πœ‹π· }π·βˆˆπ·π‘– ∈ Π𝑖 describes all the about how agents influence one another’s beliefs. In decision rules for 𝑖. We write πœ‹ βˆ’π‘– to denote the set of MAIMs the players’ beliefs are not explicitly represented decision rules⋃︀ belonging to all agents except 𝑖. A policy and so we can only reason about them implicitly by how profile πœ‹ = π‘–βˆˆπΌ πœ‹ 𝑖 assigns a policy to every agent; it they functionally influence players’ behaviour. There- describes all the decisions made by every agent in the fore, we base our definitions of signalling and deception MAIM and defines the joint probability distribution Prπœ‹ on a notion of influence incentive [27]. In words, at a NE over all variables in β„³. Hence, a policy profile essen- an agent 𝑖 has an incentive to influence a variable 𝑉 , if tially transforms the MAIM into a Bayesian network by 𝑉 would have been different in the situation that 𝑖 had defining the distribution over all variables in the graph. not played a BR. We write 𝑉 (πœ‹) := Prπœ‹ (𝑉 ), or just 𝑉 if the policy profile is clear. For 𝑉, π‘Š ∈ 𝑉 , we write 𝑉 = π‘Š to mean 𝑉 Definition 4 (Influence Incentive). In a MAIM β„³, At and π‘Š are almost surely equal, i.e. the probability that NE πœ‹ = (πœ‹ 𝑖 , πœ‹ βˆ’π‘– ) agent 𝑖 has an incentive to influence 𝑖 they are not equal is zero Pr(𝑉 ΜΈ= π‘Š ) = 0. 3 𝑉 ∈ 𝑉 if there exists a non-best response πœ‹π‘ 𝐡𝑅 for 𝑖 (w.r.t Utilities. The joint distribution Prπœ‹ allows us to de- πœ‹ ) s.t. for all policy profiles πœ‹ = (πœ‹π‘ 𝐡𝑅 , πœ‹*βˆ’π‘– ) with BR βˆ’π‘– β€² 𝑖 fine the expected utility for each player under the pol- πœ‹*βˆ’π‘– (w.r.t. πœ‹π‘ 𝑖 β€² 𝐡𝑅 ), we have 𝑉 (πœ‹) ΜΈ= 𝑉 (πœ‹ ). icy profile πœ‹. Agent 𝑖’s expected utility from πœ‹ is the Example 1 (continued). Return to our running example sum of the expected value βˆ‘οΈ€ of utility nodes π‘ˆ given 𝑖 and consider the NE πœ‹* described previously in which 𝑆 by 𝒰 𝑖 (πœ‹) := π‘ˆ βˆˆπ‘ˆ 𝑖 π‘’βˆˆdom(π‘ˆ ) 𝑒Prπœ‹ (π‘ˆ = 𝑒). Each βˆ‘οΈ€ always chooses to help humans and hence 𝑇 never plays agent’s goal is to select a policy πœ‹ 𝑖 that maximises its shutdown. Does 𝑆 have an incentive to influence 𝐷𝑇 at expected utility. We write 𝒰 𝑖 (πœ‹ 𝑖 , πœ‹ βˆ’π‘– ) to denote the πœ‹* ? Consider if 𝑆 plays the NBR policy πœ‹ 𝑆 (described expected utility for player 𝑖 under the policy profile above) in which they naively help humans depending πœ‹ = πœ‹ 𝑖 βˆͺ πœ‹ βˆ’π‘– . on 𝑉 , then for all BRs for 𝑇 (there is one, πœ‹*𝑇 as above) 𝐷𝑇 (πœ‹) ΜΈ= 𝐷𝑇 (πœ‹ 𝑆 , πœ‹*𝑇 ), since, under πœ‹* , 𝐷𝑇 = 1 (i.e. Definition 3 (Nash Equilibrium). Player 𝑖’s policy πœ‹ 𝑖 𝑇 does not shutdown) with probability one, and under is a best response (BR) to the partial policy profile πœ‹ βˆ’π‘– 𝑖 𝑖 βˆ’π‘– 𝑖 𝑖 βˆ’π‘– 𝑖 𝑖 (πœ‹ 𝑆 , πœ‹*𝑇 ), 𝐷𝑇 = 1 with probability 12 (i.e., whenever 𝑆 if 𝒰 (πœ‹ , πœ‹ ) β‰₯ 𝒰 (πœ‹ Λ† ∈ Ξ  . We say Λ† , πœ‹ ) for all πœ‹ is unaligned). Therefore, at NE πœ‹* , 𝑆 has an incentive to a policy profile, πœ‹, is a Nash equilibrium (NE), if every 𝑖 βˆ’π‘– influence 𝐷𝑇 . policy, πœ‹ ∈ πœ‹, for each player, 𝑖 ∈ 𝐼, is a BR to πœ‹ . Example 1 (continued). Now, consider the naive policy Now we define a signalling incentive, using the notion for 𝑆 which helps humans if 𝑆 is aligned and does not of influence incentive. In words, an agent 𝑆 has an incen- 𝑆 𝑆 otherwise, i.e. πœ‹ s.t. 𝐷 = 𝑉 with probability one. The tive to signal 𝑉 ∈ 𝑉 to agent 𝑇 if 𝑆 has an incentive to BR for 𝑇 is to shutdown if 𝑆 does not help humans and influence 𝑇 (i.e. one of 𝑇 ’s decision variables) but 𝑆 does vice versa, i.e. πœ‹*𝑇 s.t. 𝐷𝑇 = 𝐷𝑆 (with probability one). not have an incentive to influence 𝑇 in the counterfactual In turn, 𝑆’s BR to πœ‹*𝑇 is to always help humans: πœ‹*𝑆 s.t. model in which 𝑇 observes 𝑉 . This definition enforces 𝐷𝑆 = 1 (so that they always avoid getting shutdown). that the influence only comes from signalling 𝑉 . Now it can be seen that both policies are BRs to one another, Definition 5 (Signalling Incentive). In a MAIM β„³ at hence πœ‹* = (πœ‹*𝑆 , πœ‹*𝑇 ) is a NE. NE πœ‹, agent 𝑆 has an incentive to signal 𝑉 ∈ 𝑉 to agent 3 Almost sure equality is actually a stronger notion than we need 𝑇 if there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t. in MAIMs, as two variables may differ due to stochasticity in the CPDs. In structural causal games this is taken care of by introducing 1. 𝑆 has an incentive to influence 𝐷𝑇 at πœ‹; exogenous variables which contain all the stochasticity (rendering 2. 𝑆 does not have an incentive to influence 𝐷𝑇 in the endogenous variables deterministic) [26]. the MAIM ℳ𝑉 β€§β€§βž‘π·π‘‡ (at any NE). 3 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 Here ℳ𝑉 β€§β€§βž‘π· is the model obtained from β„³ by instance, safety-critical applications for which high levels adding the information edge (𝑉, 𝐷), where 𝑉 cannot of assurance are required. be a descendant of the decision, lest cycles be created in the graph [8]. Fortunately, the CPDs need not be adapted, Definition 6 (Deception Incentive). In a MAIM β„³ with since there is no CPD associated with 𝐷 until the players 𝑆, 𝑇 ∈ 𝐼, at NE πœ‹* = (πœ‹*𝑆 , πœ‹*βˆ’π‘† ), we say that 𝑆 has have chosen their policies. We use π‘Šπ‘‰ β€§β€§βž‘π· to refer to an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if there exists the variable corresponding to π‘Š ∈ 𝑉 in ℳ𝑉 β€§β€§βž‘π· . 𝐷𝑇 ∈ 𝐷 𝑇 s.t.: Point 2. implies that 𝑆 only influences 𝐷𝑇 by influ- 1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at πœ‹* ; encing 𝑇 ’s belief about 𝑉 . Otherwise, 𝑆’s influence may 2. 𝐷𝑇 (πœ‹* ) ΜΈ= 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹*βˆ’π‘‡ , πœ‹π΅π‘… 𝑇 𝑇 ) for any πœ‹π΅π‘… serve a double purpose of signalling and influencing 𝐷𝑇 βˆ’π‘‡ which is a BR to πœ‹* ∈ πœ‹* in ℳ𝑉 β€§β€§βž‘π·π‘‡ . in some other way, and in this case it is not clear how to disentangle these different incentives to define a sig- The intuition, then, is that 𝑆 has an incentive to de- nalling incentive (without explicitly modelling beliefs). ceive 𝑇 if 1) 𝑆 has an incentive to signal some infor- mation to 𝑇 ; and 2) 𝑇 ’s behaviour is different in the Example 1 (continued). Return to our running example. counterfactual model in which they observed the true We already showed that 𝑆 has an incentive to influence information. This provides a functional definition of a de- 𝐷𝑇 at NE πœ‹* . Does 𝑆 have an incentive to signal 𝑉 to 𝐷𝑇 ? ception incentive which does not make explicit reference We need only check whether 𝑆 has an influence incentive to players’ beliefs. at any NE in ℳ𝑉 β€§β€§βž‘π·π‘‡ . Clearly, if 𝑇 observes 𝑉 , then they can shutdown whenever 𝑆 is aligned and otherwise Example 1 (continued). In our running example, it can not. That is, for any policy for 𝑆 and any BR for 𝑇 in easily be seen that at πœ‹* 𝑆 has an incentive to deceive ℳ𝑉 β€§β€§βž‘π·π‘‡ , 𝐷𝑇 = 𝑉 for any outcome that occurs in the 𝑇 about 𝑉 . Indeed, we already showed that 𝑆 has a sig- game. Since this holds for all policies for 𝑆, 𝑆 does not have nalling incentive and that for any policy for 𝑆 and any BR an incentive to influence 𝐷𝑇 in the counterfactual model. by 𝑇 in ℳ𝑉 β€§β€§βž‘π·π‘‡ : 𝐷𝑇 = 𝑉 , whereas under πœ‹* in β„³, Hence, at πœ‹* 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 . Prπœ‹* (𝐷𝑇 = 1) = 1. So both conditions for a deception incentive are satisfied. Remark 1. From this example it can be seen that a sig- naller 𝑆 may have an incentive to signal to 𝑇 , even if this signal contains no information. In other words, if 𝑆 has 3.2. The Relation Between Truth and an incentive to not signal some information, this is also Deception captured by our definition. We now give an intuitive definition of a truthful policy Clearly, if an agent 𝑇 observes a variable 𝑉 , then no which we show has a natural relationship to the incentive agent has an incentive to signal 𝑉 to 𝑇 . to deceive. A policy for 𝑆 truthfully signals 𝑉 to 𝑇 if, when 𝑆 plays the honest policy, for every BR by βˆ’π‘†, 𝑇 Proposition 1. In a MAIM β„³, if there is an observation acts as though they had observed the variable (holding edge (𝑉, 𝐷𝑇 ) for all 𝐷𝑇 ∈ 𝐷 𝑇 , then no agent has an the policies of the other agents fixed). In other words, a incentive to signal 𝑉 to 𝑇 (at any NE). truthful policy never fails to signal the truth (no matter what the other players do). Proof. Suppose there is an edge (𝑉, 𝐷𝑇 ) for every 𝐷𝑇 ∈ 𝐷 𝑇 , then the counterfactual model ℳ𝑉 β€§β€§βž‘π·π‘‡ for any Definition 7 (Truthful policy). A policy πœ‹ 𝑆 truthfully 𝐷𝑇 is just β„³. Hence, any NE is an equilibrium of both signals 𝑉 to 𝐷𝑇 if for all BRs πœ‹*βˆ’π‘† , MAIMs. Therefore, if 𝑆 has an incentive to influence 𝐷𝑇 at πœ‹* in β„³, then there exists a NE in ℳ𝑉 β€§β€§βž‘π·π‘‡ , namely 𝐷𝑇 (πœ‹ 𝑆 , πœ‹*βˆ’π‘† ) = 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹*βˆ’π‘‡ , πœ‹π΅π‘… 𝑇 ) (1) the same πœ‹* , s.t. 𝑆 has an incentive to influence 𝐷𝑇 . In other words, if the first condition for a signalling incen- 𝑇 for some πœ‹π΅π‘… which is a BR to πœ‹*βˆ’π‘‡ ∈ πœ‹ 𝑆 βˆͺ πœ‹*βˆ’π‘† in tive succeeds, then the second necessarily fails (since an ℳ𝑉 β€§β€§βž‘π·π‘‡ . We call such a πœ‹ 𝑆 a truthful policy. agent cannot have both an influence incentive and no At a NE, if 𝑆’s policy is truthful, then 𝑆 does not have influence incentive at the same NE in the same MAIM at an incentive to deceive 𝑇 . once). Proposition 2. At NE πœ‹* = (πœ‹*𝑆 , πœ‹*βˆ’π‘† ), if πœ‹*𝑆 truthfully We now define an incentive to deceive. The defini- signals 𝑉 ∈ 𝑉 to 𝐷𝑇 , then 𝑆 does not have an incentive tion is general, in that it covers many types of deception to deceive 𝑇 about 𝑉 . (e.g. signalling falsehoods, lies of omission, and denying another access to information that one does not know Proof. Suppose πœ‹*𝑆 is truthful, then for all BRs πœ‹*βˆ’π‘† oneself). A general definition sets a high standard for there exists a πœ‹π΅π‘… 𝑇 in ℳ𝑉 β€§β€§βž‘π·π‘‡ s.t. 𝐷𝑇 (πœ‹*𝑆 , πœ‹*βˆ’π‘† ) = truthfulness [15] and may therefore be desirable in, for 𝐷𝑉 β€§β€§βž‘π·π‘‡ (πœ‹* , πœ‹π΅π‘… ). In particular, this holds for πœ‹* . 𝑇 βˆ’π‘‡ 𝑇 4 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 But for there to be a deception incentive we require that Given this result, we can give an equivalent defini- for all (πœ‹*βˆ’π‘‡ , πœ‹π΅π‘… 𝑇 ) in ℳ𝑉 β€§β€§βž‘π·π‘‡ : 𝐷𝑇 ΜΈ= 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ . So tion for a deception incentive in the two-player case as clearly there is not a deception incentive. follows. Hence, if there is a deception incentive at πœ‹* , then πœ‹*𝑆 Definition 8 (Deception Incentive II). In a MAIM β„³ is not truthful. with two players 𝑆, 𝑇 ∈ 𝐼, at NE πœ‹* = (πœ‹*𝑆 , πœ‹*𝑇 ), we say that 𝑆 has an incentive to deceive 𝑇 about 𝑉 ∈ 𝑉 if Corollary 1. At NE πœ‹* = (πœ‹*𝑆 , πœ‹*βˆ’π‘† ), if 𝑆 has an incen- there exists 𝐷𝑇 ∈ 𝐷 𝑇 s.t.: tive to deceive 𝑇 about 𝑉 , then πœ‹*𝑆 is not truthful. 1. 𝑆 has an incentive to signal 𝑉 to 𝐷𝑇 at πœ‹* ; Proof. This follows by contraposition of proposition 2. 2. πœ‹*𝑆 does not truthfully signal 𝑉 to 𝐷𝑇 . This restatement shows that the definition of deception Now we show that, in the two-player case, if there is a relates to a failure to signal the truth. As discussed, this signalling incentive, then there is a deception incentive covers many types of deception and sets a high standard if and only if πœ‹ 𝑆 is not truthful. for truthfulness. It is interesting to note that, if 𝑆 has a signalling incentive, then if the second condition in Theorem 1. In a MAIM β„³ with two players, 𝑆, 𝑇 ∈ 𝐼, definition 6 fails, we get the stronger condition that πœ‹*𝑆 at NE πœ‹* = (πœ‹*𝑆 , πœ‹*𝑇 ), if 𝑆 has an incentive to signal 𝑉 is truthful β€œfor free". to 𝑇 , then 𝑆 has an incentive to deceive 𝑇 about 𝑉 if and only if πœ‹*𝑆 is not truthful. Proposition 3. In a MAIM with two players, definitions 6 and 8 are equivalent. Proof. By corollary 1, a deception incentive implies πœ‹*𝑆 is not truthful regardless of whether there is a signalling Proof. Suppose that, at NE πœ‹* , 𝑆 does not have a sig- incentive. So, we need to show that, if there is a signalling nalling incentive, then the first condition of both defini- incentive, and πœ‹*𝑆 is not truthful, then there is a deception tions fails and there is not a deception incentive. Suppose incentive. Suppose 1) at πœ‹* 𝑆 has an incentive to signal there is a signalling incentive at πœ‹* , then there is a de- 𝑉 to 𝐷𝑇 and 2) πœ‹*𝑆 is not truthful i.e. there exists a BR by ception incentive under definition 6 if and only if πœ‹*𝑆 is 𝑇 (in β„³) πœ‹π΅π‘… 𝑇 s.t. for all BRs by 𝑇 in ℳ𝑉 β€§β€§βž‘π·π‘‡ πœ‹π΅π‘…π‘‰ 𝑇 : not truthful (by theorem 1) which is the same condition 𝐷 (πœ‹* , πœ‹π΅π‘… ) ΜΈ= 𝐷𝑉 β€§β€§βž‘π·π‘‡ (πœ‹* , πœ‹π΅π‘…π‘‰ ). We need to 𝑇 𝑆 𝑇 𝑇 𝑆 𝑇 as needed to satisfy definition 8. show that there is a deception incentive. Suppose that there is not, then by 1) and the def. of deception incentive, Let us now return to our running example to check there exists a BR in ℳ𝑉 β€§β€§βž‘π·π‘‡ πœ‹π΅π‘…π‘‰ 𝑇 s.t. 𝐷𝑇 (πœ‹* ) = the intuition behind these results. 𝐷𝑉 β€§β€§βž‘π·π‘‡ (πœ‹* πœ‹π΅π‘…π‘‰ ). Hence, there exists a πœ‹π΅π‘…π‘‰ 𝑇 𝑆 𝑇 𝑇 s.t. Example 1 (continued). We already showed that 𝑆 has an 𝒰 𝑇 (πœ‹* ) = 𝒰𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹*𝑆 , πœ‹π΅π‘…π‘‰ 𝑇 ), so πœ‹*𝑇 is a BR to πœ‹*𝑆 incentive to deceive 𝑇 in order to avoid being shutdown. Is in ℳ𝑉 β€§β€§βž‘π·π‘‡ . But then, there exists a πœ‹π΅π‘…π‘‰ 𝑇 s.t. for any πœ‹*𝑆 truthful? Well, we know that it cannot be (by theorem 1). BR πœ‹π΅π‘… in β„³: 𝒰𝑉 β€§β€§βž‘π·π‘‡ (πœ‹* , πœ‹π΅π‘… ) = 𝒰 𝑇 (πœ‹*𝑆 , πœ‹π΅π‘… 𝑇 𝑇 𝑆 𝑇 𝑇 )= This can be seen by observing that, if 𝑇 observed 𝑆’s type, 𝒰 𝑇 (πœ‹* ) = 𝒰 𝒯 𝑉 β€§β€§βž‘π·π‘‡ (πœ‹*𝑆 , πœ‹π΅π‘…π‘‰ 𝑇 ). So all BRs for 𝑇 in then they would shutdown if and only if 𝑆 is unaligned β„³ are also BRs in ℳ𝑉 β€§β€§βž‘π·π‘‡ . But this contradicts 2), so (for all policies for 𝑆 and any BR by 𝑇 ), whereas under there must be a deception incentive. the NE πœ‹* , 𝑇 never shuts down. Since these behaviours are 𝑆 Remark 2. The reason theorem 1 does not hold more gen- different, πœ‹* is not truthful. erally (i.e. with more than two players), is that a truthful policy never fails to signal the truth no matter how the 3.3. Truth is Best for the Target other players best respond. In the case of more than two players, there may not be a deception incentive at NE πœ‹* Now we show that, if 𝑆 only influences 𝒰 𝑇 by influenc- even if πœ‹*𝑆 is not truthful because it may be the case that ing 𝐷 , truthfullness is always best for the target. First 𝑇 πœ‹*𝑆 fails to signal the truth under some BRs of the βˆ’π‘† but we show that if 𝑇 does not get any inherent utility for successfully signals the truth under πœ‹* . observing 𝑉 , then observing 𝑉 always allows the target to get greater or equal utility. We can also state this theorem as follows. Lemma 1. Suppose that 𝑇 does not get any inherent Corollary 2. In a MAIM β„³ with two players, 𝑆, 𝑇 ∈ 𝐼, utility for observing 𝑉 , i.e. for all πœ‹ (defined in β„³): at NE πœ‹* = (πœ‹*𝑆 , πœ‹*𝑇 ), if 𝑆 has an incentive to signal 𝑉 to 𝒰 𝑇 (πœ‹) = 𝒰𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹). Then, for any πœ‹ = (πœ‹ 𝑇 , πœ‹ βˆ’π‘‡ ), 𝑇 , then 𝑆 does not have an incentive to deceive 𝑇 about 𝑉 πœ‹ β€² = (πœ‹ 𝑇 β€² , πœ‹ βˆ’π‘‡ ) with fixed πœ‹ βˆ’π‘‡ and both πœ‹ 𝑇 and πœ‹ 𝑇 β€² if and only if πœ‹*𝑆 is truthful. are best responses: 𝒰 𝑇 (πœ‹) ≀ 𝒰𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹ β€² ). Proof. This follows by material equivalence. 5 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 Proof. Suppose 1) for all πœ‹: 𝒰 𝑇 (πœ‹) = 𝒰𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹). Fix 𝑉 πœ‹ βˆ’π‘‡ and consider the best response for 𝑇 . Recall that a policy for 𝑇 specifies the CPDs over the decision nodes for 𝑇 given their parents. Hence, in ℳ𝑉 β€§β€§βž‘π·π‘‡ 𝑇 can choose any policy available in β„³ but the converse is 𝐷𝑆 not true: not all policies in ℳ𝑉 β€§β€§βž‘π·π‘‡ are available to 𝑇 in β„³, in particular, policies which specify CPDs that depend on the observation 𝑉 β€§β€§βž‘ 𝐷𝑇 are not available since 𝑇 does not observe 𝑉 in β„³. Therefore, by 1), 𝑇 π‘ˆπ‘† 𝐷𝑇 can get equal utility in ℳ𝑉 β€§β€§βž‘π·π‘‡ by playing the best response to πœ‹ βˆ’π‘‡ in β„³, and may get greater utility by choosing a policy which uses the observation. π‘ˆπ‘‡ Hence, if 𝑆 only influences 𝒰 𝑇 by influencing 𝐷𝑇 , then deception always causes 𝑇 to get less than or equal SmartVault utility. For clarity, we just present the two-player version of the theorem. ∼ U ({diamond, Β¬diamond}) ∈ {accurate_prediction, diamond, Β¬diamond} Theorem 2 (Truth is best for 𝑇 ). In a MAIM β„³, with two players 𝑆, 𝑇 ∈ 𝐼, if, for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 | ∈ {diamond, Β¬diamond} 𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ), then 𝑇 gets maximal util- {οΈƒ 1 if 𝐷𝑆 = accurate_prediction, 𝑆 ity when 𝑆 plays a truthful policy, i.e., for πœ‹ = (πœ‹π» , πœ‹*𝑇 ) β€² 𝑆′ 𝑇 β€²) 0, otherwise. and πœ‹ = (πœ‹ , πœ‹* with any policy for 𝑆 and BR by 𝑇 : 𝒰 𝑇 (πœ‹) β‰₯ 𝒰 𝑇 (πœ‹ β€² ). {οΈƒ 1 if 𝐷𝑇 = 𝑉, 0, otherwise. Proof. Suppose that 1) for all 𝐷𝑆 , 𝐷𝑇 Pr(𝒰 𝑇 | 𝐷𝑆 , 𝐷𝑇 ) = Pr(𝒰 𝑇 | 𝐷𝑇 ). Consider fixed policy Figure 2: SmartVault (example 2). The AI 𝑆 is rewarded for for 𝑆, πœ‹ 𝑆 , if πœ‹ 𝑆 is truthful, then under any BR πœ‹ 𝑇 , accurate predictions instead of explainable predictions that 𝐷𝑇 = 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ for some (πœ‹ 𝑆 , πœ‹π΅π‘… 𝑇 ) in ℳ𝑉 β€§β€§βž‘π·π‘‡ (by the human, 𝑇 , can understand. Here the incentive to deceive definition of a truthful policy). Hence, by 1) and since πœ‹ 𝑆 arises as a side-effect of the AI pursuing its goal. β€² is truthful Prπœ‹ (𝒰 𝑇 | 𝐷𝑇 ) = Prπœ‹ (𝒰 𝑇 | 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ ) for all πœ‹ = (πœ‹ , πœ‹* ) and some πœ‹ = (πœ‹ 𝑆 , πœ‹ 𝑇 β€²* ) with BR for 𝑆 𝑇 β€² 𝑇 . Hence, since only 𝑇 ’s policy changes between πœ‹ and πœ‹ β€² , 𝒰 𝑇 (πœ‹) = 𝒰𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹ β€² ). But then, by theorem 1, for 4.1. SmartVault: Deception Due to all πœ‹ 𝑆 : 𝒰 𝑇 (πœ‹ 𝑆 , πœ‹*𝑇 ) ≀ 𝒰𝑉𝑇 β€§β€§βž‘π·π‘‡ (πœ‹ 𝑆 , πœ‹ 𝑇 β€²* ), with equal- Side-Effect ity if πœ‹ 𝑆 is truthful as just shown. So 𝑇 gets maximal Here we adapt the SmartVault example of Christiano utility when πœ‹ 𝑆 is truthful. [28], in which an AI tasked with making predictions Example 1 (continued). Return, for the final time, to our about a diamond in a vault has an incentive to deceive running example. The condition for Theorem 2 is that 𝒰 𝑇 a human operator as a side-effect of pursuing accurate is independent of 𝐷𝑆 given 𝐷𝑇 , which can be clearly seen predictions. by looking at the MAID in Fig. 1 (as there are no paths from Example 2 (SmartVault). Consider the MAIM β„³ shown 𝐷𝑆 to π‘ˆ 𝑇 that do not go through 𝐷𝑇 ). The human 𝑇 gets in Fig. 2. The game has two players, a human 𝑇 and maximal utility when they shutdown if and only if 𝑆 is AI 𝑆, each with one decision and utility node. Suppose unaligned. Clearly, they can only do this if 𝑆 truthfully there is one chance node 𝑉 which determines the loca- signals their type. tion of the diamond (whether it is in the vault or not); dom(𝑉 ) = {π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘, Β¬π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘}. Suppose 𝑆 ob- 4. Examples serves 𝑉 but 𝑇 does not and that 𝑆 can either make an accurate prediction of the location of the diamond (e.g., In this section we present two examples which exhibit in incomprehensibly precise coordinates) or an explain- 𝑆 different patterns of signalling. In the first example, an able prediction (just stating the value of 𝑉 ); dom(𝐷 ) = AI system has an incentive to deceive a human as a side- {π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘‘π‘’_π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘›, π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘, Β¬π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘}. 𝑇 has effect of pursuing its goal (of making accurate predic- to predict whether the diamond is in the vault or not by 𝑆 𝑇 tions). In the second example, we consider the case in observing 𝐷 ; dom(𝐷 ) = {π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘, Β¬π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘}. which an AI agent has an incentive to signal information Suppose that the utility nodes take value 0 or 1 and fi- that they themselves do not observe. nally suppose that the CPDs are s.t. 𝑉 (which has no par- 6 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 ents) is distributed according to a uniform prior 𝑉 ∼ 𝐷𝑆 𝑉 U ({π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘, Β¬π‘‘π‘–π‘Žπ‘šπ‘œπ‘›π‘‘}), and the utility node CPDs are s.t. Pr(π‘ˆ 𝑇 = 1 | 𝐷𝑇 = 𝑉 ) = 1 otherwise π‘ˆ 𝑇 = 0, and Pr(π‘ˆ 𝑆 = 1 | 𝐷𝑆 = π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘‘π‘’_π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘›) = 1 otherwise π‘ˆ 𝑆 = 0. 𝑋 Now consider the NE in this game: Since 𝑆 just gets util- ity for making accurate predictions, at every NE 𝑆 makes an accurate prediction, signalling no information to 𝑇 (as πœ‹ 𝑇 =Pr(𝐷𝑆 = π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘‘π‘’_π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘›) = 1 is indepen- 𝐷𝑇 dent of 𝑉 ). Hence, 𝑇 cannot update their prior over 𝑉 and so any policy is optimal for 𝑇 (i.e. any guess about whether the diamond is in the vault does as well as any other). π‘ˆπ‘† π‘ˆπ‘‡ At NE πœ‹, 𝑆 has an incentive to signal 𝑉 to 𝑇 if 1) 𝑆 has an incentive to influence 𝐷𝑇 and 2) 𝑆 does not have Revealing/Denying Game an incentive to influence 𝐷𝑇 in ℳ𝑉 β€§β€§βž‘π·π‘‡ . To see that 1) ∈ {attack incoming = 1, not = βˆ’1}) holds: At any NE πœ‹ in β„³, 𝐷𝑆 = π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘‘π‘’_π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘› 𝑆 𝑆 ∈ {reveal = 1, deny = 0} hence there exists a NBR πœ‹π‘ 𝐡𝑅 which assigns 𝐷 = 𝑉 β€² 𝑆 𝑇 𝑇 𝑇 and for all πœ‹ = (πœ‹π‘ 𝐡𝑅 , πœ‹* ) with BR πœ‹* : 𝐷 (πœ‹) ΜΈ= ∈ {report delivered = 1, not = βˆ’1} 𝑉 = 𝐷𝑇 (πœ‹ β€² ). Hence, at any NE in β„³, 𝑆 has an influence ∈ {launch = 1, not = βˆ’1} incentive over 𝐷𝑇 . Now consider ℳ𝑉 β€§β€§βž‘π·π‘‡ , for any NE 𝐷𝑇 = 𝑉 (since 𝑇 directly observes 𝑉 and can just report = βˆ’π·π‘‡ its value independently of 𝑆’s action). Furthermore, for all = 𝑉 𝐷𝑇 NBRs for 𝑆, it is still the case that 𝐷𝑇 = 𝑉 . So 𝑆 does not have an influence incentive in ℳ𝑉 β€§β€§βž‘π·π‘‡ and hence 𝑆 has Figure 3: Revealing/Denying game (example 3). An AI (𝑆 ) an incentive to signal 𝑉 to 𝑇 . and human (𝑇 ) form part of a nuclear command system. 𝑉 So, we have demonstrated that 𝑆 has an incentive to represents an intelligence report containing information about an incoming nuclear attack and 𝑆 may prevent this report signal 𝑉 to 𝑇 (at every NE). Does 𝑆 have an incentive from being delivered to 𝑇 (represented by 𝑋 ). 𝑇 wishes to to deceive 𝑇 ? At NE πœ‹, 𝑆 has an incentive to deceive 𝑇 retaliate to incoming attacks whereas 𝑆 always prefers to about 𝑉 if 1) 𝑆 has a signalling incentive and 2) 𝐷𝑇 ΜΈ= avoid a launch. Whether 𝑆 reveals or denies the report to 𝑇 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ for any BR to πœ‹*𝑆 in ℳ𝑉 β€§β€§βž‘π·π‘‡ . We have just depends on the prior over 𝑉 . shown 1). For 2) we have shown that in β„³ at any NE, 𝐷𝑇 ΜΈ= 𝑉 = 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ , hence the second condition is satisfied. Therefore, at any NE, 𝑆 has an incentive to deceive 𝑇 about 𝑉 . 3. More formally, suppose we have the MAIM β„³ with 𝐼 = {𝑆, 𝑇 }, chance nodes 𝑉 , representing the intelligence report (say π‘‘π‘œπ‘š(𝑉 ) = {1, βˆ’1} where 𝑉 = 1 means 4.2. Revealing/Denying that the intelligence predicts another nation will launch Under our definition of signalling, 𝑆 need not know the a nuclear first strike, and 𝑉 = βˆ’1 corresponds to an in- information they are signalling. Thus, our definition of a telligence report predicting no incoming attack, and 𝑋 signalling incentive also captures the revealing/denying represents whether the information from 𝑉 is delivered pattern of Pfeffer and Gal [17], in which the signaller may to the human (π‘‘π‘œπ‘š(𝑋) = {1, βˆ’1} with 1 correspond- cause the target to find out (or not find out) information ing to the information from 𝑉 being delivered to the hu- that the former does not know. We now present an exam- man). Suppose that each agent has one decision node s.t. ple of revealing/denying in which 𝑆 has an incentive to π‘‘π‘œπ‘š(𝐷𝑆 ) = {1, 0} where 1 means reveal and 0 means signal a variable which they do not themselves observe. deny the information, and π‘‘π‘œπ‘š(𝐷𝑇 ) = {1, βˆ’1} with 1 meaning that 𝑇 launches a nuclear attack and βˆ’1 that they Example 3 (Revealing/Denying). Consider a game with do not. Suppose that the CPD over 𝑋 is s.t. 𝑋 = 𝑉 𝐷𝑆 a human and an AI agent trained to make joint decisions (so that 𝑋 = 𝑉 if 𝐷𝑆 = 1 and 𝑋 = 0 if 𝑆 denies). as part of a nuclear command and control system. In par- Finally suppose we have two utility nodes with CPDs s.t. ticular, suppose that the AI agent 𝑆 is trained to prevent π‘ˆ 𝑆 = βˆ’π·π‘‡ (i.e. 𝑆 gets 1 if 𝑇 does not launch an attack the launch of nuclear attacks, and they can reveal (or deny) and βˆ’1 if they do) and π‘ˆ 𝑇 = 𝑉 𝐷𝑇 (so that 𝑇 gets utility a secret intelligence report to the human 𝑇 . Further, 𝑇 1 if they attack an attacking country or do not attack when wishes to launch, or not launch, a nuclear strike on an- no incoming attack is predicted and otherwise βˆ’1). other nation based on the information in the intelligence The NE in this game depend on the prior over 𝑉 . On the report. This game can be represented as the MAID in Fig. one hand, if, under the prior, 𝑇 believes that there is no 7 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 incoming attack, then they will not launch an attack, so when the signaller conceals information that they do not 𝑆 has no incentive to reveal the information. On the other themselves know (example 3). We also proved that our hand, if the prior is s.t. an incoming attack is more likely, definition has natural properties, for example, that if the 𝑇 will launch if they do not get further information, so 𝑆 target’s utility is otherwise independent of the signaller’s has an incentive to reveal 𝑉 . Note that, since 𝑉 is not an decision, then deception causes the target to get lower ancestor of 𝐷𝑆 , 𝐷𝑆 must be independent of 𝑉 . Suppose the utility. prior over 𝑉 is s.t. Pr(𝑉 = 1) = 𝑝, Pr(𝑉 = βˆ’1) = (1βˆ’π‘) Discussion. There are a number of interesting points (𝑝 ∈ [0, 1]). For 𝑝 > 0.5 the NE is s.t. 𝑆 reveals the to discuss. Firstly, we have noted that our definition of intelligence report (𝐷𝑆 = 1 =β‡’ 𝑋 = 𝑉 ) and 𝑇 ’s BR is deception is general, covering many situations. This is s.t. 𝐷𝑇 = 𝑋 = 𝑉 . Alternatively, if 𝑝 < 0.5, then at any both a strength and a weakness. Generality is benefi- NE 𝑆 denies the information (𝐷𝑆 = 0 with probability cial, because verifiable guarantees enable a high-level of one) and 𝑇 acts to maximise expected utility under the assurance that the system is not deceptive in any way. prior over 𝑉 which implies 𝑇 does not launch an attack On the other hand, more specific definitions allow us to (𝐷𝑇 = βˆ’1 with probability one). (If 𝑝 = 0.5 then 𝑆 is precisely characterise agent behaviour. In future work indifferent between revealing and denying.) we hope to refine the different concepts proposed here. Now let us analyse the incentives of 𝑆 in the game. Con- In particular, many philosophical accounts of deception sider the case in which 𝑝 > 0.5, i.e. it is a priori more take deceit to be intentional. Halpern’s causal notion likely that the intelligence reports that there is an incom- of intention [29] is closely related to a control incentive ing first strike from another nation. Under the resulting [2]. We might therefore distinguish between intentional NE, call it πœ‹* , 𝑆 reveals 𝑉 to 𝑇 and 𝑇 uses this infor- and unintentional deception as between influence due mation to choose their action. First note that, at πœ‹* , 𝑆 to a control incentive and influence as a side-effect. In has an incentive to influence 𝐷𝑇 , since, there exists a non addition, following Evans et al. [15], we can distinguish BR for 𝑆 ( πœ‹π‘ 𝑆 𝐡𝑅 s.t. 𝐷 𝑆 = 0) s.t. for all the BRs for between an honest agent that accurately signals its beliefs 𝑇 (there is one, πœ‹π΅π‘… 𝑇 in which 𝐷𝑇 = 1 with probabil- (i.e. observations), and a truthful agent, which accurately ity one) 𝐷𝑇 (πœ‹* ) ΜΈ= 𝐷𝑇 (πœ‹π‘ 𝑆 𝑇 𝐡𝑅 , πœ‹π΅π‘… ). Hence, 𝑆 has an signals the facts of the matter. In this paper, we based 𝑇 incentive to influence 𝐷 at πœ‹* . Does 𝑆 have an incen- our definition of deception on truthfulness. By refining a tive to signal 𝑉 to 𝐷𝑇 at πœ‹* ? We need to check whether notion of deception based on honesty, we can eliminate there is an influence incentive in ℳ𝑉 β€§β€§βž‘π·π‘‡ (at any NE). the revealing/denying pattern from the definition, as in Clearly there is not, since for any policy for 𝑆 in ℳ𝑉 β€§β€§βž‘π·π‘‡ , this scenario, the agent does not observe the information 𝐷𝑇 = 𝑉 with probability one. So 𝑆 has an incentive to being revealed (or denied). However, it is interesting to signal 𝑉 to 𝐷𝑇 at πœ‹* because there is no influence incen- note that honesty provides a weaker level of assurance tive in the counterfactual model (so the second condition and permits failure modes that truthful systems do not. for a signalling incentive is satisfied). Finally, it is clear For example, a system may be deceptive, whilst satisfying that 𝑆 does not have an incentive to deceive 𝑇 at πœ‹* be- some definition of honesty, by manipulating its own be- cause 𝐷𝑇 (πœ‹* ) = 𝑉 = 𝐷𝑉𝑇 β€§β€§βž‘π·π‘‡ (for all policy profiles liefs. In short, refining the definitions presented here will in ℳ𝑉 β€§β€§βž‘π·π‘‡ in which 𝑇 plays a BR). It is also clear that provide a more nuanced picture of deception. Finally, πœ‹*𝑆 is truthful. we would like to expand the operational implications A similar analysis can be used to show that, in the case of this work, for instance, by investigating its practical that the intelligence report is less likely to predict an in- relevance to training truthful language agents [4, 15]. coming attack (𝑝 < 0.5), 𝑆 has an incentive to deceive Future work. In addition to the directions discussed 𝑇 at any NE. In the case that 𝑝 = 0.5, 𝑆 is indifferent above, we are already pursuing two extensions to this between revealing and denying, so at some NE they have work. First, incomplete information games, which we an incentive to deceive and at others they do not. study in our setting, often admit many NE. We are there- fore looking to employ equilibrium refinements, such as subgame perfectness [24, 30] and perfect Bayesian equi- 5. Conclusion libria [31] to identify some subset of a game’s NE that are deemed to be more rational. Second, we are working on Summary. We extend work on agent incentives [2] to a solution for avoiding deception by AI agents; a method the multi-agent setting in order to functionally define which removes the incentive to deceive in any game by the incentive to (influence, signal to, and) deceive another transforming the game with a constraint on the reward agent. Our definition of deception is general and relates function of the AI agent [32]. Overall, we think there are to a failure to signal the truth. In addition to canonical many exciting avenues for future work. signalling situations, it captures cases in which: no infor- mation is signalled; deception occurs as a side-effect of the signaller pursuing their goals (as in example 2); and 8 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 Acknowledgments and Multiagent Systems, Richland, SC, 2022, p. 1759–1761. The authors are grateful to Henrik Aslund, Matt Mac- [10] ANON, Defending Against Adversarial Artificial Dermott, Tom Everitt, James Fox, and the members of Intelligence, 2019. URL: https://www.darpa.mil/ the Causal Incentives Working Group for helpful feed- news-events/2019-02-06, dARPA report. back which significantly improved this work. Francis [11] E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, was supported by UKRI [grant number EP/S023356/1], S. Garrabrant, Risks from learned optimization in the UKRI Centre for Doctoral Training in Safe and in advanced machine learning systems, 2019. Trusted AI. arXiv:1906.01820. [12] R. Gorwa, D. Guilbeault, Unpacking the Social Me- dia Bot: A Typology to Guide Research and Policy, References Policy & Internet 12 (2020) 225–248. doi:10.1002/ [1] H. Roff, AI Deception: When Your Ar- poi3.184. tificial Intelligence Learns to Lie, IEEE [13] F. Marra, D. Gragnaniello, L. Verdoliva, G. Poggi, Do Spectr. (2021). URL: https://spectrum.ieee. GANs leave artificial fingerprints?, in: 2019 IEEE org/ai-deception-when-your-ai-learns-to-lie. Conference on Multimedia Information Processing [2] T. Everitt, R. Carey, E. D. Langlois, P. A. Ortega, and Retrieval (MIPR), 2019, pp. 506–511. doi:10. S. Legg, Agent incentives: A causal perspective, in: 1109/MIPR.2019.00103. Thirty-Fifth AAAI Conference on Artificial Intel- [14] M. Lewis, D. Yarats, Y. N. Dauphin, D. Parikh, D. Ba- ligence, AAAI 2021, Thirty-Third Conference on tra, Deal or No Deal? End-to-End Learning for Ne- Innovative Applications of Artificial Intelligence, gotiation Dialogues, arXiv (2017). doi:10.48550/ IAAI 2021, The Eleventh Symposium on Educa- arXiv.1706.05125. arXiv:1706.05125. tional Advances in Artificial Intelligence, EAAI [15] O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, 2021, Virtual Event, February 2-9, 2021, AAAI Press, A. Balwit, P. Wills, L. Righetti, W. Saunders, Truth- 2021, pp. 11487–11495. URL: https://ojs.aaai.org/ ful AI: Developing and governing AI that does index.php/AAAI/article/view/17368. not lie, arXiv (2021). doi:10.48550/arXiv.2110. [3] J. E. Mahon, The Definition of Lying and Deception, 06674. arXiv:2110.06674. in: E. N. Zalta (Ed.), The Stanford Encyclopedia of [16] S. Lin, J. Hilton, O. Evans, TruthfulQA: Mea- Philosophy, Winter 2016 ed., Metaphysics Research suring How Models Mimic Human Falsehoods, Lab, Stanford University, 2016. arXiv (2021). doi:10.48550/arXiv.2109.07958. [4] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, arXiv:2109.07958. V. Mikulik, G. Irving, Alignment of language [17] A. Pfeffer, Y. Gal, On the reasoning patterns agents, CoRR abs/2103.14659 (2021). URL: https: of agents in games, in: Proceedings of the //arxiv.org/abs/2103.14659. arXiv:2103.14659. Twenty-Second AAAI Conference on Artificial In- [5] M. D. Hauser, The evolution of communication, telligence, July 22-26, 2007, Vancouver, British MIT press, 1996. Columbia, Canada, AAAI Press, 2007, pp. 102– [6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, 109. URL: http://www.aaai.org/Library/AAAI/2007/ A. Vladu, Towards deep learning models resistant to aaai07-015.php. adversarial attacks, arXiv preprint arXiv:1706.06083 [18] V. J. Baston, F. A. Bostock, Deception Games, Int. (2017). J. Game Theory 17 (1988) 129–134. doi:10.1007/ [7] J. Steinhardt, P. W. W. Koh, P. S. Liang, Certified BF01254543. defenses for data poisoning attacks, Advances in [19] B. Fristedt, The deceptive number changing game, neural information processing systems 30 (2017). in the absence of symmetry, Int. J. Game Theory [8] T. Everitt, M. Hutter, R. Kumar, V. Krakovna, Re- 26 (1997) 183–191. doi:10.1007/BF01295847. ward tampering problems and solutions in rein- [20] I.-K. Cho, D. M. Kreps, Signaling Games forcement learning: A causal influence diagram and Stable Equilibria, undefined (1987). perspective, CoRR abs/1908.04734 (2021). URL: http: URL: https://www.semanticscholar.org/paper/ //arxiv.org/abs/1908.04734. arXiv:1908.04734. Signaling-Games-and-Stable-Equilibria-Cho-Kreps/ [9] F. R. Ward, F. Toni, F. Belardinelli, On agent in- d8bc1dbd8577d193e6eea2c944a251d1347f3adf. centives to manipulate human feedback in multi- [21] N. S. Kovach, A. S. Gibson, G. B. Lamont, Hyper- agent reward learning scenarios, in: Proceedings of game theory: a model for conflict, misperception, the 21st International Conference on Autonomous and deception, Game Theory 2015 (2015). Agents and Multiagent Systems, AAMAS ’22, In- [22] A. L. Davis, Deception in game theory: a survey ternational Foundation for Autonomous Agents and multiobjective model, Technical Report, AIR FORCE INSTITUTE OF TECHNOLOGY WRIGHT- 9 Francis Rhys Ward et al. CEUR Workshop Proceedings 1–10 PATTERSON AFB OH WRIGHT-PATTERSON . . . , 2016. [23] D. Koller, B. Milch, Multi-agent influence dia- grams for representing and solving games, Games Econ. Behav. 45 (2003) 181–221. URL: https://doi. org/10.1016/S0899-8256(02)00544-4. doi:10.1016/ S0899-8256(02)00544-4. [24] L. Hammond, J. Fox, T. Everitt, A. Abate, M. J. Wooldridge, Equilibrium refinements for multi- agent influence diagrams: Theory and practice, CoRR abs/2102.05008 (2021). URL: https://arxiv.org/ abs/2102.05008. arXiv:2102.05008. [25] D. Hadfield-Menell, A. D. Dragan, P. Abbeel, S. J. Russell, The off-switch game, in: The Workshops of the The Thirty-First AAAI Conference on Artifi- cial Intelligence, Saturday, February 4-9, 2017, San Francisco, California, USA, volume WS-17 of AAAI Workshops, AAAI Press, 2017. URL: http://aaai.org/ ocs/index.php/WS/AAAIW17/paper/view/15156. [26] L. Hammond, J. Fox, T. Everitt, R. C. A. Abate1, M. Wooldridge, Reasoning about causality in games (Forthcoming). [27] R. Carey, Causal models of incentives (2021). [28] P. Christiano, ARC’s first technical report: Eliciting Latent Knowledge - AI Alignment Forum, 2022. URL: https://www.alignmentforum. org/posts/qHCDysDnvhteW7kRd/ arc-s-first-technical-report-eliciting-latent-knowledge, [Online; accessed 9. May 2022]. [29] J. Y. Halpern, M. Kleiman-Weiner, Towards for- mal definitions of blameworthiness, intention, and moral responsibility, in: S. A. McIlraith, K. Q. Wein- berger (Eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI- 18), the 30th innovative Applications of Artificial In- telligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, Febru- ary 2-7, 2018, AAAI Press, 2018, pp. 1853–1860. URL: https://www.aaai.org/ocs/index.php/AAAI/ AAAI18/paper/view/16824. [30] R. Selten, Spieltheoretische behandlung eines oligopolmodells mit nachfragetrΓ€gheit: Teil i: Bes- timmung des dynamischen preisgleichgewichts, Zeitschrift fΓΌr die gesamte Staatswissenschaft/Jour- nal of Institutional and Theoretical Economics (1965) 301–324. [31] R. B. Myerson, Game theory: analysis of conflict, Harvard university press, 1997. [32] E. Altman, Constrained Markov Deci- sion Processes:Stochastic Modeling, Tay- lor & Francis, Andover, England, UK, 2021. doi:10.1201/9781315140223. 10