A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning Mathieu Godbout 1, 2, 3 , Maxime Heuillet 1, 2 , Sharath Chandra Raparthy 3, Rupali Bhati 1, 2, 3 , Audrey Durand 1, 2, 3, * , 1 Université Laval, 2 Institut Intelligence et Données, 3 Mila, * CIFAR AI Chair {mathieu.godbout.3@, maxime.heuillet.1@, rupali.bhati.1@, audrey.durand@ift}.ulaval.ca, raparths@mila.quebec Abstract a nerve or a critical region of the brain (Baek et al. 2018). Automation in this context and other safety-critical domains Most Reinforcement Learning (RL) approaches usually aim to find a policy that maximizes its expected return. However, can only come if the agent is able to successfully reach its this objective may be inappropriate in many safety-critical goal while avoiding the most risky actions (Gottesman et al. domains such as healthcare or autonomous driving, where it 2019). Therefore, naively maximizing the expected return is is often preferable to optimize for a risk-sensitive measure not a satisfying approach in such scenarios. of the policy’s return as the learning objective, such as the This has motivated the community to design risk-sensitive Conditional-Value-at-Risk (CVaR). Although previous litera- algorithms, where the agent is trained to account for the pos- ture exists to address the problem of learning CVaR-optimal sibility of catastrophic events. One way to proceed is to in- policies in Markov decision problems, it mostly relies on the clude a risk measure (Artzner et al. 1999) in the algorithm’s distributional RL perspective. In this paper, we solve this problem by rather proposing an approach based on a game objective. A commonly used risk measure in RL is the Con- theoretic perspective, which can be applied on top of any ex- ditional Value-at-Risk (CVaRα ), defined as the expectation isting RL algorithm. At the core of our approach is a two- over the worst α-quantile of a distribution. The lower the player zero-sum game between a policy player and an adver- value of α, the more risk-averse the agent. The search for sary that perturbs the policy player’s state transitions given a CVaR-optimal policies, usually referred to as CVaR RL, is finite budget. We show that, the closer the players are to the typically achieved with distributional RL (Schubert et al. game’s equilibrium point, the closer the learned policy is to 2021; Keramati et al. 2020). In this approach, the agent pre- the CVaR-optimal one with a risk tolerance explicitly related dicts the whole distribution of returns rather than only its to the adversary’s budget. We provide a gradient-based train- mean (Bellemare, Dabney, and Munos 2017; Dabney et al. ing procedure to solve the proposed game by formulating it 2018). A CVaRα transform is then applied to the predicted as a Stackelberg game, enabling the use of deep RL architec- tures and training algorithms. We illustrate the applicability distribution, resulting in more conservative choices depend- of our approach on a risky artificial environment, presenting ing on the chosen α threshold. the different policies learned for various adversary budgets. In this work, we rather tackle the CVaR RL problem in a game-theoretic setting. Looking at a problem from a game-theoretic perspective has helped advance other fields 1 Introduction like Generative Adversarial Networks (GANs) (Berthelot, Reinforcement Learning (RL) (Sutton, Barto et al. 1998) is Schumm, and Metz 2017; Goodfellow et al. 2020), suggest- a branch of machine learning where the learner (agent) is ing that this perspective is a promising new research avenue not told how to behave in an environment, but instead must for risk-sensitivity in RL. Precisely, we suggest that learn- learn to do so by trial and error. RL approaches usually aim ing the policy of the RL protagonist (agent) can be cast as a to find agents that maximize their expected return. This ex- two-player zero-sum game between this protagonist and an pectation maximization objective has led to increased suc- antagonist. In this game, the protagonist aims at maximizing cesses in domains like videogames (Vinyals et al. 2019), its reward collection, while the goal of the antagonist is to board games (Silver et al. 2018) or content recommenda- minimize the rewards obtained by the protagonist. In order tion (Li et al. 2010). However, in safety-critical domains like to achieve this, the antagonist has access to a fixed budget for healthcare, autonomous driving or financial planning, some interfering with the next state transitions of the protagonist. erroneous actions may lead to disastrous consequences. In Our contributions are the following: healthcare for instance, an RL agent may be in charge of de- signing the shortest path to an organ for surgery. In this task, • we propose a two-player zero-sum game formulation some paths may be shorter but at the same time may be risk which yields (approximate) CVaR-optimal policies at its endangering the patient as they are too close to an artery, (approximate) equilibrium; Copyright © 2022 for this paper by its authors. Use permitted un- • we provide a gradient-based algorithm to solve the pro- der Creative Commons License Attribution 4.0 International (CC posed game alongside sufficient conditions to ensure its BY 4.0). convergence; • we illustrate the applicability of our method in a risky Conditional-Value-at-Risk (Artzner et al. 1999) at confi- artificial experiment. dence level α is defined as   The rest of the paper is separated as follows. Section 2 1 first introduces the necessary background and notation. Fol- CVaRα (Z) := min w + E [max(Z − w, 0)] . (2) w∈R α lowing is an overview of the related work in Section 3. Next, we present in Section 4 an overview of our proposed game If Z has a continuous distribution, the CVaR measure can be formulation alongside its relevant properties, which include written as the convergence to a CVaR RL policy at equilibrium point. CVaRα (Z) := E [z | z ≤ VaR(Z)] . (3) In Section 5, we develop a gradient-based algorithm based on the Stackelberg game formulation to solve the presented Intuitively, the CVaRα (Z) measure can therefore be viewed game, presenting both a theoretical analysis of its conver- as the mean over the α-quantile worst values of Z. Since gence as well as key practical concerns. Lastly, we illustrate Z represents returns in our case, the worst values of Z can the validity of the method in an artificial gridworld experi- be seen as the ones where most risk has been incurred. This ment (Section 6). last interpretation is particularly attractive, as it makes the CVaRα easy to understand for non-experts who might be 2 Background and Notation involved in the design of any risk-sensitive model in safety- Reinforcement Learning critical domains. In RL, an agent interacts with an environment, trying to CVaR Reinforcement Learning identify actions that lead to the highest rewards. The most To measure the level of risk associated with a policy π, the common RL formulation for the environment is that of a CVaRα measure defined in (3) can be applied to the random Markov Decision Process (MDP) framework. An MDP M discounted return J (π) defined in (1). Optimizing for this is represented by a tuple hS, A, P, R, γ, ρi, where S is the yields the CVaR RL objective state space, A is the action space, P : S × A → [0, 1]S is the transition probability function which governs the evolu- π ∗ = arg max CVaRα (J (π)) . (4) tion of the system, R : S × A → R is the reward function, π γ ∈ [0, 1) is a discount factor and ρ is the initial state dis- This objective, which differs from the traditional expectation tribution. We formalize an agent’s decisions as following a maximization goal, can be viewed as finding a risk-sensitive policy πθ : S → [0, 1]A that generates distributions over policy. Indeed, by having the optimal CVaRα return, the op- which actions to take for each state from parameters θ1 . timal policy π ∗ is maximizing the expectation over its worst A policy’s actions not only generate immediate rewards α-percentile trajectories. Since the expectation is now taken but also influence future rewards because they rule the next over a small, disadvantageous subset of all returns rather state the agent will end up in. This is why the performance of than over all of them, the optimal policy is more sensible a policy π is measured by balancing immediate and distant towards avoiding the probability of large negative returns. rewards, using a measure called the random return: In practice, this means that optimizing for the CVaR RL ob- ∞ X jective (4) will produce policies that accept to reduce their J (π) := γ t rt , (1) expected performance, so long as it means they avoid catas- t=0 trophic trajectories in return. Let us take a moment to note here that it is well known where γ is a discount factor, at ∼ π(st ), st+1 ∼ P(st , at ) that the optimal policy π ∗ for the CVaR RL objective can and rt ∼ R(st , at ). Due to the stochasticity that may be be history-dependent (Shapiro, Dentcheva, and Ruszczyński present in either the reward function, the policy’s action 2014), meaning that there are cases where the agent needs selection or the environment transitions, the random re- to know what previous rewards were collected to achieve turn J (π) is accurately defined as a random variable. The CVaR optimal returns. We follow previous work on CVaR usual objective in RL is to find the policy π ∗ that maxi- RL (Keramati et al. 2020; Dabney et al. 2018) and limit our mizes its expected return, which boils down to finding π ∗ = analysis to stationary, history independent policies which arg maxπ E[J (π)]. can be suboptimal but typically achieve high CVaR nonethe- less. Conditional-Value-at-Risk Let Z be a random variable with a bounded expectation 3 Related Work E[Z] < ∞ and cumulative distribution function F (z) := Adversarial Reinforcement Learning P(Z ≤ z). In this paper, we interpret Z as a return to be maximized. The Value-at-Risk (VaR) at confidence level Although typically used in the supervised learning setup, ad- α ∈ (0, 1] represents the worst 1 − α quantile of Z, versarial learning (Kurakin, Goodfellow, and Bengio 2017) i.e., VaRα (Z) := min{z|F (z) ≥ α}. Analogously, the has already been applied in RL with the aim of increasing task difficulty. Dennis et al. (2020); Tobin et al. (2017) use 1 For ease of notation, we will only write π when it is clear that an adversary to design hard environment instances in order we are talking about a parametrized policy πθ . This applies for any to optimize the robustness of the policy learned by a RL pro- parametrized function throughout the paper. tagonist. This is achieved by allowing the adversary to select a different environment from a constrained set at the begin- to apply external forces or disturbances to the environment ning of each episode. The adversary is therefore limited to model. Their antagonist’s action space is however rather a single move (at the beginning of each episode) and its ac- different, as it can only change parameters of the model tions are not budgeted. within a given range at every time step. In contrast, our More similar to the current work, Mandlekar et al. (2017) approach lets the antagonist explicitly change the model’s use an adversary to perturb the action selection policy of a transition probabilities and we limit the antagonist’s pertur- RL protagonist at every time step. However, their adversary bation amount over a whole trajectory rather than at every consists in gradient-based perturbations which do not adapt time step. Moreover, unlike their CVaR formulation which to the policy of the protagonist. In the formulation tackled is essentially only intuitive, we establish a clear, theoreti- in the current work, the antagonist strategy is jointly learned cally justified, connection between our antagonist’s budget with the protagonist strategy, resulting in an antagonist strat- and the α-quantile the protagonist’s policy is optimized for. egy adapted to battle the protagonist. 4 Game Setting Risk-Sensitive RL Before diving into the explanation of our proposed adversar- Broadly speaking, risk-sensitivity in RL represents the en- ial game, let us first present the two assumptions necessary semble of methods that aim at reducing the level of risk on the target MDP for the game to be applicable. associated with the learned policy. Different risk measures have been proposed in the literature to address this problem. Assumption 1. The MDP has stochastic state transitions P. First, numerous approaches have been proposed to explic- The stochasticity is required because it is precisely upon itly balance the expectation and variance of the return of a those state transitions that the antagonist will be acting. policy (Tamar and Mannor 2013; Pan et al. 2019). Contrary This assumption is the least restrictive, as a large number to our CVaR objective which only assumes the returns to be of MDPs are already stochastic in nature. Furthermore, any bounded, such approaches are valid only under the assump- deterministic transition function P could be made stochastic tion that returns follow a normal distribution. by adding a noise component, either via ε-random sampling One other popular alternative to incorporate risk- for discrete states or adding a Gaussian noise ε ∼ N (0, σ) sensitivity is the use of exponential utility functions (Mi- for continuous states. hatsch and Neuneier 2002; Fei et al. 2020). In these ap- proaches, the return landscape is reshaped into a convex Assumption 2. It is possible to access the state transitions set, essentially achieving balance between expectation and P and arbitrarily modify their non-zero values. worst-case maximization of the return. The reshaping is This assumption is the one that allows the antagonist’s based on applying an exponential function and is regulated actions to be registered. This is a much more restrictive as- by a user-specified λ trade-off parameter. However, it is dif- sumption and is only likely to be met in artificial environ- ficult to interpret what different values of λ represent regard- ments over which the user has a lot of control. Notably, given ing the balance between mean and minimal return (Gosavi, an MDP with added noise as described above, it should be Das, and Murray 2014), in contrast with the straightforward relatively straightforward to meet this assumption by access- interpretation of our proposed CVaR measure. ing the analytical formulation of the added noise component. Conditional-Value-at-Risk RL Formulation Many algorithms have been proposed to find optimal poli- We aim to learn a risk-sensitive policy with respect to the cies with respect to the CVaR criterion. Proposed algo- CVaR risk measure. To do so, we leverage Chow et al. rithms range from policy gradient (Tamar et al. 2015) and Q- (2015)’s model perturbation framework. Namely, we design learning (Chow et al. 2015), to actor-critic methods (Tamar a zero-sum adversarial game where a protagonist tries to and Mannor 2013). Unlike our algorithm that is compati- learn a policy πθ to maximize its return while an antagonist ble with modern deep learning, the above approaches are all tries to learn a perturbation function Λω that instead mini- limited to either tabular or low-dimensional action and state mizes the protagonist’s return. space settings. To explain our game setting, let us first define time step More recently, distributional RL (Bellemare, Dabney, and Munos 2017; Dabney et al. 2018) has been used to learn transition dynamics Pt = P(· | st , at ) ∈ [0, 1]|S| from a CVaR optimal policies in high-dimensional settings (Kera- given trajectory τ = {(st , at , rt )}Tt=1 , where T is the trajec- mati et al. 2020; Zhang and Weng 2021). This approach has tory’s duration. At time step t, the antagonist is given direct seen growing usage, largely due to the empirical efficiency access to Pt and allowed to perturb it with a multiplicative of distributional RL and its ability to incorporate a wide va- probability perturbation δt ∈ (0, ∞)|S| , yielding perturbed riety of risk measures in its objective, not exclusively the transitions P̂t = Pt ◦ δt , for ◦ the Hadamard (pointwise) CVaR. Instead of relying on distributional RL, our paper product. A perturbation δ is admissible if it generates a valid presents a game-theoretic perspective which can be used on probability distribution P̂ . Let ∆t be the set of all perturba- top of any conventional RL algorithm. tions δt admissible for a given Pt and ∆ = ∆1 × ... × ...∆T The closest work to ours would be Robust Adversarial Re- be the set of all possible perturbations of a trajectory τ . We inforcement Learning (RARL) (Pinto et al. 2017), where a can define a trajectory perturbation budget η ≥ 1 and con- CVaR optimal policy is learned by allowing an antagonist strain ∆η to contain only perturbations within an η budget st by posing the admissible perturbation envelope rt Protagonist ∆η = ∆ = {δt }Tt=1 | δ1 (s1 ) ∗ ... ∗ δT (sT ) ≤ η , (5)  where δt (si ) represents the antagonist’s perturbation to- at wards next state si at time step t. Note here that, since per- st+1 turbations are multiplicative in nature, η = 1 represents an empty budget, meaning that the antagonist could not apply rt+1 Environment any modifications to next state transitions. This definition and limitation of the antagonist’s actions −rt+1 δt η t Pt enjoy two principal interesting properties. First, since the antagonist is only given a limited budget for each trajec- tory, they have to be efficient with their perturbations, trying Antagonist to maximize the damage to the protagonist’s return without wasting their budget. Intuitively, the role of the budget can be seen as a way to ensure that the antagonist cannot force Figure 1: Overview of the proposed game’s interaction loop. the worst possible scenario to occur at every time step. Sec- Components that deviate from the standard RL training loop ondly, the fact that the antagonist perturbations are done via are in red. a Hadamard product constrains the perturbations to states with non-zero transition probability by default. Essentially, this means that the antagonist is only allowed to modify the Definition 1. (Minimax Equilibrium). A Minimax Equilib- distribution over next states that were already reachable. rium (π ∗ , Λ∗ ) is defined as a point where The main idea of this paper is that an antagonist Λ can be learned to select perturbations within the admissible pertur- E [J η (π, Λ∗ )] ≤ E [J η (π ∗ , Λ∗ )] ≤ E [J η (π ∗ , Λ)] , (7) bation envelope ∆η defined in (5), attempting to minimize for which the inequalities hold for any other protagonist π the protagonist π’s reward. Combining the protagonist and or antagonist Λ. antagonist naturally produces the two-player zero-sum game At this equilibrium point, we have the following interest- max min E [J η (π, Λ)] , (6) ing result. π Λ where E[J η (π, Λ)] represents the expected return collected Lemma 1. Suppose πθ and Λω are expressive enough from a protagonist π with admissible perturbations sampled player parametrizations, meaning that they can represent from an antagonist Λ respecting a budget η. Interestingly the optimal players π ∗ and Λ∗ . Then, at a Minimax Equilib- enough, this formulation implies that both players’ goals are rium point (π ∗ , Λ∗ ), the protagonist’s policy π ∗ is CVaRα to respectively minimize and maximize the expected return, optimal for α = η1 , where η is the antagonist Λ’s perturba- retrieving the classical RL objective. This notably means tion budget: that any classical RL algorithm like actor-critic methods or max min E [J η (π, Λ)] = max CVaR η1 [J (π)] . Deep Q-Network may be used without fundamental changes π Λ π for our game. Moreover, the embedded flexibility of the Proof. Proposition 1 from Chow et al. (2015) established game also allows for the underlying MDP to have contin- that uous action and/or state spaces. inf0 E [J η (π, Λ0 )] = CVaR η1 [J (π)] . We note that both the protagonist and antagonist players Λ are well defined under the MDP framework for a shared If the parametrization of the antagonist Λ is expressive environment. On one hand, the protagonist player’s MDP enough, the inf operation will return an antagonist in the remains exactly the same, with the only modification that search space, implying that one can look for the best avail- next states are sampled from P̂t rather than Pt . On the other able antagonist for a protagonist to find the CVaR of its re- hand, the antagonist receives as input states s0t = (Pt , ηt ) ∈ turns. The result follows by taking the max over all possible S 0 = [0, 1]|S| × [1, ∞) and outputs perturbed transitions protagonists, where we once again use the fact that π is ex- pressive enough to guarantee the existence in the parameter P̂t = Pt ◦ δt subject to P̂t being a distribution over next search space. states and δt (s) ≤ ηt for all states. Next antagonist states s0t+1 are selected by first sampling the next protagonist state The result in Lemma 1 is satisfying in two ways. First, st+1 from P̂t , then sampling the next protagonist action it directly connects our adversarial framework to risk- at+1 from π(st+1 ), and finally observing the next transi- sensitivity for the learned protagonist policy, even though tion distribution Pt+1 . Lastly, since the game is zero-sum, the protagonist and antagonist’s objectives remain axed on the antagonist’s reward is simply −rt+1 . An overview of the expectation maximization. Secondly, the equation estab- interaction between the protagonist, antagonist and environ- lishes that the α confidence interval in the CVaRα objective ment is shown in Figure 1. is directly linked to the budget perturbation η via α = η1 . Given our parametrized setting where players are learned Connection to CVaR RL rather than exactly computed, we are actually more inter- The solution concept to the game (6) is given by its Minimax ested in the notion of approximate equilibrium, which is Equilibrium. more likely to be encountered in practice. Definition 2. (ε-Minimax Equilibrium). Let V ∗ := uf (θl , θf ), where θ represents the parameters of a given E [J η (π ∗ , Λ∗ )], the value at the Minimax Equilibrium player. The particularity in this game is that the follower’s (π ∗ , Λ∗ ). An ε-Minimax equilibrium is a pair of players parameters always represent a best-response with respect to (π 0 , Λ0 ) where the leader’s parameters. Accordingly, the leader can then pick its parameters by taking for granted that the follower’s V ∗ − min E [J η (π 0 , Λ)] ≤ ε parameters will be optimal with respect to them. This yields Λ (8) the following optimization problem for the leader and max E [J η (π, Λ0 )] − V ∗ ≤ ε π ( ) 0 0 Interestingly, this approximate equilibrium yields an ap- θl ∈ arg max ul (θ, θf ) θf ∈ arg max uf (θl , θf ) , proximately CVaR-optimal policy. θ θf Theorem 1. Suppose πθ and Λω are expressive enough and for the follower player parametrizations. Then, at an ε-Minimax Equilib- rium point (π 0 , Λ0 ), the protagonist’s policy π 0 is at most θf ∈ arg max uf (θl , θ). θ ε-suboptimal: The key intuition to note here is that, since the follower al- max CVaR η1 [J (π)] − CVaR η1 [J (π 0 )] ≤ ε π ways has optimal parameters with respect to the leader, then the follower parameters may be seen as an implicit function Proof. Taking the first inequality in (8), we have taking as input the leader’s parameters θf (θl ). In turn, this allows the leader to exploit this optimal-response form of its V ∗ − minΛ E [J η (π 0 , Λ)] ≤ ε follower to update its own parameters towards its own goal. ⇐⇒ V ∗ − CVaR η1 [J (π 0 )] ≤ ε Intuitively, updating both player’s parameters may be viewed as a bi-level optimization procedure. In order to ⇐⇒ maxπ CVaR η1 [J (π)] − CVaR η1 [J (π 0 )] ≤ ε, solve a Stackelberg game, we can therefore simply alternate between solving each player player’s respective optimiza- first leveraging Proposition 1 from Chow et al. (2015) and tion problem, producing a simple yet effective algorithm. then using Lemma 1. As Stackelberg games are a generalization of minimax games such as ours, this Stackelberg algorithm only requires 5 Gradient-based algorithm us to establish a leader and a follower to be able to find our Having established the favorable properties of an approxi- game’s equilibrium. We arbitrarily select the protagonist as mate equilibrium of our proposed game, we only need to the leader and the antagonist as their follower, but show in find an algorithm that can reliably reach it. First note that, the following convergence analysis that this choice can be in order for the algorithm to be able to tackle challeng- reversed without problem. ing problems, we wish both players to be used with high- capacity function approximators such as neural networks. Convergence Analysis Therefore, the algorithm we are looking for should not only Given some assumptions on the environment MDP, the converge towards the Minimax Equilibrium, but also do so Stackelberg algorithm described previously converges to the using gradient-based updates. equilibrium of our game. However, computing the solution to the minmax game in (6) for players represented by neural networks is not a Assumption 3. The reward function R and all transition simple task. Indeed, the dependence of each player’s re- functions P are both smooth and convex. ward on the other makes the optimization objective non- Assumption 3 is quite restrictive in practice. Indeed, while stationary (Fiez, Chasnov, and Ratliff 2020), hampering the smoothness portion of it is likely to hold in practice, e.g. naive gradient-based algorithms’ convergence properties. To many real-world environments like robotics can be consid- this end, we propose to view our game as a Stackelberg ered smooth (Pirotta, Restelli, and Bascetta 2015) and the game (Von Stackelberg 2010), which allows us to derive same can be said about neural networks with ReLU acti- well-defined gradient-based algorithms. vations (Petersen and Voigtländer 2018), the convexity as- sumption is much less likely to hold. We however wish to Stackelberg Algorithm emphasize that previous work (Pinto et al. 2017; Perolat An essential step to achieve stable learning of the game is to et al. 2015; Patek 1997) also uses this assumption in the con- take into account the game’s structure in the parameter up- text of minimax objectives and still obtain good empirical dates. Although there aren’t good gradient-based optimiza- results, even when the convexity assumption is not met. tion updates specifically designed for minimax games like When Assumption 3 is met, we have the following lemma. ours, such updates do exist for Stackelberg games. We there- Lemma 2. The expected return E[J η (π, Λ)] is convex with fore cast our game under the Stackelberg formulation, allow- respect to π and concave with respect to Λ. ing us to derive our desired gradient updates. A two-player Stackelberg game is an asymmetric game Proof. This follows from the convexity of R and all P. played with a leader l and a follower f . Both players aim to maximize their respective payoff functions ul (θl , θf ) and Building upon Lemma 2, we have the following theorem. Theorem 2. The proposed Stackelberg algorithm will con- Algorithm 1: CVaR Adversarial Stackelberg Algorithm verge towards the Minimax Equilibrium of the game. Require: πθ (protagonist), Λω (antagonist), η (perturbation Proof. This is a well-known result for Stackelberg games budget), Kant (number of intermediate antagonist steps) in the two-player zero-sum setting with a convex-concave 1: Nupdates = 0 playoff function (see Fiez, Chasnov, and Ratliff (2020)), a 2: while training not done do condition guaranteed by Lemma 2. 3: Get initial state st 4: ητ = η . Remaining antagonist budget Remember that the (approximate) Minimax Equilibrium 5: while st not terminal do of the game contains a CVaR (approximately) optimal pol- 6: at ∼ πθ (st ), Pt = P(st , at ) icy. It follows that Theorem 2 constitutes the proof that our 7: δt = Λω (Pt , ητ ) overall framework (the game and the proposed algorithm), 8: P̂t = Pt ◦ δ is indeed a theoretically justified way to attain a CVaR opti- 9: st+1 ∼ P̂t , rt+1 ∼ R(st+1 ) mal policy. A last theoretical result follows from the game’s 10: ητ = δt (sηt+1 τ . Update remaining budget ) convex-concave objective. 11: end while Corollary 1. For the proposed game, we have 12: Update θ or ω according to Nupdates and Kant . min max E [J η (π, Λ)] = max min E [J η (π, Λ)] . 13: Nupdates = Nupdates + 1 Λ π π Λ 14: end while Proof. This follows from the convexity of the payoff func- tion E [J η (π, Λ)] (Lemma 2), which allows us to apply the minimax theorem (Sion 1958), inversing the min and max terms. This last corollary is of separate interest, as it notably im- plies that the Stackelberg algorithm can take any of the play- ers as its leader. Practical Considerations In practice, we adopt a few relaxations of the exact Stack- elberg algorithm described above. The adopted relaxations Figure 2: Overview of our Lava gridworld environment. have proven effective in domains like GANs (Heusel et al. 2017; Metz et al. 2017) or model-based RL (Rajeswaran, Mordatch, and Kumar 2020), to name a few. Precisely, we grid). A lava zone lies between the starting point and the goal (i) implement the almost optimal updates of the follower by (wavy orange squares). A state in this game corresponds to doing Kant > 1 gradient steps of the antagonist (follower) the (x, y) coordinates of the protagonist and the state space before updating the protagonist (leader). Also, we (ii) use S corresponds to the set of all possible coordinates in the the first-order approximation of the joint gradient to update grid. The action space A for the protagonist corresponds to the leader rather than the computationally expensive true Ja- moving into one of the four cardinal directions (up, down, cobian term. right, left). The environment is stochastic with probability Another important consideration is the fact that the pro- (p = 0.05) for the protagonist to perform a random action tagonist needs to output perturbations within the admissible instead of the chosen action at . At each time step t of a game perturbation envelope for every collected trajectory. To do episode, the protagonist incurs a reward penalty of −0.035, so, we propose to simply keep track of the remaining antag- motivating the protagonist to reach the goal using the short- onist budget after every time step t, updating it according est path. The episode ends when the protagonist either (i) to the realized perturbation (consumed budget) δt (st+1 ). A reaches the goal, resulting into a reward of +1, (ii) falls into convex transformation and a rescaling function are applied the lava, resulting into a reward of −1, or (iii) when the max- to the budget predictions in order to constrain the predic- imum number of 40 steps is reached within the episode. tions to give a probability distribution when multiplied with As the protagonist aims to reach the goal location, the the state probability. A pseudocode of the overall algorithm antagonist rather aims to push the protagonist into the lava incorporating all the above relaxations can be seen in Algo- zone. Since the antagonist acts upon a limited budget, the rithm 1. protagonist is able to reduce the chances of falling into the lava by just moving farther away from the lava. As the 6 Experiments protagonist also incurs penalties for every step taken, they We conduct experiments on a variation of the Gym Min- nonetheless want to avoid making unnecessary steps, even- igrid Lava environment (Chevalier-Boisvert, Willems, and tually needing to be close to the lava zone. This dilemma be- Pal 2018) displayed in Figure 2. In our environment, a pro- tween minimizing the chances of falling into the lava while tagonist player walks around in a 7 × 10 grid with the aim also making sure that no unnecessary steps are taken serves to reach a goal location (top right corner of the grid) as fast as a good minimal illustration of a safety-critical environ- as possible given their initial location (top left corner of the ment. Indeed, the more weight the protagonist will put on avoiding large negative rewards, the more it will be willing than 0.05 indicates that the antagonist is disturbing the tran- to take the time to get to the lower cases of the grid before sition probabilities at this location since the probability dif- crossing from left to right. fers from the natural environment randomness. Implementation details Both players are trained using Actor-Critic PPO algo- Results rithms (Schulman et al. 2017). To keep our antagonist’s state and action spaces reasonable, we represent next state transi- Figure 3 displays the learned protagonist policy averaged tions Pt as the concatenation of the protagonist current co- over the 10 seeds for each of the considered game instances. ordinates (xt , yt ), the probability vector pt of each action As expected, Figure 3a shows that the protagonist learns a being executed and the value of their remaining perturba- the shortest path to the goal in the absence of an antago- tion budget ηt . The players are trained jointly over 5M time nist (η = 1). However, when there is an antagonist (η > 1), steps, following Algorithm 1 and the antagonist’s budget Figures 3b and 3c show that the protagonist learns a care- constraint is enforced using a convex rescaling method. The ful path that deviates from the lava. More specifically, the number Kant of antagonist updates between the policy up- stronger the antagonist (the larger the perturbation budget dates has been set to 2, after evaluating over Kant = 2, 4, 8. η), the more careful the protagonist has to be in order to An update of the parameters occurs every 10k time steps. avoid getting pushed into lava. This confirms that the pres- Both algorithms use an Adam optimizer with their own ence of an antagonist agent when learning the protagonist decaying learning rate that decreases linearly from 0.005 to policy results in a safer policy. 0.0001 each time their parameters are updated. Further de- Note: The noticeable noise in the averaged protagonist tails on the hyper-parameters and implementation are dis- policy displayed in Figure 3b is due to convergence instabil- cussed in our publicly available implementation. Our code- ities, which caused the protagonist to fail reaching the goal base is made publicly available2 for full reproduction of the in 2 out of the 10 tested seeds. This may be due to the pro- experiments and figures. tagonist being particularly unlucky in these realizations of the environment and/or in the initialization of their policy. Evaluation In other applications of game theory with gradient-based We consider three instances of the described game: the case algorithms (e.g. Generative Adversarial Networks (Good- where there is no antagonist, which corresponds to having fellow et al. 2020)), it is known that systems learning in an antagonist with a perturbation budget η = 1 (as a refer- an adversarial manner are subject to instabilities (Berthelot, ence for comparison), as well as having an antagonist with Schumm, and Metz 2017). Despite these challenges inherent two different budgets, η ∈ {25, 100}. Without an antagonist, to the training procedure, the protagonist consistently dis- the game corresponds to the classical problem of maximiz- played convergence to the same policy in 8/10 seeds, still ing the excepted returns. When the antagonist has a positive demonstrating the robustness of the method to diverse train- perturbation budget, this corresponds to optimizing a CVaR- ing scenarios. optimal policy, i.e. CVaR0.04 for η = 25 and CVaR0.01 for Figure 4 shows 300k trajectories obtained by executing a η = 100. In order to evaluate the robustness of the proposed protagonist (single seed) in a stochastic environment (p = approach, we replicate each instance of the game 10 times, 0.05) in presence of the antagonist used during their train- each time with a different random seed for the environment ing. The number on a given square indicates the probability (governing its randomness), for the protagonist (governing that the protagonist will perform a different action than at the initialization of its policy and the action selection pro- when st corresponds to that location. This includes both the cess), and for the antagonist (governing the initialization of natural environment randomness as well as the influence of its policy and the perturbation prediction process). the antagonist. Figures 4b and 4c show that the antagonist In order to visualize the protagonist strategy in any game is mostly active at the beginning of a game episode, regard- instance, we execute the policy learned by the protagonist less of the amount of perturbation budget. The beginning of agent in a deterministic environment (p = 0) without the the game is indeed the riskiest situation for the protagonist antagonist. This allows to observe the decisions made by the since the starting point is very close to the lava (top-left cor- protagonist agent without perturbations. To visualize the an- ner). More sophisticated adversarial strategies emerge when tagonist strategy in a given game instance, we execute the the antagonist benefits from a higher budget (see Figure 4c). policy learned by a protagonist under this instance, i.e. the For instance, when the protagonist gets far from the lava, stochastic environment (p = 0.05) in presence of the an- we see that the antagonist exploits the borders of the grid tagonist. For each state in the grid, we estimate the proba- to force the protagonist into accumulating moving reward bility that the protagonist performs a different action than penalties. Also, when the protagonist is at the extreme bot- their preferred action in that state. A probability estimate tom of the grid (far from both the lava and the goal), the equal to 0.05 indicates that the antagonist does not hamper perturbation probability drops below 0.05 (the natural envi- the protagonist on this location since this corresponds to the ronment randomness), meaning that the antagonist tends to environment randomness (p). A probability greater or lesser give more leverage to the protagonist in this region. The an- tagonist does this in order to recover its budget to get more 2 https://github.com/TortillasAlfred/CvarAdversarialRL powerful when the protagonist gets closer to the goal. (a) No antagonist (η = 1) (b) Perturbation budget η = 25 (c) Perturbation budget η = 100 Figure 3: Protagonist policy (averaged over 10 random seeds) learned on a stochastic environment with different antagonist perturbation budget. S and G respectively denote the starting location of the protagonist and the goal to reach. (a) No antagonist (η = 1) (b) Perturbation budget η = 25 (c) Perturbation budget η = 100 Figure 4: Antagonist policy averaged over 300k trajectories of a protagonist learned on a stochastic environment with different antagonist perturbation budget. Numbers indicate the probabilities that the protagonist may perform an action different from the selected action at when their state st corresponds to the given location. In order to display accurate estimators, only the probabilities for the locations that the agent visited at least 500 times are displayed. G denotes the goal. 7 Conclusion potential of the method. Also, in light of the instability ob- served in our own empirical results, we would like to address In this work, we presented an adversarial game designed to the issue of sensibility to hyperparameters in our practical compute risk-sensitive policies with respect to the Condi- gradient-based algorithm. tional Value-at-Risk (CVaR) risk measure. Our adversarial game is particular in the sense that it limits an antagonist’s References perturbations to be within a specified budget and to only change next state transitions for a protagonist. We then pre- Artzner, P.; Delbaen, F.; Eber, J.-M.; and Heath, D. 1999. Coherent measures of risk. Mathematical finance, 9(3): 203–228. sented a gradient-based algorithm based on the Stackelberg formulation of the game to solve it, both proving its conver- Baek, D.; Hwang, M.; Kim, H.; and Kwon, D.-S. 2018. Path gence under sufficient conditions and presenting key consid- Planning for Automation of Surgery Robot based on Probabilistic Roadmap and Reinforcement Learning. In 2018 15th International erations for a practical implementation. We put our method Conference on Ubiquitous Robots (UR), 342–347. to the test on an artificial risky environment, illustrating that Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distri- increasing the antagonist’s budget indeed leads a more cau- butional perspective on reinforcement learning. In International tious protagonist policy. Given the fact that this is the first Conference on Machine Learning, 449–458. PMLR. game-theoretic perspective on CVaR RL, we expect this pa- Berthelot, D.; Schumm, T.; and Metz, L. 2017. BEGAN: per to be a building block towards connecting risk-sensitivity Boundary Equilibrium Generative Adversarial Networks. CoRR, in RL with the field of Game Theory. abs/1703.10717. One possibly interesting application of our method is with Chevalier-Boisvert, M.; Willems, L.; and Pal, S. 2018. Minimalis- regards to the Sim2Real domain, where one wishes to opti- tic Gridworld Environment for OpenAI Gym. https://github.com/ mize to train an RL agent on a high-quality simulation en- maximecb/gym-minigrid. vironment before real-life deployment. Since this domain is Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risk- naturally concerned with risk-sensitivity and supposes ac- Sensitive and Robust Decision-Making: a CVaR Optimization Ap- cess to a simulator to access and perturbate the true next proach. In NIPS, 1522–1530. state transitions, it appears highly likely that our proposed Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018. Im- approach can be of use in this context. plicit quantile networks for distributional reinforcement learn- In future work, we would like to apply our game to a con- ing. In International conference on machine learning, 1096–1105. crete Sim2Real application to try and leverage the empirical PMLR. Dennis, M.; Jaques, N.; Vinitsky, E.; Bayen, A. M.; Russell, Pan, X.; Seita, D.; Gao, Y.; and Canny, J. 2019. Risk averse robust S.; Critch, A.; and Levine, S. 2020. Emergent Complexity and adversarial reinforcement learning. In 2019 International Confer- Zero-shot Transfer via Unsupervised Environment Design. In ence on Robotics and Automation (ICRA), 8522–8528. IEEE. Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., Patek, S. D. 1997. Stochastic and shortest path games: theory and eds., Advances in Neural Information Processing Systems 33: An- algorithms. Ph.D. thesis, Massachusetts Institute of Technology. nual Conference on Neural Information Processing Systems 2020, Perolat, J.; Scherrer, B.; Piot, B.; and Pietquin, O. 2015. Ap- NeurIPS 2020, December 6-12, 2020, virtual. proximate dynamic programming for two-player zero-sum markov Fei, Y.; Yang, Z.; Chen, Y.; Wang, Z.; and Xie, Q. 2020. Risk- games. In International Conference on Machine Learning, 1321– Sensitive Reinforcement Learning: Near-Optimal Risk-Sample 1329. PMLR. Tradeoff in Regret. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Petersen, P.; and Voigtländer, F. 2018. Optimal approximation Balcan, M.; and Lin, H., eds., Advances in Neural Information of piecewise smooth functions using deep ReLU neural networks. Processing Systems 33: Annual Conference on Neural Information Neural Networks, 108: 296–330. Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Pinto, L.; Davidson, J.; Sukthankar, R.; and Gupta, A. 2017. Robust virtual. adversarial reinforcement learning. In International Conference on Fiez, T.; Chasnov, B.; and Ratliff, L. 2020. Implicit Learning Dy- Machine Learning, 2817–2826. PMLR. namics in Stackelberg Games: Equilibria Characterization, Con- Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gradient in vergence Analysis, and Empirical Study. In III, H. D.; and Singh, lipschitz markov decision processes. Machine Learning, 100(2): A., eds., Proceedings of the 37th International Conference on Ma- 255–283. chine Learning, volume 119 of Proceedings of Machine Learning Rajeswaran, A.; Mordatch, I.; and Kumar, V. 2020. A Game Theo- Research, 3133–3144. PMLR. retic Framework for Model Based Reinforcement Learning. In III, Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde- H. D.; and Singh, A., eds., Proceedings of the 37th International Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2020. Gener- Conference on Machine Learning, volume 119 of Proceedings of ative adversarial networks. Communications of the ACM, 63(11): Machine Learning Research, 7953–7963. PMLR. 139–144. Schubert, F.; Eimer, T.; Rosenhahn, B.; and Lindauer, M. 2021. Au- Gosavi, A. A.; Das, S. K.; and Murray, S. L. 2014. Beyond ex- tomatic Risk Adaptation in Distributional Reinforcement Learning. ponential utility functions: A variance-adjusted approach for risk- CoRR, abs/2106.06317. averse reinforcement learning. In 2014 IEEE Symposium on Adap- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, tive Dynamic Programming and Reinforcement Learning (AD- O. 2017. Proximal Policy Optimization Algorithms. CoRR, PRL), 1–8. IEEE. abs/1707.06347. Gottesman, O.; Johansson, F.; Komorowski, M.; Faisal, A.; Sontag, Shapiro, A.; Dentcheva, D.; and Ruszczyński, A. 2014. Lectures D.; Doshi-Velez, F.; and Celi, L. A. 2019. Guidelines for reinforce- on stochastic programming: modeling and theory. SIAM. ment learning in healthcare. Nature medicine, 25(1): 16–18. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Hochreiter, S. 2017. Gans trained by a two time-scale update rule 2018. A general reinforcement learning algorithm that masters converge to a local nash equilibrium. Advances in neural informa- chess, shogi, and Go through self-play. Science, 362(6419): 1140– tion processing systems, 30. 1144. Sion, M. 1958. On general minimax theorems. Pacific Journal of Keramati, R.; Dann, C.; Tamkin, A.; and Brunskill, E. 2020. Be- mathematics, 8(1): 171–176. ing optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI Conference on Artificial Intelligence, Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to reinforce- volume 34, 4436–4443. ment learning, volume 135. MIT press Cambridge. Kurakin, A.; Goodfellow, I. J.; and Bengio, S. 2017. Adversarial Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2015. Pol- Machine Learning at Scale. In 5th International Conference on icy Gradient for Coherent Risk Measures. In Cortes, C.; Lawrence, Learning Representations, ICLR 2017, Toulon, France, April 24- N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., Advances 26, 2017, Conference Track Proceedings. OpenReview.net. in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A 2015, Montreal, Quebec, Canada, 1468–1476. Contextual-Bandit Approach to Personalized News Article Recom- Tamar, A.; and Mannor, S. 2013. Variance Adjusted Actor Critic mendation. In Proceedings of the 19th International Conference on Algorithms. CoRR, abs/1310.3697. World Wide Web, WWW ’10, 661–670. New York, NY, USA: As- sociation for Computing Machinery. ISBN 9781605587998. Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; and Abbeel, P. 2017. Domain randomization for transferring deep neu- Mandlekar, A.; Zhu, Y.; Garg, A.; Fei-Fei, L.; and Savarese, S. ral networks from simulation to the real world. In 2017 IEEE/RSJ 2017. Adversarially robust policy learning: Active construction international conference on intelligent robots and systems (IROS), of physically-plausible perturbations. In 2017 IEEE/RSJ Interna- 23–30. IEEE. tional Conference on Intelligent Robots and Systems (IROS), 3932– Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; 3939. IEEE. Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.; Metz, L.; Poole, B.; Pfau, D.; and Sohl-Dickstein, J. 2017. Unrolled Georgiev, P.; et al. 2019. Grandmaster level in StarCraft II using Generative Adversarial Networks. In 5th International Conference multi-agent reinforcement learning. Nature, 575(7782): 350–354. on Learning Representations, ICLR 2017, Toulon, France, April Von Stackelberg, H. 2010. Market structure and equilibrium. 24-26, 2017, Conference Track Proceedings. OpenReview.net. Springer Science & Business Media. Mihatsch, O.; and Neuneier, R. 2002. Risk-sensitive reinforcement Zhang, J.; and Weng, P. 2021. Safe Distributional Reinforcement learning. Machine learning, 49(2): 267–290. Learning. CoRR, abs/2102.13446.