A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning
                 Mathieu Godbout 1, 2, 3 , Maxime Heuillet 1, 2 , Sharath Chandra Raparthy 3,
                              Rupali Bhati 1, 2, 3 , Audrey Durand 1, 2, 3, * ,
                      1
                        Université Laval, 2 Institut Intelligence et Données, 3 Mila, * CIFAR AI Chair
      {mathieu.godbout.3@, maxime.heuillet.1@, rupali.bhati.1@, audrey.durand@ift}.ulaval.ca, raparths@mila.quebec


                            Abstract                                 a nerve or a critical region of the brain (Baek et al. 2018).
                                                                     Automation in this context and other safety-critical domains
  Most Reinforcement Learning (RL) approaches usually aim
  to find a policy that maximizes its expected return. However,      can only come if the agent is able to successfully reach its
  this objective may be inappropriate in many safety-critical        goal while avoiding the most risky actions (Gottesman et al.
  domains such as healthcare or autonomous driving, where it         2019). Therefore, naively maximizing the expected return is
  is often preferable to optimize for a risk-sensitive measure       not a satisfying approach in such scenarios.
  of the policy’s return as the learning objective, such as the         This has motivated the community to design risk-sensitive
  Conditional-Value-at-Risk (CVaR). Although previous litera-        algorithms, where the agent is trained to account for the pos-
  ture exists to address the problem of learning CVaR-optimal        sibility of catastrophic events. One way to proceed is to in-
  policies in Markov decision problems, it mostly relies on the      clude a risk measure (Artzner et al. 1999) in the algorithm’s
  distributional RL perspective. In this paper, we solve this
  problem by rather proposing an approach based on a game
                                                                     objective. A commonly used risk measure in RL is the Con-
  theoretic perspective, which can be applied on top of any ex-      ditional Value-at-Risk (CVaRα ), defined as the expectation
  isting RL algorithm. At the core of our approach is a two-         over the worst α-quantile of a distribution. The lower the
  player zero-sum game between a policy player and an adver-         value of α, the more risk-averse the agent. The search for
  sary that perturbs the policy player’s state transitions given a   CVaR-optimal policies, usually referred to as CVaR RL, is
  finite budget. We show that, the closer the players are to the     typically achieved with distributional RL (Schubert et al.
  game’s equilibrium point, the closer the learned policy is to      2021; Keramati et al. 2020). In this approach, the agent pre-
  the CVaR-optimal one with a risk tolerance explicitly related      dicts the whole distribution of returns rather than only its
  to the adversary’s budget. We provide a gradient-based train-      mean (Bellemare, Dabney, and Munos 2017; Dabney et al.
  ing procedure to solve the proposed game by formulating it         2018). A CVaRα transform is then applied to the predicted
  as a Stackelberg game, enabling the use of deep RL architec-
  tures and training algorithms. We illustrate the applicability
                                                                     distribution, resulting in more conservative choices depend-
  of our approach on a risky artificial environment, presenting      ing on the chosen α threshold.
  the different policies learned for various adversary budgets.         In this work, we rather tackle the CVaR RL problem
                                                                     in a game-theoretic setting. Looking at a problem from a
                                                                     game-theoretic perspective has helped advance other fields
                     1     Introduction                              like Generative Adversarial Networks (GANs) (Berthelot,
Reinforcement Learning (RL) (Sutton, Barto et al. 1998) is           Schumm, and Metz 2017; Goodfellow et al. 2020), suggest-
a branch of machine learning where the learner (agent) is            ing that this perspective is a promising new research avenue
not told how to behave in an environment, but instead must           for risk-sensitivity in RL. Precisely, we suggest that learn-
learn to do so by trial and error. RL approaches usually aim         ing the policy of the RL protagonist (agent) can be cast as a
to find agents that maximize their expected return. This ex-         two-player zero-sum game between this protagonist and an
pectation maximization objective has led to increased suc-           antagonist. In this game, the protagonist aims at maximizing
cesses in domains like videogames (Vinyals et al. 2019),             its reward collection, while the goal of the antagonist is to
board games (Silver et al. 2018) or content recommenda-              minimize the rewards obtained by the protagonist. In order
tion (Li et al. 2010). However, in safety-critical domains like      to achieve this, the antagonist has access to a fixed budget for
healthcare, autonomous driving or financial planning, some           interfering with the next state transitions of the protagonist.
erroneous actions may lead to disastrous consequences. In            Our contributions are the following:
healthcare for instance, an RL agent may be in charge of de-
signing the shortest path to an organ for surgery. In this task,      • we propose a two-player zero-sum game formulation
some paths may be shorter but at the same time may be risk              which yields (approximate) CVaR-optimal policies at its
endangering the patient as they are too close to an artery,             (approximate) equilibrium;
Copyright © 2022 for this paper by its authors. Use permitted un-     • we provide a gradient-based algorithm to solve the pro-
der Creative Commons License Attribution 4.0 International (CC          posed game alongside sufficient conditions to ensure its
BY 4.0).                                                                convergence;
 • we illustrate the applicability of our method in a risky            Conditional-Value-at-Risk (Artzner et al. 1999) at confi-
   artificial experiment.                                              dence level α is defined as
                                                                                                                      
   The rest of the paper is separated as follows. Section 2                                        1
first introduces the necessary background and notation. Fol-             CVaRα (Z) := min w + E [max(Z − w, 0)] . (2)
                                                                                          w∈R      α
lowing is an overview of the related work in Section 3. Next,
we present in Section 4 an overview of our proposed game               If Z has a continuous distribution, the CVaR measure can be
formulation alongside its relevant properties, which include           written as
the convergence to a CVaR RL policy at equilibrium point.
                                                                                   CVaRα (Z) := E [z | z ≤ VaR(Z)] .              (3)
In Section 5, we develop a gradient-based algorithm based
on the Stackelberg game formulation to solve the presented             Intuitively, the CVaRα (Z) measure can therefore be viewed
game, presenting both a theoretical analysis of its conver-            as the mean over the α-quantile worst values of Z. Since
gence as well as key practical concerns. Lastly, we illustrate         Z represents returns in our case, the worst values of Z can
the validity of the method in an artificial gridworld experi-          be seen as the ones where most risk has been incurred. This
ment (Section 6).                                                      last interpretation is particularly attractive, as it makes the
                                                                       CVaRα easy to understand for non-experts who might be
            2    Background and Notation                               involved in the design of any risk-sensitive model in safety-
Reinforcement Learning                                                 critical domains.
In RL, an agent interacts with an environment, trying to               CVaR Reinforcement Learning
identify actions that lead to the highest rewards. The most
                                                                       To measure the level of risk associated with a policy π, the
common RL formulation for the environment is that of a
                                                                       CVaRα measure defined in (3) can be applied to the random
Markov Decision Process (MDP) framework. An MDP M
                                                                       discounted return J (π) defined in (1). Optimizing for this
is represented by a tuple hS, A, P, R, γ, ρi, where S is the
                                                                       yields the CVaR RL objective
state space, A is the action space, P : S × A → [0, 1]S is
the transition probability function which governs the evolu-                         π ∗ = arg max CVaRα (J (π)) .                (4)
tion of the system, R : S × A → R is the reward function,                                       π
γ ∈ [0, 1) is a discount factor and ρ is the initial state dis-        This objective, which differs from the traditional expectation
tribution. We formalize an agent’s decisions as following a            maximization goal, can be viewed as finding a risk-sensitive
policy πθ : S → [0, 1]A that generates distributions over              policy. Indeed, by having the optimal CVaRα return, the op-
which actions to take for each state from parameters θ1 .              timal policy π ∗ is maximizing the expectation over its worst
   A policy’s actions not only generate immediate rewards              α-percentile trajectories. Since the expectation is now taken
but also influence future rewards because they rule the next           over a small, disadvantageous subset of all returns rather
state the agent will end up in. This is why the performance of         than over all of them, the optimal policy is more sensible
a policy π is measured by balancing immediate and distant              towards avoiding the probability of large negative returns.
rewards, using a measure called the random return:                     In practice, this means that optimizing for the CVaR RL ob-
                                  ∞
                                  X                                    jective (4) will produce policies that accept to reduce their
                       J (π) :=         γ t rt ,                (1)    expected performance, so long as it means they avoid catas-
                                  t=0                                  trophic trajectories in return.
                                                                          Let us take a moment to note here that it is well known
where γ is a discount factor, at ∼ π(st ), st+1 ∼ P(st , at )          that the optimal policy π ∗ for the CVaR RL objective can
and rt ∼ R(st , at ). Due to the stochasticity that may be             be history-dependent (Shapiro, Dentcheva, and Ruszczyński
present in either the reward function, the policy’s action             2014), meaning that there are cases where the agent needs
selection or the environment transitions, the random re-               to know what previous rewards were collected to achieve
turn J (π) is accurately defined as a random variable. The             CVaR optimal returns. We follow previous work on CVaR
usual objective in RL is to find the policy π ∗ that maxi-             RL (Keramati et al. 2020; Dabney et al. 2018) and limit our
mizes its expected return, which boils down to finding π ∗ =           analysis to stationary, history independent policies which
arg maxπ E[J (π)].                                                     can be suboptimal but typically achieve high CVaR nonethe-
                                                                       less.
Conditional-Value-at-Risk
Let Z be a random variable with a bounded expectation                                     3    Related Work
E[Z] < ∞ and cumulative distribution function F (z) :=                 Adversarial Reinforcement Learning
P(Z ≤ z). In this paper, we interpret Z as a return to be
maximized. The Value-at-Risk (VaR) at confidence level                 Although typically used in the supervised learning setup, ad-
α ∈ (0, 1] represents the worst 1 − α quantile of Z,                   versarial learning (Kurakin, Goodfellow, and Bengio 2017)
i.e., VaRα (Z) := min{z|F (z) ≥ α}. Analogously, the                   has already been applied in RL with the aim of increasing
                                                                       task difficulty. Dennis et al. (2020); Tobin et al. (2017) use
   1
    For ease of notation, we will only write π when it is clear that   an adversary to design hard environment instances in order
we are talking about a parametrized policy πθ . This applies for any   to optimize the robustness of the policy learned by a RL pro-
parametrized function throughout the paper.                            tagonist. This is achieved by allowing the adversary to select
a different environment from a constrained set at the begin-       to apply external forces or disturbances to the environment
ning of each episode. The adversary is therefore limited to        model. Their antagonist’s action space is however rather
a single move (at the beginning of each episode) and its ac-       different, as it can only change parameters of the model
tions are not budgeted.                                            within a given range at every time step. In contrast, our
   More similar to the current work, Mandlekar et al. (2017)       approach lets the antagonist explicitly change the model’s
use an adversary to perturb the action selection policy of a       transition probabilities and we limit the antagonist’s pertur-
RL protagonist at every time step. However, their adversary        bation amount over a whole trajectory rather than at every
consists in gradient-based perturbations which do not adapt        time step. Moreover, unlike their CVaR formulation which
to the policy of the protagonist. In the formulation tackled       is essentially only intuitive, we establish a clear, theoreti-
in the current work, the antagonist strategy is jointly learned    cally justified, connection between our antagonist’s budget
with the protagonist strategy, resulting in an antagonist strat-   and the α-quantile the protagonist’s policy is optimized for.
egy adapted to battle the protagonist.
                                                                                        4    Game Setting
Risk-Sensitive RL
                                                                   Before diving into the explanation of our proposed adversar-
Broadly speaking, risk-sensitivity in RL represents the en-
                                                                   ial game, let us first present the two assumptions necessary
semble of methods that aim at reducing the level of risk
                                                                   on the target MDP for the game to be applicable.
associated with the learned policy. Different risk measures
have been proposed in the literature to address this problem.      Assumption 1. The MDP has stochastic state transitions P.
   First, numerous approaches have been proposed to explic-           The stochasticity is required because it is precisely upon
itly balance the expectation and variance of the return of a       those state transitions that the antagonist will be acting.
policy (Tamar and Mannor 2013; Pan et al. 2019). Contrary          This assumption is the least restrictive, as a large number
to our CVaR objective which only assumes the returns to be         of MDPs are already stochastic in nature. Furthermore, any
bounded, such approaches are valid only under the assump-          deterministic transition function P could be made stochastic
tion that returns follow a normal distribution.                    by adding a noise component, either via ε-random sampling
   One other popular alternative to incorporate risk-              for discrete states or adding a Gaussian noise ε ∼ N (0, σ)
sensitivity is the use of exponential utility functions (Mi-       for continuous states.
hatsch and Neuneier 2002; Fei et al. 2020). In these ap-
proaches, the return landscape is reshaped into a convex           Assumption 2. It is possible to access the state transitions
set, essentially achieving balance between expectation and         P and arbitrarily modify their non-zero values.
worst-case maximization of the return. The reshaping is               This assumption is the one that allows the antagonist’s
based on applying an exponential function and is regulated         actions to be registered. This is a much more restrictive as-
by a user-specified λ trade-off parameter. However, it is dif-     sumption and is only likely to be met in artificial environ-
ficult to interpret what different values of λ represent regard-   ments over which the user has a lot of control. Notably, given
ing the balance between mean and minimal return (Gosavi,           an MDP with added noise as described above, it should be
Das, and Murray 2014), in contrast with the straightforward        relatively straightforward to meet this assumption by access-
interpretation of our proposed CVaR measure.                       ing the analytical formulation of the added noise component.
Conditional-Value-at-Risk RL                                       Formulation
Many algorithms have been proposed to find optimal poli-           We aim to learn a risk-sensitive policy with respect to the
cies with respect to the CVaR criterion. Proposed algo-            CVaR risk measure. To do so, we leverage Chow et al.
rithms range from policy gradient (Tamar et al. 2015) and Q-       (2015)’s model perturbation framework. Namely, we design
learning (Chow et al. 2015), to actor-critic methods (Tamar        a zero-sum adversarial game where a protagonist tries to
and Mannor 2013). Unlike our algorithm that is compati-            learn a policy πθ to maximize its return while an antagonist
ble with modern deep learning, the above approaches are all        tries to learn a perturbation function Λω that instead mini-
limited to either tabular or low-dimensional action and state      mizes the protagonist’s return.
space settings.
                                                                      To explain our game setting, let us first define time step
   More recently, distributional RL (Bellemare, Dabney, and
Munos 2017; Dabney et al. 2018) has been used to learn             transition dynamics Pt = P(· | st , at ) ∈ [0, 1]|S| from a
CVaR optimal policies in high-dimensional settings (Kera-          given trajectory τ = {(st , at , rt )}Tt=1 , where T is the trajec-
mati et al. 2020; Zhang and Weng 2021). This approach has          tory’s duration. At time step t, the antagonist is given direct
seen growing usage, largely due to the empirical efficiency        access to Pt and allowed to perturb it with a multiplicative
of distributional RL and its ability to incorporate a wide va-     probability perturbation δt ∈ (0, ∞)|S| , yielding perturbed
riety of risk measures in its objective, not exclusively the       transitions P̂t = Pt ◦ δt , for ◦ the Hadamard (pointwise)
CVaR. Instead of relying on distributional RL, our paper           product. A perturbation δ is admissible if it generates a valid
presents a game-theoretic perspective which can be used on         probability distribution P̂ . Let ∆t be the set of all perturba-
top of any conventional RL algorithm.                              tions δt admissible for a given Pt and ∆ = ∆1 × ... × ...∆T
   The closest work to ours would be Robust Adversarial Re-        be the set of all possible perturbations of a trajectory τ . We
inforcement Learning (RARL) (Pinto et al. 2017), where a           can define a trajectory perturbation budget η ≥ 1 and con-
CVaR optimal policy is learned by allowing an antagonist           strain ∆η to contain only perturbations within an η budget
                                                                                       st
by posing the admissible perturbation envelope
                                                                                       rt           Protagonist
   ∆η = ∆ = {δt }Tt=1 | δ1 (s1 ) ∗ ... ∗ δT (sT ) ≤ η , (5)
           

where δt (si ) represents the antagonist’s perturbation to-                                                                at
wards next state si at time step t. Note here that, since per-
                                                                                            st+1
turbations are multiplicative in nature, η = 1 represents an
empty budget, meaning that the antagonist could not apply                                   rt+1    Environment
any modifications to next state transitions.
   This definition and limitation of the antagonist’s actions
                                                                                            −rt+1   δt        η t Pt
enjoy two principal interesting properties. First, since the
antagonist is only given a limited budget for each trajec-
tory, they have to be efficient with their perturbations, trying                                     Antagonist
to maximize the damage to the protagonist’s return without
wasting their budget. Intuitively, the role of the budget can
be seen as a way to ensure that the antagonist cannot force        Figure 1: Overview of the proposed game’s interaction loop.
the worst possible scenario to occur at every time step. Sec-      Components that deviate from the standard RL training loop
ondly, the fact that the antagonist perturbations are done via     are in red.
a Hadamard product constrains the perturbations to states
with non-zero transition probability by default. Essentially,
this means that the antagonist is only allowed to modify the       Definition 1. (Minimax Equilibrium). A Minimax Equilib-
distribution over next states that were already reachable.         rium (π ∗ , Λ∗ ) is defined as a point where
   The main idea of this paper is that an antagonist Λ can be
learned to select perturbations within the admissible pertur-        E [J η (π, Λ∗ )] ≤ E [J η (π ∗ , Λ∗ )] ≤ E [J η (π ∗ , Λ)] ,   (7)
bation envelope ∆η defined in (5), attempting to minimize          for which the inequalities hold for any other protagonist π
the protagonist π’s reward. Combining the protagonist and          or antagonist Λ.
antagonist naturally produces the two-player zero-sum game            At this equilibrium point, we have the following interest-
                   max min E [J η (π, Λ)] ,                  (6)   ing result.
                     π    Λ
where E[J η (π, Λ)] represents the expected return collected       Lemma 1. Suppose πθ and Λω are expressive enough
from a protagonist π with admissible perturbations sampled         player parametrizations, meaning that they can represent
from an antagonist Λ respecting a budget η. Interestingly          the optimal players π ∗ and Λ∗ . Then, at a Minimax Equilib-
enough, this formulation implies that both players’ goals are      rium point (π ∗ , Λ∗ ), the protagonist’s policy π ∗ is CVaRα
to respectively minimize and maximize the expected return,         optimal for α = η1 , where η is the antagonist Λ’s perturba-
retrieving the classical RL objective. This notably means          tion budget:
that any classical RL algorithm like actor-critic methods or             max min E [J η (π, Λ)] = max CVaR η1 [J (π)] .
Deep Q-Network may be used without fundamental changes                     π    Λ                         π
for our game. Moreover, the embedded flexibility of the            Proof. Proposition 1 from Chow et al. (2015) established
game also allows for the underlying MDP to have contin-            that
uous action and/or state spaces.                                             inf0 E [J η (π, Λ0 )] = CVaR η1 [J (π)] .
   We note that both the protagonist and antagonist players                    Λ
are well defined under the MDP framework for a shared              If the parametrization of the antagonist Λ is expressive
environment. On one hand, the protagonist player’s MDP             enough, the inf operation will return an antagonist in the
remains exactly the same, with the only modification that          search space, implying that one can look for the best avail-
next states are sampled from P̂t rather than Pt . On the other     able antagonist for a protagonist to find the CVaR of its re-
hand, the antagonist receives as input states s0t = (Pt , ηt ) ∈   turns. The result follows by taking the max over all possible
S 0 = [0, 1]|S| × [1, ∞) and outputs perturbed transitions         protagonists, where we once again use the fact that π is ex-
                                                                   pressive enough to guarantee the existence in the parameter
P̂t = Pt ◦ δt subject to P̂t being a distribution over next
                                                                   search space.
states and δt (s) ≤ ηt for all states. Next antagonist states
s0t+1 are selected by first sampling the next protagonist state       The result in Lemma 1 is satisfying in two ways. First,
st+1 from P̂t , then sampling the next protagonist action          it directly connects our adversarial framework to risk-
at+1 from π(st+1 ), and finally observing the next transi-         sensitivity for the learned protagonist policy, even though
tion distribution Pt+1 . Lastly, since the game is zero-sum,       the protagonist and antagonist’s objectives remain axed on
the antagonist’s reward is simply −rt+1 . An overview of the       expectation maximization. Secondly, the equation estab-
interaction between the protagonist, antagonist and environ-       lishes that the α confidence interval in the CVaRα objective
ment is shown in Figure 1.                                         is directly linked to the budget perturbation η via α = η1 .
                                                                      Given our parametrized setting where players are learned
Connection to CVaR RL                                              rather than exactly computed, we are actually more inter-
The solution concept to the game (6) is given by its Minimax       ested in the notion of approximate equilibrium, which is
Equilibrium.                                                       more likely to be encountered in practice.
Definition 2. (ε-Minimax Equilibrium). Let V ∗ :=                 uf (θl , θf ), where θ represents the parameters of a given
E [J η (π ∗ , Λ∗ )], the value at the Minimax Equilibrium         player. The particularity in this game is that the follower’s
(π ∗ , Λ∗ ). An ε-Minimax equilibrium is a pair of players        parameters always represent a best-response with respect to
(π 0 , Λ0 ) where                                                 the leader’s parameters. Accordingly, the leader can then
                                                                  pick its parameters by taking for granted that the follower’s
                     V ∗ − min E [J η (π 0 , Λ)] ≤ ε              parameters will be optimal with respect to them. This yields
                             Λ
                                                           (8)    the following optimization problem for the leader
              and    max E [J η (π, Λ0 )] − V ∗ ≤ ε
                       π                                                             (                                    )
                                                                                             0       0
  Interestingly, this approximate equilibrium yields an ap-         θl ∈ arg max ul (θ, θf ) θf ∈ arg max uf (θl , θf ) ,
proximately CVaR-optimal policy.                                              θ                            θf

Theorem 1. Suppose πθ and Λω are expressive enough                and for the follower
player parametrizations. Then, at an ε-Minimax Equilib-
rium point (π 0 , Λ0 ), the protagonist’s policy π 0 is at most                      θf ∈ arg max uf (θl , θ).
                                                                                                 θ
ε-suboptimal:
                                                                     The key intuition to note here is that, since the follower al-
        max CVaR η1 [J (π)] − CVaR η1 [J (π 0 )] ≤ ε
          π                                                       ways has optimal parameters with respect to the leader, then
                                                                  the follower parameters may be seen as an implicit function
Proof. Taking the first inequality in (8), we have
                                                                  taking as input the leader’s parameters θf (θl ). In turn, this
                                                                  allows the leader to exploit this optimal-response form of its
              V ∗ − minΛ E [J η (π 0 , Λ)] ≤ ε
                                                                  follower to update its own parameters towards its own goal.
   ⇐⇒         V ∗ − CVaR η1 [J (π 0 )] ≤ ε                           Intuitively, updating both player’s parameters may be
                                                                  viewed as a bi-level optimization procedure. In order to
   ⇐⇒         maxπ CVaR η1 [J (π)] − CVaR η1 [J (π 0 )] ≤ ε,      solve a Stackelberg game, we can therefore simply alternate
                                                                  between solving each player player’s respective optimiza-
first leveraging Proposition 1 from Chow et al. (2015) and
                                                                  tion problem, producing a simple yet effective algorithm.
then using Lemma 1.
                                                                     As Stackelberg games are a generalization of minimax
                                                                  games such as ours, this Stackelberg algorithm only requires
           5      Gradient-based algorithm                        us to establish a leader and a follower to be able to find our
Having established the favorable properties of an approxi-        game’s equilibrium. We arbitrarily select the protagonist as
mate equilibrium of our proposed game, we only need to            the leader and the antagonist as their follower, but show in
find an algorithm that can reliably reach it. First note that,    the following convergence analysis that this choice can be
in order for the algorithm to be able to tackle challeng-         reversed without problem.
ing problems, we wish both players to be used with high-
capacity function approximators such as neural networks.          Convergence Analysis
Therefore, the algorithm we are looking for should not only       Given some assumptions on the environment MDP, the
converge towards the Minimax Equilibrium, but also do so          Stackelberg algorithm described previously converges to the
using gradient-based updates.                                     equilibrium of our game.
   However, computing the solution to the minmax game
in (6) for players represented by neural networks is not a        Assumption 3. The reward function R and all transition
simple task. Indeed, the dependence of each player’s re-          functions P are both smooth and convex.
ward on the other makes the optimization objective non-              Assumption 3 is quite restrictive in practice. Indeed, while
stationary (Fiez, Chasnov, and Ratliff 2020), hampering           the smoothness portion of it is likely to hold in practice, e.g.
naive gradient-based algorithms’ convergence properties. To       many real-world environments like robotics can be consid-
this end, we propose to view our game as a Stackelberg            ered smooth (Pirotta, Restelli, and Bascetta 2015) and the
game (Von Stackelberg 2010), which allows us to derive            same can be said about neural networks with ReLU acti-
well-defined gradient-based algorithms.                           vations (Petersen and Voigtländer 2018), the convexity as-
                                                                  sumption is much less likely to hold. We however wish to
Stackelberg Algorithm                                             emphasize that previous work (Pinto et al. 2017; Perolat
An essential step to achieve stable learning of the game is to    et al. 2015; Patek 1997) also uses this assumption in the con-
take into account the game’s structure in the parameter up-       text of minimax objectives and still obtain good empirical
dates. Although there aren’t good gradient-based optimiza-        results, even when the convexity assumption is not met.
tion updates specifically designed for minimax games like            When Assumption 3 is met, we have the following lemma.
ours, such updates do exist for Stackelberg games. We there-      Lemma 2. The expected return E[J η (π, Λ)] is convex with
fore cast our game under the Stackelberg formulation, allow-      respect to π and concave with respect to Λ.
ing us to derive our desired gradient updates.
   A two-player Stackelberg game is an asymmetric game            Proof. This follows from the convexity of R and all P.
played with a leader l and a follower f . Both players aim
to maximize their respective payoff functions ul (θl , θf ) and     Building upon Lemma 2, we have the following theorem.
Theorem 2. The proposed Stackelberg algorithm will con-             Algorithm 1: CVaR Adversarial Stackelberg Algorithm
verge towards the Minimax Equilibrium of the game.                  Require: πθ (protagonist), Λω (antagonist), η (perturbation
Proof. This is a well-known result for Stackelberg games                budget), Kant (number of intermediate antagonist steps)
in the two-player zero-sum setting with a convex-concave             1: Nupdates = 0
playoff function (see Fiez, Chasnov, and Ratliff (2020)), a          2: while training not done do
condition guaranteed by Lemma 2.                                     3:    Get initial state st
                                                                     4:    ητ = η                  . Remaining antagonist budget
   Remember that the (approximate) Minimax Equilibrium               5:    while st not terminal do
of the game contains a CVaR (approximately) optimal pol-             6:         at ∼ πθ (st ), Pt = P(st , at )
icy. It follows that Theorem 2 constitutes the proof that our        7:         δt = Λω (Pt , ητ )
overall framework (the game and the proposed algorithm),             8:         P̂t = Pt ◦ δ
is indeed a theoretically justified way to attain a CVaR opti-       9:         st+1 ∼ P̂t , rt+1 ∼ R(st+1 )
mal policy. A last theoretical result follows from the game’s       10:         ητ = δt (sηt+1
                                                                                           τ
                                                                                                      . Update remaining budget
                                                                                               )
convex-concave objective.
                                                                    11:    end while
Corollary 1. For the proposed game, we have                         12:    Update θ or ω according to Nupdates and Kant .
     min max E [J η (π, Λ)] = max min E [J η (π, Λ)] .              13:    Nupdates = Nupdates + 1
      Λ     π                      π    Λ                           14: end while
Proof. This follows from the convexity of the payoff func-
tion E [J η (π, Λ)] (Lemma 2), which allows us to apply the
minimax theorem (Sion 1958), inversing the min and max
terms.
   This last corollary is of separate interest, as it notably im-
plies that the Stackelberg algorithm can take any of the play-
ers as its leader.

Practical Considerations
In practice, we adopt a few relaxations of the exact Stack-
elberg algorithm described above. The adopted relaxations             Figure 2: Overview of our Lava gridworld environment.
have proven effective in domains like GANs (Heusel et al.
2017; Metz et al. 2017) or model-based RL (Rajeswaran,
Mordatch, and Kumar 2020), to name a few. Precisely, we             grid). A lava zone lies between the starting point and the goal
(i) implement the almost optimal updates of the follower by         (wavy orange squares). A state in this game corresponds to
doing Kant > 1 gradient steps of the antagonist (follower)          the (x, y) coordinates of the protagonist and the state space
before updating the protagonist (leader). Also, we (ii) use         S corresponds to the set of all possible coordinates in the
the first-order approximation of the joint gradient to update       grid. The action space A for the protagonist corresponds to
the leader rather than the computationally expensive true Ja-       moving into one of the four cardinal directions (up, down,
cobian term.                                                        right, left). The environment is stochastic with probability
   Another important consideration is the fact that the pro-        (p = 0.05) for the protagonist to perform a random action
tagonist needs to output perturbations within the admissible        instead of the chosen action at . At each time step t of a game
perturbation envelope for every collected trajectory. To do         episode, the protagonist incurs a reward penalty of −0.035,
so, we propose to simply keep track of the remaining antag-         motivating the protagonist to reach the goal using the short-
onist budget after every time step t, updating it according         est path. The episode ends when the protagonist either (i)
to the realized perturbation (consumed budget) δt (st+1 ). A        reaches the goal, resulting into a reward of +1, (ii) falls into
convex transformation and a rescaling function are applied          the lava, resulting into a reward of −1, or (iii) when the max-
to the budget predictions in order to constrain the predic-         imum number of 40 steps is reached within the episode.
tions to give a probability distribution when multiplied with          As the protagonist aims to reach the goal location, the
the state probability. A pseudocode of the overall algorithm        antagonist rather aims to push the protagonist into the lava
incorporating all the above relaxations can be seen in Algo-        zone. Since the antagonist acts upon a limited budget, the
rithm 1.                                                            protagonist is able to reduce the chances of falling into the
                                                                    lava by just moving farther away from the lava. As the
                     6    Experiments                               protagonist also incurs penalties for every step taken, they
We conduct experiments on a variation of the Gym Min-               nonetheless want to avoid making unnecessary steps, even-
igrid Lava environment (Chevalier-Boisvert, Willems, and            tually needing to be close to the lava zone. This dilemma be-
Pal 2018) displayed in Figure 2. In our environment, a pro-         tween minimizing the chances of falling into the lava while
tagonist player walks around in a 7 × 10 grid with the aim          also making sure that no unnecessary steps are taken serves
to reach a goal location (top right corner of the grid) as fast     as a good minimal illustration of a safety-critical environ-
as possible given their initial location (top left corner of the    ment. Indeed, the more weight the protagonist will put on
avoiding large negative rewards, the more it will be willing     than 0.05 indicates that the antagonist is disturbing the tran-
to take the time to get to the lower cases of the grid before    sition probabilities at this location since the probability dif-
crossing from left to right.                                     fers from the natural environment randomness.

Implementation details
Both players are trained using Actor-Critic PPO algo-
                                                                 Results
rithms (Schulman et al. 2017). To keep our antagonist’s state
and action spaces reasonable, we represent next state transi-    Figure 3 displays the learned protagonist policy averaged
tions Pt as the concatenation of the protagonist current co-     over the 10 seeds for each of the considered game instances.
ordinates (xt , yt ), the probability vector pt of each action   As expected, Figure 3a shows that the protagonist learns
a being executed and the value of their remaining perturba-      the shortest path to the goal in the absence of an antago-
tion budget ηt . The players are trained jointly over 5M time    nist (η = 1). However, when there is an antagonist (η > 1),
steps, following Algorithm 1 and the antagonist’s budget         Figures 3b and 3c show that the protagonist learns a care-
constraint is enforced using a convex rescaling method. The      ful path that deviates from the lava. More specifically, the
number Kant of antagonist updates between the policy up-         stronger the antagonist (the larger the perturbation budget
dates has been set to 2, after evaluating over Kant = 2, 4, 8.   η), the more careful the protagonist has to be in order to
An update of the parameters occurs every 10k time steps.         avoid getting pushed into lava. This confirms that the pres-
   Both algorithms use an Adam optimizer with their own          ence of an antagonist agent when learning the protagonist
decaying learning rate that decreases linearly from 0.005 to     policy results in a safer policy.
0.0001 each time their parameters are updated. Further de-          Note: The noticeable noise in the averaged protagonist
tails on the hyper-parameters and implementation are dis-        policy displayed in Figure 3b is due to convergence instabil-
cussed in our publicly available implementation. Our code-       ities, which caused the protagonist to fail reaching the goal
base is made publicly available2 for full reproduction of the    in 2 out of the 10 tested seeds. This may be due to the pro-
experiments and figures.                                         tagonist being particularly unlucky in these realizations of
                                                                 the environment and/or in the initialization of their policy.
Evaluation                                                       In other applications of game theory with gradient-based
We consider three instances of the described game: the case      algorithms (e.g. Generative Adversarial Networks (Good-
where there is no antagonist, which corresponds to having        fellow et al. 2020)), it is known that systems learning in
an antagonist with a perturbation budget η = 1 (as a refer-      an adversarial manner are subject to instabilities (Berthelot,
ence for comparison), as well as having an antagonist with       Schumm, and Metz 2017). Despite these challenges inherent
two different budgets, η ∈ {25, 100}. Without an antagonist,     to the training procedure, the protagonist consistently dis-
the game corresponds to the classical problem of maximiz-        played convergence to the same policy in 8/10 seeds, still
ing the excepted returns. When the antagonist has a positive     demonstrating the robustness of the method to diverse train-
perturbation budget, this corresponds to optimizing a CVaR-      ing scenarios.
optimal policy, i.e. CVaR0.04 for η = 25 and CVaR0.01 for           Figure 4 shows 300k trajectories obtained by executing a
η = 100. In order to evaluate the robustness of the proposed     protagonist (single seed) in a stochastic environment (p =
approach, we replicate each instance of the game 10 times,       0.05) in presence of the antagonist used during their train-
each time with a different random seed for the environment       ing. The number on a given square indicates the probability
(governing its randomness), for the protagonist (governing       that the protagonist will perform a different action than at
the initialization of its policy and the action selection pro-   when st corresponds to that location. This includes both the
cess), and for the antagonist (governing the initialization of   natural environment randomness as well as the influence of
its policy and the perturbation prediction process).             the antagonist. Figures 4b and 4c show that the antagonist
   In order to visualize the protagonist strategy in any game    is mostly active at the beginning of a game episode, regard-
instance, we execute the policy learned by the protagonist       less of the amount of perturbation budget. The beginning of
agent in a deterministic environment (p = 0) without the         the game is indeed the riskiest situation for the protagonist
antagonist. This allows to observe the decisions made by the     since the starting point is very close to the lava (top-left cor-
protagonist agent without perturbations. To visualize the an-    ner). More sophisticated adversarial strategies emerge when
tagonist strategy in a given game instance, we execute the       the antagonist benefits from a higher budget (see Figure 4c).
policy learned by a protagonist under this instance, i.e. the    For instance, when the protagonist gets far from the lava,
stochastic environment (p = 0.05) in presence of the an-         we see that the antagonist exploits the borders of the grid
tagonist. For each state in the grid, we estimate the proba-     to force the protagonist into accumulating moving reward
bility that the protagonist performs a different action than     penalties. Also, when the protagonist is at the extreme bot-
their preferred action in that state. A probability estimate     tom of the grid (far from both the lava and the goal), the
equal to 0.05 indicates that the antagonist does not hamper      perturbation probability drops below 0.05 (the natural envi-
the protagonist on this location since this corresponds to the   ronment randomness), meaning that the antagonist tends to
environment randomness (p). A probability greater or lesser      give more leverage to the protagonist in this region. The an-
                                                                 tagonist does this in order to recover its budget to get more
   2
       https://github.com/TortillasAlfred/CvarAdversarialRL      powerful when the protagonist gets closer to the goal.
                     (a) No antagonist (η = 1)       (b) Perturbation budget η = 25   (c) Perturbation budget η = 100

Figure 3: Protagonist policy (averaged over 10 random seeds) learned on a stochastic environment with different antagonist
perturbation budget. S and G respectively denote the starting location of the protagonist and the goal to reach.


                     (a) No antagonist (η = 1)       (b) Perturbation budget η = 25   (c) Perturbation budget η = 100

Figure 4: Antagonist policy averaged over 300k trajectories of a protagonist learned on a stochastic environment with different
antagonist perturbation budget. Numbers indicate the probabilities that the protagonist may perform an action different from
the selected action at when their state st corresponds to the given location. In order to display accurate estimators, only the
probabilities for the locations that the agent visited at least 500 times are displayed. G denotes the goal.


                      7    Conclusion                                  potential of the method. Also, in light of the instability ob-
                                                                       served in our own empirical results, we would like to address
In this work, we presented an adversarial game designed to             the issue of sensibility to hyperparameters in our practical
compute risk-sensitive policies with respect to the Condi-             gradient-based algorithm.
tional Value-at-Risk (CVaR) risk measure. Our adversarial
game is particular in the sense that it limits an antagonist’s                                  References
perturbations to be within a specified budget and to only
change next state transitions for a protagonist. We then pre-          Artzner, P.; Delbaen, F.; Eber, J.-M.; and Heath, D. 1999. Coherent
                                                                       measures of risk. Mathematical finance, 9(3): 203–228.
sented a gradient-based algorithm based on the Stackelberg
formulation of the game to solve it, both proving its conver-          Baek, D.; Hwang, M.; Kim, H.; and Kwon, D.-S. 2018. Path
gence under sufficient conditions and presenting key consid-           Planning for Automation of Surgery Robot based on Probabilistic
                                                                       Roadmap and Reinforcement Learning. In 2018 15th International
erations for a practical implementation. We put our method             Conference on Ubiquitous Robots (UR), 342–347.
to the test on an artificial risky environment, illustrating that
                                                                       Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distri-
increasing the antagonist’s budget indeed leads a more cau-
                                                                       butional perspective on reinforcement learning. In International
tious protagonist policy. Given the fact that this is the first        Conference on Machine Learning, 449–458. PMLR.
game-theoretic perspective on CVaR RL, we expect this pa-
                                                                       Berthelot, D.; Schumm, T.; and Metz, L. 2017.              BEGAN:
per to be a building block towards connecting risk-sensitivity
                                                                       Boundary Equilibrium Generative Adversarial Networks. CoRR,
in RL with the field of Game Theory.                                   abs/1703.10717.
   One possibly interesting application of our method is with          Chevalier-Boisvert, M.; Willems, L.; and Pal, S. 2018. Minimalis-
regards to the Sim2Real domain, where one wishes to opti-              tic Gridworld Environment for OpenAI Gym. https://github.com/
mize to train an RL agent on a high-quality simulation en-             maximecb/gym-minigrid.
vironment before real-life deployment. Since this domain is            Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risk-
naturally concerned with risk-sensitivity and supposes ac-             Sensitive and Robust Decision-Making: a CVaR Optimization Ap-
cess to a simulator to access and perturbate the true next             proach. In NIPS, 1522–1530.
state transitions, it appears highly likely that our proposed          Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018. Im-
approach can be of use in this context.                                plicit quantile networks for distributional reinforcement learn-
   In future work, we would like to apply our game to a con-           ing. In International conference on machine learning, 1096–1105.
crete Sim2Real application to try and leverage the empirical           PMLR.
Dennis, M.; Jaques, N.; Vinitsky, E.; Bayen, A. M.; Russell,           Pan, X.; Seita, D.; Gao, Y.; and Canny, J. 2019. Risk averse robust
S.; Critch, A.; and Levine, S. 2020. Emergent Complexity and           adversarial reinforcement learning. In 2019 International Confer-
Zero-shot Transfer via Unsupervised Environment Design. In             ence on Robotics and Automation (ICRA), 8522–8528. IEEE.
Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H.,     Patek, S. D. 1997. Stochastic and shortest path games: theory and
eds., Advances in Neural Information Processing Systems 33: An-        algorithms. Ph.D. thesis, Massachusetts Institute of Technology.
nual Conference on Neural Information Processing Systems 2020,         Perolat, J.; Scherrer, B.; Piot, B.; and Pietquin, O. 2015. Ap-
NeurIPS 2020, December 6-12, 2020, virtual.                            proximate dynamic programming for two-player zero-sum markov
Fei, Y.; Yang, Z.; Chen, Y.; Wang, Z.; and Xie, Q. 2020. Risk-         games. In International Conference on Machine Learning, 1321–
Sensitive Reinforcement Learning: Near-Optimal Risk-Sample             1329. PMLR.
Tradeoff in Regret. In Larochelle, H.; Ranzato, M.; Hadsell, R.;       Petersen, P.; and Voigtländer, F. 2018. Optimal approximation
Balcan, M.; and Lin, H., eds., Advances in Neural Information          of piecewise smooth functions using deep ReLU neural networks.
Processing Systems 33: Annual Conference on Neural Information         Neural Networks, 108: 296–330.
Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,
                                                                       Pinto, L.; Davidson, J.; Sukthankar, R.; and Gupta, A. 2017. Robust
virtual.
                                                                       adversarial reinforcement learning. In International Conference on
Fiez, T.; Chasnov, B.; and Ratliff, L. 2020. Implicit Learning Dy-     Machine Learning, 2817–2826. PMLR.
namics in Stackelberg Games: Equilibria Characterization, Con-         Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gradient in
vergence Analysis, and Empirical Study. In III, H. D.; and Singh,      lipschitz markov decision processes. Machine Learning, 100(2):
A., eds., Proceedings of the 37th International Conference on Ma-      255–283.
chine Learning, volume 119 of Proceedings of Machine Learning
                                                                       Rajeswaran, A.; Mordatch, I.; and Kumar, V. 2020. A Game Theo-
Research, 3133–3144. PMLR.
                                                                       retic Framework for Model Based Reinforcement Learning. In III,
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-           H. D.; and Singh, A., eds., Proceedings of the 37th International
Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2020. Gener-      Conference on Machine Learning, volume 119 of Proceedings of
ative adversarial networks. Communications of the ACM, 63(11):         Machine Learning Research, 7953–7963. PMLR.
139–144.                                                               Schubert, F.; Eimer, T.; Rosenhahn, B.; and Lindauer, M. 2021. Au-
Gosavi, A. A.; Das, S. K.; and Murray, S. L. 2014. Beyond ex-          tomatic Risk Adaptation in Distributional Reinforcement Learning.
ponential utility functions: A variance-adjusted approach for risk-    CoRR, abs/2106.06317.
averse reinforcement learning. In 2014 IEEE Symposium on Adap-         Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov,
tive Dynamic Programming and Reinforcement Learning (AD-               O. 2017. Proximal Policy Optimization Algorithms. CoRR,
PRL), 1–8. IEEE.                                                       abs/1707.06347.
Gottesman, O.; Johansson, F.; Komorowski, M.; Faisal, A.; Sontag,      Shapiro, A.; Dentcheva, D.; and Ruszczyński, A. 2014. Lectures
D.; Doshi-Velez, F.; and Celi, L. A. 2019. Guidelines for reinforce-   on stochastic programming: modeling and theory. SIAM.
ment learning in healthcare. Nature medicine, 25(1): 16–18.            Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.;
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and            Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al.
Hochreiter, S. 2017. Gans trained by a two time-scale update rule      2018. A general reinforcement learning algorithm that masters
converge to a local nash equilibrium. Advances in neural informa-      chess, shogi, and Go through self-play. Science, 362(6419): 1140–
tion processing systems, 30.                                           1144.
                                                                       Sion, M. 1958. On general minimax theorems. Pacific Journal of
Keramati, R.; Dann, C.; Tamkin, A.; and Brunskill, E. 2020. Be-
                                                                       mathematics, 8(1): 171–176.
ing optimistic to be conservative: Quickly learning a cvar policy.
In Proceedings of the AAAI Conference on Artificial Intelligence,      Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to reinforce-
volume 34, 4436–4443.                                                  ment learning, volume 135. MIT press Cambridge.
Kurakin, A.; Goodfellow, I. J.; and Bengio, S. 2017. Adversarial       Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2015. Pol-
Machine Learning at Scale. In 5th International Conference on          icy Gradient for Coherent Risk Measures. In Cortes, C.; Lawrence,
Learning Representations, ICLR 2017, Toulon, France, April 24-         N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., Advances
26, 2017, Conference Track Proceedings. OpenReview.net.                in Neural Information Processing Systems 28: Annual Conference
                                                                       on Neural Information Processing Systems 2015, December 7-12,
Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A             2015, Montreal, Quebec, Canada, 1468–1476.
Contextual-Bandit Approach to Personalized News Article Recom-         Tamar, A.; and Mannor, S. 2013. Variance Adjusted Actor Critic
mendation. In Proceedings of the 19th International Conference on      Algorithms. CoRR, abs/1310.3697.
World Wide Web, WWW ’10, 661–670. New York, NY, USA: As-
sociation for Computing Machinery. ISBN 9781605587998.                 Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; and
                                                                       Abbeel, P. 2017. Domain randomization for transferring deep neu-
Mandlekar, A.; Zhu, Y.; Garg, A.; Fei-Fei, L.; and Savarese, S.        ral networks from simulation to the real world. In 2017 IEEE/RSJ
2017. Adversarially robust policy learning: Active construction        international conference on intelligent robots and systems (IROS),
of physically-plausible perturbations. In 2017 IEEE/RSJ Interna-       23–30. IEEE.
tional Conference on Intelligent Robots and Systems (IROS), 3932–      Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;
3939. IEEE.                                                            Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.;
Metz, L.; Poole, B.; Pfau, D.; and Sohl-Dickstein, J. 2017. Unrolled   Georgiev, P.; et al. 2019. Grandmaster level in StarCraft II using
Generative Adversarial Networks. In 5th International Conference       multi-agent reinforcement learning. Nature, 575(7782): 350–354.
on Learning Representations, ICLR 2017, Toulon, France, April          Von Stackelberg, H. 2010. Market structure and equilibrium.
24-26, 2017, Conference Track Proceedings. OpenReview.net.             Springer Science & Business Media.
Mihatsch, O.; and Neuneier, R. 2002. Risk-sensitive reinforcement      Zhang, J.; and Weng, P. 2021. Safe Distributional Reinforcement
learning. Machine learning, 49(2): 267–290.                            Learning. CoRR, abs/2102.13446.