=Paper= {{Paper |id=Vol-2808/Paper_30 |storemode=property |title=Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents |pdfUrl=https://ceur-ws.org/Vol-2808/Paper_30.pdf |volume=Vol-2808 |authors=Jared Markowitz,Marie Chau,I-Jeng Wang |dblpUrl=https://dblp.org/rec/conf/aaai/MarkowitzCW21 }} ==Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents== https://ceur-ws.org/Vol-2808/Paper_30.pdf
      Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents

                                      Jared Markowitz, Marie Chau, I-Jeng Wang
                                     Johns Hopkins University Applied Physics Laboratory
                                                 11000 Johns Hopkins Road
                                                  Laurel, Maryland 20723
                         Jared.Markowitz@jhuapl.edu, Marie.Chau@jhuapl.edu, I-Jeng.Wang@jhuapl.edu


                           Abstract                                 portfolio management. It is critical that RL-based systems
                                                                    adhere to strict safety standards, particularly when the po-
  Current deep reinforcement learning (DRL) methods fail to
  address risk in an intelligent manner, potentially leading to
                                                                    tential for injury or damage exists. They must be able to in-
  unsafe behaviors when deployed. One strategy for improving        tegrate seamlessly with humans, anticipating and adjusting
  agent risk management is to mimic human behavior. While           to the actions of operators and bystanders beyond the situa-
  imperfect, human risk processing displays two key benefits        tions in which they were trained. Few present-day systems
  absent from standard artificial agents: accounting for rare       meet this rigorous standard.
  but consequential events and incorporating context. The for-          Human preferences in decision-making problems have
  mer ability may prevent catastrophic outcomes in unfamil-         been widely studied in finance and economics, but have only
  iar settings while the latter results in asymmetric process-      recently begun to be addressed in earnest by the machine
  ing of potential gains and losses. These two attributes have
                                                                    learning community (Jie et al. 2018). As humans have a nat-
  been quantified by behavioral economists and form the ba-
  sis of cumulative prospect theory (CPT), a leading model          ural ability to emphasize rare events and incorporate context
  of human decision-making. We introduce a two-step method          when assessing risk, having artificial agents mimic human
  for training DRL agents to maximize the CPT-value of full-        decision-making behaviors could be beneficial. On the other
  episode rewards accumulated from an environment, rather           hand, given the known shortcomings of human decision-
  than the standard practice of maximizing expected discounted      making (Kahneman and Tversky 1979; Tversky and Kahne-
  rewards. We quantitatively compare the distribution of out-       man 1992), human imitation could represent a mere stepping
  comes when optimizing full-episode expected reward, CPT-          stone to more adept risk-handling strategies.
  value, and conditional value-at-risk (CVaR) in the CrowdSim           Classical reinforcement learning finds policies for se-
  robot navigation environment, elucidating the impacts of dif-
  ferent objectives on the agent’s willingness to trade safety
                                                                    quential decision-making problems by optimizing expected
  for speed. We find that properly-configured maximization of       future rewards. This criterion alone fails to adequately em-
  CPT-value allows for a reduction of the frequency of negative     phasize edge behaviors, rendering it unsuitable for many
  outcomes with only a slight degradation of the best outcomes,     practical applications. However, numerous approaches ex-
  compared to maximization of expected reward.                      ist for addressing edge behaviors through either explicit or
                                                                    implicit means. One widely-used explicit technique is to ar-
                                                                    tificially increase the frequency of known problematic edge
                     1    Introduction                              cases during training; however, this can lead to performance
Increasingly impressive demonstrations of machine learning          degradation on more frequently observed scenarios. Another
raise increasingly critical questions about its robustness and      explicit strategy is to apply a risk-sensitive measure during
safety in real-world environments. To inspire trust from hu-        training. Example risk-sensitive measures include exponen-
mans, machines must be capable of reasoning, acting, and            tial utility (Pratt 1964), percentile performance criteria (Wu
generalizing in alignment with human preferences. In par-           and Lin 1999), value-at-risk (Leavens 1945), conditional
ticular, they must be able to adeptly handle rare but con-          value-at-risk (Rockafellar and Uryasev 2000), prospect the-
sequential scenarios and should incorporate context in their        ory (Kahneman and Tversky 1979), and cumulative prospect
decision-making.                                                    theory (CPT; Tversky and Kahneman (1992)). An implicit
   Reinforcement learning (RL) methods that consider edge           strategy incorporates a notion of risk by considering the full
behaviors are often referred to as risk-sensitive and have          value distribution as opposed to the expected value (Belle-
become increasingly well-studied as the prospects for real-         mare, Dabney, and Munos 2017; Dabney et al. 2018b,a).
world deployment of RL have grown. Potential target appli-              In this paper, we extend CPT-RL (Prashanth et al. 2016),
cations include autonomous vehicles (e.g., cars and planes),        which explicitly incorporates cumulative prospect theory
autonomous schedulers (e.g., power grids), and financial            into the perceived reward of an RL agent. Our method allows
Copyright c 2021 for this paper by its authors. Use permitted un-   for the application of CPT-RL to agents with deep policy
der Creative Commons License Attribution 4.0 International (CC      networks, unlocking CPT-based control of more complex
BY 4.0).                                                            problem spaces. More precisely, we demonstrate a method-
ology for modifying agents trained by conventional deep re-              Another approach for addressing edge cases is to look
inforcement learning (DRL) algorithms to maximize CPT-                to human decision-making for inspiration. Human decision
value instead of expected rewards. We evaluate our method             makers preferentially consider rare events and perform ad-
on an enhanced version of the CrowdSim robot navigation               mirably on many tasks that RL agents are trained to tackle.
environment (Chen et al. 2019). Our results show that the in-         The goal in many applications is to maximize agent align-
corporation of CPT allows for fewer negative outcomes with            ment with human preferences, which may also point to ap-
only a slight degradation of the best outcomes. In this case,         proaches that attempt to mimic human decision-making.
the CPT-based agent accepts a slight reduction in speed in            One method for incorporating human tendencies into agent
order to facilitate surer progress.                                   behavior is to have the agent maximize the CPT-value func-
   The remainder of this paper is organized as follows. In            tion instead of expected future reward (CPT-RL; Prashanth
Section 2, we provide an overview of common explicit                  et al. (2016)). CPT, a leading model of human decision-
and implicit risk-sensitive methods. In Section 3, we intro-          making, is a refined variant of prospect theory that includes
duce cumulative prospect theory in the context of reinforce-          a generalization of expected utility and is supported by sub-
ment learning, including a CPT-value estimation technique             stantial empirical evidence.
and an efficient high-dimensional stochastic approximation               In this paper, we extend CPT-RL to enable application to
method. In Section 4, we present Deep CPT-RL, an exten-               deep neural network (DNN) policies. Because the gradient-
sion of CPT-RL to algorithms that use deep policy networks.           free methods used in CPT-RL do not scale to training DNNs
In Section 5, we provide computational results to illustrate          from scratch, we perform initial training to maximize ex-
the benefits of Deep CPT-RL when applied to robot nav-                pected reward and update a subset of the resulting optimal
igation. In Section 6, we provide concluding remarks and              weights to maximize the value of the CPT function. Our
directions for future research.                                       approach resembles transfer learning at first glance, as the
                                                                      policy is retrained according to a related objective function.
                    2     Related Work                                However, in our case, a subset of the optimal weights are
                                                                      retained and a subset are re-initialized and re-trained.
Both explicit and implicit means for introducing risk sen-
sitivity to artificial agents have been previously investi-
gated. On the explicit side, methods often incorporate risk                                3    Background
by distorting the reward function. For instance, exponential          3.1   Cumulative Prospect Theory
utility transforms the expected discounted reward E[Z] to             Cumulative prospect theory uses two components to model
 1                                                            1
β log E[exp(βZ)], where β determines the level of risk.               human behavior: a utility function that quantifies human at-
Another class of risk-sensitive measures focuses on rare oc-          titudes toward outcomes and a weight function that quanti-
currences not captured in the standard expected utility, in or-       fies the emphasis placed on different outcomes (Figure 1).
der to mitigate possible detrimental outcomes. Value-at-risk          The utility function u = (u+ , u− ), where u± : R → R+ ,
at confidence level α (VaRα ) is an α-quantile that focuses on        u+ (x) = 0 for x ≤ 0, u− (x) = 0 for x > 0, has two regions
the tail distribution, VaRα (Z) = max{z|P (Z < z) ≤ α},               separated by a reference point, which we consider as zero
where α sets risk. Conditional value-at-risk at confidence            for illustrative purposes. The region to the left of the ref-
level α (CVaRα ) considers the expectation of the tail below          erence point specifies the utility of losses, while the region
the α-quantile point VaRα .                                           to its right specifies the utilities of gains. Humans have em-
   On the implicit side, distributional reinforcement learn-          pirically been found to be more risk-averse with gains than
ing (DistRL) approaches risk from a different perspective.            losses (Tversky and Kahneman 1992), leading to the asym-
Instead of distorting the utility function and optimizing the         metric concavities on the two sides of the reference point.
expectation, DistRL models the value distribution in the op-          The reference point may be static or dynamic; in the case
timization process and selects the policy based on its mean.          of humans it may change with accumulated experience. In
Bellemare, Dabney, and Munos (2017) introduced the first              our experiments we consider dynamic reference points for
DistRL algorithm, C51, and showed that it outperforms stan-           reasons that will be discussed in Section 5.
dard Deep Q-Networks (DQN; Mnih et al. (2015)) in some                   The probability weight function w = (w+ , w− ), where
                                                                         ±
environments. C51 estimates probabilities on a fixed set              w : [0, 1] → [0, 1], models human consideration of events
of uniformly-spaced possible returns, requiring bounds that           based on frequency, highlighting both positive and nega-
may not exist in practice. Quantile regression DQN (QR-               tive rare events relative to more common events (Figure
DQN) overcomes this limitation by estimating the inverse              1(b)). Tversky and Kahneman (1992) recommend the weight
                                                                                                                pη
cumulative distribution function (CDF) of equally-spaced              function w± (p) = w(p) = (pη +(1−p)          η )1/η , while Prelec
quantiles (Dabney et al. 2018b). Implicit quantile networks                                ±
                                                                      (1998) suggests w (p) = w(p) = exp(−(− ln p)η ) where
further improve on QR-DQN by estimating the inverse CDF
                                                                      η ∈ (0, 1).
of random quantiles (Dabney et al. 2018a). Another recent
                                                                         The CPT-value function, defined as
method, Distributional Soft Actor-Critic (Ma et al. 2020),                                Z +∞
enables application of DistRL to continuous spaces.
                                                                       Cu,w (X) =                w+ (P (u+ (X) > z))dz
   1                                                                                       0
    This is apparent after applying a straightforward Taylor expan-                              Z +∞
sion; risk-averse and risk-seeking behavior translate to β < 0 and
β > 0, respectively.                                                                          −          w− (P (u− (X) > z))dz, (1)
                                                                                                  0
                         CPT Utility Function                                                   CPT Weight Function
                                                                            1.0
                                                                                           Identity
                                                                            0.8            w+ (p)
                                                   u+                                      w−(p)
                                                                            0.6




                                                                   Weight
                Losses                                  Gains               0.4


                                                                            0.2
                           −u−

                                                                            0.0
                                                                               0.0        0.2          0.4       0.6     0.8   1.0
                                                                                                      Probability (p)
                                   (a)                                                                    (b)

Figure 1: (a) Example CPT utility function. Gains and losses are treated asymmetrically, with behavior towards gains being
more risk-averse. (b) Example CPT weight function. Rare outcomes on either side are emphasized, as determined by the slope
of w+ (p) (and similarly w− (p)).


satisfies certain requirements. We consider u and w to be                     where X[i] is the ith ordered statistic.
fixed; therefore, we simplify notation henceforth: Cu,w →                                             +           −
C.                                                                     3. Return C(X) = Cm (X) − Cm (X).
    The CPT function (1) is a generalization of expected
value. In particular, if we consider w± (x) = x and
                                                                        3.3          Gradient-Free Stochastic Approximation
u+ (x) = x for x ≥ 0, u− (x) = −x for x < 0, then
            R +∞                  R +∞
C(X) = 0 P(X > z)dz − 0 P(−X > z)dz =                                   Consider the stochastic optimization problem
R +∞                               R +∞
  0
       P(max(X, 0) > z)dz − 0 P(max(−X, 0) >                                                       max C(X(θ)),
                                                                                                     θ∈Θ
z)dz = E[max(X, 0)] − E[max(−X, 0)], where u+ and u−
are utility functions corresponding to gains (X > 0) and                where C(·) is the CPT-value function (1), Θ is a compact,
losses (X < 0) with reference point X = 0.                              convex subset of Rd , X(·) is a random reward, and θ =
                                                                        (θ1 , . . . , θd ) is a d-dimensional weight parameter. Stochastic
3.2    CPT-Value Estimation                                             approximation updates the weights θ iteratively using the
Prashanth et al. (2016) proposed a provably convergent CPT-             recursion
value estimate of (1), inspired by quantiles. Their approach
                                                                                         θn+1 = Γ (θn + γn ĝ(X(θn ))) ,
is summarized in Algorithm 1. The random reward X is
based on a policy πθ , where θ is a set of controllable pa-             where θ0 is chosen randomly, Γ(θ) is a projection operator
rameter values chosen to optimize the CPT function (1). Re-             that ensures θ ∈ Θ, γn is a diagonal matrix with step sizes
ward Xi is generated from episode i, where an episode is                for each dimension on the diagonal, and ĝ is an estimate of
a trajectory from initial to terminal state guided by actions.          the true gradient ∇C (Robbins and Monro 1951; Chau and
Episodes are assumed to be both noisy and expensive to gen-             Fu 2015; Chau et al. 2014). Direct (unbiased) gradients are
erate. Note that in addition to selecting appropriate utility           unavailable for CPT-values; however, with the availability
and weight functions, this approach requires users to choose            of CPT-value estimates, indirect (biased) gradient estimates
a suitable sample size or number of episodes m to produce               can be computed. Applicable indirect methods include fi-
a reasonable estimate.                                                  nite differences (Kiefer and Wolfowitz 1952) and simulta-
                                                                        neous perturbations stochastic approximation (SPSA) (Spall
Algorithm 1: CPT-value Estim. (Prashanth et al. 2016)
                                                                        1992).
1. Generate X1 , . . . , Xm i.i.d. from the distribution X.                In this paper, we employ SPSA for its computational effi-
                                                                        ciency. The kth component of the SPSA gradient is defined
2. Let                                                                  as
             m
                                + m+1−i             + m−i
                                                        
    +       X
                   +
   Cm (X) = u (X[i] ) w                         −w             ,                                      C(θ + δ∆) − C(θ − δ∆)
                                       m                  m                          gkSP SA (θ) =                                    (2)
            i=1                                                                                               2δ∆k
             m
                                               i−1
                                                 
    −       X                        i                                  for k = 1, . . . , d, where δ > 0 and ∆ = (∆1 , . . . , ∆d ) has
   Cm (X) = u− (X[i] ) w−              −w−              ,
            i=1
                                    m            m                      random i.i.d. components with zero mean and finite inverse
moments. Each CPT-value estimate is generated from sam-           only the last layer in Stage 2. Further flexibility may be
ple rewards based on m episodes, i.e., Xi for i = 1, . . . , m.   gained through the re-optimization of the last two layers.
Note that the number of episodes m can vary from one iter-           The second stage is feasible because 1) the lower layers
ation to the next.                                                of a properly-trained initial network tease out the features
                                                                  necessary to produce intelligent agent behavior and 2) the
                   4    Deep CPT-RL                               action layer(s) comprise a small fraction of the parameter
The CPT-RL approach described in Section 3 was applied by         count of the entire network. We used SPSA to estimate the
Prashanth et al. (2016) to successfully optimize a Boltzmann      gradient because of its computational efficiency; in partic-
policy and thereby control a relatively simple traffic manage-    ular, finite differences was not considered due to the sheer
ment scenario. However, CPT-RL does not directly scale to         number of parameters involved. As in the original CPT-RL
policy mappings with high-dimensional parameter spaces,           formulation, all information required to compute the gradi-
including deep neural networks. This is because SPSA pro-         ents was derived from batches of episodes. To increase sta-
duces gradient estimates with lower fidelity and higher vari-     bility, we averaged over gradient estimates generated from
ance than backpropagation.                                        multiple perturbations before conducting a network update
   To address these issues and thereby extend CPT-RL to           via the Adam optimizer (Kingma and Ba 2015).
deep policy networks, we employ a two-stage approach.                One notable extension to the work by Prashanth et al.
First, the deep policy network of an agent is trained using       (2016) is our consideration of multiple reward terms. While
a conventional actor-critic method. The actor-critic formu-       other approaches are possible, we employed the simple strat-
lation was chosen in accordance with the need to learn a          egy of choosing a single reference point based on all terms.
policy and the desire to reduce the variance of policy gradi-     We allowed this reference point to vary based on the out-
ent estimates. The lower, input-side layers of the network are    come of a given episode, as required by our experimental
frozen and the upper, output-side layers (one or more) are re-    setup discussed below. One could alternatively compute a
trained to maximize the CPT-value using a procedure similar       CPT-value for each reward term where it makes sense to
to CPT-RL. Thus the first stage learns a feature representa-      weigh risk using expectations for other terms, but we found
tion of the observation space and the second stage learns a       this to provide similar performance with additional com-
policy under the learned representation. By limiting the sec-     plexity.
ond stage updates to the upper layer(s) of the network, we           Our approach is outlined in Algorithm 2. For notational
overcome the limitations of SPSA in high-dimensional set-         simplicity, in the neural network, let θ−1 ∈ Θ−1 and θ:−1 ∈
tings.                                                            Θ:−1 denote the weights in the last layer and all but the last
   More explicitly, Stage 1 aims to find an optimal policy        layer, respectively. The weight parameters θ:−1 are fixed in
πθ∗ for choosing actions at that maximize expected return.        the second stage, so we drop them from the input, X(θ:−1 ∪
The optimal policy parameters θ ∗ are given by                    θ−1 ) → X(θ−1 ).
                                                                  Algorithm 2: Deep CPT-RL
                                 "                #
                                   X
         ∗
        θ = arg max Eτ ∼pθ (τ )        r(st , at ) ,              Stage 1.
                 θ∈Θ
                                    t
                                                                  1. Identify deep policy network architecture with weight pa-
where the τ are trajectories drawn from the distribution             rameters θ.
pθ (τ ) of possible trajectories an agent using policy πθ may
take in the environment, Θ is the feasible policy parameter       2. Train deep policy network to obtain optimal weight pa-
region, t indicates time, and r(st , at ) is the stochastic re-      rameters θ ∗ that maximize expected return (e.g., using
ward granted by the environment when action at is taken              PPO).
in state st . Stage 2 aims to find a policy that maximizes        Stage 2.
the CPT-value function (1), or in principle any risk-sensitive
measure. Unfortunately, unlike the standard expectation of        3. Initialize stopping time N , gradient estimates per update
rewards, the CPT-functional is based on cumulative proba-            M , SPSA parameters {δn }, Adam parameters α > 0, β ∈
bilities and piecewise, precluding direct, sample-based dif-         (0, 1),  > 0.
ferentiation. No nested structure is assumed, precluding the      4. Fix θ̃:−1 = θ:−1  ∗
                                                                                            , re-initialize θ̃−1 s.t. θ̃−1;i ∼
use of bootstrapping methods. Hence, we generate indirect            Uniform(−|θ−1 | −1/2
                                                                                            , |θ−1 |−1/2 ) for i = 0 . . . |θ̃−1 | − 1,
(biased) gradients from the CPT-value estimates.
   In principle, any policy-based DRL algorithm may be ap-           where θ̃−1 = (θ̃−1;1 , . . . , θ̃−1;|θ̃−1 | ).
plied in the first stage of our procedure. We applied Prox-       5. For i = 0, . . . , N − 1:
imal Policy Optimization (PPO; Schulman et al. (2017)), a
                                                                     • For j = 0, . . . , M − 1:
leading actor-critic approach, to train our agents to conver-
                                                                                                −1 |θ̃   |−1
gence. As previously stated, the lower layers of the network          – Generate ∆ = {∆k }k=0        , where ∆k = ±1 with
learned in Stage 1 provide a feature representation for the             probability 0.5, independent ∀k.
upper layer(s). Therefore, PPO can be thought of as learn-                    ±
                                                                      – Set θ̃−1                                  ±
                                                                                 = θ̃−1 ± δi ∆ and generate Xj (θ̃−1 ).
ing a “perception” mapping suitable for the ensuing “action”                               +               −
layers learned by the second stage. In our experiments, we            – Generate C(Xj (θ̃−1 )) and C(Xj (θ̃−1 )).
found it sufficient to re-initialize the parameters and tune          – Compute gjSP SA (θ̃−1 ) using (2).
                         −1
   • Average {gjSP SA }M
                       j=0 and use result for Adam update                        of rotation-invariant navigation policies that were challeng-
     Gi .                                                                        ing to produce with the original, feature-based representa-
   • Update θ̃−1 ← θ̃−1 + Gi .                                                   tion. We modified the published reward function to 1) allow
                                                                                 removal of the initial imitation learning step used in Chen
6. Return θ̃CP T = θ̃:−1 ∪ θ̃−1 .                                                et al. (2019) and 2) explicitly model a speed-safety trade-
                                                                                 off. Removal of the imitation learning was desired to allow
                                                                                 quantification of CPT-based shaping of an RL agent trained
                                       Stage 1 Training                          tabula rasa, as well as to prevent our training from requiring
                0.5                                                              more than two phases. It was enabled through the use of a
                0.4
                                                                                 progress term that encouraged movement in the direction of
                                                                                 the goal. The speed-safety tradeoff came from a small con-
                0.3                                                              stant penalty being assessed at each step, encouraging the
      Reward




                0.2                                                              agent to get to the goal quickly. In sum, the reward function
                                                                                 was formulated as
                0.1
                                                                                                         rt = −Ctime + Cprogress (dt−1 − dt ),              (3)
                0.0
                                                                                 where dt is the distance between the robot and the goal at
               −0.1                                                              time t (i.e. dt−1 − dt is the “progress” made in the timestep),
                      0.00    0.25   0.50   0.75   1.00   1.25   1.50   1.75     Ctime > 0 is the time penalty, and Cprogress > 0 is the
                                                                          ×107
                                        Training Frames                          progress reward. We set Ctime = 0.02 and Cprogress = 0.1,
                                                                                 with the total distance from the agent’s starting position to
        Figure 2: Stage 1 training of agent using PPO.                           the target being 10. Our choice of Cprogress provided a maxi-
                                                                                 mal progress contribution of 1 per episode; Ctime was chosen
                                                                                 to provide a meaningful fraction of that value for episodes of
                                                                                 standard duration. An episode ends when one of three things
                      5      Numerical Experiments                               happens: the robot reaches the goal, the robot collides with
To quantify the impact of optimizing the CPT-value, we                           a pedestrian, or a timeout occurs. Hence, even though there
applied our two-step procedure to an enhanced version of                         is no explicit collision penalty, the agent is incentivized to
the CrowdSim robotic navigation environment (Chen et al.                         avoid collisions because it precludes the possibility of accu-
2019). CrowdSim allows for explicit evaluation of an agent’s                     mulating more progress reward.
risk-taking behavior in the form of its willingness to trade                        Our agent was configured to choose from 33 different mo-
safety for speed.                                                                tions; remaining at rest and 4 different speeds (0.25 m/s, 0.5
                                                                                 m/s, 0.75 m/s, 1.0 m/s) at each of 8 evenly-spaced angles in
5.1   CrowdSim Environment                                                       [0, 2π). Both the robot and the pedestrians were configured
CrowdSim was developed to study social mobility; that is,                        to move at a preferred speed of vpref = 1.0 m/s. We consid-
to train robotic agents to avoid collisions with pedestrians                     ered 10 pedestrians in each training episode, allowing that
while moving toward a goal. In the default configuration,                        number to vary in testing to evaluate “out-of-distribution”
pedestrians follow the Optimal Reciprocal Collision Avoid-                       performance. This configuration was chosen with the goal
ance (ORCA; van den Berg et al. (2011)) protocol to avoid                        of exploring challenging regimes (where the agent would
other pedestrians while traversing to their own goals. An                        not always be able to avoid collisions) in order to generate
episode concludes when the goal is reached, the robot col-                       meaningful comparisons amongst methods.
lides with a pedestrian, or the system times out. We took the
robot to be invisible to the pedestrians in order to provide                                                            Stage 2 Training
a more challenging test case. To facilitate our experiments,                                   0.5
we made a few changes to the published environment. These                                      0.4
changes were broadly designed to make the scenario more
                                                                                               0.3
realistic, facilitate efficient learning, and provide rotational
invariance. We closed the simulation, enforcing elastic colli-                                 0.2
                                                                                     Reward




sions when an agent reaches an outer wall. We kept pedestri-                                   0.1
ans in constant motion, assigning them a new goal destina-
                                                                                               0.0
tion when they reached a target. While the original version                                                                                  CVaR (25%)
fixed the initial position of the robot and its goal, we ran-                                 −0.1
                                                                                                                                             AVG
domly placed them opposite each other on a circle centered                                    −0.2                                           CPT
at the origin.
                                                                                              −0.3
   In addition to environmental modifications, we adjusted                                           0      200   400     600   800   1000    1200   1400
the observations and rewards received by the agents. We                                                                 Top-layer Updates
replaced the feature-based representation with grayscale,
pixel-based observations. This allowed us to encode the ge-                      Figure 3: Stage 2 training of agent using CVaR, AVG, and
ometry of the system more naturally, enabling the learning                       CPT-based objectives.
                                        Full-Episode Reward Distributions (10 Pedestrians)                                 Full-Episode Reward Distributions (11-15 Pedestrians)

                              3.5           CVaR (25%)                                                                                                                  CVaR (25%)
                                                                                                                    4.0
                                            Average                                                                                                                     Average
                              3.0           CPT-RL                                                                  3.5                                                 CPT-RL
        Probability Density




                                                                                              Probability Density
                              2.5                                                                                   3.0

                                                                                                                    2.5
                              2.0
                                                                                                                    2.0
                              1.5
                                                                                                                    1.5
                              1.0
                                                                                                                    1.0

                              0.5                                                                                   0.5

                              0.0                                                                                   0.0
                                −1.00    −0.75   −0.50   −0.25    0.00   0.25   0.50   0.75                                −0.6   −0.4   −0.2   0.0     0.2      0.4    0.6       0.8
                                                         Full-Episode Rewards                                                               Full-Episode Rewards

                                                                 (a)                                                                              (b)

Figure 4: (a) Histogram of full-episode rewards for agents in 10-pedestrian test scenarios. (b) Histogram of full-episode rewards
for agents in test scenarios with 11-15 pedestrians.

                                                           Rewards, 10 Pedestrians                                               Rewards, 11-15 Pedestrians
                                    Method             Mean         Median 0.01-quantile                                       Mean        Median 0.01-quantile
                                    CVaR           0.262 ± 0.002     0.260       −0.029                                    0.239 ± 0.002    0.232       −0.020
                                     AVG           0.418 ± 0.003     0.414       −0.172                                    0.358 ± 0.003    0.320        −0.066
                                     CPT          0.432 ± 0.003 0.461            −0.085                                   0.375 ± 0.003 0.360            −0.123

                              Table 1: Reward distribution statistics for testing of Stage 2 agents. Reported error bars are standard error.


5.2   Network Architecture                                                                          average reward (AVG) and full-episode conditional value-at-
In our experiments, we used a convolutional neural network                                          risk (CVaR). CVaR was included to evaluate a standard risk-
that closely resembles the architecture commonly used for                                           sensitive measure, focused on improving the worst agent
Atari (Mnih et al. 2015). However, our input images were                                            outcomes. While the standard practice is to consider the bot-
50% larger in each direction than those used by Mnih et al.                                         tom 5% of the distribution for CVaR, we used the bottom
(2015) to ensure proper representation of agent and goal                                            25% in an effort to maximize performance over a broader
edges. To encode motion, 4 frames were stacked. The input                                           range of the distribution. We chose hyperparameters for the
to the network thus consisted of 126 × 126 × 4 image ar-                                            SPSA gradient estimation and the Adam optimizer in line
rays, which were passed to 3 consecutive convolutional lay-                                         with standard guidance (Spall 1992; Kingma and Ba 2015).
ers. These layers had 32, 64, and 64 channels (input-side to                                           One critical detail of CPT-based training is the selection
output-side), kernel sizes of 12, 4, and 3, and stride lengths                                      of the reference point. Our variable reference point consists
of 6, 2, and 1. The layers were separated by rectified linear                                       of two terms, corresponding to the two terms in the reward
unit (ReLU) nonlinearities. The output of the last convolu-                                         function (3). We chose a fixed contribution of pref = 0.5
tional layer was passed to a rectified, 512-dimensional fully-                                      from the progress term, corresponding to the agent making
connected layer. The resulting output was used as the input                                         it halfway from the start to the goal before a collision. The
to a policy head of dimension 33 and a scalar value head.                                           contribution from time varies with episode progress:

5.3   Stage 1 Training                                                                                                                                        p(T )
                                                                                                                                         xref = pref +              tref, max ,         (4)
                                                                                                                                                              pmax
The neural networks governing our navigation agents were
first trained to convergence using PPO. Hyperparameters                                             where p(T )/pmax is the fractional progress toward the goal
were chosen to mimic those used for Atari in Schulman et al.                                        made by the agent over the whole episode ending at T and
(2017), except with shorter windows (16 steps) and more                                             tref, max = −0.3 is a reference time multiplied by the time
windows per batch (64).                                                                             penalty scaling factor Ctime in (3). Varying the reference
                                                                                                    point in this manner correlates the “acceptable” amount of
5.4   Stage 2 Training                                                                              time an agent may take in an episode with the amount of
After Stage 1 training, we trained our agent to maximize the                                        progress it makes. It prevents an agent from being unduly
CPT-value under Algorithm 2. For comparison purposes, we                                            rewarded for quickly running into a pedestrian, resulting in
also used the same procedure to optimize for full-episode                                           a small episode time.
                                  Full-Episode Average Velocities (10 Pedestrians)                                         Full-Episode Average Velocities (11-15 Pedestrians)
                                                                                                                3.0
                                 CVaR (25%)                                                                                    CVaR (25%)
                     3.0         Average                                                                                       Average
                                 CPT-RL                                                                         2.5            CPT-RL
                     2.5
                                                                                                                2.0
       Occurrences




                                                                                                  Occurrences
                     2.0

                                                                                                                1.5
                     1.5

                                                                                                                1.0
                     1.0


                     0.5                                                                                        0.5


                     0.0                                                                                        0.0
                              −0.75   −0.50    −0.25    0.00    0.25    0.50    0.75   1.00                             −1.0           −0.5            0.0           0.5         1.0
                                              Full-Episode Average Velocities                                                           Full-Episode Average Velocities

                                                       (a)                                                                                       (b)

Figure 5: (a) Histogram of full-episode average velocities for agents in 10-pedestrian test scenarios. (b) Histogram of full-
episode average velocities for agents in test scenarios with 11-15 pedestrians.

                                                     10 Pedestrians                                                                             11-15 Pedestrians
  Method                      Progress             Time            Velocity                   Success                    Progress               Time           Velocity                Success
  CVaR                      4.35 ± 0.03         8.61 ± 0.10     0.617 ± 0.003                  0.028                   3.81 ± 0.03          7.14 ± 0.08    0.621 ± 0.003                0.018
   AVG                      6.23 ± 0.04        10.26 ± 0.09 0.661 ± 0.002                      0.294                   5.39 ± 0.04          9.07 ± 0.09 0.640 ± 0.002                   0.188
   CPT                     6.63 ± 0.04         11.52 ± 0.10     0.631 ± 0.002                 0.343                   5.86 ± 0.04           10.55 ± 0.10   0.605 ± 0.002               0.240

Table 2: Agent behavior statistics during testing. Reported error bars are the standard error, and the “success” quantity is the
fraction of the time the agent reaches the goal without a collision.


5.5   Results                                                                                                   pected. As shown in Figure 1a, CPT penalizes negative out-
                                                                                                                comes that are reasonably close to the reference point more
The learning curves for training in Stages 1 and 2 are shown                                                    harshly than AVG while assigning slightly less utility to the
in Figures 2 and 3, respectively. For subsequent analysis, the                                                  very best outcomes. The weighting from Figure 1b may have
same amount of training data and number of Stage 2 network                                                      played a small role in the improved performance on both
updates (1500; chosen for convergence) were used to com-                                                        edges of the distribution. The differences between the CPT
pare the agents that maximized average reward, CPT-value,                                                       and AVG agents were preserved when presented with more
and CVaR. Figure 4a shows the reward distributions earned                                                       challenging test scenarios, as illustrated in Figures 4a and
by each agent over 5000 test runs, each with 10 pedestri-                                                       4b.
ans. To investigate the impact of more challenging “out-of-
                                                                                                                   Looking more closely at the CPT and AVG agents, we
distribution” test cases, we also evaluated on scenarios with
                                                                                                                see that the former averages more progress before a collision
11-15 pedestrians sampled uniformly on the same networks,
                                                                                                                and reaches the goal more often because it is more deliber-
as illustrated in Figure 4b. Statistics describing each test run
                                                                                                                ate. Figure 5 shows the distribution of average velocities v̄
are provided in Table 1. Note that different random seeds
                                                                                                                of the agent’s movement toward the goal over the test runs,
were used for each of the 5000 test runs, but these seeds
                                                                                                                defined by
were the same across test conditions.
   Figure 4 displays quantitative differences in the testing
                                                                                                                                                        dT − d0
performance of the three agents. The CVaR approach fo-                                                                                          v̄ =            ,                           (5)
cuses exclusively on the lower region of performance. As                                                                                                   T
such, it does the best at removing lower outliers, as evi-
denced by the 1% quantiles in Table 1. However, it does                                                         where dt is the distance from the agent to the goal at time
not make any effort to enhance performance far away from                                                        t as previously defined and T is the duration of a given
the worst case and therefore cannot compete with the other                                                      episode. As can be seen from Table 2, this slight reduction in
two methods in any other region. The maximization of CPT-                                                       speed allows the CPT-based agent to, on average, make more
value and average reward led to similar reward distributions.                                                   progress toward the goal before a collision than the other
However, CPT was seen to reduce the frequency of poor out-                                                      agents. It also allows the agent to reach the target without a
comes compared to AVG, at the cost of a slight reduction in                                                     collision more often, as reflected by the “success” fractions
top-end performance. Intuitively, this behavior was to be ex-                                                   given in Table 2.
      6   Conclusions and Future Research                       Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R.
We have developed a method to shape the full-episode re-        2018b. Distributional Reinforcement Learning With Quan-
ward distributions of DRL agents through the maximiza-          tile Regression. In Proceedings of the Thirty-Second AAAI
tion of quantities besides expected reward. Our two-step ap-    Conference on Artificial Intelligence, 2892–2901.
proach has been shown to enable different agent behaviors       Jie, C.; Prashanth, L.; Fu, M. C.; Marcus, S.; and Szepesvári,
when maximizing CPT-value, CVaR, and full-episode ex-           C. 2018. Stochastic Optimization in a Cumulative Prospect
pected reward, both qualitatively and quantitatively. In par-   Theory Framework. IEEE Transactions on Automatic Con-
ticular, the CPT-based agents in our navigation experiments     trol 63: 2867–2882.
were seen to process risk differently than the AVG agents,
                                                                Kahneman, D.; and Tversky, A. 1979. Prospect Theory: An
on average proceeding more deliberately and thereby mak-
                                                                Analysis of Decision under Risk. Econometrica 47(2): 263–
ing more progress toward the goal before a collision. While
                                                                291.
CVaR is a standard risk-sensitive measure, we found that op-
timizing it produced agents unable to compete with the oth-     Kiefer, J.; and Wolfowitz, J. 1952. Stochastic Estimation
ers above the very bottom of the reward distribution, even at   of the Maximum of a Regression Function. The Annals of
the 25% level.                                                  Mathematical Statistics 23(3): 462–466.
   To further this work, we intend to investigate the impact    Kingma, D. P.; and Ba, J. 2015. Adam: A Method for
of adjusting the tunable parameters of the CPT function on      Stochastic Optimization. In 3rd International Conference
the risk-taking behavior of agents. In working toward as-       on Learning Representations.
sured operation, we plan to explore the combination of our
methods with techniques for constrained RL (Achiam et al.       Leavens, D. H. 1945. Diversification of Investments. Trusts
2017). Finally, to gain a deeper understanding of the effects   and Estates 80: 469–473.
of our methods, we also plan to apply them to more complex      Ma, X.; Xia, L.; Zhou, Z.; Yang, J.; and Zhao, Q. 2020.
experimental testbeds.                                          DSAC: Distributional Soft Actor Critic for Risk-Sensitive
                                                                Reinforcement Learning. arXiv: 2004.14547 .
               7     Acknowledgments
                                                                Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness,
The authors thank Ashley Llorens for insightful technical       J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland,
discussions and for his leadership of the Johns Hopkins In-     A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
stitute for Assured Autonomy (JHU IAA) project that sup-        Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
ported our experimentation and analysis. This project was       S.; and Hassabis, D. 2015. Human-level Control Through
conducted under funding from both JHU/APL Internal Re-          Deep Reinforcement Learning. Nature 518(7540): 529–533.
search and Development and JHU IAA.
                                                                Prashanth, L.; Jie, C.; Fu, M. C.; Marcus, S. I.; and
                                                                Szepesvári, C. 2016. Cumulative Prospect Theory Meets
                       References                               Reinforcement Learning: Prediction and Control. In Pro-
Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con-      ceedings of the 33nd International Conference on Machine
strained Policy Optimization. arXiv:1705.10528 [cs] URL         Learning, 1406–1415.
http://arxiv.org/abs/1705.10528. ArXiv: 1705.10528.
                                                                Pratt, J. W. 1964. Risk Aversion in the Small and in the
Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A Dis-        Large. Econometrica 32: 122–136.
tributional Perspective on Reinforcement Learning. In Pro-
                                                                Prelec, D. 1998. The Probability Weighting Function.
ceedings of the 34th International Conference on Machine
                                                                Econometrica 66: 497–527.
Learning, volume 70, 449–458.
                                                                Robbins, H.; and Monro, S. 1951. A Stochastic Approxima-
Chau, M.; and Fu, M. C. 2015. An Overview of Stochastic
                                                                tion Method. The Annals of Mathematical Statistics 22(3):
Approximation, 149–178. Springer.
                                                                400–407.
Chau, M.; Fu, M. C.; Qu, H.; and Ryzhov, I. O. 2014. Sim-
                                                                Rockafellar, R. T.; and Uryasev, S. 2000. Optimization of
ulation Optimization: A Tutorial Overview and Recent De-
                                                                Conditional Value-at-Risk. Journal of Risk 2: 21–41.
velopments in Gradient-based Methods. In Proceedings of
the 2014 Winter Simulation Conference, 21–35.                   Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
                                                                Klimov, O. 2017. Proximal Policy Optimization Algorithms.
Chen, C.; Liu, Y.; Kreiss, S.; and Alahi, A. 2019. Crowd-
                                                                arXiv preprint arXiv:1707.06347.
Robot Interaction: Crowd-Aware Robot Navigation With
Attention-Based Deep Reinforcement Learning. 2019 In-           Spall, J. C. 1992. Multivariate Stochastic Approximation
ternational Conference on Robotics and Automation 6015–         Using a Simultaneous Perturbation Gradient Approxima-
6022.                                                           tion. IEEE Transactions on Automatic Control 37(3): 332–
                                                                341.
Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018a.
Implicit Quantile Networks for Distributional Reinforce-        Tversky, A.; and Kahneman, D. 1992. Advances in Prospect
ment Learning. In Proceedings of the 35th International         Theory: Cumulative Representation of Uncertainty. Journal
Conference on Machine Learning, volume 80, 1096–1105.           of Risk and Uncertainty 5(4): 297–323.
van den Berg, J.; Guy, S.; Lin, M.; and Manocha, D.
2011. Robotics Research, volume 70 of Springer Tracts
in Advanced Robotics, chapter Reciprocal n-Body Collision
Avoidance, 3–19. Springer.
Wu, C.; and Lin, Y. 1999. Minimizing Risk Models in
Markov Decision Processes with Policies Depending on Tar-
get Values. Journal of Mathematical Analysis and Applica-
tions 231: 47–67.