=Paper= {{Paper |id=Vol-2808/Paper_30 |storemode=property |title=Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents |pdfUrl=https://ceur-ws.org/Vol-2808/Paper_30.pdf |volume=Vol-2808 |authors=Jared Markowitz,Marie Chau,I-Jeng Wang |dblpUrl=https://dblp.org/rec/conf/aaai/MarkowitzCW21 }} ==Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents== https://ceur-ws.org/Vol-2808/Paper_30.pdf

Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents

Jared Markowitz, Marie Chau, I-Jeng Wang
Johns Hopkins University Applied Physics Laboratory
11000 Johns Hopkins Road
Laurel, Maryland 20723
Jared.Markowitz@jhuapl.edu, Marie.Chau@jhuapl.edu, I-Jeng.Wang@jhuapl.edu

Abstract portfolio management. It is critical that RL-based systems
adhere to strict safety standards, particularly when the po-
Current deep reinforcement learning (DRL) methods fail to
address risk in an intelligent manner, potentially leading to
tential for injury or damage exists. They must be able to in-
unsafe behaviors when deployed. One strategy for improving tegrate seamlessly with humans, anticipating and adjusting
agent risk management is to mimic human behavior. While to the actions of operators and bystanders beyond the situa-
imperfect, human risk processing displays two key benefits tions in which they were trained. Few present-day systems
absent from standard artificial agents: accounting for rare meet this rigorous standard.
but consequential events and incorporating context. The for- Human preferences in decision-making problems have
mer ability may prevent catastrophic outcomes in unfamil- been widely studied in finance and economics, but have only
iar settings while the latter results in asymmetric process- recently begun to be addressed in earnest by the machine
ing of potential gains and losses. These two attributes have
learning community (Jie et al. 2018). As humans have a nat-
been quantified by behavioral economists and form the ba-
sis of cumulative prospect theory (CPT), a leading model ural ability to emphasize rare events and incorporate context
of human decision-making. We introduce a two-step method when assessing risk, having artificial agents mimic human
for training DRL agents to maximize the CPT-value of full- decision-making behaviors could be beneficial. On the other
episode rewards accumulated from an environment, rather hand, given the known shortcomings of human decision-
than the standard practice of maximizing expected discounted making (Kahneman and Tversky 1979; Tversky and Kahne-
rewards. We quantitatively compare the distribution of out- man 1992), human imitation could represent a mere stepping
comes when optimizing full-episode expected reward, CPT- stone to more adept risk-handling strategies.
value, and conditional value-at-risk (CVaR) in the CrowdSim Classical reinforcement learning finds policies for se-
robot navigation environment, elucidating the impacts of dif-
ferent objectives on the agent’s willingness to trade safety
quential decision-making problems by optimizing expected
for speed. We find that properly-configured maximization of future rewards. This criterion alone fails to adequately em-
CPT-value allows for a reduction of the frequency of negative phasize edge behaviors, rendering it unsuitable for many
outcomes with only a slight degradation of the best outcomes, practical applications. However, numerous approaches ex-
compared to maximization of expected reward. ist for addressing edge behaviors through either explicit or
implicit means. One widely-used explicit technique is to ar-
tificially increase the frequency of known problematic edge
1 Introduction cases during training; however, this can lead to performance
Increasingly impressive demonstrations of machine learning degradation on more frequently observed scenarios. Another
raise increasingly critical questions about its robustness and explicit strategy is to apply a risk-sensitive measure during
safety in real-world environments. To inspire trust from hu- training. Example risk-sensitive measures include exponen-
mans, machines must be capable of reasoning, acting, and tial utility (Pratt 1964), percentile performance criteria (Wu
generalizing in alignment with human preferences. In par- and Lin 1999), value-at-risk (Leavens 1945), conditional
ticular, they must be able to adeptly handle rare but con- value-at-risk (Rockafellar and Uryasev 2000), prospect the-
sequential scenarios and should incorporate context in their ory (Kahneman and Tversky 1979), and cumulative prospect
decision-making. theory (CPT; Tversky and Kahneman (1992)). An implicit
Reinforcement learning (RL) methods that consider edge strategy incorporates a notion of risk by considering the full
behaviors are often referred to as risk-sensitive and have value distribution as opposed to the expected value (Belle-
become increasingly well-studied as the prospects for real- mare, Dabney, and Munos 2017; Dabney et al. 2018b,a).
world deployment of RL have grown. Potential target appli- In this paper, we extend CPT-RL (Prashanth et al. 2016),
cations include autonomous vehicles (e.g., cars and planes), which explicitly incorporates cumulative prospect theory
autonomous schedulers (e.g., power grids), and financial into the perceived reward of an RL agent. Our method allows
Copyright c 2021 for this paper by its authors. Use permitted un- for the application of CPT-RL to agents with deep policy
der Creative Commons License Attribution 4.0 International (CC networks, unlocking CPT-based control of more complex
BY 4.0). problem spaces. More precisely, we demonstrate a method-
ology for modifying agents trained by conventional deep re- Another approach for addressing edge cases is to look
inforcement learning (DRL) algorithms to maximize CPT- to human decision-making for inspiration. Human decision
value instead of expected rewards. We evaluate our method makers preferentially consider rare events and perform ad-
on an enhanced version of the CrowdSim robot navigation mirably on many tasks that RL agents are trained to tackle.
environment (Chen et al. 2019). Our results show that the in- The goal in many applications is to maximize agent align-
corporation of CPT allows for fewer negative outcomes with ment with human preferences, which may also point to ap-
only a slight degradation of the best outcomes. In this case, proaches that attempt to mimic human decision-making.
the CPT-based agent accepts a slight reduction in speed in One method for incorporating human tendencies into agent
order to facilitate surer progress. behavior is to have the agent maximize the CPT-value func-
The remainder of this paper is organized as follows. In tion instead of expected future reward (CPT-RL; Prashanth
Section 2, we provide an overview of common explicit et al. (2016)). CPT, a leading model of human decision-
and implicit risk-sensitive methods. In Section 3, we intro- making, is a refined variant of prospect theory that includes
duce cumulative prospect theory in the context of reinforce- a generalization of expected utility and is supported by sub-
ment learning, including a CPT-value estimation technique stantial empirical evidence.
and an efficient high-dimensional stochastic approximation In this paper, we extend CPT-RL to enable application to
method. In Section 4, we present Deep CPT-RL, an exten- deep neural network (DNN) policies. Because the gradient-
sion of CPT-RL to algorithms that use deep policy networks. free methods used in CPT-RL do not scale to training DNNs
In Section 5, we provide computational results to illustrate from scratch, we perform initial training to maximize ex-
the benefits of Deep CPT-RL when applied to robot nav- pected reward and update a subset of the resulting optimal
igation. In Section 6, we provide concluding remarks and weights to maximize the value of the CPT function. Our
directions for future research. approach resembles transfer learning at first glance, as the
policy is retrained according to a related objective function.
2 Related Work However, in our case, a subset of the optimal weights are
retained and a subset are re-initialized and re-trained.
Both explicit and implicit means for introducing risk sen-
sitivity to artificial agents have been previously investi-
gated. On the explicit side, methods often incorporate risk 3 Background
by distorting the reward function. For instance, exponential 3.1 Cumulative Prospect Theory
utility transforms the expected discounted reward E[Z] to Cumulative prospect theory uses two components to model
1 1
β log E[exp(βZ)], where β determines the level of risk. human behavior: a utility function that quantifies human at-
Another class of risk-sensitive measures focuses on rare oc- titudes toward outcomes and a weight function that quanti-
currences not captured in the standard expected utility, in or- fies the emphasis placed on different outcomes (Figure 1).
der to mitigate possible detrimental outcomes. Value-at-risk The utility function u = (u+ , u− ), where u± : R → R+ ,
at confidence level α (VaRα ) is an α-quantile that focuses on u+ (x) = 0 for x ≤ 0, u− (x) = 0 for x > 0, has two regions
the tail distribution, VaRα (Z) = max{z|P (Z < z) ≤ α}, separated by a reference point, which we consider as zero
where α sets risk. Conditional value-at-risk at confidence for illustrative purposes. The region to the left of the ref-
level α (CVaRα ) considers the expectation of the tail below erence point specifies the utility of losses, while the region
the α-quantile point VaRα . to its right specifies the utilities of gains. Humans have em-
On the implicit side, distributional reinforcement learn- pirically been found to be more risk-averse with gains than
ing (DistRL) approaches risk from a different perspective. losses (Tversky and Kahneman 1992), leading to the asym-
Instead of distorting the utility function and optimizing the metric concavities on the two sides of the reference point.
expectation, DistRL models the value distribution in the op- The reference point may be static or dynamic; in the case
timization process and selects the policy based on its mean. of humans it may change with accumulated experience. In
Bellemare, Dabney, and Munos (2017) introduced the first our experiments we consider dynamic reference points for
DistRL algorithm, C51, and showed that it outperforms stan- reasons that will be discussed in Section 5.
dard Deep Q-Networks (DQN; Mnih et al. (2015)) in some The probability weight function w = (w+ , w− ), where
±
environments. C51 estimates probabilities on a fixed set w : [0, 1] → [0, 1], models human consideration of events
of uniformly-spaced possible returns, requiring bounds that based on frequency, highlighting both positive and nega-
may not exist in practice. Quantile regression DQN (QR- tive rare events relative to more common events (Figure
DQN) overcomes this limitation by estimating the inverse 1(b)). Tversky and Kahneman (1992) recommend the weight
pη
cumulative distribution function (CDF) of equally-spaced function w± (p) = w(p) = (pη +(1−p) η )1/η , while Prelec
quantiles (Dabney et al. 2018b). Implicit quantile networks ±
(1998) suggests w (p) = w(p) = exp(−(− ln p)η ) where
further improve on QR-DQN by estimating the inverse CDF
η ∈ (0, 1).
of random quantiles (Dabney et al. 2018a). Another recent
The CPT-value function, defined as
method, Distributional Soft Actor-Critic (Ma et al. 2020), Z +∞
enables application of DistRL to continuous spaces.
Cu,w (X) = w+ (P (u+ (X) > z))dz
1 0
This is apparent after applying a straightforward Taylor expan- Z +∞
sion; risk-averse and risk-seeking behavior translate to β < 0 and
β > 0, respectively. − w− (P (u− (X) > z))dz, (1)
0
CPT Utility Function CPT Weight Function
1.0
Identity
0.8 w+ (p)
u+ w−(p)
0.6

Weight
Losses Gains 0.4

0.2
−u−

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Probability (p)
(a) (b)

Figure 1: (a) Example CPT utility function. Gains and losses are treated asymmetrically, with behavior towards gains being
more risk-averse. (b) Example CPT weight function. Rare outcomes on either side are emphasized, as determined by the slope
of w+ (p) (and similarly w− (p)).

satisfies certain requirements. We consider u and w to be where X[i] is the ith ordered statistic.
fixed; therefore, we simplify notation henceforth: Cu,w → + −
C. 3. Return C(X) = Cm (X) − Cm (X).
The CPT function (1) is a generalization of expected
value. In particular, if we consider w± (x) = x and
3.3 Gradient-Free Stochastic Approximation
u+ (x) = x for x ≥ 0, u− (x) = −x for x < 0, then
R +∞ R +∞
C(X) = 0 P(X > z)dz − 0 P(−X > z)dz = Consider the stochastic optimization problem
R +∞ R +∞
0
P(max(X, 0) > z)dz − 0 P(max(−X, 0) > max C(X(θ)),
θ∈Θ
z)dz = E[max(X, 0)] − E[max(−X, 0)], where u+ and u−
are utility functions corresponding to gains (X > 0) and where C(·) is the CPT-value function (1), Θ is a compact,
losses (X < 0) with reference point X = 0. convex subset of Rd , X(·) is a random reward, and θ =
(θ1 , . . . , θd ) is a d-dimensional weight parameter. Stochastic
3.2 CPT-Value Estimation approximation updates the weights θ iteratively using the
Prashanth et al. (2016) proposed a provably convergent CPT- recursion
value estimate of (1), inspired by quantiles. Their approach
θn+1 = Γ (θn + γn ĝ(X(θn ))) ,
is summarized in Algorithm 1. The random reward X is
based on a policy πθ , where θ is a set of controllable pa- where θ0 is chosen randomly, Γ(θ) is a projection operator
rameter values chosen to optimize the CPT function (1). Re- that ensures θ ∈ Θ, γn is a diagonal matrix with step sizes
ward Xi is generated from episode i, where an episode is for each dimension on the diagonal, and ĝ is an estimate of
a trajectory from initial to terminal state guided by actions. the true gradient ∇C (Robbins and Monro 1951; Chau and
Episodes are assumed to be both noisy and expensive to gen- Fu 2015; Chau et al. 2014). Direct (unbiased) gradients are
erate. Note that in addition to selecting appropriate utility unavailable for CPT-values; however, with the availability
and weight functions, this approach requires users to choose of CPT-value estimates, indirect (biased) gradient estimates
a suitable sample size or number of episodes m to produce can be computed. Applicable indirect methods include fi-
a reasonable estimate. nite differences (Kiefer and Wolfowitz 1952) and simulta-
neous perturbations stochastic approximation (SPSA) (Spall
Algorithm 1: CPT-value Estim. (Prashanth et al. 2016)
1992).
1. Generate X1 , . . . , Xm i.i.d. from the distribution X. In this paper, we employ SPSA for its computational effi-
ciency. The kth component of the SPSA gradient is defined
2. Let as
m
+ m+1−i + m−i

+ X
+
Cm (X) = u (X[i] ) w −w , C(θ + δ∆) − C(θ − δ∆)
m m gkSP SA (θ) = (2)
i=1 2δ∆k
m
i−1

− X i for k = 1, . . . , d, where δ > 0 and ∆ = (∆1 , . . . , ∆d ) has
Cm (X) = u− (X[i] ) w− −w− ,
i=1
m m random i.i.d. components with zero mean and finite inverse
moments. Each CPT-value estimate is generated from sam- only the last layer in Stage 2. Further flexibility may be
ple rewards based on m episodes, i.e., Xi for i = 1, . . . , m. gained through the re-optimization of the last two layers.
Note that the number of episodes m can vary from one iter- The second stage is feasible because 1) the lower layers
ation to the next. of a properly-trained initial network tease out the features
necessary to produce intelligent agent behavior and 2) the
4 Deep CPT-RL action layer(s) comprise a small fraction of the parameter
The CPT-RL approach described in Section 3 was applied by count of the entire network. We used SPSA to estimate the
Prashanth et al. (2016) to successfully optimize a Boltzmann gradient because of its computational efficiency; in partic-
policy and thereby control a relatively simple traffic manage- ular, finite differences was not considered due to the sheer
ment scenario. However, CPT-RL does not directly scale to number of parameters involved. As in the original CPT-RL
policy mappings with high-dimensional parameter spaces, formulation, all information required to compute the gradi-
including deep neural networks. This is because SPSA pro- ents was derived from batches of episodes. To increase sta-
duces gradient estimates with lower fidelity and higher vari- bility, we averaged over gradient estimates generated from
ance than backpropagation. multiple perturbations before conducting a network update
To address these issues and thereby extend CPT-RL to via the Adam optimizer (Kingma and Ba 2015).
deep policy networks, we employ a two-stage approach. One notable extension to the work by Prashanth et al.
First, the deep policy network of an agent is trained using (2016) is our consideration of multiple reward terms. While
a conventional actor-critic method. The actor-critic formu- other approaches are possible, we employed the simple strat-
lation was chosen in accordance with the need to learn a egy of choosing a single reference point based on all terms.
policy and the desire to reduce the variance of policy gradi- We allowed this reference point to vary based on the out-
ent estimates. The lower, input-side layers of the network are come of a given episode, as required by our experimental
frozen and the upper, output-side layers (one or more) are re- setup discussed below. One could alternatively compute a
trained to maximize the CPT-value using a procedure similar CPT-value for each reward term where it makes sense to
to CPT-RL. Thus the first stage learns a feature representa- weigh risk using expectations for other terms, but we found
tion of the observation space and the second stage learns a this to provide similar performance with additional com-
policy under the learned representation. By limiting the sec- plexity.
ond stage updates to the upper layer(s) of the network, we Our approach is outlined in Algorithm 2. For notational
overcome the limitations of SPSA in high-dimensional set- simplicity, in the neural network, let θ−1 ∈ Θ−1 and θ:−1 ∈
tings. Θ:−1 denote the weights in the last layer and all but the last
More explicitly, Stage 1 aims to find an optimal policy layer, respectively. The weight parameters θ:−1 are fixed in
πθ∗ for choosing actions at that maximize expected return. the second stage, so we drop them from the input, X(θ:−1 ∪
The optimal policy parameters θ ∗ are given by θ−1 ) → X(θ−1 ).
Algorithm 2: Deep CPT-RL
" #
X
∗
θ = arg max Eτ ∼pθ (τ ) r(st , at ) , Stage 1.
θ∈Θ
t
1. Identify deep policy network architecture with weight pa-
where the τ are trajectories drawn from the distribution rameters θ.
pθ (τ ) of possible trajectories an agent using policy πθ may
take in the environment, Θ is the feasible policy parameter 2. Train deep policy network to obtain optimal weight pa-
region, t indicates time, and r(st , at ) is the stochastic re- rameters θ ∗ that maximize expected return (e.g., using
ward granted by the environment when action at is taken PPO).
in state st . Stage 2 aims to find a policy that maximizes Stage 2.
the CPT-value function (1), or in principle any risk-sensitive
measure. Unfortunately, unlike the standard expectation of 3. Initialize stopping time N , gradient estimates per update
rewards, the CPT-functional is based on cumulative proba- M , SPSA parameters {δn }, Adam parameters α > 0, β ∈
bilities and piecewise, precluding direct, sample-based dif- (0, 1), > 0.
ferentiation. No nested structure is assumed, precluding the 4. Fix θ̃:−1 = θ:−1 ∗
, re-initialize θ̃−1 s.t. θ̃−1;i ∼
use of bootstrapping methods. Hence, we generate indirect Uniform(−|θ−1 | −1/2
, |θ−1 |−1/2 ) for i = 0 . . . |θ̃−1 | − 1,
(biased) gradients from the CPT-value estimates.
In principle, any policy-based DRL algorithm may be ap- where θ̃−1 = (θ̃−1;1 , . . . , θ̃−1;|θ̃−1 | ).
plied in the first stage of our procedure. We applied Prox- 5. For i = 0, . . . , N − 1:
imal Policy Optimization (PPO; Schulman et al. (2017)), a
• For j = 0, . . . , M − 1:
leading actor-critic approach, to train our agents to conver-
−1 |θ̃ |−1
gence. As previously stated, the lower layers of the network – Generate ∆ = {∆k }k=0 , where ∆k = ±1 with
learned in Stage 1 provide a feature representation for the probability 0.5, independent ∀k.
upper layer(s). Therefore, PPO can be thought of as learn- ±
– Set θ̃−1 ±
= θ̃−1 ± δi ∆ and generate Xj (θ̃−1 ).
ing a “perception” mapping suitable for the ensuing “action” + −
layers learned by the second stage. In our experiments, we – Generate C(Xj (θ̃−1 )) and C(Xj (θ̃−1 )).
found it sufficient to re-initialize the parameters and tune – Compute gjSP SA (θ̃−1 ) using (2).
−1
• Average {gjSP SA }M
j=0 and use result for Adam update of rotation-invariant navigation policies that were challeng-
Gi . ing to produce with the original, feature-based representa-
• Update θ̃−1 ← θ̃−1 + Gi . tion. We modified the published reward function to 1) allow
removal of the initial imitation learning step used in Chen
6. Return θ̃CP T = θ̃:−1 ∪ θ̃−1 . et al. (2019) and 2) explicitly model a speed-safety trade-
off. Removal of the imitation learning was desired to allow
quantification of CPT-based shaping of an RL agent trained
Stage 1 Training tabula rasa, as well as to prevent our training from requiring
0.5 more than two phases. It was enabled through the use of a
0.4
progress term that encouraged movement in the direction of
the goal. The speed-safety tradeoff came from a small con-
0.3 stant penalty being assessed at each step, encouraging the
Reward

0.2 agent to get to the goal quickly. In sum, the reward function
was formulated as
0.1
rt = −Ctime + Cprogress (dt−1 − dt ), (3)
0.0
where dt is the distance between the robot and the goal at
−0.1 time t (i.e. dt−1 − dt is the “progress” made in the timestep),
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Ctime > 0 is the time penalty, and Cprogress > 0 is the
×107
Training Frames progress reward. We set Ctime = 0.02 and Cprogress = 0.1,
with the total distance from the agent’s starting position to
Figure 2: Stage 1 training of agent using PPO. the target being 10. Our choice of Cprogress provided a maxi-
mal progress contribution of 1 per episode; Ctime was chosen
to provide a meaningful fraction of that value for episodes of
standard duration. An episode ends when one of three things
5 Numerical Experiments happens: the robot reaches the goal, the robot collides with
To quantify the impact of optimizing the CPT-value, we a pedestrian, or a timeout occurs. Hence, even though there
applied our two-step procedure to an enhanced version of is no explicit collision penalty, the agent is incentivized to
the CrowdSim robotic navigation environment (Chen et al. avoid collisions because it precludes the possibility of accu-
2019). CrowdSim allows for explicit evaluation of an agent’s mulating more progress reward.
risk-taking behavior in the form of its willingness to trade Our agent was configured to choose from 33 different mo-
safety for speed. tions; remaining at rest and 4 different speeds (0.25 m/s, 0.5
m/s, 0.75 m/s, 1.0 m/s) at each of 8 evenly-spaced angles in
5.1 CrowdSim Environment [0, 2π). Both the robot and the pedestrians were configured
CrowdSim was developed to study social mobility; that is, to move at a preferred speed of vpref = 1.0 m/s. We consid-
to train robotic agents to avoid collisions with pedestrians ered 10 pedestrians in each training episode, allowing that
while moving toward a goal. In the default configuration, number to vary in testing to evaluate “out-of-distribution”
pedestrians follow the Optimal Reciprocal Collision Avoid- performance. This configuration was chosen with the goal
ance (ORCA; van den Berg et al. (2011)) protocol to avoid of exploring challenging regimes (where the agent would
other pedestrians while traversing to their own goals. An not always be able to avoid collisions) in order to generate
episode concludes when the goal is reached, the robot col- meaningful comparisons amongst methods.
lides with a pedestrian, or the system times out. We took the
robot to be invisible to the pedestrians in order to provide Stage 2 Training
a more challenging test case. To facilitate our experiments, 0.5
we made a few changes to the published environment. These 0.4
changes were broadly designed to make the scenario more
0.3
realistic, facilitate efficient learning, and provide rotational
invariance. We closed the simulation, enforcing elastic colli- 0.2
Reward

sions when an agent reaches an outer wall. We kept pedestri- 0.1
ans in constant motion, assigning them a new goal destina-
0.0
tion when they reached a target. While the original version CVaR (25%)
fixed the initial position of the robot and its goal, we ran- −0.1
AVG
domly placed them opposite each other on a circle centered −0.2 CPT
at the origin.
−0.3
In addition to environmental modifications, we adjusted 0 200 400 600 800 1000 1200 1400
the observations and rewards received by the agents. We Top-layer Updates
replaced the feature-based representation with grayscale,
pixel-based observations. This allowed us to encode the ge- Figure 3: Stage 2 training of agent using CVaR, AVG, and
ometry of the system more naturally, enabling the learning CPT-based objectives.
Full-Episode Reward Distributions (10 Pedestrians) Full-Episode Reward Distributions (11-15 Pedestrians)

3.5 CVaR (25%) CVaR (25%)
4.0
Average Average
3.0 CPT-RL 3.5 CPT-RL
Probability Density

Probability Density
2.5 3.0

2.5
2.0
2.0
1.5
1.5
1.0
1.0

0.5 0.5

0.0 0.0
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
Full-Episode Rewards Full-Episode Rewards

(a) (b)

Figure 4: (a) Histogram of full-episode rewards for agents in 10-pedestrian test scenarios. (b) Histogram of full-episode rewards
for agents in test scenarios with 11-15 pedestrians.

Rewards, 10 Pedestrians Rewards, 11-15 Pedestrians
Method Mean Median 0.01-quantile Mean Median 0.01-quantile
CVaR 0.262 ± 0.002 0.260 −0.029 0.239 ± 0.002 0.232 −0.020
AVG 0.418 ± 0.003 0.414 −0.172 0.358 ± 0.003 0.320 −0.066
CPT 0.432 ± 0.003 0.461 −0.085 0.375 ± 0.003 0.360 −0.123

Table 1: Reward distribution statistics for testing of Stage 2 agents. Reported error bars are standard error.

5.2 Network Architecture average reward (AVG) and full-episode conditional value-at-
In our experiments, we used a convolutional neural network risk (CVaR). CVaR was included to evaluate a standard risk-
that closely resembles the architecture commonly used for sensitive measure, focused on improving the worst agent
Atari (Mnih et al. 2015). However, our input images were outcomes. While the standard practice is to consider the bot-
50% larger in each direction than those used by Mnih et al. tom 5% of the distribution for CVaR, we used the bottom
(2015) to ensure proper representation of agent and goal 25% in an effort to maximize performance over a broader
edges. To encode motion, 4 frames were stacked. The input range of the distribution. We chose hyperparameters for the
to the network thus consisted of 126 × 126 × 4 image ar- SPSA gradient estimation and the Adam optimizer in line
rays, which were passed to 3 consecutive convolutional lay- with standard guidance (Spall 1992; Kingma and Ba 2015).
ers. These layers had 32, 64, and 64 channels (input-side to One critical detail of CPT-based training is the selection
output-side), kernel sizes of 12, 4, and 3, and stride lengths of the reference point. Our variable reference point consists
of 6, 2, and 1. The layers were separated by rectified linear of two terms, corresponding to the two terms in the reward
unit (ReLU) nonlinearities. The output of the last convolu- function (3). We chose a fixed contribution of pref = 0.5
tional layer was passed to a rectified, 512-dimensional fully- from the progress term, corresponding to the agent making
connected layer. The resulting output was used as the input it halfway from the start to the goal before a collision. The
to a policy head of dimension 33 and a scalar value head. contribution from time varies with episode progress:

5.3 Stage 1 Training p(T )
xref = pref + tref, max , (4)
pmax
The neural networks governing our navigation agents were
first trained to convergence using PPO. Hyperparameters where p(T )/pmax is the fractional progress toward the goal
were chosen to mimic those used for Atari in Schulman et al. made by the agent over the whole episode ending at T and
(2017), except with shorter windows (16 steps) and more tref, max = −0.3 is a reference time multiplied by the time
windows per batch (64). penalty scaling factor Ctime in (3). Varying the reference
point in this manner correlates the “acceptable” amount of
5.4 Stage 2 Training time an agent may take in an episode with the amount of
After Stage 1 training, we trained our agent to maximize the progress it makes. It prevents an agent from being unduly
CPT-value under Algorithm 2. For comparison purposes, we rewarded for quickly running into a pedestrian, resulting in
also used the same procedure to optimize for full-episode a small episode time.
Full-Episode Average Velocities (10 Pedestrians) Full-Episode Average Velocities (11-15 Pedestrians)
3.0
CVaR (25%) CVaR (25%)
3.0 Average Average
CPT-RL 2.5 CPT-RL
2.5
2.0
Occurrences

Occurrences
2.0

1.5
1.5

1.0
1.0

0.5 0.5

0.0 0.0
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 −1.0 −0.5 0.0 0.5 1.0
Full-Episode Average Velocities Full-Episode Average Velocities

(a) (b)

Figure 5: (a) Histogram of full-episode average velocities for agents in 10-pedestrian test scenarios. (b) Histogram of full-
episode average velocities for agents in test scenarios with 11-15 pedestrians.

10 Pedestrians 11-15 Pedestrians
Method Progress Time Velocity Success Progress Time Velocity Success
CVaR 4.35 ± 0.03 8.61 ± 0.10 0.617 ± 0.003 0.028 3.81 ± 0.03 7.14 ± 0.08 0.621 ± 0.003 0.018
AVG 6.23 ± 0.04 10.26 ± 0.09 0.661 ± 0.002 0.294 5.39 ± 0.04 9.07 ± 0.09 0.640 ± 0.002 0.188
CPT 6.63 ± 0.04 11.52 ± 0.10 0.631 ± 0.002 0.343 5.86 ± 0.04 10.55 ± 0.10 0.605 ± 0.002 0.240

Table 2: Agent behavior statistics during testing. Reported error bars are the standard error, and the “success” quantity is the
fraction of the time the agent reaches the goal without a collision.

5.5 Results pected. As shown in Figure 1a, CPT penalizes negative out-
comes that are reasonably close to the reference point more
The learning curves for training in Stages 1 and 2 are shown harshly than AVG while assigning slightly less utility to the
in Figures 2 and 3, respectively. For subsequent analysis, the very best outcomes. The weighting from Figure 1b may have
same amount of training data and number of Stage 2 network played a small role in the improved performance on both
updates (1500; chosen for convergence) were used to com- edges of the distribution. The differences between the CPT
pare the agents that maximized average reward, CPT-value, and AVG agents were preserved when presented with more
and CVaR. Figure 4a shows the reward distributions earned challenging test scenarios, as illustrated in Figures 4a and
by each agent over 5000 test runs, each with 10 pedestri- 4b.
ans. To investigate the impact of more challenging “out-of-
Looking more closely at the CPT and AVG agents, we
distribution” test cases, we also evaluated on scenarios with
see that the former averages more progress before a collision
11-15 pedestrians sampled uniformly on the same networks,
and reaches the goal more often because it is more deliber-
as illustrated in Figure 4b. Statistics describing each test run
ate. Figure 5 shows the distribution of average velocities v̄
are provided in Table 1. Note that different random seeds
of the agent’s movement toward the goal over the test runs,
were used for each of the 5000 test runs, but these seeds
defined by
were the same across test conditions.
Figure 4 displays quantitative differences in the testing
dT − d0
performance of the three agents. The CVaR approach fo- v̄ = , (5)
cuses exclusively on the lower region of performance. As T
such, it does the best at removing lower outliers, as evi-
denced by the 1% quantiles in Table 1. However, it does where dt is the distance from the agent to the goal at time
not make any effort to enhance performance far away from t as previously defined and T is the duration of a given
the worst case and therefore cannot compete with the other episode. As can be seen from Table 2, this slight reduction in
two methods in any other region. The maximization of CPT- speed allows the CPT-based agent to, on average, make more
value and average reward led to similar reward distributions. progress toward the goal before a collision than the other
However, CPT was seen to reduce the frequency of poor out- agents. It also allows the agent to reach the target without a
comes compared to AVG, at the cost of a slight reduction in collision more often, as reflected by the “success” fractions
top-end performance. Intuitively, this behavior was to be ex- given in Table 2.
6 Conclusions and Future Research Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R.
We have developed a method to shape the full-episode re- 2018b. Distributional Reinforcement Learning With Quan-
ward distributions of DRL agents through the maximiza- tile Regression. In Proceedings of the Thirty-Second AAAI
tion of quantities besides expected reward. Our two-step ap- Conference on Artificial Intelligence, 2892–2901.
proach has been shown to enable different agent behaviors Jie, C.; Prashanth, L.; Fu, M. C.; Marcus, S.; and Szepesvári,
when maximizing CPT-value, CVaR, and full-episode ex- C. 2018. Stochastic Optimization in a Cumulative Prospect
pected reward, both qualitatively and quantitatively. In par- Theory Framework. IEEE Transactions on Automatic Con-
ticular, the CPT-based agents in our navigation experiments trol 63: 2867–2882.
were seen to process risk differently than the AVG agents,
Kahneman, D.; and Tversky, A. 1979. Prospect Theory: An
on average proceeding more deliberately and thereby mak-
Analysis of Decision under Risk. Econometrica 47(2): 263–
ing more progress toward the goal before a collision. While
291.
CVaR is a standard risk-sensitive measure, we found that op-
timizing it produced agents unable to compete with the oth- Kiefer, J.; and Wolfowitz, J. 1952. Stochastic Estimation
ers above the very bottom of the reward distribution, even at of the Maximum of a Regression Function. The Annals of
the 25% level. Mathematical Statistics 23(3): 462–466.
To further this work, we intend to investigate the impact Kingma, D. P.; and Ba, J. 2015. Adam: A Method for
of adjusting the tunable parameters of the CPT function on Stochastic Optimization. In 3rd International Conference
the risk-taking behavior of agents. In working toward as- on Learning Representations.
sured operation, we plan to explore the combination of our
methods with techniques for constrained RL (Achiam et al. Leavens, D. H. 1945. Diversification of Investments. Trusts
2017). Finally, to gain a deeper understanding of the effects and Estates 80: 469–473.
of our methods, we also plan to apply them to more complex Ma, X.; Xia, L.; Zhou, Z.; Yang, J.; and Zhao, Q. 2020.
experimental testbeds. DSAC: Distributional Soft Actor Critic for Risk-Sensitive
Reinforcement Learning. arXiv: 2004.14547 .
7 Acknowledgments
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness,
The authors thank Ashley Llorens for insightful technical J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland,
discussions and for his leadership of the Johns Hopkins In- A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
stitute for Assured Autonomy (JHU IAA) project that sup- Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
ported our experimentation and analysis. This project was S.; and Hassabis, D. 2015. Human-level Control Through
conducted under funding from both JHU/APL Internal Re- Deep Reinforcement Learning. Nature 518(7540): 529–533.
search and Development and JHU IAA.
Prashanth, L.; Jie, C.; Fu, M. C.; Marcus, S. I.; and
Szepesvári, C. 2016. Cumulative Prospect Theory Meets
References Reinforcement Learning: Prediction and Control. In Pro-
Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con- ceedings of the 33nd International Conference on Machine
strained Policy Optimization. arXiv:1705.10528 [cs] URL Learning, 1406–1415.
http://arxiv.org/abs/1705.10528. ArXiv: 1705.10528.
Pratt, J. W. 1964. Risk Aversion in the Small and in the
Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A Dis- Large. Econometrica 32: 122–136.
tributional Perspective on Reinforcement Learning. In Pro-
Prelec, D. 1998. The Probability Weighting Function.
ceedings of the 34th International Conference on Machine
Econometrica 66: 497–527.
Learning, volume 70, 449–458.
Robbins, H.; and Monro, S. 1951. A Stochastic Approxima-
Chau, M.; and Fu, M. C. 2015. An Overview of Stochastic
tion Method. The Annals of Mathematical Statistics 22(3):
Approximation, 149–178. Springer.
400–407.
Chau, M.; Fu, M. C.; Qu, H.; and Ryzhov, I. O. 2014. Sim-
Rockafellar, R. T.; and Uryasev, S. 2000. Optimization of
ulation Optimization: A Tutorial Overview and Recent De-
Conditional Value-at-Risk. Journal of Risk 2: 21–41.
velopments in Gradient-based Methods. In Proceedings of
the 2014 Winter Simulation Conference, 21–35. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Klimov, O. 2017. Proximal Policy Optimization Algorithms.
Chen, C.; Liu, Y.; Kreiss, S.; and Alahi, A. 2019. Crowd-
arXiv preprint arXiv:1707.06347.
Robot Interaction: Crowd-Aware Robot Navigation With
Attention-Based Deep Reinforcement Learning. 2019 In- Spall, J. C. 1992. Multivariate Stochastic Approximation
ternational Conference on Robotics and Automation 6015– Using a Simultaneous Perturbation Gradient Approxima-
6022. tion. IEEE Transactions on Automatic Control 37(3): 332–
341.
Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018a.
Implicit Quantile Networks for Distributional Reinforce- Tversky, A.; and Kahneman, D. 1992. Advances in Prospect
ment Learning. In Proceedings of the 35th International Theory: Cumulative Representation of Uncertainty. Journal
Conference on Machine Learning, volume 80, 1096–1105. of Risk and Uncertainty 5(4): 297–323.
van den Berg, J.; Guy, S.; Lin, M.; and Manocha, D.
2011. Robotics Research, volume 70 of Springer Tracts
in Advanced Robotics, chapter Reciprocal n-Body Collision
Avoidance, 3–19. Springer.
Wu, C.; and Lin, Y. 1999. Minimizing Risk Models in
Markov Decision Processes with Policies Depending on Tar-
get Values. Journal of Mathematical Analysis and Applica-
tions 231: 47–67.