=Paper=
{{Paper
|id=Vol-2808/Paper_30
|storemode=property
|title=Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_30.pdf
|volume=Vol-2808
|authors=Jared Markowitz,Marie Chau,I-Jeng Wang
|dblpUrl=https://dblp.org/rec/conf/aaai/MarkowitzCW21
}}
==Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents==
Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents Jared Markowitz, Marie Chau, I-Jeng Wang Johns Hopkins University Applied Physics Laboratory 11000 Johns Hopkins Road Laurel, Maryland 20723 Jared.Markowitz@jhuapl.edu, Marie.Chau@jhuapl.edu, I-Jeng.Wang@jhuapl.edu Abstract portfolio management. It is critical that RL-based systems adhere to strict safety standards, particularly when the po- Current deep reinforcement learning (DRL) methods fail to address risk in an intelligent manner, potentially leading to tential for injury or damage exists. They must be able to in- unsafe behaviors when deployed. One strategy for improving tegrate seamlessly with humans, anticipating and adjusting agent risk management is to mimic human behavior. While to the actions of operators and bystanders beyond the situa- imperfect, human risk processing displays two key benefits tions in which they were trained. Few present-day systems absent from standard artificial agents: accounting for rare meet this rigorous standard. but consequential events and incorporating context. The for- Human preferences in decision-making problems have mer ability may prevent catastrophic outcomes in unfamil- been widely studied in finance and economics, but have only iar settings while the latter results in asymmetric process- recently begun to be addressed in earnest by the machine ing of potential gains and losses. These two attributes have learning community (Jie et al. 2018). As humans have a nat- been quantified by behavioral economists and form the ba- sis of cumulative prospect theory (CPT), a leading model ural ability to emphasize rare events and incorporate context of human decision-making. We introduce a two-step method when assessing risk, having artificial agents mimic human for training DRL agents to maximize the CPT-value of full- decision-making behaviors could be beneficial. On the other episode rewards accumulated from an environment, rather hand, given the known shortcomings of human decision- than the standard practice of maximizing expected discounted making (Kahneman and Tversky 1979; Tversky and Kahne- rewards. We quantitatively compare the distribution of out- man 1992), human imitation could represent a mere stepping comes when optimizing full-episode expected reward, CPT- stone to more adept risk-handling strategies. value, and conditional value-at-risk (CVaR) in the CrowdSim Classical reinforcement learning finds policies for se- robot navigation environment, elucidating the impacts of dif- ferent objectives on the agent’s willingness to trade safety quential decision-making problems by optimizing expected for speed. We find that properly-configured maximization of future rewards. This criterion alone fails to adequately em- CPT-value allows for a reduction of the frequency of negative phasize edge behaviors, rendering it unsuitable for many outcomes with only a slight degradation of the best outcomes, practical applications. However, numerous approaches ex- compared to maximization of expected reward. ist for addressing edge behaviors through either explicit or implicit means. One widely-used explicit technique is to ar- tificially increase the frequency of known problematic edge 1 Introduction cases during training; however, this can lead to performance Increasingly impressive demonstrations of machine learning degradation on more frequently observed scenarios. Another raise increasingly critical questions about its robustness and explicit strategy is to apply a risk-sensitive measure during safety in real-world environments. To inspire trust from hu- training. Example risk-sensitive measures include exponen- mans, machines must be capable of reasoning, acting, and tial utility (Pratt 1964), percentile performance criteria (Wu generalizing in alignment with human preferences. In par- and Lin 1999), value-at-risk (Leavens 1945), conditional ticular, they must be able to adeptly handle rare but con- value-at-risk (Rockafellar and Uryasev 2000), prospect the- sequential scenarios and should incorporate context in their ory (Kahneman and Tversky 1979), and cumulative prospect decision-making. theory (CPT; Tversky and Kahneman (1992)). An implicit Reinforcement learning (RL) methods that consider edge strategy incorporates a notion of risk by considering the full behaviors are often referred to as risk-sensitive and have value distribution as opposed to the expected value (Belle- become increasingly well-studied as the prospects for real- mare, Dabney, and Munos 2017; Dabney et al. 2018b,a). world deployment of RL have grown. Potential target appli- In this paper, we extend CPT-RL (Prashanth et al. 2016), cations include autonomous vehicles (e.g., cars and planes), which explicitly incorporates cumulative prospect theory autonomous schedulers (e.g., power grids), and financial into the perceived reward of an RL agent. Our method allows Copyright c 2021 for this paper by its authors. Use permitted un- for the application of CPT-RL to agents with deep policy der Creative Commons License Attribution 4.0 International (CC networks, unlocking CPT-based control of more complex BY 4.0). problem spaces. More precisely, we demonstrate a method- ology for modifying agents trained by conventional deep re- Another approach for addressing edge cases is to look inforcement learning (DRL) algorithms to maximize CPT- to human decision-making for inspiration. Human decision value instead of expected rewards. We evaluate our method makers preferentially consider rare events and perform ad- on an enhanced version of the CrowdSim robot navigation mirably on many tasks that RL agents are trained to tackle. environment (Chen et al. 2019). Our results show that the in- The goal in many applications is to maximize agent align- corporation of CPT allows for fewer negative outcomes with ment with human preferences, which may also point to ap- only a slight degradation of the best outcomes. In this case, proaches that attempt to mimic human decision-making. the CPT-based agent accepts a slight reduction in speed in One method for incorporating human tendencies into agent order to facilitate surer progress. behavior is to have the agent maximize the CPT-value func- The remainder of this paper is organized as follows. In tion instead of expected future reward (CPT-RL; Prashanth Section 2, we provide an overview of common explicit et al. (2016)). CPT, a leading model of human decision- and implicit risk-sensitive methods. In Section 3, we intro- making, is a refined variant of prospect theory that includes duce cumulative prospect theory in the context of reinforce- a generalization of expected utility and is supported by sub- ment learning, including a CPT-value estimation technique stantial empirical evidence. and an efficient high-dimensional stochastic approximation In this paper, we extend CPT-RL to enable application to method. In Section 4, we present Deep CPT-RL, an exten- deep neural network (DNN) policies. Because the gradient- sion of CPT-RL to algorithms that use deep policy networks. free methods used in CPT-RL do not scale to training DNNs In Section 5, we provide computational results to illustrate from scratch, we perform initial training to maximize ex- the benefits of Deep CPT-RL when applied to robot nav- pected reward and update a subset of the resulting optimal igation. In Section 6, we provide concluding remarks and weights to maximize the value of the CPT function. Our directions for future research. approach resembles transfer learning at first glance, as the policy is retrained according to a related objective function. 2 Related Work However, in our case, a subset of the optimal weights are retained and a subset are re-initialized and re-trained. Both explicit and implicit means for introducing risk sen- sitivity to artificial agents have been previously investi- gated. On the explicit side, methods often incorporate risk 3 Background by distorting the reward function. For instance, exponential 3.1 Cumulative Prospect Theory utility transforms the expected discounted reward E[Z] to Cumulative prospect theory uses two components to model 1 1 β log E[exp(βZ)], where β determines the level of risk. human behavior: a utility function that quantifies human at- Another class of risk-sensitive measures focuses on rare oc- titudes toward outcomes and a weight function that quanti- currences not captured in the standard expected utility, in or- fies the emphasis placed on different outcomes (Figure 1). der to mitigate possible detrimental outcomes. Value-at-risk The utility function u = (u+ , u− ), where u± : R → R+ , at confidence level α (VaRα ) is an α-quantile that focuses on u+ (x) = 0 for x ≤ 0, u− (x) = 0 for x > 0, has two regions the tail distribution, VaRα (Z) = max{z|P (Z < z) ≤ α}, separated by a reference point, which we consider as zero where α sets risk. Conditional value-at-risk at confidence for illustrative purposes. The region to the left of the ref- level α (CVaRα ) considers the expectation of the tail below erence point specifies the utility of losses, while the region the α-quantile point VaRα . to its right specifies the utilities of gains. Humans have em- On the implicit side, distributional reinforcement learn- pirically been found to be more risk-averse with gains than ing (DistRL) approaches risk from a different perspective. losses (Tversky and Kahneman 1992), leading to the asym- Instead of distorting the utility function and optimizing the metric concavities on the two sides of the reference point. expectation, DistRL models the value distribution in the op- The reference point may be static or dynamic; in the case timization process and selects the policy based on its mean. of humans it may change with accumulated experience. In Bellemare, Dabney, and Munos (2017) introduced the first our experiments we consider dynamic reference points for DistRL algorithm, C51, and showed that it outperforms stan- reasons that will be discussed in Section 5. dard Deep Q-Networks (DQN; Mnih et al. (2015)) in some The probability weight function w = (w+ , w− ), where ± environments. C51 estimates probabilities on a fixed set w : [0, 1] → [0, 1], models human consideration of events of uniformly-spaced possible returns, requiring bounds that based on frequency, highlighting both positive and nega- may not exist in practice. Quantile regression DQN (QR- tive rare events relative to more common events (Figure DQN) overcomes this limitation by estimating the inverse 1(b)). Tversky and Kahneman (1992) recommend the weight pη cumulative distribution function (CDF) of equally-spaced function w± (p) = w(p) = (pη +(1−p) η )1/η , while Prelec quantiles (Dabney et al. 2018b). Implicit quantile networks ± (1998) suggests w (p) = w(p) = exp(−(− ln p)η ) where further improve on QR-DQN by estimating the inverse CDF η ∈ (0, 1). of random quantiles (Dabney et al. 2018a). Another recent The CPT-value function, defined as method, Distributional Soft Actor-Critic (Ma et al. 2020), Z +∞ enables application of DistRL to continuous spaces. Cu,w (X) = w+ (P (u+ (X) > z))dz 1 0 This is apparent after applying a straightforward Taylor expan- Z +∞ sion; risk-averse and risk-seeking behavior translate to β < 0 and β > 0, respectively. − w− (P (u− (X) > z))dz, (1) 0 CPT Utility Function CPT Weight Function 1.0 Identity 0.8 w+ (p) u+ w−(p) 0.6 Weight Losses Gains 0.4 0.2 −u− 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability (p) (a) (b) Figure 1: (a) Example CPT utility function. Gains and losses are treated asymmetrically, with behavior towards gains being more risk-averse. (b) Example CPT weight function. Rare outcomes on either side are emphasized, as determined by the slope of w+ (p) (and similarly w− (p)). satisfies certain requirements. We consider u and w to be where X[i] is the ith ordered statistic. fixed; therefore, we simplify notation henceforth: Cu,w → + − C. 3. Return C(X) = Cm (X) − Cm (X). The CPT function (1) is a generalization of expected value. In particular, if we consider w± (x) = x and 3.3 Gradient-Free Stochastic Approximation u+ (x) = x for x ≥ 0, u− (x) = −x for x < 0, then R +∞ R +∞ C(X) = 0 P(X > z)dz − 0 P(−X > z)dz = Consider the stochastic optimization problem R +∞ R +∞ 0 P(max(X, 0) > z)dz − 0 P(max(−X, 0) > max C(X(θ)), θ∈Θ z)dz = E[max(X, 0)] − E[max(−X, 0)], where u+ and u− are utility functions corresponding to gains (X > 0) and where C(·) is the CPT-value function (1), Θ is a compact, losses (X < 0) with reference point X = 0. convex subset of Rd , X(·) is a random reward, and θ = (θ1 , . . . , θd ) is a d-dimensional weight parameter. Stochastic 3.2 CPT-Value Estimation approximation updates the weights θ iteratively using the Prashanth et al. (2016) proposed a provably convergent CPT- recursion value estimate of (1), inspired by quantiles. Their approach θn+1 = Γ (θn + γn ĝ(X(θn ))) , is summarized in Algorithm 1. The random reward X is based on a policy πθ , where θ is a set of controllable pa- where θ0 is chosen randomly, Γ(θ) is a projection operator rameter values chosen to optimize the CPT function (1). Re- that ensures θ ∈ Θ, γn is a diagonal matrix with step sizes ward Xi is generated from episode i, where an episode is for each dimension on the diagonal, and ĝ is an estimate of a trajectory from initial to terminal state guided by actions. the true gradient ∇C (Robbins and Monro 1951; Chau and Episodes are assumed to be both noisy and expensive to gen- Fu 2015; Chau et al. 2014). Direct (unbiased) gradients are erate. Note that in addition to selecting appropriate utility unavailable for CPT-values; however, with the availability and weight functions, this approach requires users to choose of CPT-value estimates, indirect (biased) gradient estimates a suitable sample size or number of episodes m to produce can be computed. Applicable indirect methods include fi- a reasonable estimate. nite differences (Kiefer and Wolfowitz 1952) and simulta- neous perturbations stochastic approximation (SPSA) (Spall Algorithm 1: CPT-value Estim. (Prashanth et al. 2016) 1992). 1. Generate X1 , . . . , Xm i.i.d. from the distribution X. In this paper, we employ SPSA for its computational effi- ciency. The kth component of the SPSA gradient is defined 2. Let as m + m+1−i + m−i + X + Cm (X) = u (X[i] ) w −w , C(θ + δ∆) − C(θ − δ∆) m m gkSP SA (θ) = (2) i=1 2δ∆k m i−1 − X i for k = 1, . . . , d, where δ > 0 and ∆ = (∆1 , . . . , ∆d ) has Cm (X) = u− (X[i] ) w− −w− , i=1 m m random i.i.d. components with zero mean and finite inverse moments. Each CPT-value estimate is generated from sam- only the last layer in Stage 2. Further flexibility may be ple rewards based on m episodes, i.e., Xi for i = 1, . . . , m. gained through the re-optimization of the last two layers. Note that the number of episodes m can vary from one iter- The second stage is feasible because 1) the lower layers ation to the next. of a properly-trained initial network tease out the features necessary to produce intelligent agent behavior and 2) the 4 Deep CPT-RL action layer(s) comprise a small fraction of the parameter The CPT-RL approach described in Section 3 was applied by count of the entire network. We used SPSA to estimate the Prashanth et al. (2016) to successfully optimize a Boltzmann gradient because of its computational efficiency; in partic- policy and thereby control a relatively simple traffic manage- ular, finite differences was not considered due to the sheer ment scenario. However, CPT-RL does not directly scale to number of parameters involved. As in the original CPT-RL policy mappings with high-dimensional parameter spaces, formulation, all information required to compute the gradi- including deep neural networks. This is because SPSA pro- ents was derived from batches of episodes. To increase sta- duces gradient estimates with lower fidelity and higher vari- bility, we averaged over gradient estimates generated from ance than backpropagation. multiple perturbations before conducting a network update To address these issues and thereby extend CPT-RL to via the Adam optimizer (Kingma and Ba 2015). deep policy networks, we employ a two-stage approach. One notable extension to the work by Prashanth et al. First, the deep policy network of an agent is trained using (2016) is our consideration of multiple reward terms. While a conventional actor-critic method. The actor-critic formu- other approaches are possible, we employed the simple strat- lation was chosen in accordance with the need to learn a egy of choosing a single reference point based on all terms. policy and the desire to reduce the variance of policy gradi- We allowed this reference point to vary based on the out- ent estimates. The lower, input-side layers of the network are come of a given episode, as required by our experimental frozen and the upper, output-side layers (one or more) are re- setup discussed below. One could alternatively compute a trained to maximize the CPT-value using a procedure similar CPT-value for each reward term where it makes sense to to CPT-RL. Thus the first stage learns a feature representa- weigh risk using expectations for other terms, but we found tion of the observation space and the second stage learns a this to provide similar performance with additional com- policy under the learned representation. By limiting the sec- plexity. ond stage updates to the upper layer(s) of the network, we Our approach is outlined in Algorithm 2. For notational overcome the limitations of SPSA in high-dimensional set- simplicity, in the neural network, let θ−1 ∈ Θ−1 and θ:−1 ∈ tings. Θ:−1 denote the weights in the last layer and all but the last More explicitly, Stage 1 aims to find an optimal policy layer, respectively. The weight parameters θ:−1 are fixed in πθ∗ for choosing actions at that maximize expected return. the second stage, so we drop them from the input, X(θ:−1 ∪ The optimal policy parameters θ ∗ are given by θ−1 ) → X(θ−1 ). Algorithm 2: Deep CPT-RL " # X ∗ θ = arg max Eτ ∼pθ (τ ) r(st , at ) , Stage 1. θ∈Θ t 1. Identify deep policy network architecture with weight pa- where the τ are trajectories drawn from the distribution rameters θ. pθ (τ ) of possible trajectories an agent using policy πθ may take in the environment, Θ is the feasible policy parameter 2. Train deep policy network to obtain optimal weight pa- region, t indicates time, and r(st , at ) is the stochastic re- rameters θ ∗ that maximize expected return (e.g., using ward granted by the environment when action at is taken PPO). in state st . Stage 2 aims to find a policy that maximizes Stage 2. the CPT-value function (1), or in principle any risk-sensitive measure. Unfortunately, unlike the standard expectation of 3. Initialize stopping time N , gradient estimates per update rewards, the CPT-functional is based on cumulative proba- M , SPSA parameters {δn }, Adam parameters α > 0, β ∈ bilities and piecewise, precluding direct, sample-based dif- (0, 1), > 0. ferentiation. No nested structure is assumed, precluding the 4. Fix θ̃:−1 = θ:−1 ∗ , re-initialize θ̃−1 s.t. θ̃−1;i ∼ use of bootstrapping methods. Hence, we generate indirect Uniform(−|θ−1 | −1/2 , |θ−1 |−1/2 ) for i = 0 . . . |θ̃−1 | − 1, (biased) gradients from the CPT-value estimates. In principle, any policy-based DRL algorithm may be ap- where θ̃−1 = (θ̃−1;1 , . . . , θ̃−1;|θ̃−1 | ). plied in the first stage of our procedure. We applied Prox- 5. For i = 0, . . . , N − 1: imal Policy Optimization (PPO; Schulman et al. (2017)), a • For j = 0, . . . , M − 1: leading actor-critic approach, to train our agents to conver- −1 |θ̃ |−1 gence. As previously stated, the lower layers of the network – Generate ∆ = {∆k }k=0 , where ∆k = ±1 with learned in Stage 1 provide a feature representation for the probability 0.5, independent ∀k. upper layer(s). Therefore, PPO can be thought of as learn- ± – Set θ̃−1 ± = θ̃−1 ± δi ∆ and generate Xj (θ̃−1 ). ing a “perception” mapping suitable for the ensuing “action” + − layers learned by the second stage. In our experiments, we – Generate C(Xj (θ̃−1 )) and C(Xj (θ̃−1 )). found it sufficient to re-initialize the parameters and tune – Compute gjSP SA (θ̃−1 ) using (2). −1 • Average {gjSP SA }M j=0 and use result for Adam update of rotation-invariant navigation policies that were challeng- Gi . ing to produce with the original, feature-based representa- • Update θ̃−1 ← θ̃−1 + Gi . tion. We modified the published reward function to 1) allow removal of the initial imitation learning step used in Chen 6. Return θ̃CP T = θ̃:−1 ∪ θ̃−1 . et al. (2019) and 2) explicitly model a speed-safety trade- off. Removal of the imitation learning was desired to allow quantification of CPT-based shaping of an RL agent trained Stage 1 Training tabula rasa, as well as to prevent our training from requiring 0.5 more than two phases. It was enabled through the use of a 0.4 progress term that encouraged movement in the direction of the goal. The speed-safety tradeoff came from a small con- 0.3 stant penalty being assessed at each step, encouraging the Reward 0.2 agent to get to the goal quickly. In sum, the reward function was formulated as 0.1 rt = −Ctime + Cprogress (dt−1 − dt ), (3) 0.0 where dt is the distance between the robot and the goal at −0.1 time t (i.e. dt−1 − dt is the “progress” made in the timestep), 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Ctime > 0 is the time penalty, and Cprogress > 0 is the ×107 Training Frames progress reward. We set Ctime = 0.02 and Cprogress = 0.1, with the total distance from the agent’s starting position to Figure 2: Stage 1 training of agent using PPO. the target being 10. Our choice of Cprogress provided a maxi- mal progress contribution of 1 per episode; Ctime was chosen to provide a meaningful fraction of that value for episodes of standard duration. An episode ends when one of three things 5 Numerical Experiments happens: the robot reaches the goal, the robot collides with To quantify the impact of optimizing the CPT-value, we a pedestrian, or a timeout occurs. Hence, even though there applied our two-step procedure to an enhanced version of is no explicit collision penalty, the agent is incentivized to the CrowdSim robotic navigation environment (Chen et al. avoid collisions because it precludes the possibility of accu- 2019). CrowdSim allows for explicit evaluation of an agent’s mulating more progress reward. risk-taking behavior in the form of its willingness to trade Our agent was configured to choose from 33 different mo- safety for speed. tions; remaining at rest and 4 different speeds (0.25 m/s, 0.5 m/s, 0.75 m/s, 1.0 m/s) at each of 8 evenly-spaced angles in 5.1 CrowdSim Environment [0, 2π). Both the robot and the pedestrians were configured CrowdSim was developed to study social mobility; that is, to move at a preferred speed of vpref = 1.0 m/s. We consid- to train robotic agents to avoid collisions with pedestrians ered 10 pedestrians in each training episode, allowing that while moving toward a goal. In the default configuration, number to vary in testing to evaluate “out-of-distribution” pedestrians follow the Optimal Reciprocal Collision Avoid- performance. This configuration was chosen with the goal ance (ORCA; van den Berg et al. (2011)) protocol to avoid of exploring challenging regimes (where the agent would other pedestrians while traversing to their own goals. An not always be able to avoid collisions) in order to generate episode concludes when the goal is reached, the robot col- meaningful comparisons amongst methods. lides with a pedestrian, or the system times out. We took the robot to be invisible to the pedestrians in order to provide Stage 2 Training a more challenging test case. To facilitate our experiments, 0.5 we made a few changes to the published environment. These 0.4 changes were broadly designed to make the scenario more 0.3 realistic, facilitate efficient learning, and provide rotational invariance. We closed the simulation, enforcing elastic colli- 0.2 Reward sions when an agent reaches an outer wall. We kept pedestri- 0.1 ans in constant motion, assigning them a new goal destina- 0.0 tion when they reached a target. While the original version CVaR (25%) fixed the initial position of the robot and its goal, we ran- −0.1 AVG domly placed them opposite each other on a circle centered −0.2 CPT at the origin. −0.3 In addition to environmental modifications, we adjusted 0 200 400 600 800 1000 1200 1400 the observations and rewards received by the agents. We Top-layer Updates replaced the feature-based representation with grayscale, pixel-based observations. This allowed us to encode the ge- Figure 3: Stage 2 training of agent using CVaR, AVG, and ometry of the system more naturally, enabling the learning CPT-based objectives. Full-Episode Reward Distributions (10 Pedestrians) Full-Episode Reward Distributions (11-15 Pedestrians) 3.5 CVaR (25%) CVaR (25%) 4.0 Average Average 3.0 CPT-RL 3.5 CPT-RL Probability Density Probability Density 2.5 3.0 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 Full-Episode Rewards Full-Episode Rewards (a) (b) Figure 4: (a) Histogram of full-episode rewards for agents in 10-pedestrian test scenarios. (b) Histogram of full-episode rewards for agents in test scenarios with 11-15 pedestrians. Rewards, 10 Pedestrians Rewards, 11-15 Pedestrians Method Mean Median 0.01-quantile Mean Median 0.01-quantile CVaR 0.262 ± 0.002 0.260 −0.029 0.239 ± 0.002 0.232 −0.020 AVG 0.418 ± 0.003 0.414 −0.172 0.358 ± 0.003 0.320 −0.066 CPT 0.432 ± 0.003 0.461 −0.085 0.375 ± 0.003 0.360 −0.123 Table 1: Reward distribution statistics for testing of Stage 2 agents. Reported error bars are standard error. 5.2 Network Architecture average reward (AVG) and full-episode conditional value-at- In our experiments, we used a convolutional neural network risk (CVaR). CVaR was included to evaluate a standard risk- that closely resembles the architecture commonly used for sensitive measure, focused on improving the worst agent Atari (Mnih et al. 2015). However, our input images were outcomes. While the standard practice is to consider the bot- 50% larger in each direction than those used by Mnih et al. tom 5% of the distribution for CVaR, we used the bottom (2015) to ensure proper representation of agent and goal 25% in an effort to maximize performance over a broader edges. To encode motion, 4 frames were stacked. The input range of the distribution. We chose hyperparameters for the to the network thus consisted of 126 × 126 × 4 image ar- SPSA gradient estimation and the Adam optimizer in line rays, which were passed to 3 consecutive convolutional lay- with standard guidance (Spall 1992; Kingma and Ba 2015). ers. These layers had 32, 64, and 64 channels (input-side to One critical detail of CPT-based training is the selection output-side), kernel sizes of 12, 4, and 3, and stride lengths of the reference point. Our variable reference point consists of 6, 2, and 1. The layers were separated by rectified linear of two terms, corresponding to the two terms in the reward unit (ReLU) nonlinearities. The output of the last convolu- function (3). We chose a fixed contribution of pref = 0.5 tional layer was passed to a rectified, 512-dimensional fully- from the progress term, corresponding to the agent making connected layer. The resulting output was used as the input it halfway from the start to the goal before a collision. The to a policy head of dimension 33 and a scalar value head. contribution from time varies with episode progress: 5.3 Stage 1 Training p(T ) xref = pref + tref, max , (4) pmax The neural networks governing our navigation agents were first trained to convergence using PPO. Hyperparameters where p(T )/pmax is the fractional progress toward the goal were chosen to mimic those used for Atari in Schulman et al. made by the agent over the whole episode ending at T and (2017), except with shorter windows (16 steps) and more tref, max = −0.3 is a reference time multiplied by the time windows per batch (64). penalty scaling factor Ctime in (3). Varying the reference point in this manner correlates the “acceptable” amount of 5.4 Stage 2 Training time an agent may take in an episode with the amount of After Stage 1 training, we trained our agent to maximize the progress it makes. It prevents an agent from being unduly CPT-value under Algorithm 2. For comparison purposes, we rewarded for quickly running into a pedestrian, resulting in also used the same procedure to optimize for full-episode a small episode time. Full-Episode Average Velocities (10 Pedestrians) Full-Episode Average Velocities (11-15 Pedestrians) 3.0 CVaR (25%) CVaR (25%) 3.0 Average Average CPT-RL 2.5 CPT-RL 2.5 2.0 Occurrences Occurrences 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 −1.0 −0.5 0.0 0.5 1.0 Full-Episode Average Velocities Full-Episode Average Velocities (a) (b) Figure 5: (a) Histogram of full-episode average velocities for agents in 10-pedestrian test scenarios. (b) Histogram of full- episode average velocities for agents in test scenarios with 11-15 pedestrians. 10 Pedestrians 11-15 Pedestrians Method Progress Time Velocity Success Progress Time Velocity Success CVaR 4.35 ± 0.03 8.61 ± 0.10 0.617 ± 0.003 0.028 3.81 ± 0.03 7.14 ± 0.08 0.621 ± 0.003 0.018 AVG 6.23 ± 0.04 10.26 ± 0.09 0.661 ± 0.002 0.294 5.39 ± 0.04 9.07 ± 0.09 0.640 ± 0.002 0.188 CPT 6.63 ± 0.04 11.52 ± 0.10 0.631 ± 0.002 0.343 5.86 ± 0.04 10.55 ± 0.10 0.605 ± 0.002 0.240 Table 2: Agent behavior statistics during testing. Reported error bars are the standard error, and the “success” quantity is the fraction of the time the agent reaches the goal without a collision. 5.5 Results pected. As shown in Figure 1a, CPT penalizes negative out- comes that are reasonably close to the reference point more The learning curves for training in Stages 1 and 2 are shown harshly than AVG while assigning slightly less utility to the in Figures 2 and 3, respectively. For subsequent analysis, the very best outcomes. The weighting from Figure 1b may have same amount of training data and number of Stage 2 network played a small role in the improved performance on both updates (1500; chosen for convergence) were used to com- edges of the distribution. The differences between the CPT pare the agents that maximized average reward, CPT-value, and AVG agents were preserved when presented with more and CVaR. Figure 4a shows the reward distributions earned challenging test scenarios, as illustrated in Figures 4a and by each agent over 5000 test runs, each with 10 pedestri- 4b. ans. To investigate the impact of more challenging “out-of- Looking more closely at the CPT and AVG agents, we distribution” test cases, we also evaluated on scenarios with see that the former averages more progress before a collision 11-15 pedestrians sampled uniformly on the same networks, and reaches the goal more often because it is more deliber- as illustrated in Figure 4b. Statistics describing each test run ate. Figure 5 shows the distribution of average velocities v̄ are provided in Table 1. Note that different random seeds of the agent’s movement toward the goal over the test runs, were used for each of the 5000 test runs, but these seeds defined by were the same across test conditions. Figure 4 displays quantitative differences in the testing dT − d0 performance of the three agents. The CVaR approach fo- v̄ = , (5) cuses exclusively on the lower region of performance. As T such, it does the best at removing lower outliers, as evi- denced by the 1% quantiles in Table 1. However, it does where dt is the distance from the agent to the goal at time not make any effort to enhance performance far away from t as previously defined and T is the duration of a given the worst case and therefore cannot compete with the other episode. As can be seen from Table 2, this slight reduction in two methods in any other region. The maximization of CPT- speed allows the CPT-based agent to, on average, make more value and average reward led to similar reward distributions. progress toward the goal before a collision than the other However, CPT was seen to reduce the frequency of poor out- agents. It also allows the agent to reach the target without a comes compared to AVG, at the cost of a slight reduction in collision more often, as reflected by the “success” fractions top-end performance. Intuitively, this behavior was to be ex- given in Table 2. 6 Conclusions and Future Research Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R. We have developed a method to shape the full-episode re- 2018b. Distributional Reinforcement Learning With Quan- ward distributions of DRL agents through the maximiza- tile Regression. In Proceedings of the Thirty-Second AAAI tion of quantities besides expected reward. Our two-step ap- Conference on Artificial Intelligence, 2892–2901. proach has been shown to enable different agent behaviors Jie, C.; Prashanth, L.; Fu, M. C.; Marcus, S.; and Szepesvári, when maximizing CPT-value, CVaR, and full-episode ex- C. 2018. Stochastic Optimization in a Cumulative Prospect pected reward, both qualitatively and quantitatively. In par- Theory Framework. IEEE Transactions on Automatic Con- ticular, the CPT-based agents in our navigation experiments trol 63: 2867–2882. were seen to process risk differently than the AVG agents, Kahneman, D.; and Tversky, A. 1979. Prospect Theory: An on average proceeding more deliberately and thereby mak- Analysis of Decision under Risk. Econometrica 47(2): 263– ing more progress toward the goal before a collision. While 291. CVaR is a standard risk-sensitive measure, we found that op- timizing it produced agents unable to compete with the oth- Kiefer, J.; and Wolfowitz, J. 1952. Stochastic Estimation ers above the very bottom of the reward distribution, even at of the Maximum of a Regression Function. The Annals of the 25% level. Mathematical Statistics 23(3): 462–466. To further this work, we intend to investigate the impact Kingma, D. P.; and Ba, J. 2015. Adam: A Method for of adjusting the tunable parameters of the CPT function on Stochastic Optimization. In 3rd International Conference the risk-taking behavior of agents. In working toward as- on Learning Representations. sured operation, we plan to explore the combination of our methods with techniques for constrained RL (Achiam et al. Leavens, D. H. 1945. Diversification of Investments. Trusts 2017). Finally, to gain a deeper understanding of the effects and Estates 80: 469–473. of our methods, we also plan to apply them to more complex Ma, X.; Xia, L.; Zhou, Z.; Yang, J.; and Zhao, Q. 2020. experimental testbeds. DSAC: Distributional Soft Actor Critic for Risk-Sensitive Reinforcement Learning. arXiv: 2004.14547 . 7 Acknowledgments Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, The authors thank Ashley Llorens for insightful technical J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, discussions and for his leadership of the Johns Hopkins In- A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; stitute for Assured Autonomy (JHU IAA) project that sup- Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, ported our experimentation and analysis. This project was S.; and Hassabis, D. 2015. Human-level Control Through conducted under funding from both JHU/APL Internal Re- Deep Reinforcement Learning. Nature 518(7540): 529–533. search and Development and JHU IAA. Prashanth, L.; Jie, C.; Fu, M. C.; Marcus, S. I.; and Szepesvári, C. 2016. Cumulative Prospect Theory Meets References Reinforcement Learning: Prediction and Control. In Pro- Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con- ceedings of the 33nd International Conference on Machine strained Policy Optimization. arXiv:1705.10528 [cs] URL Learning, 1406–1415. http://arxiv.org/abs/1705.10528. ArXiv: 1705.10528. Pratt, J. W. 1964. Risk Aversion in the Small and in the Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A Dis- Large. Econometrica 32: 122–136. tributional Perspective on Reinforcement Learning. In Pro- Prelec, D. 1998. The Probability Weighting Function. ceedings of the 34th International Conference on Machine Econometrica 66: 497–527. Learning, volume 70, 449–458. Robbins, H.; and Monro, S. 1951. A Stochastic Approxima- Chau, M.; and Fu, M. C. 2015. An Overview of Stochastic tion Method. The Annals of Mathematical Statistics 22(3): Approximation, 149–178. Springer. 400–407. Chau, M.; Fu, M. C.; Qu, H.; and Ryzhov, I. O. 2014. Sim- Rockafellar, R. T.; and Uryasev, S. 2000. Optimization of ulation Optimization: A Tutorial Overview and Recent De- Conditional Value-at-Risk. Journal of Risk 2: 21–41. velopments in Gradient-based Methods. In Proceedings of the 2014 Winter Simulation Conference, 21–35. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. Chen, C.; Liu, Y.; Kreiss, S.; and Alahi, A. 2019. Crowd- arXiv preprint arXiv:1707.06347. Robot Interaction: Crowd-Aware Robot Navigation With Attention-Based Deep Reinforcement Learning. 2019 In- Spall, J. C. 1992. Multivariate Stochastic Approximation ternational Conference on Robotics and Automation 6015– Using a Simultaneous Perturbation Gradient Approxima- 6022. tion. IEEE Transactions on Automatic Control 37(3): 332– 341. Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018a. Implicit Quantile Networks for Distributional Reinforce- Tversky, A.; and Kahneman, D. 1992. Advances in Prospect ment Learning. In Proceedings of the 35th International Theory: Cumulative Representation of Uncertainty. Journal Conference on Machine Learning, volume 80, 1096–1105. of Risk and Uncertainty 5(4): 297–323. van den Berg, J.; Guy, S.; Lin, M.; and Manocha, D. 2011. Robotics Research, volume 70 of Springer Tracts in Advanced Robotics, chapter Reciprocal n-Body Collision Avoidance, 3–19. Springer. Wu, C.; and Lin, Y. 1999. Minimizing Risk Models in Markov Decision Processes with Policies Depending on Tar- get Values. Journal of Mathematical Analysis and Applica- tions 231: 47–67.