<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jared Markowitz</string-name>
          <email>Jared.Markowitz@jhuapl.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marie Chau</string-name>
          <email>Marie.Chau@jhuapl.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I-Jeng Wang</string-name>
          <email>I-Jeng.Wang@jhuapl.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johns Hopkins University Applied Physics Laboratory 11000 Johns Hopkins Road Laurel</institution>
          ,
          <addr-line>Maryland 20723</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <volume>70</volume>
      <fpage>3</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Current deep reinforcement learning (DRL) methods fail to address risk in an intelligent manner, potentially leading to unsafe behaviors when deployed. One strategy for improving agent risk management is to mimic human behavior. While imperfect, human risk processing displays two key benefits absent from standard artificial agents: accounting for rare but consequential events and incorporating context. The former ability may prevent catastrophic outcomes in unfamiliar settings while the latter results in asymmetric processing of potential gains and losses. These two attributes have been quantified by behavioral economists and form the basis of cumulative prospect theory (CPT), a leading model of human decision-making. We introduce a two-step method for training DRL agents to maximize the CPT-value of fullepisode rewards accumulated from an environment, rather than the standard practice of maximizing expected discounted rewards. We quantitatively compare the distribution of outcomes when optimizing full-episode expected reward, CPTvalue, and conditional value-at-risk (CVaR) in the CrowdSim robot navigation environment, elucidating the impacts of different objectives on the agent's willingness to trade safety for speed. We find that properly-configured maximization of CPT-value allows for a reduction of the frequency of negative outcomes with only a slight degradation of the best outcomes, compared to maximization of expected reward.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Increasingly impressive demonstrations of machine learning
raise increasingly critical questions about its robustness and
safety in real-world environments. To inspire trust from
humans, machines must be capable of reasoning, acting, and
generalizing in alignment with human preferences. In
particular, they must be able to adeptly handle rare but
consequential scenarios and should incorporate context in their
decision-making.</p>
      <p>Reinforcement learning (RL) methods that consider edge
behaviors are often referred to as risk-sensitive and have
become increasingly well-studied as the prospects for
realworld deployment of RL have grown. Potential target
applications include autonomous vehicles (e.g., cars and planes),
autonomous schedulers (e.g., power grids), and financial
portfolio management. It is critical that RL-based systems
adhere to strict safety standards, particularly when the
potential for injury or damage exists. They must be able to
integrate seamlessly with humans, anticipating and adjusting
to the actions of operators and bystanders beyond the
situations in which they were trained. Few present-day systems
meet this rigorous standard.</p>
      <p>
        Human preferences in decision-making problems have
been widely studied in finance and economics, but have only
recently begun to be addressed in earnest by the machine
learning community
        <xref ref-type="bibr" rid="ref9">(Jie et al. 2018)</xref>
        . As humans have a
natural ability to emphasize rare events and incorporate context
when assessing risk, having artificial agents mimic human
decision-making behaviors could be beneficial. On the other
hand, given the known shortcomings of human
decisionmaking
        <xref ref-type="bibr" rid="ref10 ref26">(Kahneman and Tversky 1979; Tversky and
Kahneman 1992)</xref>
        , human imitation could represent a mere stepping
stone to more adept risk-handling strategies.
      </p>
      <p>
        Classical reinforcement learning finds policies for
sequential decision-making problems by optimizing expected
future rewards. This criterion alone fails to adequately
emphasize edge behaviors, rendering it unsuitable for many
practical applications. However, numerous approaches
exist for addressing edge behaviors through either explicit or
implicit means. One widely-used explicit technique is to
artificially increase the frequency of known problematic edge
cases during training; however, this can lead to performance
degradation on more frequently observed scenarios. Another
explicit strategy is to apply a risk-sensitive measure during
training. Example risk-sensitive measures include
exponential utility
        <xref ref-type="bibr" rid="ref18">(Pratt 1964)</xref>
        , percentile performance criteria (Wu
and Lin 1999), value-at-risk
        <xref ref-type="bibr" rid="ref13">(Leavens 1945)</xref>
        , conditional
value-at-risk
        <xref ref-type="bibr" rid="ref22">(Rockafellar and Uryasev 2000)</xref>
        , prospect
theory
        <xref ref-type="bibr" rid="ref10">(Kahneman and Tversky 1979)</xref>
        , and cumulative prospect
theory (CPT;
        <xref ref-type="bibr" rid="ref26">Tversky and Kahneman (1992)</xref>
        ). An implicit
strategy incorporates a notion of risk by considering the full
value distribution as opposed to the expected value
        <xref ref-type="bibr" rid="ref2 ref6">(Bellemare, Dabney, and Munos 2017; Dabney et al. 2018b,a)</xref>
        .
      </p>
      <p>
        In this paper, we extend CPT-RL
        <xref ref-type="bibr" rid="ref17">(Prashanth et al. 2016)</xref>
        ,
which explicitly incorporates cumulative prospect theory
into the perceived reward of an RL agent. Our method allows
for the application of CPT-RL to agents with deep policy
networks, unlocking CPT-based control of more complex
problem spaces. More precisely, we demonstrate a
methodology for modifying agents trained by conventional deep
reinforcement learning (DRL) algorithms to maximize
CPTvalue instead of expected rewards. We evaluate our method
on an enhanced version of the CrowdSim robot navigation
environment
        <xref ref-type="bibr" rid="ref5">(Chen et al. 2019)</xref>
        . Our results show that the
incorporation of CPT allows for fewer negative outcomes with
only a slight degradation of the best outcomes. In this case,
the CPT-based agent accepts a slight reduction in speed in
order to facilitate surer progress.
      </p>
      <p>The remainder of this paper is organized as follows. In
Section 2, we provide an overview of common explicit
and implicit risk-sensitive methods. In Section 3, we
introduce cumulative prospect theory in the context of
reinforcement learning, including a CPT-value estimation technique
and an efficient high-dimensional stochastic approximation
method. In Section 4, we present Deep CPT-RL, an
extension of CPT-RL to algorithms that use deep policy networks.
In Section 5, we provide computational results to illustrate
the benefits of Deep CPT-RL when applied to robot
navigation. In Section 6, we provide concluding remarks and
directions for future research.</p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Both explicit and implicit means for introducing risk
sensitivity to artificial agents have been previously
investigated. On the explicit side, methods often incorporate risk
by distorting the reward function. For instance, exponential
utility transforms the expected discounted reward E[Z] to
1 log E[exp( Z)], where determines the level of risk.1
Another class of risk-sensitive measures focuses on rare
occurrences not captured in the standard expected utility, in
order to mitigate possible detrimental outcomes. Value-at-risk
at confidence level (VaR ) is an -quantile that focuses on
the tail distribution, VaR (Z) = maxfzjP (Z &lt; z) g,
where sets risk. Conditional value-at-risk at confidence
level (CVaR ) considers the expectation of the tail below
the -quantile point VaR .</p>
      <p>
        On the implicit side, distributional reinforcement
learning (DistRL) approaches risk from a different perspective.
Instead of distorting the utility function and optimizing the
expectation, DistRL models the value distribution in the
optimization process and selects the policy based on its mean.
Bellemare, Dabney, and Munos (2017) introduced the first
DistRL algorithm, C51, and showed that it outperforms
standard Deep Q-Networks (DQN;
        <xref ref-type="bibr" rid="ref16">Mnih et al. (2015)</xref>
        ) in some
environments. C51 estimates probabilities on a fixed set
of uniformly-spaced possible returns, requiring bounds that
may not exist in practice. Quantile regression DQN
(QRDQN) overcomes this limitation by estimating the inverse
cumulative distribution function (CDF) of equally-spaced
quantiles
        <xref ref-type="bibr" rid="ref6">(Dabney et al. 2018b)</xref>
        . Implicit quantile networks
further improve on QR-DQN by estimating the inverse CDF
of random quantiles
        <xref ref-type="bibr" rid="ref6">(Dabney et al. 2018a)</xref>
        . Another recent
method, Distributional Soft Actor-Critic
        <xref ref-type="bibr" rid="ref14">(Ma et al. 2020)</xref>
        ,
enables application of DistRL to continuous spaces.
      </p>
      <p>1This is apparent after applying a straightforward Taylor
expansion; risk-averse and risk-seeking behavior translate to &lt; 0 and
&gt; 0, respectively.</p>
      <p>
        Another approach for addressing edge cases is to look
to human decision-making for inspiration. Human decision
makers preferentially consider rare events and perform
admirably on many tasks that RL agents are trained to tackle.
The goal in many applications is to maximize agent
alignment with human preferences, which may also point to
approaches that attempt to mimic human decision-making.
One method for incorporating human tendencies into agent
behavior is to have the agent maximize the CPT-value
function instead of expected future reward (CPT-RL;
        <xref ref-type="bibr" rid="ref17">Prashanth
et al. (2016)</xref>
        ). CPT, a leading model of human
decisionmaking, is a refined variant of prospect theory that includes
a generalization of expected utility and is supported by
substantial empirical evidence.
      </p>
      <p>In this paper, we extend CPT-RL to enable application to
deep neural network (DNN) policies. Because the
gradientfree methods used in CPT-RL do not scale to training DNNs
from scratch, we perform initial training to maximize
expected reward and update a subset of the resulting optimal
weights to maximize the value of the CPT function. Our
approach resembles transfer learning at first glance, as the
policy is retrained according to a related objective function.
However, in our case, a subset of the optimal weights are
retained and a subset are re-initialized and re-trained.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Cumulative Prospect Theory</title>
        <p>
          Cumulative prospect theory uses two components to model
human behavior: a utility function that quantifies human
attitudes toward outcomes and a weight function that
quantifies the emphasis placed on different outcomes (Figure 1).
The utility function u = (u+; u ), where u : R ! R+,
u+(x) = 0 for x 0, u (x) = 0 for x &gt; 0, has two regions
separated by a reference point, which we consider as zero
for illustrative purposes. The region to the left of the
reference point specifies the utility of losses, while the region
to its right specifies the utilities of gains. Humans have
empirically been found to be more risk-averse with gains than
losses
          <xref ref-type="bibr" rid="ref26">(Tversky and Kahneman 1992)</xref>
          , leading to the
asymmetric concavities on the two sides of the reference point.
The reference point may be static or dynamic; in the case
of humans it may change with accumulated experience. In
our experiments we consider dynamic reference points for
reasons that will be discussed in Section 5.
        </p>
        <p>
          The probability weight function w = (w+; w ), where
w : [0; 1] ! [0; 1], models human consideration of events
based on frequency, highlighting both positive and
negative rare events relative to more common events (Figure
1(b)).
          <xref ref-type="bibr" rid="ref26">Tversky and Kahneman (1992)</xref>
          recommend the weight
function w (p) = w(p) = (p +(1p p) )1= , while
          <xref ref-type="bibr" rid="ref19">Prelec
(1998)</xref>
          suggests w (p) = w(p) = exp( ( ln p) ) where
2 (0; 1).
        </p>
        <p>The CPT-value function, defined as
Cu;w(X)
=</p>
        <p>w+(P (u+(X) &gt; z))dz
Z +1
0</p>
        <p>Z +1
w (P (u (X) &gt; z))dz; (1)</p>
        <sec id="sec-3-1-1">
          <title>Losses</title>
          <p>−u−
(a)
u+</p>
          <p>Gains
satisfies certain requirements. We consider u and w to be
fixed; therefore, we simplify notation henceforth: Cu;w !
C.</p>
          <p>The CPT function (1) is a generalization of expected
value. In particular, if we consider w (x) = x and
u+(x) = x for x 0, u (x) = x for x &lt; 0, then
C(X) = R0+1 P(X &gt; z)dz R0+1 P( X &gt; z)dz =
R0+1 P(max(X; 0) &gt; z)dz R0+1 P(max( X; 0) &gt;
z)dz = E[max(X; 0)] E[max( X; 0)], where u+ and u
are utility functions corresponding to gains (X &gt; 0) and
losses (X &lt; 0) with reference point X = 0.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 CPT-Value Estimation</title>
        <p>
          <xref ref-type="bibr" rid="ref17">Prashanth et al. (2016)</xref>
          proposed a provably convergent
CPTvalue estimate of (1), inspired by quantiles. Their approach
is summarized in Algorithm 1. The random reward X is
based on a policy , where is a set of controllable
parameter values chosen to optimize the CPT function (1).
Reward Xi is generated from episode i, where an episode is
a trajectory from initial to terminal state guided by actions.
Episodes are assumed to be both noisy and expensive to
generate. Note that in addition to selecting appropriate utility
and weight functions, this approach requires users to choose
a suitable sample size or number of episodes m to produce
a reasonable estimate.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Algorithm 1: CPT-value Estim. (Prashanth et al. 2016)</title>
        <p>1. Generate X1; : : : ; Xm i.i.d. from the distribution X.
2. Let</p>
        <p>m
Cm(X) =Xu+(X[i]) w+ m + 1
+
m
i=1
m
Cm(X) =Xu (X[i]) w
i=1
i
m
w
i
i</p>
        <p>1
m
w+ m
i</p>
        <p>;
;
m
Identity
w+(p)
w−(p)
0.0
0.2
0.8
1.0
0.4 0.6
Probability (p)</p>
        <p>(b)
where X[i] is the ith ordered statistic.</p>
        <p>+
3. Return C(X) = Cm(X)
Cm(X).
3.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Gradient-Free Stochastic Approximation</title>
        <p>Consider the stochastic optimization problem
max C(X( ));</p>
        <p>
          2
where C( ) is the CPT-value function (1), is a compact,
convex subset of Rd, X( ) is a random reward, and =
( 1; : : : ; d) is a d-dimensional weight parameter. Stochastic
approximation updates the weights iteratively using the
recursion
n+1 =
( n +
ng^(X( n))) ;
where 0 is chosen randomly, ( ) is a projection operator
that ensures 2 , n is a diagonal matrix with step sizes
for each dimension on the diagonal, and g^ is an estimate of
the true gradient rC
          <xref ref-type="bibr" rid="ref12 ref21 ref3 ref4">(Robbins and Monro 1951; Chau and
Fu 2015; Chau et al. 2014)</xref>
          . Direct (unbiased) gradients are
unavailable for CPT-values; however, with the availability
of CPT-value estimates, indirect (biased) gradient estimates
can be computed. Applicable indirect methods include
finite differences
          <xref ref-type="bibr" rid="ref11">(Kiefer and Wolfowitz 1952)</xref>
          and
simultaneous perturbations stochastic approximation (SPSA)
          <xref ref-type="bibr" rid="ref25">(Spall
1992)</xref>
          .
        </p>
        <p>In this paper, we employ SPSA for its computational
efficiency. The kth component of the SPSA gradient is defined
as
gSP SA( ) = C( +
k
)
2</p>
        <p>C(
k
)
(2)
for k = 1; : : : ; d; where &gt; 0 and = ( 1; : : : ; d) has
random i.i.d. components with zero mean and finite inverse
moments. Each CPT-value estimate is generated from
sample rewards based on m episodes, i.e., Xi for i = 1; : : : ; m.
Note that the number of episodes m can vary from one
iteration to the next.</p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Deep CPT-RL</title>
      <p>
        The CPT-RL approach described in Section 3 was applied by
        <xref ref-type="bibr" rid="ref17">Prashanth et al. (2016)</xref>
        to successfully optimize a Boltzmann
policy and thereby control a relatively simple traffic
management scenario. However, CPT-RL does not directly scale to
policy mappings with high-dimensional parameter spaces,
including deep neural networks. This is because SPSA
produces gradient estimates with lower fidelity and higher
variance than backpropagation.
      </p>
      <p>To address these issues and thereby extend CPT-RL to
deep policy networks, we employ a two-stage approach.
First, the deep policy network of an agent is trained using
a conventional actor-critic method. The actor-critic
formulation was chosen in accordance with the need to learn a
policy and the desire to reduce the variance of policy
gradient estimates. The lower, input-side layers of the network are
frozen and the upper, output-side layers (one or more) are
retrained to maximize the CPT-value using a procedure similar
to CPT-RL. Thus the first stage learns a feature
representation of the observation space and the second stage learns a
policy under the learned representation. By limiting the
second stage updates to the upper layer(s) of the network, we
overcome the limitations of SPSA in high-dimensional
settings.</p>
      <p>More explicitly, Stage 1 aims to find an optimal policy
for choosing actions at that maximize expected return.
The optimal policy parameters are given by
= arg max E
2
p ( )
"</p>
      <p>#
X r(st; at) ;
t
where the are trajectories drawn from the distribution
p ( ) of possible trajectories an agent using policy may
take in the environment, is the feasible policy parameter
region, t indicates time, and r(st; at) is the stochastic
reward granted by the environment when action at is taken
in state st. Stage 2 aims to find a policy that maximizes
the CPT-value function (1), or in principle any risk-sensitive
measure. Unfortunately, unlike the standard expectation of
rewards, the CPT-functional is based on cumulative
probabilities and piecewise, precluding direct, sample-based
differentiation. No nested structure is assumed, precluding the
use of bootstrapping methods. Hence, we generate indirect
(biased) gradients from the CPT-value estimates.</p>
      <p>
        In principle, any policy-based DRL algorithm may be
applied in the first stage of our procedure. We applied
Proximal Policy Optimization (PPO;
        <xref ref-type="bibr" rid="ref23">Schulman et al. (2017)</xref>
        ), a
leading actor-critic approach, to train our agents to
convergence. As previously stated, the lower layers of the network
learned in Stage 1 provide a feature representation for the
upper layer(s). Therefore, PPO can be thought of as
learning a “perception” mapping suitable for the ensuing “action”
layers learned by the second stage. In our experiments, we
found it sufficient to re-initialize the parameters and tune
only the last layer in Stage 2. Further flexibility may be
gained through the re-optimization of the last two layers.
      </p>
      <p>
        The second stage is feasible because 1) the lower layers
of a properly-trained initial network tease out the features
necessary to produce intelligent agent behavior and 2) the
action layer(s) comprise a small fraction of the parameter
count of the entire network. We used SPSA to estimate the
gradient because of its computational efficiency; in
particular, finite differences was not considered due to the sheer
number of parameters involved. As in the original CPT-RL
formulation, all information required to compute the
gradients was derived from batches of episodes. To increase
stability, we averaged over gradient estimates generated from
multiple perturbations before conducting a network update
via the Adam optimizer
        <xref ref-type="bibr" rid="ref12 ref3">(Kingma and Ba 2015)</xref>
        .
      </p>
      <p>
        One notable extension to the work by
        <xref ref-type="bibr" rid="ref17">Prashanth et al.
(2016)</xref>
        is our consideration of multiple reward terms. While
other approaches are possible, we employed the simple
strategy of choosing a single reference point based on all terms.
We allowed this reference point to vary based on the
outcome of a given episode, as required by our experimental
setup discussed below. One could alternatively compute a
CPT-value for each reward term where it makes sense to
weigh risk using expectations for other terms, but we found
this to provide similar performance with additional
complexity.
      </p>
      <p>Our approach is outlined in Algorithm 2. For notational
simplicity, in the neural network, let 1 2 1 and : 1 2
: 1 denote the weights in the last layer and all but the last
layer, respectively. The weight parameters : 1 are fixed in
the second stage, so we drop them from the input, X( : 1 [
1) ! X( 1).</p>
      <sec id="sec-4-1">
        <title>Algorithm 2: Deep CPT-RL</title>
      </sec>
      <sec id="sec-4-2">
        <title>Stage 1.</title>
        <p>1. Identify deep policy network architecture with weight
parameters .
2. Train deep policy network to obtain optimal weight
parameters that maximize expected return (e.g., using
PPO).</p>
      </sec>
      <sec id="sec-4-3">
        <title>Stage 2.</title>
        <p>3. Initialize stopping time N , gradient estimates per update
M , SPSA parameters f ng, Adam parameters &gt; 0, 2
(0; 1), &gt; 0.
4. Fix ~: 1</p>
        <p>= : 1, re-initialize ~ 1 s.t. ~ 1;i
Uniform( j 1j 1=2; j 1j 1=2) for i = 0 : : : j ~ 1j
where ~ 1 = (~ 1;1; : : : ; ~ 1;j ~ 1j).
5. For i = 0; : : : ; N
1:
1,
For j = 0; : : : ; M</p>
        <p>1:
– Generate = f kgjk~=01j 1, where</p>
        <p>probability 0:5, independent 8k.
– Set ~ 1 = ~ 1 i
– Generate C(Xj ( ~+1)) and C(Xj ( ~ 1)).
– Compute gjSP SA( ~ 1) using (2).</p>
        <p>and generate Xj ( ~ 1).</p>
        <p>k =
1 with
Gi:
Update ~ 1</p>
        <p>
          Average fgjSP SAgjM=01 and use result for Adam update
6. Return ~CP T = ~: 1 [ ~ 1:
To quantify the impact of optimizing the CPT-value, we
applied our two-step procedure to an enhanced version of
the CrowdSim robotic navigation environment
          <xref ref-type="bibr" rid="ref5">(Chen et al.
2019)</xref>
          . CrowdSim allows for explicit evaluation of an agent’s
risk-taking behavior in the form of its willingness to trade
safety for speed.
5.1
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>CrowdSim Environment</title>
        <p>CrowdSim was developed to study social mobility; that is,
to train robotic agents to avoid collisions with pedestrians
while moving toward a goal. In the default configuration,
pedestrians follow the Optimal Reciprocal Collision
Avoidance (ORCA; van den Berg et al. (2011)) protocol to avoid
other pedestrians while traversing to their own goals. An
episode concludes when the goal is reached, the robot
collides with a pedestrian, or the system times out. We took the
robot to be invisible to the pedestrians in order to provide
a more challenging test case. To facilitate our experiments,
we made a few changes to the published environment. These
changes were broadly designed to make the scenario more
realistic, facilitate efficient learning, and provide rotational
invariance. We closed the simulation, enforcing elastic
collisions when an agent reaches an outer wall. We kept
pedestrians in constant motion, assigning them a new goal
destination when they reached a target. While the original version
fixed the initial position of the robot and its goal, we
randomly placed them opposite each other on a circle centered
at the origin.</p>
        <p>
          In addition to environmental modifications, we adjusted
the observations and rewards received by the agents. We
replaced the feature-based representation with grayscale,
pixel-based observations. This allowed us to encode the
geometry of the system more naturally, enabling the learning
0.5
0.4
0.3
0.2
of rotation-invariant navigation policies that were
challenging to produce with the original, feature-based
representation. We modified the published reward function to 1) allow
removal of the initial imitation learning step used in
          <xref ref-type="bibr" rid="ref5">Chen
et al. (2019)</xref>
          and 2) explicitly model a speed-safety
tradeoff. Removal of the imitation learning was desired to allow
quantification of CPT-based shaping of an RL agent trained
tabula rasa, as well as to prevent our training from requiring
more than two phases. It was enabled through the use of a
progress term that encouraged movement in the direction of
the goal. The speed-safety tradeoff came from a small
constant penalty being assessed at each step, encouraging the
agent to get to the goal quickly. In sum, the reward function
was formulated as
where dt is the distance between the robot and the goal at
time t (i.e. dt 1 dt is the “progress” made in the timestep),
Ctime &gt; 0 is the time penalty, and Cprogress &gt; 0 is the
progress reward. We set Ctime = 0:02 and Cprogress = 0:1,
with the total distance from the agent’s starting position to
the target being 10. Our choice of Cprogress provided a
maximal progress contribution of 1 per episode; Ctime was chosen
to provide a meaningful fraction of that value for episodes of
standard duration. An episode ends when one of three things
happens: the robot reaches the goal, the robot collides with
a pedestrian, or a timeout occurs. Hence, even though there
is no explicit collision penalty, the agent is incentivized to
avoid collisions because it precludes the possibility of
accumulating more progress reward.
        </p>
        <p>Our agent was configured to choose from 33 different
motions; remaining at rest and 4 different speeds (0.25 m/s, 0.5
m/s, 0.75 m/s, 1.0 m/s) at each of 8 evenly-spaced angles in
[0; 2 ). Both the robot and the pedestrians were configured
to move at a preferred speed of vpref = 1:0 m/s. We
considered 10 pedestrians in each training episode, allowing that
number to vary in testing to evaluate “out-of-distribution”
performance. This configuration was chosen with the goal
of exploring challenging regimes (where the agent would
not always be able to avoid collisions) in order to generate
meaningful comparisons amongst methods.</p>
        <sec id="sec-4-4-1">
          <title>Stage 2 Training</title>
          <p>CVaR (25%)
AVG
CPT
0
200
400</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Network Architecture</title>
        <p>
          In our experiments, we used a convolutional neural network
that closely resembles the architecture commonly used for
Atari
          <xref ref-type="bibr" rid="ref16">(Mnih et al. 2015)</xref>
          . However, our input images were
50% larger in each direction than those used by
          <xref ref-type="bibr" rid="ref16">Mnih et al.
(2015)</xref>
          to ensure proper representation of agent and goal
edges. To encode motion, 4 frames were stacked. The input
to the network thus consisted of 126 126 4 image
arrays, which were passed to 3 consecutive convolutional
layers. These layers had 32, 64, and 64 channels (input-side to
output-side), kernel sizes of 12, 4, and 3, and stride lengths
of 6, 2, and 1. The layers were separated by rectified linear
unit (ReLU) nonlinearities. The output of the last
convolutional layer was passed to a rectified, 512-dimensional
fullyconnected layer. The resulting output was used as the input
to a policy head of dimension 33 and a scalar value head.
5.3
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>Stage 1 Training</title>
        <p>
          The neural networks governing our navigation agents were
first trained to convergence using PPO. Hyperparameters
were chosen to mimic those used for Atari in
          <xref ref-type="bibr" rid="ref23">Schulman et al.
(2017)</xref>
          , except with shorter windows (16 steps) and more
windows per batch (64).
5.4
        </p>
      </sec>
      <sec id="sec-4-7">
        <title>Stage 2 Training</title>
        <p>
          After Stage 1 training, we trained our agent to maximize the
CPT-value under Algorithm 2. For comparison purposes, we
also used the same procedure to optimize for full-episode
average reward (AVG) and full-episode conditional
value-atrisk (CVaR). CVaR was included to evaluate a standard
risksensitive measure, focused on improving the worst agent
outcomes. While the standard practice is to consider the
bottom 5% of the distribution for CVaR, we used the bottom
25% in an effort to maximize performance over a broader
range of the distribution. We chose hyperparameters for the
SPSA gradient estimation and the Adam optimizer in line
with standard guidance
          <xref ref-type="bibr" rid="ref12 ref25 ref3">(Spall 1992; Kingma and Ba 2015)</xref>
          .
        </p>
        <p>One critical detail of CPT-based training is the selection
of the reference point. Our variable reference point consists
of two terms, corresponding to the two terms in the reward
function (3). We chose a fixed contribution of pref = 0:5
from the progress term, corresponding to the agent making
it halfway from the start to the goal before a collision. The
contribution from time varies with episode progress:
p(T )
pmax
xref = pref +
tref, max;
(4)
where p(T )=pmax is the fractional progress toward the goal
made by the agent over the whole episode ending at T and
tref, max = 0:3 is a reference time multiplied by the time
penalty scaling factor Ctime in (3). Varying the reference
point in this manner correlates the “acceptable” amount of
time an agent may take in an episode with the amount of
progress it makes. It prevents an agent from being unduly
rewarded for quickly running into a pedestrian, resulting in
a small episode time.
3.0
The learning curves for training in Stages 1 and 2 are shown
in Figures 2 and 3, respectively. For subsequent analysis, the
same amount of training data and number of Stage 2 network
updates (1500; chosen for convergence) were used to
compare the agents that maximized average reward, CPT-value,
and CVaR. Figure 4a shows the reward distributions earned
by each agent over 5000 test runs, each with 10
pedestrians. To investigate the impact of more challenging
“out-ofdistribution” test cases, we also evaluated on scenarios with
11-15 pedestrians sampled uniformly on the same networks,
as illustrated in Figure 4b. Statistics describing each test run
are provided in Table 1. Note that different random seeds
were used for each of the 5000 test runs, but these seeds
were the same across test conditions.</p>
        <p>Figure 4 displays quantitative differences in the testing
performance of the three agents. The CVaR approach
focuses exclusively on the lower region of performance. As
such, it does the best at removing lower outliers, as
evidenced by the 1% quantiles in Table 1. However, it does
not make any effort to enhance performance far away from
the worst case and therefore cannot compete with the other
two methods in any other region. The maximization of
CPTvalue and average reward led to similar reward distributions.
However, CPT was seen to reduce the frequency of poor
outcomes compared to AVG, at the cost of a slight reduction in
top-end performance. Intuitively, this behavior was to be
expected. As shown in Figure 1a, CPT penalizes negative
outcomes that are reasonably close to the reference point more
harshly than AVG while assigning slightly less utility to the
very best outcomes. The weighting from Figure 1b may have
played a small role in the improved performance on both
edges of the distribution. The differences between the CPT
and AVG agents were preserved when presented with more
challenging test scenarios, as illustrated in Figures 4a and
4b.</p>
        <p>Looking more closely at the CPT and AVG agents, we
see that the former averages more progress before a collision
and reaches the goal more often because it is more
deliberate. Figure 5 shows the distribution of average velocities v
of the agent’s movement toward the goal over the test runs,
defined by
v =
dT</p>
        <p>T
d0
;
(5)
where dt is the distance from the agent to the goal at time
t as previously defined and T is the duration of a given
episode. As can be seen from Table 2, this slight reduction in
speed allows the CPT-based agent to, on average, make more
progress toward the goal before a collision than the other
agents. It also allows the agent to reach the target without a
collision more often, as reflected by the “success” fractions
given in Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Research</title>
      <p>We have developed a method to shape the full-episode
reward distributions of DRL agents through the
maximization of quantities besides expected reward. Our two-step
approach has been shown to enable different agent behaviors
when maximizing CPT-value, CVaR, and full-episode
expected reward, both qualitatively and quantitatively. In
particular, the CPT-based agents in our navigation experiments
were seen to process risk differently than the AVG agents,
on average proceeding more deliberately and thereby
making more progress toward the goal before a collision. While
CVaR is a standard risk-sensitive measure, we found that
optimizing it produced agents unable to compete with the
others above the very bottom of the reward distribution, even at
the 25% level.</p>
      <p>
        To further this work, we intend to investigate the impact
of adjusting the tunable parameters of the CPT function on
the risk-taking behavior of agents. In working toward
assured operation, we plan to explore the combination of our
methods with techniques for constrained RL
        <xref ref-type="bibr" rid="ref1">(Achiam et al.
2017)</xref>
        . Finally, to gain a deeper understanding of the effects
of our methods, we also plan to apply them to more complex
experimental testbeds.
      </p>
      <p>7</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors thank Ashley Llorens for insightful technical
discussions and for his leadership of the Johns Hopkins
Institute for Assured Autonomy (JHU IAA) project that
supported our experimentation and analysis. This project was
conducted under funding from both JHU/APL Internal
Research and Development and JHU IAA.</p>
      <p>Wu, C.; and Lin, Y. 1999. Minimizing Risk Models in
Markov Decision Processes with Policies Depending on
Target Values. Journal of Mathematical Analysis and
Applications 231: 47–67.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Held</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tamar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Abbeel,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Constrained Policy Optimization</article-title>
          . arXiv:
          <volume>1705</volume>
          .10528 [cs] URL http://arxiv.org/abs/1705.10528. ArXiv:
          <volume>1705</volume>
          .
          <fpage>10528</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dabney</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and Munos,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>A Distributional Perspective on Reinforcement Learning</article-title>
          .
          <source>In Proceedings of the 34th International Conference on Machine Learning</source>
          , volume
          <volume>70</volume>
          ,
          <fpage>449</fpage>
          -
          <lpage>458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Fu,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>An Overview of Stochastic Approximation</article-title>
          ,
          <fpage>149</fpage>
          -
          <lpage>178</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Qu</surname>
          </string-name>
          , H.; and
          <string-name>
            <surname>Ryzhov</surname>
            ,
            <given-names>I. O.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Simulation Optimization: A Tutorial Overview and Recent Developments in Gradient-based Methods</article-title>
          .
          <source>In Proceedings of the 2014 Winter Simulation Conference</source>
          ,
          <volume>21</volume>
          -
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kreiss</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Alahi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2019</year>
          . CrowdRobot Interaction:
          <article-title>Crowd-Aware Robot Navigation With Attention-Based Deep Reinforcement Learning</article-title>
          .
          <source>2019 International Conference on Robotics and Automation</source>
          <volume>6015</volume>
          - 6022.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dabney</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Ostrovski,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; and Munos,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2018a</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Implicit Quantile Networks for Distributional Reinforcement Learning</article-title>
          .
          <source>In Proceedings of the 35th International Conference on Machine Learning</source>
          , volume
          <volume>80</volume>
          ,
          <fpage>1096</fpage>
          -
          <lpage>1105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          2018b.
          <article-title>Distributional Reinforcement Learning With Quantile Regression</article-title>
          .
          <source>In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence</source>
          ,
          <fpage>2892</fpage>
          -
          <lpage>2901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Jie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Prashanth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Szepesva´ri,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Stochastic Optimization in a Cumulative Prospect Theory Framework</article-title>
          .
          <source>IEEE Transactions on Automatic Control</source>
          <volume>63</volume>
          :
          <fpage>2867</fpage>
          -
          <lpage>2882</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Kahneman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Tversky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1979</year>
          .
          <article-title>Prospect Theory: An Analysis of Decision under Risk</article-title>
          .
          <source>Econometrica</source>
          <volume>47</volume>
          (
          <issue>2</issue>
          ):
          <fpage>263</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Kiefer</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Wolfowitz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1952</year>
          .
          <article-title>Stochastic Estimation of the Maximum of a Regression Function</article-title>
          .
          <source>The Annals of Mathematical Statistics</source>
          <volume>23</volume>
          (
          <issue>3</issue>
          ):
          <fpage>462</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>In 3rd International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Leavens</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          <year>1945</year>
          . Diversification of Investments.
          <source>Trusts and Estates</source>
          <volume>80</volume>
          :
          <fpage>469</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>DSAC: Distributional Soft Actor Critic for Risk-Sensitive Reinforcement Learning</article-title>
          . arXiv:
          <year>2004</year>
          .14547 .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bellemare,
          <string-name>
            <given-names>M. G.</given-names>
            ;
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            ;
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Petersen,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Beattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Sadik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; King,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Human-level Control Through Deep Reinforcement Learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Prashanth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>S. I.;</given-names>
          </string-name>
          and Szepesva´ri,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control</article-title>
          .
          <source>In Proceedings of the 33nd International Conference on Machine Learning</source>
          ,
          <fpage>1406</fpage>
          -
          <lpage>1415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Pratt</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <year>1964</year>
          .
          <article-title>Risk Aversion in the Small and in the Large</article-title>
          .
          <source>Econometrica</source>
          <volume>32</volume>
          :
          <fpage>122</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Prelec</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>The Probability Weighting Function</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>Econometrica</source>
          <volume>66</volume>
          :
          <fpage>497</fpage>
          -
          <lpage>527</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Robbins</surname>
          </string-name>
          , H.; and
          <string-name>
            <surname>Monro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>1951</year>
          .
          <article-title>A Stochastic Approximation Method</article-title>
          .
          <source>The Annals of Mathematical Statistics</source>
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <fpage>400</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Rockafellar</surname>
          </string-name>
          , R. T.; and
          <string-name>
            <surname>Uryasev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Optimization of Conditional Value-at-Risk</article-title>
          .
          <source>Journal of Risk</source>
          <volume>2</volume>
          :
          <fpage>21</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wolski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Klimov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Proximal Policy Optimization Algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>arXiv preprint arXiv:1707</source>
          .
          <fpage>06347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Spall</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <year>1992</year>
          .
          <article-title>Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation</article-title>
          .
          <source>IEEE Transactions on Automatic Control</source>
          <volume>37</volume>
          (
          <issue>3</issue>
          ):
          <fpage>332</fpage>
          -
          <lpage>341</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Tversky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kahneman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1992</year>
          .
          <article-title>Advances in Prospect Theory: Cumulative Representation of Uncertainty</article-title>
          .
          <source>Journal of Risk and Uncertainty</source>
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <fpage>297</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>