<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>frachel.freedman, rohinmshah, ancag@berkeley.edu,</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Experimental Setup</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>Berkeley</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Specifying reward functions for robots that operate in environments without a natural reward signal can be challenging, and incorrectly specified rewards can incentivise degenerate or dangerous behavior. A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections. To interpret this feedback, robots treat as approximately optimal a choice the person makes from a choice set, like the set of possible trajectories they could have demonstrated or possible corrections they could have made. In this work, we introduce the idea that the choice set itself might be difficult to specify, and analyze choice set misspecification: what happens as the robot makes incorrect assumptions about the set of choices from which the human selects their feedback. We propose a classification of different kinds of choice set misspecification, and show that these different classes lead to meaningful differences in the inferred reward and resulting performance. While we would normally expect misspecification to hurt, we find that certain kinds of misspecification are neither helpful nor harmful (in expectation). However, in other situations, misspecification can be extremely harmful, leading the robot to believe the opposite of what it should believe. We hope our results will allow for better prediction and response to the effects of misspecification in real-world reward inference.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Specifying reward functions for robots that operate in
environments without a natural reward signal can be challenging,
and incorrectly specified rewards can incentivise degenerate
or dangerous behavior [Leike et al., 2018; Krakovna, 2018].
A promising alternative to manually specifying reward
functions is to design techniques that allow robots to infer them
from observing and interacting with humans.</p>
      <p>These techniques typically model humans as optimal or
noisily optimal. Unfortunately, humans tend to deviate
from optimality in systematically biased ways [Kahneman
and Tversky, 1979; Choi et al., 2014]. Recent work
improves upon these models by modeling pedagogy
[HadfieldMenell et al., 2016], strategic behavior [Waugh et al., 2013],
risk aversion [Majumdar et al., 2017], hyperbolic
discounting [Evans et al., 2015], or indifference between similar
options [Bobu et al., 2020b]. However, given the complexity of
human behavior, our human models will likely always be at
least somewhat misspecified [Steinhardt and Evans, 2017].</p>
      <p>One way to formally characterize misspecification is as
a misalignment between the real human and the robot’s
assumptions about the human. Recent work in this vein has
examined incorrect assumptions about the human’s hypothesis
space of rewards [Bobu et al., 2020a], their dynamics model
of the world [Reddy et al., 2018], and their level of pedagogic
behavior [Milli and Dragan, 2019]. In this work, we identify
another potential source of misalignment: what if the robot
is wrong about what feedback the human could have given?
Consider the situation illustrated in Figure 1, in which the
robot observes the human going grocery shopping. While the
grocery store contains two packages of peanuts, the human
only notices the more expensive version with flashy
packaging, and so buys that one. If the robot doesn’t realize that the
human was effectively unable to evaluate the cheaper
package on its merits, it will learn that the human values flashy
packaging.</p>
      <p>We formalize this in the recent framework of
rewardrational implicit choice (RRiC) [Jeon et al., 2020] as
misspecification in the human choice set, which specifies what
feedback the human could have given. Our core contribution is
to categorize choice set misspecification into several formally
and empirically distinguishable “classes”, and find that
different types have significantly different effects on performance.
As we might expect, misspecification is usually harmful; in
the most extreme case the choice set is so misspecified that
the robot believes the human feedback was the worst possible
feedback for the true reward, and so updates strongly towards
the opposite of the true reward. Surprisingly, we find that
under other circumstances misspecification is provably
neutral: it neither helps nor hurts performance in expectation.
Crucially, these results suggest that not all misspecification
is equivalently harmful to reward inference: we may be able
to minimize negative impact by systematically erring toward
particular misspecification classes defined in this work.
Future work will explore this possibility.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Reward Inference</title>
      <p>There are many ways that a human can provide feedback to
a robot: demonstrations [Ng and Russell, 2000; Abbeel and
Ng, 2004; Ziebart, 2010], comparisons [Sadigh et al., 2017;
Christiano et al., 2017], natural language [Goyal et al., 2019],
corrections [Bajcsy et al., 2017], the state of the world [Shah
et al., 2019], proxy rewards [Hadfield-Menell et al., 2017;
Mindermann et al., 2018], etc. Jeon et al. propose a unifying
formalism for reward inference to capture all of these possible
feedback modalities, called reward-rational (implicit) choice
(RRiC). Rather than study each feedback modality separately,
we study misspecification in this general framework.</p>
      <p>RRiC consists of two main components: the human’s
choice set, which corresponds to what the human could have
done, and the grounding function, which converts choices into
(distributions over) trajectories so that rewards can be
computed.</p>
      <p>For example, in the case of learning from comparisons, the
human chooses which out of two trajectories is better. Thus,
the human’s choice set is simply the set of trajectories they
are comparing, and the grounding function is the identity. A
more complex example is learning from the state of the world,
in which the robot is deployed in an environment in which
a human has already acted for T timesteps, and must infer
the human’s preferences from the current world state. In this
case, the robot can interpret the human as choosing between
different possible states. Thus, the choice set is the set of
possible states that the human could reach in T timesteps,
and the grounding function maps each such state to the set of
trajectories that could have produced it.</p>
      <p>Let denote a trajectory and denote the set of all
possible trajectories. Given a choice set C for the human and
grounding function : C ! ( ! [0; 1]), Jeon et al. define
a procedure for reward learning. They assume that the human
is Boltzmann-rational with rationality parameter , so that
the probability of choosing any particular feedback is given
by:</p>
      <p>P(c j ; C) = Pc02C exp(
exp(</p>
      <p>E</p>
      <p>(c)[r ( )])
E
(c0)[r ( )])
(1)</p>
      <p>From the robot’s perspective, every piece of feedback c is
an observation about the true reward parameterization , so
the robot can use Bayesian inference to infer a posterior over
. Given a prior over reward parameters P( ), the RRiC
inference procedure is defined as:</p>
      <p>P( j c; C) / P</p>
      <p>exp(
c02C exp(</p>
      <p>E</p>
      <p>(c)[r ( )]
E
(c0)[r ( )])</p>
      <p>P( ) (2)</p>
      <p>Since we care about misspecification of the choice set C,
we focus on learning from demonstrations, where we restrict
the set of trajectories that the expert can demonstrate. This
enables us to have a rich choice set, while allowing for a
simple grounding function (the identity). In future work, we aim
to test choice set misspecification with other feedback
modalities as well.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Choice Set Misspecification</title>
      <p>For many common forms of feedback, including
demonstrations and proxy rewards, the RRiC choice set is implicit. The
robot knows which element of feedback the human provided
(ex. which demonstration they performed), but must assume
which elements of feedback the human could have provided
based on their model of the human. However, this
assumption could easily be incorrect – the robot may assume that the
human has capabilities that they do not, or may fail to
account for cognitive biases that blind the human to particular
feedback options, such as the human bias towards the most
visually attention-grabbing choice in Fig 1.</p>
      <p>To model such effects, we assume that the human selects
feedback c 2 CHuman according to P(c j ; CHuman), while
the robot updates their belief assuming a different choice
set CRobot to get P( j c; CRobot). Note that CRobot is
the robot’s assumption about what the human’s choice set
is – this is distinct from the robot’s action space. When
CHuman 6= CRobot, we get choice set misspecification.</p>
      <p>It is easy to detect such misspecification when the human
chooses feedback c 2= CR. In this case, the robot observes
a choice that it believes to be impossible, which should
certainly be grounds for reverting to some safe baseline policy.
So, we only consider the case where the human’s choice c is
also present in CR (which also requires CH and CR to have
at least one element in common).</p>
      <p>Within these constraints, we propose a classification of
types of choice set misspecification in Table 1. On the
vertical axis, misspecification is classified according to
c
2 CR \ CH
2 CRnCH
the location of the optimal element of feedback c =
argmaxc2CR[CH E (c)[r ( ))]. If c is available to the
human (in CH ), then the class code begins with A. We only
consider the case where c is also in CR: the case where
it is in CH but not CR is uninteresting, as the robot would
observe the “impossible” event of the human choosing c ,
which immediately demonstrates misspecification at which
point the robot should revert to some safe baseline policy.
If c 2= CH , then we must have c 2 CR (since it was
chosen from CH [ CR), and the class code begins with B. On
the horizontal axis, misspecification is classified according to
the relationship between CR and CH . CR may be a subset
(code 1), superset (code 2), or intersecting class (code 3) of
CH . For example, class A1 describes the case in which the
robot’s choice set is a subset of the human’s (perhaps because
the human is more versatile), but both choice sets contain the
optimal choice (perhaps because it is obvious).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>To determine the effects of misspecification class, we
artificially generated CR and CH with the properties of each
particular class, simulated human feedback, ran RRiC reward
inference, and then evaluated the robot’s resulting belief
distribution and optimal policy.
4.1
Environment To isolate the effects of misspecification and
allow for computationally tractable Bayesian inference, we
ran experiments in toy environments. We ran the randomized
experiments in the four 20 20 gridworlds shown in Fig 2.
Each square in environment x is a state sx = flava; goalg.
lava 2 [0; 1] is a continuous feature, while goal 2 f0; 1g
is a binary feature set to 1 in the lower-right square of each
grid and 0 everywhere else. The true reward function r is
a linear combination of these features and a constant
stayalive cost incurred at each timestep, parameterized by =
(wlava; wgoal; walive). Each episode begins with the robot in
the upper-left corner and ends once the robot reaches the goal
state or episode length reaches the horizon of 35 timesteps.
Robot actions AR move the robot one square in a cardinal
or diagonal direction, with actions that would move the robot
off of the grid causing it to remain in place. The transition
function T is deterministic. Environment x defines an MDP
M x = hSx; AR; T; r i.</p>
      <p>Inference While the RRiC framework enables inference
from many different types of feedback, we use
demonstration feedback here because demonstrations have an implicit
choice set and straightforward deterministic grounding. Only
the human knows their true reward function
parameterization . The robot begins with a uniform prior distribution
over reward parameters P( ) in which wlava and walive vary,
but wgoal always = 2:0. P( ) contains . RRiC
inference proceeds as follows for each choice set tuple hCR; CH i
and environment x. First, the simulated human selects the
best demonstration from their choice set with respect to the
true reward cH = argmaxc2CH E (c)[r ( ))]. Then,
the simulated robot uses Eq. 2 to infer a “correct”
distribution over reward parameterizations BH ( ) , P( j c; CH )
using the true human choice set, and a “misspecified”
distribution BR( ) , P( j c; CR) using the misspecified
human choice set. In order to evaluate the effects of each
distribution on robot behavior, we define new MDPs MHx =
hSx; AR; T; rE[BH ( )]i and MRx = hSx; AR; T; rE[BR( )]i for
each environment, solve them using value iteration, and then
evaluate the rollouts of the resulting deterministic policies
according to the true reward function r .
4.2</p>
      <sec id="sec-4-1">
        <title>Randomized Choice Sets</title>
        <p>We ran experiments with randomized choice set selection for
each misspecification class to evaluate the effects of class on
entropy change and regret.</p>
        <p>Conditions The experimental conditions are the classes of
choice set misspecification in Table 1: A1, A2, A3, B2 and
B3. We tested each misspecification class on each
environment, then averaged across environments to evaluate each
class. For each environment x, we first generated a
master set CMx of all demonstrations that are optimal w.r.t. at
least one reward parameterization . For each
experimental class, we randomly generated 6 valid hCR; CH i tuples,
with CR; CH CMx . Duplicate tuples, or tuples in which
cH 2= CR, were not considered.</p>
        <p>Measures There are two key experimental measures:
entropy change and regret. Entropy change is the difference
in entropy between the correct distribution BH , and the
misspecified distribution BR. That is, H = H(BH ) H(BR).
If entropy change is positive, then misspecification induces
overconfidence, and if it is negative, then misspecification
induces underconfidence.</p>
        <p>Regret is the difference in return between the optimal
solution to MHx , with the correctly-inferred reward
parameterization, and the optimal solution to MRx , with the
incorrectlyinferred parameterization, averaged across all 4
environments. If Hx is an optimal trajectory in MHx and Rx is an
optimal trajectory in MRx , then regret = 14 P3x=0[r ( Hx)
r ( Rx)]. Note that we are measuring regret relative to the
optimal action under the correctly specified belief, rather than
optimal action under the true reward. As a result, it is possible
for regret to be negative, e.g. if the misspecification makes the
robot become more confident in the true reward than it would
be under correct specification, and so execute a better policy.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Biased Choice Sets</title>
        <p>We also ran an experiment in a fifth gridworld where we
select the human choice set with a realistic human bias to
illustrate how choice set misspecification may arise in practice. In
this experiment the human only considers demonstrations that
end at the goal state because, to humans, the word “goal” can
be synonymous with “end” (Fig 3a). However, to the robot,
the goal is merely one of multiple features in the
environment. The robot has no reason to privilege it over the other
features, so the robot considers every demonstration that is
optimal w.r.t some possible reward parameterization (Fig 3b).
The trajectory that only the robot considers is marked in blue.
We ran RRiC inference using this hCR; CH i and evaluated
the results using the same measures described above.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We summarize the aggregated measures, discuss the realistic
human bias result, then examine two interesting results:
symmetry between classes A1 and A2 and high regret in class B3.
Entropy Change Entropy change varied significantly
across misspecification class. As shown in Fig 4, the
interquartile ranges (IQRs) of classes A1 and A3 did not overlap
with the IQRs of A2 and B2. Moreover, A1 and A3 had
positive medians, suggesting a tendency toward overconfidence,
while A2 and B2 had negative medians, suggesting a
tendency toward underconfidence. B3 was less distinctive, with
an IQR that overlapped with that of all other classes.
Notably, the distributions over entropy change of classes A1 and
A2 are precisely symmetric about 0.</p>
      <p>Regret Regret also varied as a function of misspecification
class. Each class had a median regret of 0, suggesting that
misspecification commonly did not induce a large enough
shift in belief for the robot to learn a different optimal policy.
However the mean regret, plotted as green lines in Fig 5, did
vary markedly across classes. Regret was sometimes so high
in class B3 that outliers skewed the mean regret beyond of the
whiskers of the boxplot. Again, classes A1 and A2 are
precisely symmetric. We discuss this symmetry in Section 5.3,
then discuss the poor performance of B3 in Section 5.4.
(a) feedback cH
(b) P( j cH ; CR)</p>
      <sec id="sec-5-1">
        <title>5.2 Effects of Biased Choice Sets</title>
        <p>The human bias of only considering demonstrations that
terminate at the goal leads to very poor inference in this
environment. Because the human does not consider the blue
demonstration from Fig 3b, which avoids the lava altogether,
they are forced to provide the demonstration in Fig 6a, which
terminates at the goal but is long and encounters lava. As a
result, the robot infers the very incorrect belief distribution in
Fig 6b. Not only is this distribution underconfident (entropy
change = 0:614), but it also induces poor performance
(regret = 0:666). This result shows that we can see an outsized
negative impact on robot reward inference with a small
incorrect assumption that the human considered and rejected
demonstrations that don’t terminate at the goal.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3 Symmetry</title>
        <p>Intuitively, misspecification should lead to worse
performance in expectation. Surprisingly, when we combine
misspecification classes A1 and A2, their impact on entropy
change and regret is actually neutral. The key to this is their
symmetry – if we switch the contents of CRobot and CHuman
in an instance of class A1 misspecification, we get an instance
of class A2 with exactly the opposite performance
characteristics. Thus, if a pair in A1 is harmful, then the analogous
pair in A2 must be helpful, meaning that it is better for
performance than having the correct belief about the human’s
choice set. We show below that this is always the case under
certain symmetry conditions that apply to A1 and A2.</p>
        <p>Assume that there is a master choice set CM containing all
possible elements of feedback for MDP M , and that choice
sets are sampled from a symmetric distribution over pairs
of subsets D : 2CM 2CM ! [0; 1] with D(Cx; Cy) =
D(Cy; Cx) (where 2CM is the set of subsets of CM ). Let
ER(r ; M ) be the expected return from maximizing the
reward function r in M . A reward parameterization is chosen
from a shared prior P( ) and CH ; CR are sampled from D.
The human chooses the optimal element of feedback in their
choice set cCH = argmaxc2CH E (c)[r ( ))].
Theorem 1. Let M and D be defined as above. Assume that
8Cx; Cy D, we have cCx = cCy ; that is, the human would
pick the same feedback regardless of which choice set she
A1
A2
0.256
-0.256</p>
        <sec id="sec-5-2-1">
          <title>Class A1 A2</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>Mean</title>
          <p>sees. If the robot follows RRiC inference according to Eq. 2
and acts to maximize expected reward under the inferred
belief, then:</p>
          <p>E
CH ;CR D</p>
          <p>Regret(CH ; CR) = 0
Proof. Define R(Cx; c) to be the return achieved when the
robot follows RRiC inference with choice set Cx and
feedback c, then acts to maximize rE[Bx( )], keeping fixed.
Since the human’s choice is symmetric across D, for any
Cx; Cy D, regret is anti-symmetric:</p>
          <p>Regret(Cx; Cy) = R(Cx; cCx )
= R(Cx; cCy )
=</p>
          <p>Regret(Cy; Cx)</p>
          <p>R(Cy; cCx )
R(Cy; cCy )
Since D is symmetric, hCx; Cyi is as likely as hCy; Cxi.
Combined with the anti-symmetry of regret, this implies that
the expected regret must be zero:</p>
          <p>E
Cx;Cy D
[Regret(Cx; Cy)]
1
1</p>
          <p>E [Regret(Cx; Cy)] +
2 Cx;Cy</p>
          <p>E [Regret(Cy; Cx)]
2 Cx;Cy</p>
          <p>E [Regret(Cx; Cy)]
2 Cx;Cy</p>
          <p>E [Regret(Cx; Cy)]
2 Cx;Cy
1
1
=
=</p>
          <p>An analogous proof would work for any anti-symmetric
measure (including entropy change).
5.4</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Worst Case</title>
        <p>As shown in Table 4, class B3 misspecification can induce
regret an order of magnitude worse than the maximum regret
induced by classes A3 and B2, which each differ from B3
along a single axis. This is because the worst case inference
occurs in RRiC when the human feedback cH is the worst
element of CR, and this is only possible in class B3. In class
A3
B2
B3
-0.001
-1.1058
-0.9973
-0.9973
B2, CR contains all of CH , so as long as jCH j &gt; 1, CR
must contain at least one element worse than CH . In class
A3, cH = c , so CR cannot contain any elements better than
cH . However, in class B3, CR need not contain any elements
worse than cH , in which case the robot updates its belief in
the opposite direction from the ground truth.</p>
        <p>For example, consider the sample human choice set in
Fig 7a. Both trajectories are particularly poor, but the human
chooses the demonstration cH in Fig 7b because it
encounters slightly less lava and so has a marginally higher reward.
Fig 8a shows a potential corresponding robot choice set CR2
from B2, containing both trajectories from the human choice
set as well as a few others. Fig 8b shows P( j cH ; CR2).
The axes represent the weights on the lava and alive
features and the space of possible parameterizations lies on the
circle where wlava + walive = 1. The opacity of the gold line
is proportional to the weight that P( ) places on each
parameter combination. The true reward has wlava; walive &lt; 0,
whereas the peak of this distribution has wlava &lt; 0, but
walive &gt; 0. This is because CR2 contains shorter
trajectories that encounter the same amount of lava, and so the robot
infers that cH must be preferred in large part due to its length.</p>
        <p>Fig 9a shows an example robot choice set CR3 from B3,
and Fig 9b shows the inferred P( j cH ; CR3). Note that the
peak of this distribution has wlava; walive &gt; 0. Since cH
is the longest and the highest-lava trajectory in CR3, and
alternative shorter and lower-lava trajectories exist in CR3, the
robot infers that the human is attempting to maximize both
trajectory length and lava encountered: the opposite of the
truth. Unsurprisingly, maximizing expected reward for this
belief leads to high regret. The key difference between B2
and B3 is that cH is the lowest-reward element in CR3,
resulting in the robot updating directly away from the true reward.
(a) CR2
(b) P( j cH ; CR2)
Summary In this work, we highlighted the problem of
choice set misspecification in generalized reward inference,
where a human gives feedback selected from choice set
CHuman but the robot assumes that the human was
choosing from choice set CRobot. As expected, such
misspecification on average induces suboptimal behavior resulting in
regret. However, a different story emerged once we
distinguished between misspecification classes. We defined five
distinct classes varying along two axes: the relationship
between CHuman and CRobot and the location of the optimal
element of feedback c . We empirically showed that
different classes lead to different types of error, with some classes
leading to overconfidence, some to underconfidence, and one
to particularly high regret. Surprisingly, under certain
conditions the expected regret under choice set misspecification is
actually 0, meaning that in expectation, misspecification does
not hurt in these situations.</p>
        <p>Implications There is wide variance across the different
types of choice-set misspecification: some may have
particularly detrimental effects, and others may not be harmful at
all. This suggests strategies for designing robot choice sets
to minimize the impact of misspecification. For example,
we find that regret tends to be negative (that is,
misspecification is helpful) when the optimal element of feedback is in
both CRobot and CHuman and CRobot CHuman (class A2).
Similarly, worst-case inference occurs when the optimal
element of feedback is in CRobot only, and CHuman contains
elements that are not in CRobot (class B3). This suggests that
erring on the side of specifying a large CRobot, which makes
A2 more likely and B3 less, may lead to more benign
misspecification. Moreover, it may be possible to design
protocols for the robot to identify unrealistic choice set-feedback
combinations and verify its choice set with the human,
reducing the likelihood of misspecification in the first place. We
plan to investigate this in future work.</p>
        <p>Limitations and future work. In this paper, we primarily
sampled choice sets randomly from the master choice set of
all possibly optimal demonstrations. However, this is not a
realistic model. In future work, we plan to select human choice
sets based on actual human biases to improve ecological
validity. We also plan to test this classification and our
resulting conclusions in more complex and realistic environments.
Eventually, we plan to work on active learning protocols that
allow the robot to identify when its choice set is misspecified
and alter its beliefs accordingly.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Abbeel and Ng</source>
          , 2004]
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          and Andrew Y Ng.
          <article-title>Apprenticeship learning via inverse reinforcement learning</article-title>
          .
          <source>In Proceedings of the twenty-first international conference on Machine learning, page 1</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Bajcsy et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Bajcsy</surname>
          </string-name>
          , Dylan P Losey,
          <string-name>
            <surname>Marcia K O'Malley</surname>
            , and
            <given-names>Anca D</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Learning robot objectives from physical human interaction</article-title>
          .
          <source>Proceedings of Machine Learning Research</source>
          ,
          <volume>78</volume>
          :
          <fpage>217</fpage>
          -
          <lpage>226</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Bobu et al., 2020a]
          <string-name>
            <given-names>Andreea</given-names>
            <surname>Bobu</surname>
          </string-name>
          , Andrea Bajcsy, Jaime F Fisac,
          <string-name>
            <surname>Sampada Deglurkar</surname>
            , and
            <given-names>Anca D</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Quantifying hypothesis space misspecification in learning from human-robot demonstrations and physical corrections</article-title>
          .
          <source>IEEE Transactions on Robotics</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Bobu et al., 2020b]
          <string-name>
            <given-names>Andreea</given-names>
            <surname>Bobu</surname>
          </string-name>
          , Dexter RR Scobee, Jaime F Fisac,
          <string-name>
            <given-names>S Shankar</given-names>
            <surname>Sastry</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Anca D</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Less is more: Rethinking probabilistic models of human behavior</article-title>
          .
          <source>arXiv preprint arXiv:2001.04465</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Choi et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Syngjoo</given-names>
            <surname>Choi</surname>
          </string-name>
          , Shachar Kariv, Wieland Mu¨ller, and Dan Silverman.
          <article-title>Who is (more) rational? American Economic Review</article-title>
          ,
          <volume>104</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1518</fpage>
          -
          <lpage>50</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Christiano et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Paul F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          , Jan Leike, Tom B.
          <string-name>
            <surname>Brown</surname>
            , Miljan Martic, Shane Legg, and
            <given-names>Dario</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          .
          <article-title>Deep reinforcement learning from human preferences</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          , 2017-Decem:
          <fpage>4300</fpage>
          -
          <lpage>4308</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Evans et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Owain</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Stuhlmueller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Noah D.</given-names>
            <surname>Goodman</surname>
          </string-name>
          .
          <article-title>Learning the Preferences of Ignorant, Inconsistent Agents</article-title>
          . arXiv,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Goyal et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Prasoon</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Niekum</surname>
          </string-name>
          , and
          <string-name>
            <surname>Raymond</surname>
          </string-name>
          J Mooney.
          <article-title>Using natural language for reward shaping in reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1903</source>
          .
          <year>02020</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [
          <string-name>
            <surname>Hadfield-Menell</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Stuart J Russell, Pieter Abbeel, and
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Cooperative inverse reinforcement learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3909</fpage>
          -
          <lpage>3917</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [
          <string-name>
            <surname>Hadfield-Menell</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Smitha Milli, Pieter Abbeel, Stuart J Russell, and
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Inverse reward design</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>6765</fpage>
          -
          <lpage>6774</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Jeon et al.,
          <year>2020</year>
          ]
          <string-name>
            <given-names>Hong</given-names>
            <surname>Jun</surname>
          </string-name>
          <string-name>
            <surname>Jeon</surname>
          </string-name>
          , Smitha Milli, and
          <string-name>
            <given-names>Anca D</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Reward-rational (implicit) choice: A unifying formalism for reward learning</article-title>
          .
          <source>arXiv preprint arXiv:2002.04833</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Kahneman and Tversky</source>
          , 1979]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Kahneman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amos</given-names>
            <surname>Tversky</surname>
          </string-name>
          . Prospect Theory:
          <article-title>An Analysis of Decision Under Risk</article-title>
          . Econometrica,
          <volume>47</volume>
          (
          <issue>2</issue>
          ):
          <fpage>263</fpage>
          -
          <lpage>292</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Krakovna</source>
          , 2018]
          <string-name>
            <given-names>Victoria</given-names>
            <surname>Krakovna</surname>
          </string-name>
          .
          <article-title>Specification gaming examples in ai</article-title>
          ,
          <year>April 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Leike et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Jan</given-names>
            <surname>Leike</surname>
          </string-name>
          , David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and
          <string-name>
            <given-names>Shane</given-names>
            <surname>Legg</surname>
          </string-name>
          .
          <article-title>Scalable agent alignment via reward modeling: a research direction</article-title>
          . arXiv,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Majumdar et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Anirudha</given-names>
            <surname>Majumdar</surname>
          </string-name>
          , Sumeet Singh,
          <string-name>
            <given-names>Ajay</given-names>
            <surname>Mandlekar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Pavone</surname>
          </string-name>
          .
          <article-title>Risksensitive inverse reinforcement learning via coherent risk models</article-title>
          .
          <source>In Robotics: Science and Systems</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Milli and Dragan</source>
          , 2019]
          <string-name>
            <given-names>Smitha</given-names>
            <surname>Milli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anca D</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Literal or pedagogic human? analyzing human model misspecification in objective learning</article-title>
          .
          <source>arXiv preprint arXiv:1903.03877</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Mindermann et al.,
          <year>2018</year>
          ] So¨ren Mindermann, Rohin Shah, Adam Gleave, and
          <string-name>
            <surname>Dylan</surname>
          </string-name>
          Hadfield-Menell.
          <article-title>Active inverse reward design</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .03060,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Ng and Russell</source>
          , 2000] Andrew
          <string-name>
            <given-names>Y</given-names>
            <surname>Ng and Stuart J Russell</surname>
          </string-name>
          .
          <article-title>Algorithms for inverse reinforcement learning</article-title>
          .
          <source>In International Confer-ence on Machine Learning (ICML)</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Reddy et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Sid</given-names>
            <surname>Reddy</surname>
          </string-name>
          , Anca Dragan, and
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Where do you think you're going?: Inferring beliefs about dynamics from behavior</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>1454</fpage>
          -
          <lpage>1465</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Sadigh et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Dorsa</given-names>
            <surname>Sadigh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anca D Dragan</surname>
            ,
            <given-names>Shankar</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
          </string-name>
          , and
          <article-title>Sanjit A Seshia. Active preference-based learning of reward functions</article-title>
          .
          <source>In Robotics: Science and Systems</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Shah et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Rohin</given-names>
            <surname>Shah</surname>
          </string-name>
          , Dmitrii Krasheninnikov,
          <string-name>
            <surname>Jordan Alexander</surname>
            , Pieter Abbeel, and
            <given-names>Anca</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Preferences implicit in the state of the world</article-title>
          .
          <source>arXiv preprint arXiv:1902.04198</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Steinhardt and Evans</source>
          , 2017]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Owain</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Model mis-specification and inverse reinforcement learning</article-title>
          ,
          <year>Feb 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Waugh et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Waugh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Brian D Ziebart</surname>
            , and
            <given-names>J Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Bagnell</surname>
          </string-name>
          .
          <article-title>Computational rationalization: The inverse equilibrium problem</article-title>
          .
          <source>arXiv preprint arXiv:1308.3506</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>[Ziebart</source>
          , 2010]
          <string-name>
            <given-names>Brian D</given-names>
            <surname>Ziebart</surname>
          </string-name>
          .
          <article-title>Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy</article-title>
          .
          <source>PhD thesis</source>
          , Carnegie Mellon University,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>