=Paper=
{{Paper
|id=Vol-2640/paper_14
|storemode=property
|title=Choice Set Misspeciﬁcation in Reward Inference
|pdfUrl=https://ceur-ws.org/Vol-2640/paper_14.pdf
|volume=Vol-2640
|authors=Rachel Freedman,Rohin Shah,Anca Dragan
|dblpUrl=https://dblp.org/rec/conf/ijcai/FreedmanSD20
}}
==Choice Set Misspeciﬁcation in Reward Inference==
<pdf width="1500px">https://ceur-ws.org/Vol-2640/paper_14.pdf</pdf>
<pre>
                            Choice Set Misspecification in Reward Inference

                                Rachel Freedman∗ , Rohin Shah and Anca Dragan
                                           University of California, Berkeley
                                 {rachel.freedman, rohinmshah, anca}@berkeley.edu,


                           Abstract
     Specifying reward functions for robots that operate
     in environments without a natural reward signal can
     be challenging, and incorrectly specified rewards
     can incentivise degenerate or dangerous behavior.
     A promising alternative to manually specifying re-
     ward functions is to enable robots to infer them
     from human feedback, like demonstrations or cor-
     rections. To interpret this feedback, robots treat as
     approximately optimal a choice the person makes
     from a choice set, like the set of possible trajecto-
     ries they could have demonstrated or possible cor-
     rections they could have made. In this work, we
     introduce the idea that the choice set itself might be
     difficult to specify, and analyze choice set misspeci-
     fication: what happens as the robot makes incorrect
     assumptions about the set of choices from which
     the human selects their feedback. We propose a                   Figure 1: Example choice set misspecification: The human chooses
     classification of different kinds of choice set mis-             a pack of peanuts at the supermarket. They only notice the expen-
                                                                      sive one because it has flashy packaging, so that’s the one they buy.
     specification, and show that these different classes
                                                                      However, the robot incorrectly assumes that the human can see both
     lead to meaningful differences in the inferred re-               the expensive flashy one and the cheap one with dull packaging but
     ward and resulting performance. While we would                   extra peanuts. As a result, the robot incorrectly infers that the human
     normally expect misspecification to hurt, we find                likes flashy packaging, paying more, and getting fewer peanuts.
     that certain kinds of misspecification are neither
     helpful nor harmful (in expectation). However, in
     other situations, misspecification can be extremely                 These techniques typically model humans as optimal or
     harmful, leading the robot to believe the opposite of            noisily optimal. Unfortunately, humans tend to deviate
     what it should believe. We hope our results will al-             from optimality in systematically biased ways [Kahneman
     low for better prediction and response to the effects            and Tversky, 1979; Choi et al., 2014]. Recent work im-
     of misspecification in real-world reward inference.              proves upon these models by modeling pedagogy [Hadfield-
                                                                      Menell et al., 2016], strategic behavior [Waugh et al., 2013],
                                                                      risk aversion [Majumdar et al., 2017], hyperbolic discount-
1   Introduction                                                      ing [Evans et al., 2015], or indifference between similar op-
Specifying reward functions for robots that operate in envi-          tions [Bobu et al., 2020b]. However, given the complexity of
ronments without a natural reward signal can be challenging,          human behavior, our human models will likely always be at
and incorrectly specified rewards can incentivise degenerate          least somewhat misspecified [Steinhardt and Evans, 2017].
or dangerous behavior [Leike et al., 2018; Krakovna, 2018].              One way to formally characterize misspecification is as
A promising alternative to manually specifying reward func-           a misalignment between the real human and the robot’s as-
tions is to design techniques that allow robots to infer them         sumptions about the human. Recent work in this vein has ex-
from observing and interacting with humans.                           amined incorrect assumptions about the human’s hypothesis
∗
 Contact Author                                                       space of rewards [Bobu et al., 2020a], their dynamics model
Copyright c 2020 for this paper by its authors. Use permitted under   of the world [Reddy et al., 2018], and their level of pedagogic
Creative Commons License Attribution 4.0 International (CC BY         behavior [Milli and Dragan, 2019]. In this work, we identify
4.0).                                                                 another potential source of misalignment: what if the robot
is wrong about what feedback the human could have given?           trajectories that could have produced it.
Consider the situation illustrated in Figure 1, in which the          Let ξ denote a trajectory and Ξ denote the set of all pos-
robot observes the human going grocery shopping. While the         sible trajectories. Given a choice set C for the human and
grocery store contains two packages of peanuts, the human          grounding function ψ : C → (Ξ → [0, 1]), Jeon et al. define
only notices the more expensive version with flashy packag-        a procedure for reward learning. They assume that the human
ing, and so buys that one. If the robot doesn’t realize that the   is Boltzmann-rational with rationality parameter β, so that
human was effectively unable to evaluate the cheaper pack-         the probability of choosing any particular feedback is given
age on its merits, it will learn that the human values flashy      by:
packaging.
   We formalize this in the recent framework of reward-                                         exp(β · Eξ∼ψ(c) [rθ (ξ)])
rational implicit choice (RRiC) [Jeon et al., 2020] as misspec-            P(c | θ, C) = P                                       (1)
ification in the human choice set, which specifies what feed-                                c0 ∈C exp(β · Eξ∼ψ(c0 ) [rθ (ξ)])
back the human could have given. Our core contribution is             From the robot’s perspective, every piece of feedback c is
to categorize choice set misspecification into several formally    an observation about the true reward parameterization θ∗ , so
and empirically distinguishable “classes”, and find that differ-   the robot can use Bayesian inference to infer a posterior over
ent types have significantly different effects on performance.     θ. Given a prior over reward parameters P(θ), the RRiC in-
As we might expect, misspecification is usually harmful; in        ference procedure is defined as:
the most extreme case the choice set is so misspecified that
the robot believes the human feedback was the worst possible
feedback for the true reward, and so updates strongly towards                            exp(β · Eξ∼ψ(c) [rθ (ξ)]
                                                                       P(θ | c, C) ∝ P                                  · P(θ) (2)
the opposite of the true reward. Surprisingly, we find that                           c0 ∈C exp(β · Eξ∼ψ(c0 ) [rθ (ξ)])
under other circumstances misspecification is provably neu-
tral: it neither helps nor hurts performance in expectation.          Since we care about misspecification of the choice set C,
Crucially, these results suggest that not all misspecification     we focus on learning from demonstrations, where we restrict
is equivalently harmful to reward inference: we may be able        the set of trajectories that the expert can demonstrate. This
to minimize negative impact by systematically erring toward        enables us to have a rich choice set, while allowing for a sim-
particular misspecification classes defined in this work. Fu-      ple grounding function (the identity). In future work, we aim
ture work will explore this possibility.                           to test choice set misspecification with other feedback modal-
                                                                   ities as well.

2   Reward Inference                                               3     Choice Set Misspecification
There are many ways that a human can provide feedback to           For many common forms of feedback, including demonstra-
a robot: demonstrations [Ng and Russell, 2000; Abbeel and          tions and proxy rewards, the RRiC choice set is implicit. The
Ng, 2004; Ziebart, 2010], comparisons [Sadigh et al., 2017;        robot knows which element of feedback the human provided
Christiano et al., 2017], natural language [Goyal et al., 2019],   (ex. which demonstration they performed), but must assume
corrections [Bajcsy et al., 2017], the state of the world [Shah    which elements of feedback the human could have provided
et al., 2019], proxy rewards [Hadfield-Menell et al., 2017;        based on their model of the human. However, this assump-
Mindermann et al., 2018], etc. Jeon et al. propose a unifying      tion could easily be incorrect – the robot may assume that the
formalism for reward inference to capture all of these possible    human has capabilities that they do not, or may fail to ac-
feedback modalities, called reward-rational (implicit) choice      count for cognitive biases that blind the human to particular
(RRiC). Rather than study each feedback modality separately,       feedback options, such as the human bias towards the most
we study misspecification in this general framework.               visually attention-grabbing choice in Fig 1.
   RRiC consists of two main components: the human’s                  To model such effects, we assume that the human selects
choice set, which corresponds to what the human could have         feedback c ∈ CHuman according to P(c | θ, CHuman ), while
done, and the grounding function, which converts choices into      the robot updates their belief assuming a different choice
(distributions over) trajectories so that rewards can be com-      set CRobot to get P(θ | c, CRobot ). Note that CRobot is
puted.                                                             the robot’s assumption about what the human’s choice set
   For example, in the case of learning from comparisons, the      is – this is distinct from the robot’s action space. When
human chooses which out of two trajectories is better. Thus,       CHuman 6= CRobot , we get choice set misspecification.
the human’s choice set is simply the set of trajectories they         It is easy to detect such misspecification when the human
are comparing, and the grounding function is the identity. A       chooses feedback c ∈   / CR . In this case, the robot observes
more complex example is learning from the state of the world,      a choice that it believes to be impossible, which should cer-
in which the robot is deployed in an environment in which          tainly be grounds for reverting to some safe baseline policy.
a human has already acted for T timesteps, and must infer          So, we only consider the case where the human’s choice c is
the human’s preferences from the current world state. In this      also present in CR (which also requires CH and CR to have
case, the robot can interpret the human as choosing between        at least one element in common).
different possible states. Thus, the choice set is the set of         Within these constraints, we propose a classification of
possible states that the human could reach in T timesteps,         types of choice set misspecification in Table 1. On the
and the grounding function maps each such state to the set of      vertical axis, misspecification is classified according to
                          CR ⊂ CH      CR ⊃ CH        CR ∩ CH          4.1   Experimental Setup
        ∗    ∈ CR ∩ CH    A1           A2             A3
    c                                                                  Environment To isolate the effects of misspecification and
             ∈ CR \CH                  B2             B3
                                                                       allow for computationally tractable Bayesian inference, we
Table 1: Choice set misspecification classification, where CR is the   ran experiments in toy environments. We ran the randomized
robot’s assumed choice set, CH is the human’s actual choice set, and   experiments in the four 20 × 20 gridworlds shown in Fig 2.
c∗ is the optimal element from CR ∪ CH . B1 is omitted because if      Each square in environment x is a state sx = {lava, goal}.
CR ⊂ CH , then CR \CH is empty and cannot contain c∗ .                 lava ∈ [0, 1] is a continuous feature, while goal ∈ {0, 1}
                                                                       is a binary feature set to 1 in the lower-right square of each
                                                                       grid and 0 everywhere else. The true reward function rθ∗ is
                                                                       a linear combination of these features and a constant stay-
                                                                       alive cost incurred at each timestep, parameterized by θ =
                                                                       (wlava , wgoal , walive ). Each episode begins with the robot in
                                                                       the upper-left corner and ends once the robot reaches the goal
                                                                       state or episode length reaches the horizon of 35 timesteps.
                                                                       Robot actions AR move the robot one square in a cardinal
                                                                       or diagonal direction, with actions that would move the robot
                                                                       off of the grid causing it to remain in place. The transition
                                                                       function T is deterministic. Environment x defines an MDP
                                                                       M x = hS x , AR , T, rθ∗ i.
                                                                       Inference While the RRiC framework enables inference
                                                                       from many different types of feedback, we use demonstra-
                                                                       tion feedback here because demonstrations have an implicit
                                                                       choice set and straightforward deterministic grounding. Only
                                                                       the human knows their true reward function parameteriza-
                                                                       tion θ∗ . The robot begins with a uniform prior distribution
                                                                       over reward parameters P(θ) in which wlava and walive vary,
                                                                       but wgoal always = 2.0. P(θ) contains θ∗ . RRiC infer-
                                                                       ence proceeds as follows for each choice set tuple hCR , CH i
                                                                       and environment x. First, the simulated human selects the
Figure 2: The set of four gridworlds used in randomized experi-        best demonstration from their choice set with respect to the
ments, with the lava feature marked in red.                            true reward cH = argmaxc∈CH Eξ∼ψ(c) [rθ∗ (ξ))]. Then,
                                                                       the simulated robot uses Eq. 2 to infer a “correct” distribu-
                                                                       tion over reward parameterizations BH (θ) , P(θ | c, CH )
the location of the optimal element of feedback c∗ =                   using the true human choice set, and a “misspecified” dis-
argmaxc∈CR ∪CH Eξ∼ψ(c) [rθ∗ (ξ))]. If c∗ is available to the           tribution BR (θ) , P(θ | c, CR ) using the misspecified hu-
human (in CH ), then the class code begins with A. We only             man choice set. In order to evaluate the effects of each dis-
consider the case where c∗ is also in CR : the case where              tribution on robot behavior, we define new MDPs MH            x
                                                                                                                                          =
it is in CH but not CR is uninteresting, as the robot would            hS x , AR , T, rE[BH (θ)] i and MRx = hS x , AR , T, rE[BR (θ)] i for
observe the “impossible” event of the human choosing c∗ ,              each environment, solve them using value iteration, and then
which immediately demonstrates misspecification at which               evaluate the rollouts of the resulting deterministic policies ac-
point the robot should revert to some safe baseline policy.            cording to the true reward function rθ∗ .
If c∗ ∈/ CH , then we must have c∗ ∈ CR (since it was cho-
sen from CH ∪ CR ), and the class code begins with B. On
the horizontal axis, misspecification is classified according to
                                                                       4.2   Randomized Choice Sets
the relationship between CR and CH . CR may be a subset                We ran experiments with randomized choice set selection for
(code 1), superset (code 2), or intersecting class (code 3) of         each misspecification class to evaluate the effects of class on
CH . For example, class A1 describes the case in which the             entropy change and regret.
robot’s choice set is a subset of the human’s (perhaps because
the human is more versatile), but both choice sets contain the         Conditions The experimental conditions are the classes of
optimal choice (perhaps because it is obvious).                        choice set misspecification in Table 1: A1, A2, A3, B2 and
                                                                       B3. We tested each misspecification class on each environ-
                                                                       ment, then averaged across environments to evaluate each
4           Experiments                                                class. For each environment x, we first generated a mas-
                                                                                 x
To determine the effects of misspecification class, we artifi-         ter set CM   of all demonstrations that are optimal w.r.t. at
cially generated CR and CH with the properties of each par-            least one reward parameterization θ. For each experimen-
ticular class, simulated human feedback, ran RRiC reward in-           tal class, we randomly generated 6 valid hCR , CH i tuples,
                                                                                            x
ference, and then evaluated the robot’s resulting belief distri-       with CR , CH ⊆ CM      . Duplicate tuples, or tuples in which
bution and optimal policy.                                             cH ∈/ CR , were not considered.
            (a) CH                             (b) CR

Figure 3: Human and robot choice sets with a human goal bias.
Because the human only considers trajectories that terminate at the
goal, they don’t consider the blue trajectory in CR .                 Figure 4: Entropy Change (N=24). The box is the IQR, the whiskers
                                                                      are the range, and the blue line is the median. There are no outliers.

Measures There are two key experimental measures: en-
tropy change and regret. Entropy change is the difference
in entropy between the correct distribution BH , and the mis-
specified distribution BR . That is, ∆H = H(BH ) − H(BR ).
If entropy change is positive, then misspecification induces
overconfidence, and if it is negative, then misspecification in-
duces underconfidence.
   Regret is the difference in return between the optimal solu-
            x
tion to MH    , with the correctly-inferred reward parameteriza-
tion, and the optimal solution to MRx , with the incorrectly-
inferred parameterization, averaged across all 4 environ-
              ∗x                                 x       ∗x
ments. If ξH     is an optimal trajectory in MH     and ξR  is an
                           x                  1
                                                P3          ∗x
optimal trajectory in MR , then regret = 4 x=0 [rθ∗ (ξH        )−
      ∗x
rθ∗ (ξR  )]. Note that we are measuring regret relative to the
                                                                      Figure 5: Regret (N=24). The box is the IQR, the whiskers are the
optimal action under the correctly specified belief, rather than
                                                                      most distant points within 1.5*the IQR, and the green line is the
optimal action under the true reward. As a result, it is possible     mean. Multiple outliers are omitted.
for regret to be negative, e.g. if the misspecification makes the
robot become more confident in the true reward than it would
be under correct specification, and so execute a better policy.       5.1    Aggregate Measures in Randomized
                                                                             Experiments
4.3    Biased Choice Sets
                                                                      Entropy Change Entropy change varied significantly
We also ran an experiment in a fifth gridworld where we se-           across misspecification class. As shown in Fig 4, the in-
lect the human choice set with a realistic human bias to illus-       terquartile ranges (IQRs) of classes A1 and A3 did not overlap
trate how choice set misspecification may arise in practice. In       with the IQRs of A2 and B2. Moreover, A1 and A3 had pos-
this experiment the human only considers demonstrations that          itive medians, suggesting a tendency toward overconfidence,
end at the goal state because, to humans, the word “goal” can         while A2 and B2 had negative medians, suggesting a ten-
be synonymous with “end” (Fig 3a). However, to the robot,             dency toward underconfidence. B3 was less distinctive, with
the goal is merely one of multiple features in the environ-           an IQR that overlapped with that of all other classes. No-
ment. The robot has no reason to privilege it over the other          tably, the distributions over entropy change of classes A1 and
features, so the robot considers every demonstration that is          A2 are precisely symmetric about 0.
optimal w.r.t some possible reward parameterization (Fig 3b).
The trajectory that only the robot considers is marked in blue.       Regret Regret also varied as a function of misspecification
We ran RRiC inference using this hCR , CH i and evaluated             class. Each class had a median regret of 0, suggesting that
the results using the same measures described above.                  misspecification commonly did not induce a large enough
                                                                      shift in belief for the robot to learn a different optimal policy.
                                                                      However the mean regret, plotted as green lines in Fig 5, did
5     Results                                                         vary markedly across classes. Regret was sometimes so high
                                                                      in class B3 that outliers skewed the mean regret beyond of the
We summarize the aggregated measures, discuss the realistic           whiskers of the boxplot. Again, classes A1 and A2 are pre-
human bias result, then examine two interesting results: sym-         cisely symmetric. We discuss this symmetry in Section 5.3,
metry between classes A1 and A2 and high regret in class B3.          then discuss the poor performance of B3 in Section 5.4.
                                                                              Class       Mean        Std      Q1          Q3
                                                                                 A1       0.256     0.2265    0.1153      0.4153
                                                                                 A2      -0.256     0.2265   -0.4153   -0.1153

                                                                       Table 2: Entropy change is symmetric across classes A1 and A2.


        (a) feedback cH                   (b) P(θ | cH , CR )                 Class       Mean        Std      Q1          Q3

Figure 6: Human feedback and the resulting misspecified robot be-                A1        0.04     0.4906   0.1664        0.0
lief with a human goal bias. Because the feedback that the biased                A2       -0.04     0.4906     0.0     -0.1664
human provides is poor, the robot learns a very incorrect distribu-
tion over rewards.
                                                                            Table 3: Regret is symmetric across classes A1 and A2.

5.2   Effects of Biased Choice Sets
The human bias of only considering demonstrations that ter-           sees. If the robot follows RRiC inference according to Eq. 2
minate at the goal leads to very poor inference in this en-           and acts to maximize expected reward under the inferred be-
vironment. Because the human does not consider the blue               lief, then:
demonstration from Fig 3b, which avoids the lava altogether,
they are forced to provide the demonstration in Fig 6a, which                               E       Regret(CH , CR ) = 0
                                                                                        CH ,CR ∼D
terminates at the goal but is long and encounters lava. As a
result, the robot infers the very incorrect belief distribution in    Proof. Define R(Cx , c) to be the return achieved when the
Fig 6b. Not only is this distribution underconfident (entropy         robot follows RRiC inference with choice set Cx and feed-
change = −0.614), but it also induces poor performance (re-           back c, then acts to maximize rE[Bx (θ)] , keeping β fixed.
gret = 0.666). This result shows that we can see an outsized          Since the human’s choice is symmetric across D, for any
negative impact on robot reward inference with a small in-            Cx , Cy ∼ D, regret is anti-symmetric:
correct assumption that the human considered and rejected
demonstrations that don’t terminate at the goal.                             Regret(Cx , Cy ) = R(Cx , cCx ) − R(Cy , cCx )
                                                                                              = R(Cx , cCy ) − R(Cy , cCy )
5.3   Symmetry
Intuitively, misspecification should lead to worse perfor-                                          = −Regret(Cy , Cx )
mance in expectation. Surprisingly, when we combine mis-              Since D is symmetric, hCx , Cy i is as likely as hCy , Cx i.
specification classes A1 and A2, their impact on entropy              Combined with the anti-symmetry of regret, this implies that
change and regret is actually neutral. The key to this is their       the expected regret must be zero:
symmetry – if we switch the contents of CRobot and CHuman
in an instance of class A1 misspecification, we get an instance              E        [Regret(Cx , Cy )]
                                                                         Cx ,Cy ∼D
of class A2 with exactly the opposite performance character-
istics. Thus, if a pair in A1 is harmful, then the analogous              1                          1
                                                                        =     E [Regret(Cx , Cy )] +     E [Regret(Cy , Cx )]
pair in A2 must be helpful, meaning that it is better for per-            2 Cx ,Cy                   2 Cx ,Cy
formance than having the correct belief about the human’s                 1                          1
choice set. We show below that this is always the case under            =     E [Regret(Cx , Cy )] −     E [Regret(Cx , Cy )]
                                                                          2 Cx ,Cy                   2 Cx ,Cy
certain symmetry conditions that apply to A1 and A2.
   Assume that there is a master choice set CM containing all           =0
possible elements of feedback for MDP M , and that choice
sets are sampled from a symmetric distribution over pairs
of subsets D : 2CM × 2CM → [0, 1] with D(Cx , Cy ) =                    An analogous proof would work for any anti-symmetric
D(Cy , Cx ) (where 2CM is the set of subsets of CM ). Let             measure (including entropy change).
ER(rθ , M ) be the expected return from maximizing the re-
ward function rθ in M . A reward parameterization is chosen           5.4    Worst Case
from a shared prior P(θ) and CH , CR are sampled from D.              As shown in Table 4, class B3 misspecification can induce
The human chooses the optimal element of feedback in their            regret an order of magnitude worse than the maximum regret
choice set cCH = argmaxc∈CH Eξ∼ψ(c) [rθ∗ (ξ))].                       induced by classes A3 and B2, which each differ from B3
Theorem 1. Let M and D be defined as above. Assume that               along a single axis. This is because the worst case inference
∀Cx , Cy ∼ D, we have cCx = cCy ; that is, the human would            occurs in RRiC when the human feedback cH is the worst
pick the same feedback regardless of which choice set she             element of CR , and this is only possible in class B3. In class
       Class     Mean       Std        Max         Min
        A3      -0.001    0.5964     1.1689      -1.1058
         B2      0.228    0.6395     1.6358      -0.9973
         B3      2.059    6.3767     24.7252     -0.9973

Table 4: Regret comparison showing that class B3 has much higher
regret than neighboring classes.

                                                                               (a) CR2                       (b) P(θ | cH , CR2 )

                                                                    Figure 8: Robot choice set and resulting misspecified belief in B2.


            (a) CH                             (b) cH

Figure 7: Example human choice set and corresponding feedback.
                                                                               (a) CR3                       (b) P(θ | cH , CR3 )
B2, CR contains all of CH , so as long as |CH | > 1, CR
                                                                    Figure 9: Robot choice set and resulting misspecified belief in B3.
must contain at least one element worse than CH . In class
A3, cH = c∗ , so CR cannot contain any elements better than
cH . However, in class B3, CR need not contain any elements         6   Discussion
worse than cH , in which case the robot updates its belief in
the opposite direction from the ground truth.                       Summary In this work, we highlighted the problem of
   For example, consider the sample human choice set in             choice set misspecification in generalized reward inference,
Fig 7a. Both trajectories are particularly poor, but the human      where a human gives feedback selected from choice set
chooses the demonstration cH in Fig 7b because it encoun-           CHuman but the robot assumes that the human was choos-
ters slightly less lava and so has a marginally higher reward.      ing from choice set CRobot . As expected, such misspecifi-
Fig 8a shows a potential corresponding robot choice set CR2         cation on average induces suboptimal behavior resulting in
from B2, containing both trajectories from the human choice         regret. However, a different story emerged once we distin-
set as well as a few others. Fig 8b shows P(θ | cH , CR2 ).         guished between misspecification classes. We defined five
The axes represent the weights on the lava and alive fea-           distinct classes varying along two axes: the relationship be-
tures and the space of possible parameterizations lies on the       tween CHuman and CRobot and the location of the optimal
circle where wlava + walive = 1. The opacity of the gold line       element of feedback c∗ . We empirically showed that differ-
is proportional to the weight that P(θ) places on each param-       ent classes lead to different types of error, with some classes
eter combination. The true reward has wlava , walive < 0,           leading to overconfidence, some to underconfidence, and one
whereas the peak of this distribution has wlava < 0, but            to particularly high regret. Surprisingly, under certain condi-
walive > 0. This is because CR2 contains shorter trajecto-          tions the expected regret under choice set misspecification is
ries that encounter the same amount of lava, and so the robot       actually 0, meaning that in expectation, misspecification does
infers that cH must be preferred in large part due to its length.   not hurt in these situations.
   Fig 9a shows an example robot choice set CR3 from B3,            Implications There is wide variance across the different
and Fig 9b shows the inferred P(θ | cH , CR3 ). Note that the       types of choice-set misspecification: some may have partic-
peak of this distribution has wlava , walive > 0. Since cH          ularly detrimental effects, and others may not be harmful at
is the longest and the highest-lava trajectory in CR3 , and al-     all. This suggests strategies for designing robot choice sets
ternative shorter and lower-lava trajectories exist in CR3 , the    to minimize the impact of misspecification. For example,
robot infers that the human is attempting to maximize both          we find that regret tends to be negative (that is, misspecifi-
trajectory length and lava encountered: the opposite of the         cation is helpful) when the optimal element of feedback is in
truth. Unsurprisingly, maximizing expected reward for this          both CRobot and CHuman and CRobot ⊃ CHuman (class A2).
belief leads to high regret. The key difference between B2          Similarly, worst-case inference occurs when the optimal ele-
and B3 is that cH is the lowest-reward element in CR3 , result-     ment of feedback is in CRobot only, and CHuman contains
ing in the robot updating directly away from the true reward.       elements that are not in CRobot (class B3). This suggests that
erring on the side of specifying a large CRobot , which makes      [Hadfield-Menell et al., 2017] Dylan         Hadfield-Menell,
A2 more likely and B3 less, may lead to more benign mis-              Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca
specification. Moreover, it may be possible to design proto-          Dragan. Inverse reward design. In Advances in neural
cols for the robot to identify unrealistic choice set-feedback        information processing systems, pages 6765–6774, 2017.
combinations and verify its choice set with the human, reduc-      [Jeon et al., 2020] Hong Jun Jeon, Smitha Milli, and Anca D
ing the likelihood of misspecification in the first place. We         Dragan.      Reward-rational (implicit) choice: A uni-
plan to investigate this in future work.                              fying formalism for reward learning. arXiv preprint
Limitations and future work. In this paper, we primarily              arXiv:2002.04833, 2020.
sampled choice sets randomly from the master choice set of         [Kahneman and Tversky, 1979] Daniel Kahneman and
all possibly optimal demonstrations. However, this is not a re-       Amos Tversky.        Prospect Theory: An Analysis of
alistic model. In future work, we plan to select human choice         Decision Under Risk. Econometrica, 47(2):263–292,
sets based on actual human biases to improve ecological va-           1979.
lidity. We also plan to test this classification and our result-
                                                                   [Krakovna, 2018] Victoria Krakovna. Specification gaming
ing conclusions in more complex and realistic environments.
Eventually, we plan to work on active learning protocols that         examples in ai, April 2018.
allow the robot to identify when its choice set is misspecified    [Leike et al., 2018] Jan Leike, David Krueger, Tom Everitt,
and alter its beliefs accordingly.                                    Miljan Martic, Vishal Maini, and Shane Legg. Scalable
                                                                      agent alignment via reward modeling: a research direction.
References                                                            arXiv, 2018.
[Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng.               [Majumdar et al., 2017] Anirudha Majumdar,            Sumeet
  Apprenticeship learning via inverse reinforcement learn-            Singh, Ajay Mandlekar, and Marco Pavone.              Risk-
  ing. In Proceedings of the twenty-first international con-          sensitive inverse reinforcement learning via coherent risk
  ference on Machine learning, page 1, 2004.                          models. In Robotics: Science and Systems, 2017.
[Bajcsy et al., 2017] Andrea Bajcsy, Dylan P Losey, Mar-           [Milli and Dragan, 2019] Smitha Milli and Anca D Dragan.
  cia K O’Malley, and Anca D Dragan. Learning robot ob-               Literal or pedagogic human? analyzing human model
  jectives from physical human interaction. Proceedings of            misspecification in objective learning. arXiv preprint
  Machine Learning Research, 78:217–226, 2017.                        arXiv:1903.03877, 2019.
[Bobu et al., 2020a] Andreea Bobu, Andrea Bajcsy, Jaime F          [Mindermann et al., 2018] Sören Mindermann, Rohin Shah,
  Fisac, Sampada Deglurkar, and Anca D Dragan. Quanti-                Adam Gleave, and Dylan Hadfield-Menell. Active inverse
  fying hypothesis space misspecification in learning from            reward design. arXiv preprint arXiv:1809.03060, 2018.
  human–robot demonstrations and physical corrections.             [Ng and Russell, 2000] Andrew Y Ng and Stuart J Russell.
  IEEE Transactions on Robotics, 2020.                                Algorithms for inverse reinforcement learning. In Interna-
[Bobu et al., 2020b] Andreea Bobu, Dexter RR Scobee,                  tional Confer-ence on Machine Learning (ICML), 2000.
  Jaime F Fisac, S Shankar Sastry, and Anca D Dragan. Less         [Reddy et al., 2018] Sid Reddy, Anca Dragan, and Sergey
  is more: Rethinking probabilistic models of human behav-            Levine. Where do you think you’re going?: Inferring be-
  ior. arXiv preprint arXiv:2001.04465, 2020.                         liefs about dynamics from behavior. In Advances in Neural
                                                                      Information Processing Systems, pages 1454–1465, 2018.
[Choi et al., 2014] Syngjoo Choi, Shachar Kariv, Wieland
  Müller, and Dan Silverman. Who is (more) rational?              [Sadigh et al., 2017] Dorsa Sadigh, Anca D Dragan, Shankar
  American Economic Review, 104(6):1518–50, 2014.                     Sastry, and Sanjit A Seshia. Active preference-based learn-
                                                                      ing of reward functions. In Robotics: Science and Systems,
[Christiano et al., 2017] Paul F. Christiano, Jan Leike,
                                                                      2017.
  Tom B. Brown, Miljan Martic, Shane Legg, and Dario
  Amodei. Deep reinforcement learning from human pref-             [Shah et al., 2019] Rohin Shah, Dmitrii Krasheninnikov,
  erences. Advances in Neural Information Processing Sys-             Jordan Alexander, Pieter Abbeel, and Anca Dragan. Pref-
  tems, 2017-Decem:4300–4308, 2017.                                   erences implicit in the state of the world. arXiv preprint
                                                                      arXiv:1902.04198, 2019.
[Evans et al., 2015] Owain Evans, Andreas Stuhlmueller,
  and Noah D. Goodman. Learning the Preferences of Ig-             [Steinhardt and Evans, 2017] Jacob Steinhardt and Owain
  norant, Inconsistent Agents. arXiv, 2015.                           Evans. Model mis-specification and inverse reinforcement
                                                                      learning, Feb 2017.
[Goyal et al., 2019] Prasoon Goyal, Scott Niekum, and Ray-
  mond J Mooney.           Using natural language for re-          [Waugh et al., 2013] Kevin Waugh, Brian D Ziebart, and
  ward shaping in reinforcement learning. arXiv preprint              J Andrew Bagnell.          Computational rationalization:
  arXiv:1903.02020, 2019.                                             The inverse equilibrium problem.            arXiv preprint
                                                                      arXiv:1308.3506, 2013.
[Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stu-
                                                                   [Ziebart, 2010] Brian D Ziebart. Modeling Purposeful Adap-
  art J Russell, Pieter Abbeel, and Anca Dragan. Coopera-
  tive inverse reinforcement learning. In Advances in neural          tive Behavior with the Principle of Maximum Causal En-
  information processing systems, pages 3909–3917, 2016.              tropy. PhD thesis, Carnegie Mellon University, 2010.

</pre>