<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Robust Motion Planning and Safety Benchmarking in Human Workspaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shih-Yun Lo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shani Alkoby</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Stone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>shani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>pstoneg@cs.utexas.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Learning Agent Research Group, The University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>It is becoming increasingly feasible for robots to share a workspace with humans. However, for them to do so robustly, they need to be able to smoothly handle the dynamism and uncertainty caused by human motions, and efficiently adapt to newly observed event. While Markov Decision Processes (MDPs) serve as a common model for formulating cost-based approaches for robot planning, other agents are often modeled as part of the environment for the purpose of collision avoidance. This practice, however, has been shown to generate plans that are too inconsistent for humans to confidently interact with. In this work, we show how modeling other agents as part of the environment makes the problem ill-posed, and propose to instead model robot planning in human workspaces as a Stochastic Game. We thus propose a planner with safety guarantees while avoiding overly conservative behavior. Finally, we benchmark the evaluation process in the face of pedestrian modeling error, which has been identified as a major concern in state-of-the-art approaches for robot planning in human workspaces. We evaluate our approach with diverse pedestrian models based on real-world observations, and show that our approach is collision-safe when encountering various pedestrian behaviors, even when given inaccurate predictive models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Planning has a long history in the robotics community,
where efficiency, environmental uncertainty, and motion
feasibility all central concerns. Cost-based approaches that use
the MDP formulation
        <xref ref-type="bibr" rid="ref33">(Watkins and Dayan 1992)</xref>
        can be
successful for robot planning in static workspaces
        <xref ref-type="bibr" rid="ref23 ref24 ref25">(Quinlan and
Khatib 1993)</xref>
        . When planning in highly-dynamic
environments, however, this formulation is limited by its lack of
consideration of the dynamic environmental properties.
      </p>
      <p>More specifically, in MDPs, there exist intrinsic static
environment assumptions, upon which the solutions are built.
Those assumptions are: 1.a time-invariant state transition
function, 2. a time-invariant state-action reward function,
and therefore 3. a time-invariant state value function. As
those assumptions no longer hold in dynamic environments,
the accuracy of policy evaluation for long-horizon planning
deteriorates when using these methods. As a result,
MDPbased approaches for the traditional motion planning
literature suffer from poor performance when applied in the wild.
Copyright reserved by authors.</p>
      <p>
        Common solutions to the above disadvantages include
frequent replanning and finite-horizon planning; those
approaches update local information of the environment based
on online observations, and replan periodically based on
newly observed environmental conditions. Nevertheless, the
lack of awareness of future conditions causes the
planner to make shortsighted decisions (which leads to
socially incompetent behavior for human interaction
        <xref ref-type="bibr" rid="ref11 ref12">(Kruse
et al. 2012)</xref>
        ), or overly conservative behavior when
considering long-horizon planning (referred to as the
freezingrobot problem
        <xref ref-type="bibr" rid="ref27 ref29">(Trautman and Krause 2010)</xref>
        ). One example
is the commonly-seen flow-following strategy in crowd
navigation
        <xref ref-type="bibr" rid="ref5">(Helbing and Molnar 1995)</xref>
        , where people follow
one another to reach shared short-term subgoals. This
strategy relies on policy evaluation based on the future paths of
nearby agents. With static cost formulation being applied in
dynamic environments, the inaccuracy of policy evaluation
makes traditional planning algorithms fail to produce paths
with similar performance.
      </p>
      <p>
        Therefore, in this work, we first propose to
formulate robot planning in human environments as a
multiagent planning problem using Stochastic Games, or Markov
Games
        <xref ref-type="bibr" rid="ref15">(Littman 1994)</xref>
        , to compute the dynamic state-action
values that are also influenced by the states and actions of
other agents (here, the humans). We use this formulation
to design an online algorithm which incorporates the other
agents’ actions into the planning process for action
evaluation.
      </p>
      <p>For achieving the goal of robots being able (and allowed)
to smoothly steer around humans, safety guarantees are also
critically important. Traditional methods often use a
worstcase assumption to ensure safety. This assumption, however,
leads to overly conservative planning behaviors, degrading
the smoothness of the robot motions among humans.
Leveraging the notion of Stochastic Games, we propose to plan
based on worst-case predictions only in period games that
have a critical impact on the termination values; we seek to
plan carefully only when it matters and achieves safe yet not
overly conservative behavior.</p>
      <p>Finally, we evaluate our approach using the crowd
navigation domain, and discuss potential performance metrics for
robot planning in human workspaces. As humans are
adaptive and have heterogeneous behaviors, a perfect human
behavior model is likely never available; modeling error then
seems inevitable for robots to deal with on-the-fly, and we
need to quantify its impact on plan quality into the
evaluation process. To evaluate this before deploying robots into
the wild, we simulate different human behaviors, which are
based on real-world observations of pedestrian interactions
with robots, and use them to evaluate our approach in
unanticipated scenarios caused by modeling errors. We use these
scenarios to benchmark the evaluation process in
simulation, and show that our planner maintains collision-safe even
evaluated with different pedestrian models.</p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In crowd navigation domains, the freezing robot problem
arises from the challenge of planning while considering
the time-variant crowd-interactive dynamics in the
environment
        <xref ref-type="bibr" rid="ref27 ref29">(Trautman and Krause 2010)</xref>
        . Despite efforts to
introduce human factors into the planning process, traditional
planning approaches which incorporate humans as a part of
the environment for collision avoidance have been shown
to generate motions that are neither interpretable nor
socially competent
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref16">(Lichtentha¨ler, Lorenzy, and Kirsch 2012;
Kruse et al. 2012)</xref>
        .
      </p>
      <p>
        For incorporating the future motions of other agents to
avoid collisions, the reciprocal n-body collision avoidance
approach has had success in the multi-agent setting
        <xref ref-type="bibr" rid="ref18 ref32">(Van
Den Berg et al. 2011)</xref>
        , where individuals assume the
others move at constant speeds. This approach, and the dynamic
window approach
        <xref ref-type="bibr" rid="ref2">(Fox, Burgard, and Thrun 1997)</xref>
        , are
commonly used for low-level safety checks when planning in
dynamic environments. However, due to the constant-speed
assumption, the approach is overly conservative when
interacting with real-world pedestrians, who are interactive and
respond to the robot’s motions.
      </p>
      <p>
        Recently, a community proposed to solve robot
planning in human workspaces as a joint multi-agent
dynamics learning problem, and uses the predicted motions of all
agents to plan for the robot, as if it was one of the
members of the crowd
        <xref ref-type="bibr" rid="ref11 ref12 ref17 ref27 ref29">(Trautman and Krause 2010; Kuderer
et al. 2012; Mavrogiannis and Knepper 2016)</xref>
        . Such joint
modeling methods have been shown to be effective at
outputting smooth human-mimicking trajectories, as they can
capture the interactive dynamics among pedestrians. One
major drawback of these approaches, however, is that the
multi-agent interactive dynamics are typically learnt from
data collected by human demonstrations while interacting
with other humans; but humans do not act the same way
around a robot compared to how they act in fully human
environments. This problem has been shown to render those
methods ineffective in scenarios where humans exhibit
different behaviors around robots – behaviors that humans will
not present in front of another human
        <xref ref-type="bibr" rid="ref19 ref21 ref26 ref4">(Pfeiffer et al. 2016)</xref>
        .
      </p>
      <p>To model joint behaviors among agents – how one’s
action affects that of the others – another approach is to
incorporate other agents’ actions into the formulation of the
individual’s action value function. Such joint behavior
formulation is widely studied in Game theory: with different
player strategies, the interaction among agents evolve over
time and result in different outcomes.</p>
      <p>
        For interactive agent designs in video games, the
dynamics of other agents have been introduced into MDP
models to simulate multi-agent planning performance
        <xref ref-type="bibr" rid="ref13 ref16 ref18 ref32">(Nguyen
et al. 2011; Macindoe, Kaelbling, and Lozano-Pe´rez 2012)</xref>
        .
In these formulations, the AI agent’s current actions are
assumed to be known by the humans to forward simulate their
potential policies. This assumption implies a turn-taking
setting, assuming full observability of the past strategies
of others, including the current action. This assumption is
however not valid when planning in the real world, as
humans and robots act simultaneously
        <xref ref-type="bibr" rid="ref19 ref21 ref26 ref4">(Sadigh et al. 2016)</xref>
        .
The turn-taking formulation is therefore not accurate for
describing real-time interaction. For the game-theoretic
formulation to be introduced along with Markov Decision
Processes, Markov Games have been proposed as a framework
for multi-agent reinforcement learning
        <xref ref-type="bibr" rid="ref15">(Littman 1994)</xref>
        , to
study the interactions among multiple learning agents. In
this formulation, as in Stochastic Games in Game Theory,
agents act at once and contribute to the joint reward of one
another. A number of studies have been proposed in this
field, such as on how one agent’s learning affects the final
outcome and how the other agents should learn at the same
time
        <xref ref-type="bibr" rid="ref1">(Foerster et al. 2018)</xref>
        . However, to accurately
simulate human interactions with the robot, one requires agents
modeled after humans, who have different learning
mechanisms and decision-making processes from reinforcement
learning agents. As humans present heterogeneous
behaviors, prior work has proposed to estimate pedestrian types
online based on pre-defined models
        <xref ref-type="bibr" rid="ref19 ref21 ref26 ref4">(Godoy et al. 2016)</xref>
        ;
here, instead of trying to perfectly predict humans, we
address the inevitable modeling errors by evaluating our
approach with various pedestrian models. This evaluation
procedure helps bridge the gap between simulation and
realworld deployment by evaluating plans under unexpected
conditions
        <xref ref-type="bibr" rid="ref3">(Fraichard 2007)</xref>
        . Our proposed planner
experienced zero collision under this evaluation procedure; with
the demonstrated robustness under modeling errors, we
suggest this criteria better ensures safety in real-world
deployment, where unanticipated scenarios are of major concerns.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem Formulation</title>
      <p>In this section, We first consider a general
costminimization-based formulation for robot planning in
human workspaces. We then introduce the dynamic
environment dilemma which is the motivation for our proposed new
framework. Finally, we discuss the multi-agent nature of the
problem and provide the solution formulation of multi-agent
planning problem.</p>
      <sec id="sec-3-1">
        <title>General Cost-Minimization-Based Formulation</title>
        <p>The robot has its state xt at time t defined in the state space,
xt 2 X, and its action at at time t defined in the action
space, at 2 A. The collision-free workspace is defined as
a subset of the overall workspace Wfree W , which is
defined as the feasibly reachable space given robot
kinematics. The robot motion planning problem is defined to
minimize the accumulated travel cost Ct, such that the robot’s
final state xT ends in the specified goal configuration set
XG Wfree:
s:t:
at:T = argmin tT Ct;</p>
        <p>at:T
xT 2 XG;
xt:T 2 Wfree;
xt+1 = T (xt; at); 8t;
where T is the state transition function (or robot dynamics
function). The sequence of a variable v from t to T is
denoted by vt:T .</p>
        <p>A common approach for solving the motion planning
problem is to assume Ct is a function of the state-action pair,
xt and at, represented by Cost(xt; at); we can then assign
a negative end state cost CT for arriving at the goal,
represented by Costto go(xT ), and transitions out of free space
Wfree to be of high cost to ensure safety. For example:
at:T = argmin tT Cost(xt; at) + Costto go(xT );
at:T
s:t:</p>
        <p>xt+1 = T (xt; at); 8t
where</p>
        <p>Costto go(xT ) =</p>
        <p>1000; xT 2 XG;</p>
        <p>Cost(xt; at) = 1; 8xt 2= Wfree;
to encourage goal-reaching and collision-avoidant behavior.
We then can apply some sequential optimizer to solve for
the optimal action sequence at:T which follows the state
transition constraint and minimizes the overall travel cost,
with guarantees to reach the goal in a collision-safe manner.
This common formulation follows the MDP setting, which
we later refer to as the single-agent MDP formulation and
its solution at time t is the single-agent optimal policy.</p>
      </sec>
      <sec id="sec-3-2">
        <title>The Dynamic Environment Dilemma</title>
        <p>The general formulation for robot planning in human
workspaces laid out above relies on the assumption that objects in
the robot’s environment are static. In many scenarios
however, this does not hold. When encountering dynamic objects
in the environment, Wfree changes over time. This means
(1)
(2)
that the optimal sequence solved for time t may no longer
hold at time t + 1. An illustration of this challenge is shown
in Fig. 1-Right. The problem raised by introducing dynamic
objects into the environment is referred to here as a violation
of the static environment assumption in the MDP
formulation: with Wfree being time-variant, Cost becomes
timevariant as well.</p>
        <p>
          In the motion planning literature, online replanning
is a common practice to deal with dynamic
environments
          <xref ref-type="bibr" rid="ref23 ref24 ref25 ref8">(Koenig and Likhachev 2002; Quinlan and Khatib
1993)</xref>
          , which however leads to inefficient, inconsistent, and
even awkward motions for humans to confidently interact
with
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref16">(Lichtentha¨ler, Lorenzy, and Kirsch 2012; Kruse et al.
2012)</xref>
          , as shown in Fig. 1. For long-horizon planning, to
ensure collision safety, overly-conservative behavior arises,
due to the inability to incorporate future variations into the
cost function formulation, as shown in Fig. 2.
        </p>
        <p>We refer the situation as the dynamic environment
dilemma. To resolve this, instead of introducing other
agents as part of the environment and solving the problem
in a single-agent MDP setting, we propose to re-define the
planning domain using a multi-agent MDP setting, which
restores the validity of the static environment assumptions
by considering the simultaneous actions of other agents and
their joint state space.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Multi-agent MDPs vs. Stochastic Games</title>
        <p>
          In interactive agent design and human-robot interaction,
multi-agent MDPs (MAMDPs) can be used to forward
simulate human policies by chaining human policy
simulation after robot state transitions
          <xref ref-type="bibr" rid="ref13 ref16 ref19 ref21 ref26 ref4">(Macindoe, Kaelbling, and
Lozano-Pe´rez 2012; Nikolaidis et al. 2016)</xref>
          . This model
assumes that human agents have access to the action the AI
agent is about to make, as if they are omniscient. The
underlying game setting follows the turn-taking formulation.
        </p>
        <p>
          However, in reality, agents act at the same time
(illustrated in Fig. 3); in the literature of robot planning in
human workspaces
          <xref ref-type="bibr" rid="ref10 ref17 ref27 ref29">(Kretzschmar, Kuderer, and Burgard 2014;
Trautman and Krause 2010; Mavrogiannis and Knepper
2016)</xref>
          , state-of-the-art approaches generate smooth motions
in real-world interaction by planning while concerning the
effects of the simultaneous actions of other agents
          <xref ref-type="bibr" rid="ref19 ref21 ref26 ref4">(Sadigh et
al. 2016)</xref>
          . Our proposed dynamic environment dilemma
generalizes the issues addressed in these work to improve
realworld planning in human environments. And to resolve the
dilemma, we propose to model the problem using
Stochastic Games from the Game Theory literature, in which the
cost/reward received after one acts is dependent on other
agents’ states and actions at the same time. This is also true
for the state-action value function Q and state value function
V .
        </p>
        <p>In Stochastic Games, N agents act at the same time t:
the joint-action at = (at1; at2; :::; atN ) 2 A is defined in the
joint action spaces of all agents A = A1 A2::: AN ; the
joint-state xt = (xt1; xt2; :::; xtN ) 2 X is defined in the joint
state space of all agents X = X1 X2::: XN . The state
transition function: T : X A ! X affects the utility of
each agent under the same joint-action over time. Time is
discretized, and game periods are defined as follows. At the
start of each period t, each agent selects an action ait; i =
1 : N ; the transition function T takes in the current state xt
and determines (probablistically) the state at the beginning
of the next period xt+1. The game starts at the initial period
t = 0 and terminates at the final period t = T . The duration
of each period is selected along with the lookahead H based
on the computation can be afforded while maintaining
realtime performance, described in detail in Sec. 5. 1</p>
        <p>In Markov Games, the time-variant utilities are referred
to as rewards (inverse of Cost), which depend on the
jointstates xt.2 The reward rit 2 R of an agent i at time t is
defined as follows: rit = ri(xt; ait; at i); where ri is the agent’s
reward function, and at i denotes actions of all agents
except agent i. This formulation conveniently incorporates xt,
the time-variant states of other agents, into the reward/cost
formulation, which resolves the dynamic workspace issue.
It brings out the notion of planning while considering the
effects of the simultaneous actions of other agents, upon
which we base our solution to robot planning in human
workspaces.</p>
        <p>1Here we use periods instead of horizons as in the planning
literature, to distinguish our multi-agent formulation from the
traditional single-agent setting.</p>
        <p>2We start with Cost for consistency with traditional planning
literature, and continue with the notion of reward for convenience
of consistency with MDP planning literature.</p>
      </sec>
      <sec id="sec-3-4">
        <title>The Optimal Solution Formulation</title>
        <p>To find the optimal policy of this multi-agent MDP planning
problem, we first define a multi-agent policy of agent i as
i, which takes in the joint-state xt and outputs agent
action ait at time t. We first consider the known policies of the
other agents i; the state-action value function Q of agent
i executing policy i while other agents follow policy i
is defined as3:</p>
        <p>i
Q ij
i
(xt; ait; at i) = ri(xt; ait; at i)+Ext+1 [V j
(xt+1)];
(3)
where V ij i is the value function V of agent i executing
policy i while others using i. Note that the expectation is
taken over xt+1 conditioned on the state transition function
T , which is omitted throughout the paper for simplicity. The
value function of the optimal policy of agent i, when other
agents execute i, is defined as:
i
V ij
i
(xt) = maxEa i
ait t
i(xt)[Qij
i
(xt; ait; at i)]; (4)
where Qij i , the optimal state-action value of agent i given
i, is defined recursively:
i
Q j
i
(xt) = maxEa i
ait t
i(xt)[ri(xt; ait; at i)+V ij
i
(xt+1)]:
(5)
(6)
The optimal action of agent i at time t is therefore:
ait = argmax Ea i
ait t
i(xt)[Qij
i
(xt; ait; at i)]:
Note that the optimal action ait is defined in the joint state
space X, and it depends on agent i’s estimate of other
agents’ policies i.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Planning in Real World</title>
        <p>
          Despite the benefits of the Stochastic Game framework for
robot planning in human environments, there remain
challenges to deploying robots in the real world. In this work,
we propose solutions through planning and evaluation
approaches, to address safety guarantees, and evaluation under
modeling errors, as detailed below:
Imperfect Prediction of Human Behaviors While in
principle, optimizing Eq. 6 leads to an optimal solution to
robot planning, note that it relies on having an accurate
model of pedestrian motion i. In practice, there is no
easy way to obtain such a model, as people exhibit
heterogeneous
          <xref ref-type="bibr" rid="ref19 ref21 ref26 ref4">(Godoy et al. 2016)</xref>
          and highly adaptive
behavior
          <xref ref-type="bibr" rid="ref19 ref21 ref26 ref4">(Nikolaidis et al. 2016)</xref>
          . Since planning with incorrect
models could lead to safety concerns in human workspaces,
in this paper we seek planning methods that are robust to
modeling errors in the pedestrian transition function.
        </p>
        <p>3The state-action value function Q in the single-agent setting is
generally defined as: Q(xt+1) = rt(xt; at) + V (xt+1). We later
refer Qi as the single-agent state-action value of agent i.</p>
        <p>Performance Degrades for Safety Guarantees To
ensure collision safety, traditional robust planning often
formulates the problem through the maximin operation, often
leading to overly conservative behaviors. This property
prevents the robot from navigating agilely and smoothly among
human crowds, for which we propose improvement while
still guaranteeing safety. This is detailed in Sec. 4. Here, we
define safety guarantees as ensuring that the robot does not
go within a safety margin from any human while exceeding
a critical safety speed, as detailed in Sec. 5.</p>
        <p>Evaluation in Simulation Due to the inevitable
prediction errors in planning in human environments, evaluation
results in simulation often cannot provide sufficient
information for real-world deployments. We therefore seek new
metrics that better inform the smoothness, efficiency, and safety
criteria from simulation trials. The proposed metrics and the
experimental results of our proposed approach are shown in
Sec. 6 and Sec. 7.</p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>In this section, we first introduce our proposed planning
method based on Stochastic Games. We use tree search, for
the convenience of forward simulation/state transitions in
MDPs with robot non-holonomic constraints, upon which
we introduce an approach featuring robust collision-safe
planning, to resolve the challenges raised in Sec. 3 regarding
robot planning in the real world.</p>
      <sec id="sec-4-1">
        <title>Tree Structure with All-agent Rollout</title>
        <p>A tree starts with a root node xt, and it expands by forward
simulating the state-action pair (possibly through a
stochastic function): xt+1 T (xt; at), and a reward rt = r(xt; at)
is received. An illustration of a graphical model of
singleagent MDP is shown in Fig. 4: Top-Left.</p>
        <p>However, when planning in Markov Games, both the
reward function ri(xt; ait; at i) and the transition function
T (xt; ait; at i) involve other agents’ actions, which are not
available until they are observed (Fig. 4: Top-Right).
Therefore, we need to sample their potential actions for reward
estimate r^ti and state transition estimate x^t+1 2 X , (Fig. 4:
Bottom-Right), to expand the tree with all-agent rollout
(Fig. 4: Bottom-Left). Each node then maintains a number
of samples of belief actions of other agents, which can be
later corrected online when new observations come in. The
reward rti is then estimated through the unweighted sample
averages: r^ti = K1 PkK=1 ri(xt; ait; at;ki ); and the joint state
xt+1 follows the same fashion:
(7)
x^t+1 =
1 XK T (xt; [ait; at;ki ]):
K</p>
        <p>
          k=1
The approach is closely related to Partially Observable
Monte Carlo Planning
          <xref ref-type="bibr" rid="ref27 ref29">(Silver and Veness 2010)</xref>
          , where the
belief state is maintained by K samples.
        </p>
        <p>Compared to the turn-taking formulation introduced in
Sec. 3, to roll out the multi-agent state transition, our
allagent rollout better captures the simultaneous-action nature
of real-time interactions, as illustrated in Fig. 3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Planning for Collision Coordination: A</title>
      </sec>
      <sec id="sec-4-3">
        <title>Finite-period Game</title>
        <p>When sharing a workspace, agents have to plan to avoid
colliding with one another; in situations where they are aware
of a potential collision, agents should adjust their motions
early, to coordinate passing smoothly. The value of the
final passing depends on the joint-actions of previous periods;
planning lasts for a finite number of periods until the
collision threat is resolved. We define the game to terminate once
agents’ goal-oriented policies no longer lower the values of
each other 4.</p>
        <p>During the period before termination, which we refer to
as the final period, agents receive immediate penalties
(negative reward signals) due to the expected collision when
following their goal-oriented policies. They therefore need to
make sure they coordinate in the final period, to avoid this
high cost. Beyond that point, agents can safely recover back
to their goal-oriented policies. Therefore, when planning for
collision coordination, we only need to consider the
cumulative rewards up to time t = T 1, and the final-period
coordination value QiT :
i
a0:T = argmai0a:Tx Ea0:iT ;x0:T j</p>
        <p>T 1
i [X ri(xt; ait; at i) + QiT (xT ; aiT ; aT i)]:
t=0</p>
        <p>(8)
Note that, instead of QiTj i , we use QiT , since we expect no
interaction after the final period 5. An example of the
finalperiod action value is shown in Fig. 5.</p>
        <p>4The goal-oriented policy is the single-agent optimal
policy without considering the dynamic objects (here, humans) in
the environment for Wfree construction. It maintains the
staticenvironment assumptions, resulting in a policy that acts as if no
other agents are around
5This assumption holds among goal-oriented agents, but does</p>
        <p>(9)
which however results in overly conservative behavior,
commonly seen in robust planning. To avoid such behavior, we
leverage the following observation: when planning for
collision coordination, the large collision penalty does not apply
until the final period, and thus the planner does not need to
plan conservatively until then. Therefore, for collision-safe
yet not overly conservative planning, we propose to use the
average reward for t = 0 : T 1 as in Eq. 8, and use the
worst-case state-action value estimate at t = T :
A Game-theoretic Decision-making Model Figure 6: Example trajectories under different robot
objective functions (x-y in meters): a robot goes from x toward
Robust Planning for Safety Guarantees +x while avoiding a human going from y toward +y at
To ensure collision safety, a common practice is to use x = 4:8 (indicated by the solid blue line). Trajectories end
a worst-case analysis for reward estimates. Following our after 12 sec. In the initial condition the robot will arrive at
finite-period planning, the objective becomes: the intersection slightly later than the human. The
altruistic setting optimizes the pedestrian’s efficiency much more</p>
        <p>T 1 than that of the robot, resulting in yielding behaviors that
a0:T = argmax minEx0:T [X ri(xt; ait; at i) + QiT (xT ; aiT ; aT i)];always waits until the pedestrian passes and results; the
agi
ai0:T a0:iT t=0 gressive setting is the opposite, resulting in high robot travel
efficiency by reaching the furthest at the end. The
cooperative setting puts equal weights on the efficiency of both
agents, and produces passing behaviors that less hinder the
pedestrian considering his/her future trajectories.</p>
        <p>i
a0:T = argmai0a:Tx Ea0:iT ;x0:T j i</p>
        <p>T 1
[Xri(xt; ait; at i)]
t=0
+minQi(xT ; aiT ; aT i);
aT i</p>
        <p>(10)</p>
      </sec>
      <sec id="sec-4-4">
        <title>We refer this notion as to plan carefully only when it matters.</title>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Problem Instantiation</title>
      <p>We instantiate our problem formulation in the navigation
domain, where a robot moves in a shared workspace with
humans. This scenario is motivated by service or guidance
robots in malls and museums. Although service robots are
designed to assist humans, they have specified users who
may have conflicting interests with other users. For example,
a guidance robot may have path conflicts with people it is
not leading. We therefore define robot reward function to
reflect a weighted value (by parameter w) among the
individual interests of all parties involved. With the weight mostly
on the robot itself, it leads to aggressive behavior; with it
evenly on all agents, it leads to collaborative behavior; and
with it mostly on others, it presents altruistic behavior. An
example is shown in Fig. 6. In this section we consider ways
not hold among adversarial agents, who intentionally block the
others even when their paths are clear. We do not consider adversarial
agents here.
to reduce the computational complexity while planning
online, and provide a description of the search algorithm used.
Since pedestrian modeling is centrally important to our
experiments, we devote a separate section to it (Sec. 6).</p>
      <sec id="sec-5-1">
        <title>Robot Motion Generation</title>
        <p>
          In our implementation, a robot stays at a nominal speed
during normal navigation, and can vary its speed between
[0.3, 1.3] m=s and its acceleration between [-0.4, 0.4] m=s2,
in collision-dangerous situations. Collision-dangerous
situations are defined as a potential collision will occur within 3s
when both agents follow self-interested policies. Our choice
of speed, acceleration and time frame prior to a collision
follows the study on pedestrian crossing for crowd
simulation
          <xref ref-type="bibr" rid="ref20">(Paris, Pettre´, and Donikian 2007)</xref>
          , in which it is found
that humans start adapting their motions about 2:5s ahead of
a potential collision in the most collision-dangerous
situations (with two pedestrian having the same estimated arrival
time at their path intersection).
        </p>
        <p>In the nominal operation phase, we sample navigation
actions with constant heading acceleration (rotation around
z axis) of [-15, 15] deg=s2. This encourages path
exploration under constant speed (0.7 m=s). In collision
avoidance phases, we sample quartic (4th-order) polynomials
with safety margin (estimated minimum distance with the
other agent, using his/her current speed) of [0.2, 1.8] m.</p>
        <p>
          During node expansions, new robot actions and human
actions are sampled, and potential collisions are checked
under the condition that the distance between two agents
is within 0.9 m while the velocity of the robot is greater
than 0.3 m=s. We define the critical safety speed as 0.3 m=s
(defined as the safety action as) instead of stopping, since
people are very capable of avoiding low-speed objects.
Additionally, full stops are considered to be unnatural in
human crowds
          <xref ref-type="bibr" rid="ref27 ref29">(Trautman and Krause 2010)</xref>
          . Only valid nodes
(those satisfying a collision-free check) are added into the
tree.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Online Computation</title>
        <p>We plan in a receding-horizon fashion. This means that the
planner replans online at each period, up to the final period,
estimated based on forward simulation. To ensure search is
complete and the solution is returned at each period, the
planner needs to balance the computation of each run, which
depends on the number of particles K, sampled actions jAj,
and search depth H (or lookahead, periods to t = T ), with
the complexity of: KjAjH 6.</p>
        <p>
          Here we use K = 1 in our implementation when t &lt; T ,
which leads to the nominal/most-likely actions being
selected, as if we assume maximum likelihood observations
in belief space planning
          <xref ref-type="bibr" rid="ref22">(Platt Jr et al. 2010)</xref>
          . We do so until
it is the final period t = T , and apply K = 10 to
sample for worst-case estimates, to maintain the safety
mechanism from Eq. 10. We sample 5 15 actions depending on
how crucial the state is to collision coordination. Finally, we
consider H = 3, to keep the worst-case complexity under
500 nodes to search for for real-time computation. The time
duration of each period is 1 sec, to ensure the robot starts
planning 4 secs before the estimated collision timing, since
humans usually reacting to collision threats 2:5s ahead on
average
          <xref ref-type="bibr" rid="ref20">(Paris, Pettre´, and Donikian 2007)</xref>
          . Due to the tree
structure, once a node is visited, it is put in the closed set
(for the evaluated nodes), but not considered in the future
for re-evaluation.
        </p>
        <p>6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Human Behavior Modeling</title>
      <p>
        It is known that humans have heterogeneous behaviors.
Moreover, as suggested in state-of-the-art approaches that
have been deployed for real-world interactions
        <xref ref-type="bibr" rid="ref19 ref21 ref26 ref31 ref4">(Trautman
et al. 2015; Pfeiffer et al. 2016)</xref>
        , humans interact with robots
much differently than with other humans. As a result, no
sufficiently accurate human model is available for robot
planning. Thus, in order to obtain a level of assurance that the
robot will behave safely when deployed in the wild, it is
important to evaluate it when its human simulation model is
inaccurate. To construct such scenarios, and to evaluate a
planner on simulated humans in general, it is important to know
the behavior assumptions to interpret the results accordingly.
It is dangerous to evaluate an approach using a similar
simulated human behavior to what being used in the planner
for forward simulation
        <xref ref-type="bibr" rid="ref13 ref16">(Macindoe, Kaelbling, and
LozanoPe´rez 2012)</xref>
        , as it hides the effect of modeling inaccuracy.
Therefore, in this section, we leverage pedestrian
simulators, specifically, social force models with explicit collision
avoidance
        <xref ref-type="bibr" rid="ref7">(Karamouzas et al. 2009)</xref>
        , interpret their
underlying behavior assumptions using a game-theoretic
decisionmaking model, and then modify those assumptions to
simulate behaviors we observed in real-world interactions. We
6If we maintain K samples and do not apply sample average for
rollouts, as suggested in Eq. 7, the complexity will be jKjH jAjH
use them to evaluate our approach, and discuss about the
performance brought by model inaccuracy, to benchmark the
worst-case scenario.
      </p>
      <p>
        Popular approaches for behavior modeling are
datadriven, either from the behavior cloning community
(supervised learning), or inverse optimal planning community. In
the latter, agents are assumed to be rational, which means
their decisions optimize a certain objective. Here, we
consider the agent-based models in crowd simulation
        <xref ref-type="bibr" rid="ref20 ref5 ref7">(Helbing and Molnar 1995; Paris, Pettre´, and Donikian 2007;
Karamouzas et al. 2009)</xref>
        . Among those models,
nominal goal-driven navigation and interactive collision
avoidance motions are simulated through reactive policies. We
adapt the social force model with collision prediction
(SFCP
        <xref ref-type="bibr" rid="ref7">(Karamouzas et al. 2009)</xref>
        ) to sample avoidance motions.
In their work, collision avoidance behaviors are based on
the relative position at their closest point, which can be
categorized into two groups: one is anticipated to pass the path
intersection earlier, which then accelerate (to attempt to pass
in front); the other is anticipated pass later, which then
decelerate (to attempt to yield).
      </p>
      <p>In real world, however, those reactions are not based on
perfect timing estimates, and people make attempts based
on other criteria as well. Therefore, we sample avoidance
motions for both attempts, by sampling relative velocity
estimates for both earlier and later arrival timings, and use a
game-theoretic decision-making model to decide which
attempt to go for, to simulate heterogeneous behaviors we
observe in real world.</p>
      <p>For generating the worst case scenario we model the
pedestrian collision-avoidance (crossing) scenario as an
(two-agent) asymmetric game of Chicken, shown in Fig. 5,
where each agent knows the value estimates Qi; i of their
last-period actions aT , and their decisions are based on the
design of their own objectives, and the inference of the other
agents’ strategies, as listed in Eq. 6.</p>
      <p>When agents share the same objective and it is of
common knowledge to all agents i.e., all agents know they share
an objective and they all know others know that and so on,
all agents share the same optimal policy. In order to solve
this case, we can treat it as if one agent has full control to
the others, which all optimize that agent’s state-action value
function Qi:
ri(xt; ait; at i) + QiT (xT ; aiT ; aT i)]:
a0:T = argmax maxEx0:T [ X
i
ai0:T a0:iT t=0</p>
      <p>T 1
(11)
If an agent’s policy is to maximize the social welfare and
it assumes the others know that and do the same, the
overall function in Eq. 11 converges to the coordination policy
where agents yield whenever it has later arrival timing and
vice versa, such that the joint efficiency is maximized. This
behavior, which is simulated in SF-CP, is referred here as the
reciprocal behavior, as agents are cooperative and assume
the others are the same. In the real world, there are many
cases in which humans act strategically and thus do not
always act according to the reciprocal behavior assumptions.
Some people constantly yield and wait for the others to pass
first, while others are doing the opposite. This can be
observed not only among crowds, but also among pedestrians
who encounter a robot for the first time 7. Among
pedestrians who exhibit the constant-yielding behavior, some
suggested that they do not know how to predict what the robot
will do; we therefore refer those pedestrians as being
cautious, and simulate their decisions through the maximin
operation, as suggested in Eq. 9. Among the pedestrians who
exhibited the non-yielding behavior, some strong provided
feedback that the robot should have waited for them. We
simulate such behavior through Eq. 11, with self-interested
objective and altruistic assumptions for the other agent. We
refer to such behavior as being aggressive, which assumes
both parties attempt to maximize the one agent’s individual
efficiency.</p>
      <p>7</p>
    </sec>
    <sec id="sec-7">
      <title>Performance Metrics and Experiments</title>
      <p>In this section we evaluate our proposed robust planner
under randomized initial conditions, with the proposed three
types of pedestrian behaviors in simulation. We consider
plan safety, efficiency, and smoothness as performance
metrics. We show performance deterioration when different
types of pedestrians are simulated, and collision safety is
still ensured among all scenarios. We sample 100 initial
configurations for testing in a two-agent crossing scenario, with
the robot starting at its nominal speed, 0.7 m=s, the human
at a speed range of [0.9, 1.3] m=s, and the initial positions
controlled such that the estimated arrival timing difference
range of [-0.8, 0.8] s. We evaluate the planner using
reciprocal human model (SF-CP) for forward simulation, and
compare the performances among crossing 1. aggressive,
2.cautious, and 3. reciprocal pedestrians. As suggested in
Eq. 11, we expect optimal joint efficiency among
reciprocalreciprocal agents with equal travel time, highest pedestrian
individual efficiency with the aggressive model and the
opposite with the cautious model.</p>
      <p>We consider the safety metric as: the distance between
two agents never comes within 0:6m while the robot speed is
greater than 0:3m=s. Efficiency is measured by travel time.
We look into number of executed minimum-speed actions
(as) as an indicator of how smooth the crossing interaction
is. The result can be seen in Fig. 7. We can see the planner
experiences zero collisions among the three types of
simulated pedestrians 8. Due to the conservative prediction at
final periods, the expected safety actions as in the first
periods of planning are more than the executed ones at the final
periods. This is true with both reciprocal (of accurate
prediction) and cautious pedestrian models, but the opposite occurs
7We put a robot in a public space, running a policy that never
slows down until imminent threat of collision is detected. We
observed pedestrian responses, and ask them questions about what
they thought of the robot’s behavior afterwards.</p>
      <p>8In real-world interactions, another worst-case scenario is to
encounter a robot-interested pedestrian, who follows the robot after it
has passed, and block the robot when passing in front. We
acknowledge but do not explicitly consider this behavior here for
performance evaluation. A potential colliding situation when simulating
this adversarial type of agents for crossing is when they continue
with high speed while the robot has passed in front and slowed
down out of safety concern.
when encountering aggressive agents, where the robot’s
individual travel efficiency deteriorates due to frequent
slowdowns. Although it appears to have highest joint efficiency
on average, unexpected slow-downs in general cause
nonsmooth interaction; along with the frequent close-distance
interaction, we expect extra delays to both agents in the real
world.</p>
      <p>8</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion and Future Work</title>
      <p>We introduce a method for robot planning in human
workspaces as a Markov Game, to resolve the dynamic
environment dilemma in the motion planning literature when
planning in dynamic workspaces. We then proposed an
algorithm that provided safety guarantees in simulated
environments yet prevented the robot from exhibiting overly
conservative behavior through final-period worst-case simulation,
seeking to plan carefully only when it matters. We also
proposed pedestrian behavior assumptions based on
realworld observations, to benchmark the worst-case scenarios
caused by modeling inaccuracy, which is one of the most
difficult issues to deal with for state-of-the-art approaches
as they have been deployed in the real world. The proposed
approach relies on accurate modeling of the action space
of other agents A i for safety guarantee (as computed in
Eq. 10), which can not be validated for real-world
application unless deployed in real human workspaces.
Evaluation in real environments is then necessary to better support
the safety guarantee in the real world. We did not detail the
choice of heuristic in this paper, and desire to better
leverage those applied in grid-based multi-agent path
coordination and scheduling for fast online computation. Lastly, there
is limited literature on human behavior using multi-agent
decision-making models; we desire to extend the models to
incorporate more diverse interactive behavior, including
behavior adaptation, which is commonly seen in human-robot
interaction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Foerster et al. 2018]
          <string-name>
            <surname>Foerster</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chen, R. Y.;
          <string-name>
            <surname>Al-Shedivat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Whiteson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Abbeel,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Mordatch</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Learning with opponent-learning awareness</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems</source>
          ,
          <volume>122</volume>
          -
          <fpage>130</fpage>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Fox, Burgard, and Thrun 1997]
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Burgard</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and Thrun,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>1997</year>
          .
          <article-title>The dynamic window approach to collision avoidance</article-title>
          .
          <source>IEEE Robotics &amp; Automation Magazine</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>23</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Fraichard 2007] Fraichard,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>A short paper about motion safety</article-title>
          .
          <source>In IEEE int. conf. on robotics and automation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Godoy et al. 2016]
          <string-name>
            <surname>Godoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Karamouzas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Guy,
          <string-name>
            <surname>S. J.;</surname>
          </string-name>
          and Gini,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Moving in a crowd: Safe and efficient navigation among heterogeneous agents</article-title>
          .
          <source>In Proc. Int. Joint Conf. on Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Helbing and Molnar</source>
          <year>1995</year>
          ]
          <string-name>
            <surname>Helbing</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Molnar</surname>
          </string-name>
          , P.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          1995.
          <article-title>Social force model for pedestrian dynamics</article-title>
          .
          <source>Physical review E</source>
          <volume>51</volume>
          (
          <issue>5</issue>
          ):
          <fpage>4282</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Karamouzas et al. 2009]
          <string-name>
            <surname>Karamouzas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Heil, P.; van Beek,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; and Overmars,
          <string-name>
            <surname>M. H.</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>A predictive collision avoidance model for pedestrian simulation</article-title>
          . In International Workshop on Motion in Games,
          <volume>41</volume>
          -
          <fpage>52</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Koenig and Likhachev</source>
          <year>2002</year>
          ] Koenig,
          <string-name>
            <given-names>S.</given-names>
            , and
            <surname>Likhachev</surname>
          </string-name>
          , M.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          2002.
          <article-title>Dˆ* lite</article-title>
          . Aaai/iaai 15.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Kretzschmar, Kuderer, and Burgard 2014] Kretzschmar,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Kuderer,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Burgard</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Learning to predict trajectories of cooperatively navigating agents</article-title>
          .
          <source>In Robotics and Automation (ICRA)</source>
          ,
          <year>2014</year>
          IEEE International Conference on,
          <fpage>4015</fpage>
          -
          <lpage>4020</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Kruse et al. 2012] Kruse,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Basili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Glasauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Kirsch</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Legible robot navigation in the proximity of moving humans</article-title>
          .
          <source>In Advanced Robotics and its Social Impacts (ARSO)</source>
          ,
          <source>2012 IEEE Workshop on</source>
          ,
          <fpage>83</fpage>
          -
          <lpage>88</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Kuderer et al. 2012] Kuderer,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Kretzschmar</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Sprunk,
          <string-name>
            <given-names>C.</given-names>
            ; and
            <surname>Burgard</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Feature-based prediction of trajectories for socially compliant navigation</article-title>
          .
          <source>In Robotics: science and systems. Citeseer.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Lichtentha¨ler, Lorenzy, and Kirsch 2012] Lichtentha¨ler, C.;
          <string-name>
            <surname>Lorenzy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kirsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Influence of legibility on perceived safety in a virtual human-robot path crossing task</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>In</surname>
            <given-names>RO</given-names>
          </string-name>
          -MAN,
          <year>2012</year>
          IEEE,
          <fpage>676</fpage>
          -
          <lpage>681</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Littman</source>
          <year>1994</year>
          ]
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Markov games as a framework for multi-agent reinforcement learning</article-title>
          .
          <source>In Machine Learning Proceedings 1994. Elsevier</source>
          .
          <volume>157</volume>
          -
          <fpage>163</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Macindoe, Kaelbling, and
          <string-name>
            <surname>Lozano-</surname>
          </string-name>
          Pe´rez 2012]
          <string-name>
            <surname>Macindoe</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lozano-</surname>
          </string-name>
          Pe´rez, T.
          <year>2012</year>
          .
          <article-title>Pomcop: Belief space planning for sidekicks in cooperative games</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Mavrogiannis and Knepper</source>
          <year>2016</year>
          ] Mavrogiannis,
          <string-name>
            <given-names>C. I.</given-names>
            , and
            <surname>Knepper</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. A.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Decentralized multi-agent navigation planning with braids</article-title>
          .
          <source>In Proceedings of the Workshop on the Algorithmic Foundations of Robotics</source>
          . San Francisco, USA.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Nguyen et al. 2011] Nguyen,
          <string-name>
            <surname>T.-H. D.</surname>
          </string-name>
          ;
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>W. S.</given-names>
          </string-name>
          ; Leong, T.-Y.;
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lozano-Perez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Grant</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Capir: Collaborative action planning with intention recognition</article-title>
          .
          <source>In AIIDE.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Nikolaidis et al. 2016]
          <string-name>
            <surname>Nikolaidis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Srinivasa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Formalizing human-robot mutual adaptation: A bounded memory model</article-title>
          .
          <source>In The Eleventh ACM/IEEE International Conference on Human Robot Interaction</source>
          ,
          <fpage>75</fpage>
          -
          <lpage>82</lpage>
          . IEEE Press.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Paris, Pettre´, and Donikian 2007] Paris, S.; Pettre´, J.; and
          <string-name>
            <surname>Donikian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Pedestrian reactive navigation for crowd simulation: a predictive approach</article-title>
          . In Computer Graphics Forum, volume
          <volume>26</volume>
          ,
          <fpage>665</fpage>
          -
          <lpage>674</lpage>
          . Wiley Online Library.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Pfeiffer et al. 2016] Pfeiffer,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Schwesinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            ;
            <surname>Sommer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Galceran</surname>
          </string-name>
          , E.; and Siegwart,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Predicting actions to act predictably: Cooperative partial motion planning with maximum entropy models</article-title>
          .
          <source>In Intelligent Robots and Systems (IROS)</source>
          ,
          <year>2016</year>
          IEEE/RSJ International Conference on,
          <fpage>2096</fpage>
          -
          <lpage>2101</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>[Platt</surname>
            Jr et al. 2010]
            <given-names>Platt</given-names>
          </string-name>
          <string-name>
            <surname>Jr</surname>
            ,
            <given-names>R.</given-names>
            ; Tedrake, R.
          </string-name>
          ;
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lozano-Perez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Belief space planning assuming maximum likelihood observations</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>[Quinlan and Khatib</source>
          <year>1993</year>
          ] Quinlan,
          <string-name>
            <given-names>S.</given-names>
            , and
            <surname>Khatib</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          1993.
          <article-title>Elastic bands: Connecting path planning and control</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>In Robotics and Automation</source>
          ,
          <year>1993</year>
          . Proceedings., 1993 IEEE International Conference on,
          <fpage>802</fpage>
          -
          <lpage>807</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [Sadigh et al. 2016]
          <string-name>
            <surname>Sadigh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Seshia</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Dragan</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Planning for autonomous cars that leverages effects on human actions</article-title>
          .
          <source>In Proceedings of the Robotics: Science and Systems Conference (RSS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>[Silver and Veness</source>
          <year>2010</year>
          ]
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>Monte-carlo planning in large pomdps</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>2164</volume>
          -
          <fpage>2172</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>[Trautman and Krause</source>
          <year>2010</year>
          ] Trautman,
          <string-name>
            <given-names>P.</given-names>
            , and
            <surname>Krause</surname>
          </string-name>
          , A.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          2010.
          <article-title>Unfreezing the robot: Navigation in dense, interacting crowds</article-title>
          .
          <source>In Intelligent Robots and Systems (IROS)</source>
          ,
          <year>2010</year>
          IEEE/RSJ International Conference on,
          <fpage>797</fpage>
          -
          <lpage>803</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [Trautman et al. 2015] Trautman,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; Ma, J.; Murray,
          <string-name>
            <given-names>R. M.</given-names>
            ; and
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Robot navigation in dense human crowds: Statistical models and experimental studies of human-robot cooperation</article-title>
          .
          <source>The International Journal of Robotics Research</source>
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <fpage>335</fpage>
          -
          <lpage>356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>[Van Den Berg</surname>
          </string-name>
          et al. 2011
          <string-name>
            <surname>] Van Den Berg</surname>
            , J.; Guy,
            <given-names>S. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Manocha</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Reciprocal n-body collision avoidance</article-title>
          .
          <source>In Robotics research</source>
          . Springer. 3-
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>[Watkins and Dayan</source>
          <year>1992</year>
          ] Watkins,
          <string-name>
            <given-names>C. J.</given-names>
            , and
            <surname>Dayan</surname>
          </string-name>
          , P.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          1992.
          <article-title>Q-learning</article-title>
          .
          <source>Machine learning 8(3-4)</source>
          :
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>