<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Minimizing the Negative Side Effects of Planning with Reduced Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandhya Saisubramanian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shlomo Zilberstein</string-name>
          <email>shlomog@cs.umass.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Information and Computer Sciences University of Massachusetts Amherst</institution>
          ,
          <addr-line>MA 01003</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Reduced models of large Markov decision processes accelerate planning by considering a subset of outcomes for each state-action pair. This reduction in reachable states leads to replanning when the agent encounters states without a precomputed action during plan execution. However, not all states are suitable for replanning. In the worst case, the agent may not be able to reach the goal from the newly encountered state. Agents should be better prepared to handle such risky situations and avoid replanning in risky states. Hence, we consider replanning in states that are unsafe for deliberation as a negative side effect of planning with reduced models. While the negative side effects can be minimized by always using the full model, this defeats the purpose of using reduced models. The challenge is to plan with reduced models, but somehow account for the possibility of encountering risky situations. An agent should thus only replan in states that the user has approved as safe for replanning. To that end, we propose planning using a portfolio of reduced models, a planning paradigm that minimizes the negative side effects of planning using reduced models by alternating between different outcome selection approaches. We empirically demonstrate the effectiveness of our approach on three domains: an electric vehicle charging domain using real-world data from a university campus and two benchmark planning problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        A Stochastic Shortest Path (SSP) problem is a Markov
Decision Process (MDP) with a start state and goal state(s),
where the objective is to find a policy that minimizes the
expected cost of reaching a goal state from the start state
        <xref ref-type="bibr" rid="ref3">(Bertsekas and Tsitsiklis 1991)</xref>
        . Solving large SSPs optimally and
efficiently is an active research area in automated planning,
with many practical applications. Given the computational
complexity of solving large SSPs optimally
        <xref ref-type="bibr" rid="ref16">(Littman 1997)</xref>
        ,
there has been considerable interest in developing efficient
approximations, such as reduced models, that trade solution
quality for computational gains. Reduced models simplify
the problem by partially or completely ignoring uncertainty,
thereby reducing the set of reachable states for the planner.
We consider reduced models in which the number of
outcomes per action is reduced relative to the original model.
      </p>
      <p>Due to model simplification, an agent may encounter
unexpected states during plan execution, for which the policy
computed using reduced models may not specify an action.</p>
      <p>
        In such instances, the agent needs to replan online and find
a new way to reach a goal state. We consider this as a side
effect of using reduced models as the agent never has to
replan when the complete model is used. Current reduced
model techniques assume that the agent can replan in any
state
        <xref ref-type="bibr" rid="ref10 ref21 ref26">(Yoon, Fern, and Givan 2007; Keller and Eyerich 2011;
Saisubramanian, Zilberstein, and Shenoy 2017)</xref>
        . That is,
they are oblivious to how the ignored outcomes may affect
the system. Some of these side effects may be acceptable to
the user, but others may be unsafe even though they
accelerate planning (Figure 1).
      </p>
      <p>
        Consider, for example, a robot navigating on a sidewalk
adjacent to a street. The robot has a 0:1 probability of
sliding on to the street when navigating at high speed and 0:02
probability of sliding to the street when navigating at a low
speed. When planning with a reduced model that ignores the
possibility of sliding as an outcome, the robot will have to
replan if the ignored outcome occurs during plan execution.
However, it is risky for the robot to replan while on the street
and it is expected to be better prepared to handle such
situations. Another example of replanning in risky states is an
autonomous vehicle replanning in the middle of an
intersection after missing a turn or an exit. In the terminology
introduced by
        <xref ref-type="bibr" rid="ref1">Amodei et al. (2016)</xref>
        , these are negative side effects
of planning with reduced models — an undesired behavior
that is the result of ignoring certain outcomes in planning.
Table 1 outlines various properties of negative side effects.
      </p>
      <p>
        The importance of accounting for risks in AI systems is
attracting growing interest
        <xref ref-type="bibr" rid="ref15 ref28">(Kulic´ and Croft 2005;
Zilberstein 2015)</xref>
        . One such specific form of risk is the negative
side effects
        <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
        that occur when an
autonomous agent optimizes an objective function that focuses
on certain aspects of the environment, leading to undesirable
effects on other aspects of the environment. Recently,
researchers have proposed techniques to address the negative
side effects that arise due to misspecified rewards
(HadfieldMenell et al. 2017) and incomplete model
        <xref ref-type="bibr" rid="ref27">(Zhang, Durfee,
and Singh 2018)</xref>
        . In this work, we focus on addressing
negative side effects in the form of deliberation in unsafe states
that arise due to planning with simplified or reduced
models. This is a type of irreversible negative side effect that is
fully observable and avoidable. The severity of replanning
may be mild or significant depending on the state. While
the negative side effects can always be reduced by using the
full model, this defeats the purpose of using reduced models.
The challenge is to plan with reduced models while
simultaneously being prepared for handling risky situations. This
can be achieved by formulating reduced models that can
adapt to risks by selectively improving the model fidelity.
      </p>
      <p>We propose planning using a portfolio of outcome
selection principles to minimize the negative side effects of
planning with reduced models. The model fidelity is improved
in states that are risky — unsafe for deliberation or
replanning. We consider a factored state representation in which
the states are represented as a vector of features. Given a set
of features that characterize states unsafe for deliberation,
our approach employs more informative outcomes or the
full model for actions in states leading to these side effects
and a simple outcome selection otherwise. To identify states
where model fidelity is to be improved, the agent performs
reachability analysis using samples and the given features.
The sample trajectories are generated by depth-limited
random walk. Reachability analysis is performed primarily with
respect to negative side effects and this is used as a heuristic
for choosing the outcome selection principles in a
portfolio. We evaluate our approach in three proof-of-concept
domains: an electric vehicle charging problem using real-world
data, a racetrack domain and a sailing domain. Our empirical
results show that our approach significantly minimizes the
negative side effects, without affecting the run time. While
we address one particular form of negative side effects in
planning using reduced models, our proposed framework
is generic and can be extended to potentially address other
forms of negative side effects.</p>
      <p>Our primary contributions are: (i) formalizing negative
side effects in planning using reduced models; (ii)
introducing a portfolio of reduced models approach to minimize the
negative side effects; and (iii) empirical evaluation of the
proposed approach on three domains.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>This section provides a brief background on stochastic
shortest path problems and planning using reduced models.</p>
      <sec id="sec-2-1">
        <title>Property</title>
      </sec>
      <sec id="sec-2-2">
        <title>Impact Severity Occurrence Observability</title>
      </sec>
      <sec id="sec-2-3">
        <title>Reversible or irreversible Significant or insignificant Avoidable or unavoidable Fully observable or partially observable</title>
        <p>C : S A ! R+ [ f0g specifies the cost of executing an
action a in state s and is denoted by C(s; a);
s0 2 S is the start state; and
SG</p>
      </sec>
      <sec id="sec-2-4">
        <title>S is the set of absorbing goal states.</title>
        <p>The cost of an action is positive in all states except goal
states, where it is zero. The objective in an SSP is to
minimize the expected cost of reaching a goal state from the
start state. It is assumed that there exists at least one proper
policy, one that reaches a goal state from any state s with
probability 1. The optimal policy, , can be extracted using
the value function defined over the states, V (s):
V (s) = min
a</p>
        <p>
          Q (s; a);
8s 2 S
Q (s; a) = C(s; a)+X T (s; a; s0)V (s0); 8(s; a)
(1)
(2)
s0
where Q (s; a) denotes the optimal Q-value of the action a
in state s in the SSP M
          <xref ref-type="bibr" rid="ref3">(Bertsekas and Tsitsiklis 1991)</xref>
          .
        </p>
        <sec id="sec-2-4-1">
          <title>Planning Using Reduced Models</title>
          <p>
            While SSPs can be solved in polynomial time in the
number of states, many problems of interest have a state-space
whose size is exponential in the number of variables
describing the problem
            <xref ref-type="bibr" rid="ref16">(Littman 1997)</xref>
            . This complexity has led to
the use of approximate methods such as reduced models.
Reduced models simplify planning by considering a subset
of outcomes. Let (s; a) denote the set of all outcomes of
(s; a), (s; a) = fs0jT (s; a; s0) &gt; 0g.
          </p>
          <p>Definition 1. A reduced model of an SSP M is represented
by the tuple M 0 = hS; A; T 0; C; s0; SGi and characterized
by an altered transition function T 0 such that 8(s; a) 2 S
A; 0(s; a) (s; a), where 0(s; a) = fs0jT 0(s; a; s0) &gt; 0g
denotes the set of outcomes in the reduced model for action
a 2 A in state s 2 S.</p>
          <p>We normalize the probabilities of the outcomes included
in the reduced model, but more complex ways to redistribute
the probabilities of ignored outcomes may be considered.
The outcome selection process in a reduced model
framework determines the number of outcomes and how the
specific outcomes are selected. Depending on these two aspects,
a spectrum of reductions exist with varying levels of
probabilistic complexity.</p>
          <p>The outcome selection principle (OSP) determines the
outcomes included in the reduced model per state-action
pair, and the altered transition function for each state-action
pair. The outcome selection may be performed using a
simple function such as always choosing the most-likely
outcome or a more complex function. Typically, a reduced
model is characterized by a single OSP. That is, a single
principle is used to determine the number of outcomes and how
the outcomes are selected across the entire model. Hence,
existing reduced models are incapable of selectively
improving model fidelity.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Minimizing Negative Side Effects</title>
      <p>One form of negative side effects of planning using reduced
models is replanning in states, as a result of ignoring
outcomes, which are prescribed by a human as unsafe for
deliberation. The set of states characterizing negative side
effects for planning with a reduced model M 0, denoted by
N SE(M 0), is defined as N SE(M 0) = fs0jT (s; a; s0) &gt;
0 ^ D(s0) = true ^ s0 2= 0(s; a)g, where D(s0) is true if
the deliberation in the state is risky during policy execution.
Thus, the OSP determines the reduced model and thereby
the resulting negative side effects. The challenge is to
determine outcome selection principles that minimize the
negative side effects of planning with reduced models without
significantly compromising the run time gains.</p>
      <p>
        While the negative side effects can always be minimized
using the full model or picking the right determinization,
the former is computationally complex and finding the right
determinization is a non-trivial meta-level decision
problem
        <xref ref-type="bibr" rid="ref19">(Pineda and Zilberstein 2014)</xref>
        . Another approach is to
penalize the agent for replanning in risky states; that is, not
accounting for the risky outcomes in the reduced model
formulation. However, this is useful only when the agent
performs an exhaustive search in the space of reduced
models, which is computationally complex. Given a set of
features characterizing states unsuitable for deliberation, D(f~),
we address this issue by providing efficient reduced model
formulation that can selectively improve model fidelity. A
straightforward approach is to improve the model fidelity in
risky states. However, this does not minimize the negative
side effects and it is more beneficial to improve the model
fidelity in states leading to these risky states. Based on this
insight, we present a heuristic approach, which is based on
reachability to these risky states, to select outcome selection
principles for each state-action pair.
      </p>
      <sec id="sec-3-1">
        <title>Portfolios of Reduced Models</title>
        <p>
          We present planning using a portfolio of reduced models
(PRM), a generalized approach to formulate reduced
models, by switching between different outcome selection
principles, each of which represents a different reduced model.
The approach is inspired by the benefits of using
portfolios of algorithms to solve complex computational
problems
          <xref ref-type="bibr" rid="ref11 ref18">(Petrik and Zilberstein 2006)</xref>
          .
        </p>
        <p>Definition 2. Given a portfolio of finite outcome selection
principles, Z = f 1; 2; :::; kg, k&gt;1, a model selector, ,
generates T 0 for a reduced model by mapping every (s; a) to
an outcome selection principle, : S A ! i, i 2 Z, such
that T 0(s; a; s0) = T (s;a)(s; a; s0), where T (s;a)(s; a; s0)
denotes the transition probability corresponding to the
outcome selection principle selected by the model selector.</p>
        <p>Clearly, the existing reduced models are a special case
of PRMs, with a model selector ( ) that always selects the
same OSP for every state-action pair. In planning using a
portfolio of reduced models, however, the model selector
typically utilizes the synergy of multiple outcome selection
principles. Each state-action pair may have a different
number of outcomes and a different mechanism to select the
specific outcomes. Hence, we leverage this flexibility in
outcome selection to minimize negative side effects in reduced
models by using more informative outcomes in the risky
states and using simple outcome selection principles
otherwise. Though the model selector may use multiple outcome
selection principles to generate T 0 in a PRM, note that the
resulting model is still an SSP. In the rest of the paper, we
focus on a basic instantiation of a PRM that alternates between
the extremes of reductions.</p>
        <p>Definition 3. A 0/1 reduced model (0/1 RM) is a PRM with
a model selector that selects either one or all outcomes of
an action in a state to be included in the reduced model.</p>
        <p>A 0/1 RM is characterized by a model selector, 0/1, that
either ignores the stochasticity completely (0) by
considering only one outcome of (s; a), or fully accounts for the
stochasticity (1) by considering all outcomes of the
stateaction pair in the reduced model. For example, it may use the
full model in states prone to negative side effects, and
determinization otherwise. Thus, a 0/1 RM that guarantees goal
reachability with probability 1 can be devised, if a proper
policy exists in the SSP. Our experiments using 0/1 RM
show that even such basic instantiation of a PRM works well
in practice.</p>
        <p>Devising an efficient model selector automatically can be
treated as a meta-level decision problem that is
computationally more complex than solving the reduced model, due
to the numerous possible combinations of outcome
selection principles. Even in a 0/1 RM, finding when to use
determinization and when to use the full model is non-trivial.
In the more general setting, all OSPs in Z may have to be
evaluated to determine the best reduced model formulation
in the worst case. Let max denote the maximum time taken
for this evaluation across all states. In a PRM with a large
portfolio, the OSPs may be redundant in terms of specific
outcomes. For example, selecting the most-likely outcome
and greedily selecting an outcome based on heuristic could
result in the same outcome for some (s; a) pair. We use this
property to show that the worst case time complexity of
devising an efficient model selector is independent of the size
of the portfolio, which may be much larger than the size of
the problem under consideration, and depends only the size
(a) Orignal Problem
(b) MLOD
(c) 0/1 RM
of states and actions. The following proposition formally
proves this complexity.</p>
        <p>Proposition 1. The worst case time complexity for
generate T 0 for a 0/1 RM is O(jAjjSj2 max):
0/1 to
Proof Sketch. For each (s; a), at most jZj outcome selection
principles are to be evaluated and this takes at most max
time. Since this process is repeated for every (s; a), takes
O(jSjjAjjZj max) to generate T 0. In the worst case, every
action may transition to all states and the outcome selection
principles in Z may be redundant in terms of the number
and specific outcomes set produced by them. To formulate
a 0=1 RM of an SSP, it may be necessary to evaluate
every outcome selection principle that corresponds to a
determinization or a full model. The set of unique outcomes, k,
for a 0=1 RM is composed of all unique deterministic
outcomes, which is every state in the SSP, and the full model,
jkj jSj + 1. Replacing jZj with jkj, the complexity is
reduced to O(jAjjSj2 max).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Heuristic Approach to Model Selection</title>
        <p>Since it is computationally complex to perform an
exhaustive search in the space of possible 0/1 RM models, we
present an approximation to identify when to use the full
model, using samples.</p>
        <p>
          Given the features characterizing negative side effects,
model selection is performed by generating and solving
samples. The samples are generated automatically by
sampling states from the target problem. Alternatively, small
problem instances in the target domain may be used as
samples. In this paper, smaller problems are created
automatically by multiple trials of depth limited random walk on the
target problems and solved using LAO*
          <xref ref-type="bibr" rid="ref6">(Hansen and
Zilberstein 2001)</xref>
          . The probability of reaching the states unsafe
for deliberation is computed for these samples, based on the
features. Trivially, as the number of samples and the depth
of the random walk are increased, the estimates converge to
their true values. The full model is used in states that meet
the reachability threshold. In all other states, determinization
is used.
        </p>
        <p>Figure 2 is an example of an SSP with a risky state
(shaded) and instantiations of reduced models formed with
uniform and adaptive outcome selection principles. The
most likely outcome determinization (MLOD) uses uniform
outcome selection principle – always selects the most likely
outcome – and may ignore s5 in the reduced model. A
0/1 RM, however, uses the full model in state action pair
leading to the risky state, minimizing negative side effects
and resulting in a model that is better prepared to handle the
risky state. The reachability threshold can be varied to devise
reduced models with different fidelity with varying
proportions of full model usage, resulting in different sensitivity to
negative side effects and computational complexity.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>We experiment with the 0/1 RM on three
proof-ofconcept domains including an electric vehicle
charging problem using real world data from a
university campus, and two benchmark planning
problems: the racetrack domain and the sailing domain.
The experiments employ a simple portfolio Z =
fmost-likely outcome determinization (MLOD), full modelg.
We compare the results of 0/1 RM to that of MLOD and
M02 reduction, which greedily selects two outcomes per
each state-action pair.</p>
      <p>
        The expected number of negative side effects, cost of
reaching the goal, and planning time are used as
evaluation metrics. For 0/1 RM, we use the full model in states
that had a reachability of 0.25 or more to at least one of the
risky states, which is estimated using thirty samples. In all
other states, MLOD is used. All results are averaged over
100 trials of planning and execution simulations and the
average times include re-planning time. The deterministic
problems are solved using the A* algorithm
        <xref ref-type="bibr" rid="ref7">(Hart, Nilsson,
and Raphael 1968)</xref>
        and other problems using LAO*. All
algorithms were implemented with =10 3 and using hmin
heuristic computed using a labeled version of LRTA*
        <xref ref-type="bibr" rid="ref14">(Korf
1990)</xref>
        .
      </p>
      <sec id="sec-4-1">
        <title>EV Charging Problem</title>
        <p>
          The electric vehicle (EV) charging domain considers an
EV operating in a vehicle-to-grid (V2G) setting in which
an EV can either charge or discharge energy from the
smart grid
          <xref ref-type="bibr" rid="ref17 ref21">(Ma et al. 2012; Saisubramanian, Zilberstein, and
Shenoy 2017)</xref>
          . By planning when to buy or sell electricity, an
EV can devise a robust policy for charging and discharging
that is consistent with the owner’s preferences, while
minimizing the long-term cost of operating the vehicle.
        </p>
        <p>We consider uncertainty in parking duration, which is
specified by a probability that certain states may become
terminal states. This problem is modeled as a discrete
finitehorizon MDP with maximum parking duration as the
horizon H . Each state is represented by the tuple hl; t; d; p; ei,
where l denotes the current level of charge of the vehicle,
t H denotes the current time step. d and p denote the
current demand level and price distribution for electricity and
0 e 3 denotes the anticipated departure time specified
by the owner. If the owner has not provided this
information, then e = 3 and the agent plans for H . Otherwise, e
denotes the time steps remaining for departure. The vehicle
can charge ( i+) and discharge ( i ) at three different speed
levels, where 1 i 3 denotes the speed level, or remain
idle (?). The i+ and i actions are stochastic, while the
? action is deterministic. The objective is to find a reward
maximizing policy : S ! A that maximizes goal
reachability, given the charging preferences of the vehicle owner.
If the exit level charge (goal) requirement of the owner is
not met at departure, then the owner may need to extend the
parking duration to reach the required charge level. Since
this is undesirable, any state from where the goal cannot
be reached in the remaining parking duration is treated as a
preference violation (risk) in the MDP with a large penalty.</p>
        <p>
          Four demand levels and two price distributions are used
in the experiments. Each t is equivalent to 30 minutes in real
time. We assume that the owner may depart between four
to six hours of parking with a probability of 0:2 that they
announce their planned departure time. Outside that
window, there is a lower probability of 0:05 that they announce
their departure. The charging rewards and the peak hours are
based on real data
          <xref ref-type="bibr" rid="ref4">(Eversource 2017)</xref>
          . The battery capacity
and the charge speeds for the EV are based on Nissan Leaf
configuration. The charge and discharge speeds are
considered to be equal. The battery inefficiency is accounted for
by adding a 15% penalty on the rewards. The negative side
effects are estimated using: time remaining for departure, if
the current time is peak hour, and if there is sufficient charge
for discharging, with one step-lookahead.
        </p>
        <p>EV Dataset The data used in our experiments consist of
charging schedules of electric cars over a four month
duration in 2017 from an American university campus. The
data is clustered based on the entry and exit charges, and
we selected 25 representative problem instances across
clusters for our experiments. The data is from a typical charging
station, where the EV is unplugged once the desired charge
level is reached. Since we are considering an extended
parking scenario as in a vehicle parked at work, we assume
parking duration of up to eight hours in our experiments.
Therefore, for each instance, we alter the parking duration.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Racetrack Domain</title>
        <p>
          We experimented with six problem instances from the
racetrack domain
          <xref ref-type="bibr" rid="ref2">(Barto, Bradtke, and Singh 1995)</xref>
          , with
modifications to increase the difficulty of the problem. We
modified the problem such that, in addition to a 0.10 probability
of slipping, there is a 0.20 probability of randomly
changing the intended acceleration by one unit. The negative side
effects are estimated using one-step lookahead and state
features such as: whether the successor is a wall or pothole or
goal, and if the successor is moving away from the goal,
estimated using the heuristic.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Sailing Domain</title>
        <p>
          Finally, we present results on six instances of the sailing
domain
          <xref ref-type="bibr" rid="ref11 ref18">(Kocsis and Szepesva´ri 2006)</xref>
          . The problems vary in
terms of grid size and the goal position (opposite corner (C)
or middle (M) of the grid). In this domain, the actions are
deterministic and uncertainty is caused by stochastic changes
        </p>
        <sec id="sec-4-3-1">
          <title>Racetrack-Square-3</title>
          <p>Racetrack-Square-4
Racetrack-Square-5
Racetrack-Ring-3
Racetrack-Ring-5
Racetrack-Ring-6</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Sailing-20(C)</title>
          <p>Sailing-40(C)
Sailing-80(C)
Sailing-20(M)
Sailing-40(M)
Sailing-80(M)
7.16
6.60
8.24
12.09</p>
          <p>0
10.73
16.30</p>
          <p>0
11.58
3.33
in the direction of the wind. The negative side effects are
estimated using one-step lookahead and based on state
features such as: the difference between the action’s intended
direction of movement and the wind’s direction, and if the
successor is moving away from goal, estimated using the
heuristic value.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Discussion</title>
        <p>Table 2 shows the average number of negative side effects
(risky states encountered without an action to execute) of
the three techniques in three domains. For the EV domain,
the results are aggregated over 25 problem instances for each
reward function. In our experiments, we observe that as the
model fidelity is improved, the negative side effects are
minimized. In all four reward functions cases of the EV, 0/1 RM
has no negative side effects. Overall, 0/1 RM has the least
negative side effects in all three proof-of-concept domains.</p>
        <p>Table 3 shows the average increase in cost (%) (with
respect to the optimal cost as lower bound) and the time
savings (%) (with respect to solving the original problem). The
least cost increase % indicate that the performance of the
technique is closer to optimal. The higher time savings
values indicate improved run time gains by using the model.
The runtime of 0/1 RM is considerably faster than solving
the original problem, while yielding almost optimal
solutions. This shows that our approach does not significantly
affect the run time gains so as to improve the solution
quality and minimize negative side effects. Note that we solve
0/1 RM using an optimal algorithm, LAO*, to demonstrate
the effectiveness of our framework by comparing the
optimal solutions of the models. In practice, any SSP solver
(optimal or not) may be used to improve run time gains.
Over</p>
        <sec id="sec-4-4-1">
          <title>Racetrack-Square-3</title>
          <p>Racetrack-Square-4
Racetrack-Square-5
Racetrack-Ring-4
Racetrack-Ring-5
Racetrack-Ring-6</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Sailing-20(C)</title>
          <p>Sailing-40(C)
Sailing-80(C)
Sailing-20(M)
Sailing-40(M)
Sailing-80(M)
% Cost
Increase
% Time
Savings
% Cost
Increase
% Time
Savings
% Cost
Increase
% Time
Savings
all, 0/1 RM yields almost optimal results, in terms of cost
and negative side effects.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        Interest in using reduced models to accelerate planning
increased after the success of FF-Replan
        <xref ref-type="bibr" rid="ref26">(Yoon, Fern, and
Givan 2007)</xref>
        , which won the 2004 IPPC using the Fast Forward
(FF) technique to generate deterministic plans
        <xref ref-type="bibr" rid="ref8">(Hoffmann
2001)</xref>
        . Following the success of FF-Replan, researchers have
proposed various methods to improve determinization
        <xref ref-type="bibr" rid="ref10 ref25 ref9">(Yoon
et al. 2010; Keller and Eyerich 2011; Issakkimuthu et al.
2015)</xref>
        . Robust FF (RFF) generates a plan for an envelope of
states such that the probability of reaching a state outside
the envelope is below some predefined threshold
        <xref ref-type="bibr" rid="ref23">(TeichteilKo¨nigsbuch, Kuter, and Infantes 2010)</xref>
        and replans to the
subset of states for which the policy is already computed.
Our work differs from that of RFF since we compute
reachability as a pre-processing to formulate 0/1 RM and not the
policy.
      </p>
      <p>
        Safety issues in SSPs have been studied mostly with
respect to unavoidable dead ends that affect the goal
reachability
        <xref ref-type="bibr" rid="ref12 ref12 ref13 ref13 ref24">(Kolobov and Mausam 2012; Kolobov, Mausam, and
Weld 2012; Trevizan, Teichteil-Ko¨nigsbuch, and Thie´baux
2017)</xref>
        . Recently, researchers have focused on formulating
richer reduced models that can balance the trade-off between
complexity and model fidelity so as to improve the
solution quality and safety when planning with reduced
models
        <xref ref-type="bibr" rid="ref19 ref22">(Pineda and Zilberstein 2014; Saisubramanian,
Zilberstein, and Shenoy 2018)</xref>
        . Our work aims to strike a balance
between minimizing negative side effects and using a
simpler model for planning, bearing similarity to the works that
balance the trade-off between solution quality and model
fidelity in reduced model. However, unlike the existing works,
we target a well-defined and a specific form of risk that arise
due to replanning in a reduced model setting.
      </p>
      <p>
        The importance of addressing negative side effects in AI
systems is gaining growing interest.
        <xref ref-type="bibr" rid="ref1">Amodei et al. (2016)</xref>
        address the problem of negative side effects in a
reinforcement learning setting by penalizing the agent, while
optimizing the reward. Zhang, Durfee, and Singh (2018) study
a form of negative side effects in MDPs that occur when
an agent’s actions alter the state features in a way that may
negatively surprise the user. In their work, the agent queries
a user to learn the features that are acceptable for
modification and computes a plan accordingly.
        <xref ref-type="bibr" rid="ref5">Hadfield-Menell et
al. (2017)</xref>
        address negative side effects that arise from
misspecified objectives and misspecified reward by using the
specified reward as an observation to learn the true reward
function.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>Reduced models are a promising approach to quickly solve
large SSPs. However, the existing techniques are oblivious
to the risky outcomes in the original problem when
formulating a reduced model. We propose planning using a
portfolio of reduced models that provides flexibility in outcome
selection to minimize the negative side effects while still
accelerating planning. We also present a heuristic approach to
determine the outcome selection principles. Our empirical
results demonstrate the promise of this framework; 0/1 RM
that is based on the proposed heuristic yields improves the
solution quality, apart from minimizing the negative side
effects. Our results also contribute to a better understanding
of how disparate reduce model techniques help improve the
solution quality, while minimizing the negative side effects.</p>
      <p>This paper focuses on addressing one specific form of
negative side effect in planning that occurs when using
reduced models. There are a number of interesting future
research directions in this area. First, we are looking into
ways to automatically identify risks in the system. One
approach is to generate samples for learning risks using
machine learning techniques such as regression and decision
stump. Another approach is to employ limited user data in
the form of human demonstrations to learn the feature and
states that are risky for replanning. Second, we aim to build
a classifier to distinguish side effects from negative side
effects, which is a critical component for automatically
identifying the risks. Finally, the 0=1 RM represents an initial
exploration of a broad spectrum of PRMs, which we will
continue to explore and further analyze outcome selection
methods for PRMs with varying levels of probabilistic
complexity.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Support for this work was provided in part by the U.S.
National Science Foundation grants IIS-1524797 and
IIS-IIS1724101.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steinhardt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Christiano,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Schulman</surname>
          </string-name>
          , J.; and Mane´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Concrete problems in ai safety</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>06565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ; Bradtke,
          <string-name>
            <given-names>S. J.;</given-names>
            and
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. P.</surname>
          </string-name>
          <year>1995</year>
          .
          <article-title>Learning to act using real-time dynamic programming</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>72</volume>
          :
          <fpage>81</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bertsekas</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Tsitsiklis</surname>
            ,
            <given-names>J. N.</given-names>
          </string-name>
          <year>1991</year>
          .
          <article-title>An analysis of stochastic shortest path problems</article-title>
          .
          <source>Mathematics of Operations Research</source>
          <volume>16</volume>
          :
          <fpage>580</fpage>
          -
          <lpage>595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Eversource</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Eversource energy - time of use rates</article-title>
          . https: //www.eversource.com/clp/vpp/vpp.aspx.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Hadfield-Menell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Milli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Abbeel,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.;</given-names>
            and
            <surname>Dragan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Inverse reward design</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zilberstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>LAO*: A heuristic search algorithm that finds solutions with loops</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>129</volume>
          :
          <fpage>35</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nilsson</surname>
            ,
            <given-names>N. J.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Raphael</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>1968</year>
          .
          <article-title>A formal basis for the heuristic determination of minimum cost paths</article-title>
          .
          <source>IEEE Transactions on Systems Science and Cybernetics</source>
          <volume>4</volume>
          :
          <fpage>100</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Hoffmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>FF: The fast-forward planning system</article-title>
          .
          <source>AI Magazine</source>
          <volume>22</volume>
          :
          <fpage>57</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Issakkimuthu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fern</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Khardon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Tadepalli,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Hindsight optimization for probabilistic planning with factored actions</article-title>
          .
          <source>In Proc. of the 25th International Conference on Automated Planning and Scheduling.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Keller</surname>
          </string-name>
          , T., and
          <string-name>
            <surname>Eyerich</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>A polynomial all outcome determinization for probabilistic planning</article-title>
          .
          <source>In Proc. of the 21st International Conference on Automated Planning and Scheduling.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Kocsis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and Szepesva´ri,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Bandit based Monte-Carlo planning</article-title>
          .
          <source>In Proc. of the 17th European Conference on Machine Learning</source>
          , volume
          <volume>6</volume>
          ,
          <fpage>282</fpage>
          -
          <lpage>293</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Kolobov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mausam</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Planning with markov decision processes: An AI perspective</article-title>
          .
          <source>Synthesis Lectures on Artificial Intelligence and Machine Learning</source>
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Kolobov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <article-title>Mausam;</article-title>
          and
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>A theory of goaloriented MDPs with dead ends</article-title>
          .
          <source>In Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <fpage>438</fpage>
          -
          <lpage>447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Korf</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          <year>1990</year>
          .
          <article-title>Real-time heuristic search</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>42</volume>
          :
          <fpage>189</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Kulic´</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Safe planning for human-robot interaction</article-title>
          .
          <source>Journal of Field Robotics</source>
          <volume>22</volume>
          :
          <fpage>383</fpage>
          -
          <lpage>396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Probabilistic propositional planning: Representations and complexity</article-title>
          .
          <source>In Proc. of the 14th AAAI International Conference on Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , Y.;
          <string-name>
            <surname>Houghton</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cruden</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Infield</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Modeling the benefits of vehicle-to-grid technology to a power system</article-title>
          .
          <source>IEEE Transactions on Power Systems</source>
          <volume>27</volume>
          :
          <fpage>1012</fpage>
          -
          <lpage>1020</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Petrik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zilberstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Learning parallel portfolios of algorithms</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>48</volume>
          :
          <fpage>85</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zilberstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Planning under uncertainty using reduced models: Revisiting determinization</article-title>
          .
          <source>In Proc.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>of the 24th International Conference on Automated Planning and Scheduling.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Saisubramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zilberstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Shenoy,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Optimizing electric vehicle charging through determinization</article-title>
          .
          <source>In ICAPS Workshop on Scheduling and Planning Applications.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Saisubramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zilberstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Shenoy,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Planning using a portfolio of reduced models</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Teichteil-Ko</surname>
          </string-name>
          ¨nigsbuch, F.;
          <string-name>
            <surname>Kuter</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ; and Infantes,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Incremental plan aggregation for generating policies in MDPs</article-title>
          .
          <source>In 9th International Conference on Autonomous Agents and Multiagent Systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Trevizan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Teichteil-Ko</surname>
            ¨nigsbuch, F.; and Thie´baux,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Efficient solutions for stochastic shortest path problems with dead ends</article-title>
          .
          <source>In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ruml</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Benton, J.; and Do,
          <string-name>
            <surname>M. B.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Improving determinization in hindsight for on-line probabilistic planning</article-title>
          .
          <source>In Proc. of the 20th International Conference on Automated Planning and Scheduling.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fern</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Givan,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>FF-Replan: A baseline for probabilistic planning</article-title>
          .
          <source>In Proc. of the 17th International Conference on Automated Planning and Scheduling.</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Durfee</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S. P.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Minimax-regret querying on side effects for safe optimality in factored markov decision processes</article-title>
          .
          <source>In IJCAI</source>
          ,
          <fpage>4867</fpage>
          -
          <lpage>4873</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Zilberstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Building strong semi-autonomous systems</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>In Proc. of the 29th AAAI Conference on Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>