Minimizing the Negative Side Effects of Planning with Reduced Models

                                   Sandhya Saisubramanian and Shlomo Zilberstein
                                             College of Information and Computer Sciences
                                                      University of Massachusetts
                                                       Amherst, MA 01003, USA
                                               {saisubramanian,shlomo}@cs.umass.edu


                            Abstract
  Reduced models of large Markov decision processes accel-
  erate planning by considering a subset of outcomes for each
  state-action pair. This reduction in reachable states leads to
  replanning when the agent encounters states without a pre-
  computed action during plan execution. However, not all
  states are suitable for replanning. In the worst case, the agent
  may not be able to reach the goal from the newly encoun-
  tered state. Agents should be better prepared to handle such
  risky situations and avoid replanning in risky states. Hence,
  we consider replanning in states that are unsafe for delibera-
  tion as a negative side effect of planning with reduced mod-
  els. While the negative side effects can be minimized by al-
  ways using the full model, this defeats the purpose of using
  reduced models. The challenge is to plan with reduced mod-
                                                                      Figure 1: Reduced models are complemented by replanning
  els, but somehow account for the possibility of encountering        when the agent encounters a state for which the policy com-
  risky situations. An agent should thus only replan in states        puted using the reduced model does not specify an action.
  that the user has approved as safe for replanning. To that end,     Some states may be unsafe for deliberation (replanning),
  we propose planning using a portfolio of reduced models, a          which is considered a negative side effect of planning with
  planning paradigm that minimizes the negative side effects of       reduced models.
  planning using reduced models by alternating between dif-
  ferent outcome selection approaches. We empirically demon-
  strate the effectiveness of our approach on three domains: an       In such instances, the agent needs to replan online and find
  electric vehicle charging domain using real-world data from         a new way to reach a goal state. We consider this as a side
  a university campus and two benchmark planning problems.            effect of using reduced models as the agent never has to
                                                                      replan when the complete model is used. Current reduced
                                                                      model techniques assume that the agent can replan in any
                        Introduction                                  state (Yoon, Fern, and Givan 2007; Keller and Eyerich 2011;
A Stochastic Shortest Path (SSP) problem is a Markov De-              Saisubramanian, Zilberstein, and Shenoy 2017). That is,
cision Process (MDP) with a start state and goal state(s),            they are oblivious to how the ignored outcomes may affect
where the objective is to find a policy that minimizes the ex-        the system. Some of these side effects may be acceptable to
pected cost of reaching a goal state from the start state (Bert-      the user, but others may be unsafe even though they acceler-
sekas and Tsitsiklis 1991). Solving large SSPs optimally and          ate planning (Figure 1).
efficiently is an active research area in automated planning,            Consider, for example, a robot navigating on a sidewalk
with many practical applications. Given the computational             adjacent to a street. The robot has a 0.1 probability of slid-
complexity of solving large SSPs optimally (Littman 1997),            ing on to the street when navigating at high speed and 0.02
there has been considerable interest in developing efficient          probability of sliding to the street when navigating at a low
approximations, such as reduced models, that trade solution           speed. When planning with a reduced model that ignores the
quality for computational gains. Reduced models simplify              possibility of sliding as an outcome, the robot will have to
the problem by partially or completely ignoring uncertainty,          replan if the ignored outcome occurs during plan execution.
thereby reducing the set of reachable states for the planner.         However, it is risky for the robot to replan while on the street
We consider reduced models in which the number of out-                and it is expected to be better prepared to handle such situ-
comes per action is reduced relative to the original model.           ations. Another example of replanning in risky states is an
   Due to model simplification, an agent may encounter un-            autonomous vehicle replanning in the middle of an intersec-
expected states during plan execution, for which the policy           tion after missing a turn or an exit. In the terminology intro-
computed using reduced models may not specify an action.              duced by Amodei et al. (2016), these are negative side effects
of planning with reduced models — an undesired behavior             Property           Property Values
that is the result of ignoring certain outcomes in planning.        Impact             Reversible or irreversible
Table 1 outlines various properties of negative side effects.
   The importance of accounting for risks in AI systems is          Severity           Significant or insignificant
attracting growing interest (Kulić and Croft 2005; Zilber-         Occurrence         Avoidable or unavoidable
stein 2015). One such specific form of risk is the negative         Observability      Fully observable or partially observable
side effects (Amodei et al. 2016) that occur when an au-
tonomous agent optimizes an objective function that focuses               Table 1: Classification of negative side effects.
on certain aspects of the environment, leading to undesirable
effects on other aspects of the environment. Recently, re-
searchers have proposed techniques to address the negative         Stochastic Shortest Path Problems
side effects that arise due to misspecified rewards (Hadfield-     A Stochastic Shortest Path (SSP) MDP defined by M =
Menell et al. 2017) and incomplete model (Zhang, Durfee,           hS, A, T, C, s0 , SG i, where
and Singh 2018). In this work, we focus on addressing neg-
ative side effects in the form of deliberation in unsafe states    • S is a finite set of states;
that arise due to planning with simplified or reduced mod-         • A is a finite set of actions;
els. This is a type of irreversible negative side effect that is   • T : S × A × S → [0, 1] is the transition function which
fully observable and avoidable. The severity of replanning           specifies the probability of reaching a state s0 by executing
may be mild or significant depending on the state. While             an action a in state s and is denoted by T (s, a, s0 );
the negative side effects can always be reduced by using the
full model, this defeats the purpose of using reduced models.      • C : S × A → R+ ∪ {0} specifies the cost of executing an
The challenge is to plan with reduced models while simulta-          action a in state s and is denoted by C(s, a);
neously being prepared for handling risky situations. This         • s0 ∈ S is the start state; and
can be achieved by formulating reduced models that can
adapt to risks by selectively improving the model fidelity.        • SG ⊆ S is the set of absorbing goal states.
   We propose planning using a portfolio of outcome selec-         The cost of an action is positive in all states except goal
tion principles to minimize the negative side effects of plan-     states, where it is zero. The objective in an SSP is to min-
ning with reduced models. The model fidelity is improved           imize the expected cost of reaching a goal state from the
in states that are risky — unsafe for deliberation or replan-      start state. It is assumed that there exists at least one proper
ning. We consider a factored state representation in which         policy, one that reaches a goal state from any state s with
the states are represented as a vector of features. Given a set    probability 1. The optimal policy, π ∗ , can be extracted using
of features that characterize states unsafe for deliberation,      the value function defined over the states, V ∗ (s):
our approach employs more informative outcomes or the
full model for actions in states leading to these side effects         V ∗ (s) = min Q∗ (s, a), ∀s ∈ S                            (1)
                                                                                  a
and a simple outcome selection otherwise. To identify states           ∗
                                                                                          X
where model fidelity is to be improved, the agent performs            Q (s, a) = C(s, a)+    T (s, a, s0 )V ∗ (s0 ), ∀(s, a)      (2)
                                                                                               s0
reachability analysis using samples and the given features.
The sample trajectories are generated by depth-limited ran-        where Q∗ (s, a) denotes the optimal Q-value of the action a
dom walk. Reachability analysis is performed primarily with        in state s in the SSP M (Bertsekas and Tsitsiklis 1991).
respect to negative side effects and this is used as a heuristic
for choosing the outcome selection principles in a portfo-         Planning Using Reduced Models
lio. We evaluate our approach in three proof-of-concept do-        While SSPs can be solved in polynomial time in the num-
mains: an electric vehicle charging problem using real-world       ber of states, many problems of interest have a state-space
data, a racetrack domain and a sailing domain. Our empirical       whose size is exponential in the number of variables describ-
results show that our approach significantly minimizes the         ing the problem (Littman 1997). This complexity has led to
negative side effects, without affecting the run time. While       the use of approximate methods such as reduced models.
we address one particular form of negative side effects in         Reduced models simplify planning by considering a subset
planning using reduced models, our proposed framework              of outcomes. Let θ(s, a) denote the set of all outcomes of
is generic and can be extended to potentially address other        (s, a), θ(s, a) = {s0 |T (s, a, s0 ) > 0}.
forms of negative side effects.
   Our primary contributions are: (i) formalizing negative         Definition 1. A reduced model of an SSP M is represented
side effects in planning using reduced models; (ii) introduc-      by the tuple M 0 = hS, A, T 0 , C, s0 , SG i and characterized
ing a portfolio of reduced models approach to minimize the         by an altered transition function T 0 such that ∀(s, a) ∈ S ×
negative side effects; and (iii) empirical evaluation of the       A, θ0 (s, a) ⊆ θ(s, a), where θ0 (s, a) = {s0 |T 0 (s, a, s0 ) > 0}
proposed approach on three domains.                                denotes the set of outcomes in the reduced model for action
                                                                   a ∈ A in state s ∈ S.
                       Background                                     We normalize the probabilities of the outcomes included
This section provides a brief background on stochastic short-      in the reduced model, but more complex ways to redistribute
est path problems and planning using reduced models.               the probabilities of ignored outcomes may be considered.
The outcome selection process in a reduced model frame-            lios of algorithms to solve complex computational prob-
work determines the number of outcomes and how the spe-            lems (Petrik and Zilberstein 2006).
cific outcomes are selected. Depending on these two aspects,
                                                                   Definition 2. Given a portfolio of finite outcome selection
a spectrum of reductions exist with varying levels of proba-
                                                                   principles, Z = {ρ1 , ρ2 , ..., ρk }, k>1, a model selector, Φ,
bilistic complexity.
                                                                   generates T 0 for a reduced model by mapping every (s, a) to
   The outcome selection principle (OSP) determines the
                                                                   an outcome selection principle, Φ : S×A → ρi , ρi ∈ Z, such
outcomes included in the reduced model per state-action
                                                                   that T 0 (s, a, s0 ) = TΦ(s,a) (s, a, s0 ), where TΦ(s,a) (s, a, s0 )
pair, and the altered transition function for each state-action
                                                                   denotes the transition probability corresponding to the out-
pair. The outcome selection may be performed using a sim-
                                                                   come selection principle selected by the model selector.
ple function such as always choosing the most-likely out-
come or a more complex function. Typically, a reduced                 Clearly, the existing reduced models are a special case
model is characterized by a single OSP. That is, a single prin-    of PRMs, with a model selector (Φ) that always selects the
ciple is used to determine the number of outcomes and how          same OSP for every state-action pair. In planning using a
the outcomes are selected across the entire model. Hence,          portfolio of reduced models, however, the model selector
existing reduced models are incapable of selectively improv-       typically utilizes the synergy of multiple outcome selection
ing model fidelity.                                                principles. Each state-action pair may have a different num-
                                                                   ber of outcomes and a different mechanism to select the spe-
         Minimizing Negative Side Effects                          cific outcomes. Hence, we leverage this flexibility in out-
                                                                   come selection to minimize negative side effects in reduced
One form of negative side effects of planning using reduced
                                                                   models by using more informative outcomes in the risky
models is replanning in states, as a result of ignoring out-
                                                                   states and using simple outcome selection principles other-
comes, which are prescribed by a human as unsafe for de-
                                                                   wise. Though the model selector may use multiple outcome
liberation. The set of states characterizing negative side ef-
                                                                   selection principles to generate T 0 in a PRM, note that the
fects for planning with a reduced model M 0 , denoted by
                                                                   resulting model is still an SSP. In the rest of the paper, we fo-
N SE(M 0 ), is defined as N SE(M 0 ) = {s0 |T (s, a, s0 ) >
                                                                   cus on a basic instantiation of a PRM that alternates between
0 ∧ D(s0 ) = true ∧ s0 ∈   / θ0 (s, a)}, where D(s0 ) is true if
                                                                   the extremes of reductions.
the deliberation in the state is risky during policy execution.
Thus, the OSP determines the reduced model and thereby             Definition 3. A 0/1 reduced model (0/1 RM) is a PRM with
the resulting negative side effects. The challenge is to de-       a model selector that selects either one or all outcomes of
termine outcome selection principles that minimize the neg-        an action in a state to be included in the reduced model.
ative side effects of planning with reduced models without
                                                                      A 0/1 RM is characterized by a model selector, Φ0/1 , that
significantly compromising the run time gains.
                                                                   either ignores the stochasticity completely (0) by consider-
   While the negative side effects can always be minimized
                                                                   ing only one outcome of (s, a), or fully accounts for the
using the full model or picking the right determinization,
                                                                   stochasticity (1) by considering all outcomes of the state-
the former is computationally complex and finding the right
                                                                   action pair in the reduced model. For example, it may use the
determinization is a non-trivial meta-level decision prob-
                                                                   full model in states prone to negative side effects, and deter-
lem (Pineda and Zilberstein 2014). Another approach is to
                                                                   minization otherwise. Thus, a 0/1 RM that guarantees goal
penalize the agent for replanning in risky states; that is, not
                                                                   reachability with probability 1 can be devised, if a proper
accounting for the risky outcomes in the reduced model for-
                                                                   policy exists in the SSP. Our experiments using 0/1 RM
mulation. However, this is useful only when the agent per-
                                                                   show that even such basic instantiation of a PRM works well
forms an exhaustive search in the space of reduced mod-
                                                                   in practice.
els, which is computationally complex. Given a set of fea-
                                                                      Devising an efficient model selector automatically can be
tures characterizing states unsuitable for deliberation, D(f~),    treated as a meta-level decision problem that is computa-
we address this issue by providing efficient reduced model         tionally more complex than solving the reduced model, due
formulation that can selectively improve model fidelity. A         to the numerous possible combinations of outcome selec-
straightforward approach is to improve the model fidelity in       tion principles. Even in a 0/1 RM, finding when to use de-
risky states. However, this does not minimize the negative         terminization and when to use the full model is non-trivial.
side effects and it is more beneficial to improve the model        In the more general setting, all OSPs in Z may have to be
fidelity in states leading to these risky states. Based on this    evaluated to determine the best reduced model formulation
insight, we present a heuristic approach, which is based on        in the worst case. Let τmax denote the maximum time taken
reachability to these risky states, to select outcome selection    for this evaluation across all states. In a PRM with a large
principles for each state-action pair.                             portfolio, the OSPs may be redundant in terms of specific
                                                                   outcomes. For example, selecting the most-likely outcome
Portfolios of Reduced Models                                       and greedily selecting an outcome based on heuristic could
We present planning using a portfolio of reduced models            result in the same outcome for some (s, a) pair. We use this
(PRM), a generalized approach to formulate reduced mod-            property to show that the worst case time complexity of de-
els, by switching between different outcome selection prin-        vising an efficient model selector is independent of the size
ciples, each of which represents a different reduced model.        of the portfolio, which may be much larger than the size of
The approach is inspired by the benefits of using portfo-          the problem under consideration, and depends only the size
                                                                      Figure 2 is an example of an SSP with a risky state
                                                                   (shaded) and instantiations of reduced models formed with
                                                                   uniform and adaptive outcome selection principles. The
                                                                   most likely outcome determinization (MLOD) uses uniform
                                                                   outcome selection principle – always selects the most likely
                                                                   outcome – and may ignore s5 in the reduced model. A
                                                                   0/1 RM, however, uses the full model in state action pair
                                                                   leading to the risky state, minimizing negative side effects
                                                                   and resulting in a model that is better prepared to handle the
                                                                   risky state. The reachability threshold can be varied to devise
                                                                   reduced models with different fidelity with varying propor-
     (a) Orignal Problem          (b) MLOD         (c) 0/1 RM      tions of full model usage, resulting in different sensitivity to
                                                                   negative side effects and computational complexity.
Figure 2: Examples of most likely outcome determinization                           Experimental Results
(MLOD) and 0/1 RM of a problem. The shaded state s5 in
                                                                   We experiment with the 0/1 RM on three proof-of-
the original problem denotes a risky state.
                                                                   concept domains including an electric vehicle charg-
                                                                   ing problem using real world data from a univer-
of states and actions. The following proposition formally          sity campus, and two benchmark planning prob-
proves this complexity.                                            lems: the racetrack domain and the sailing domain.
                                                                   The experiments employ a simple portfolio Z =
Proposition 1. The worst case time complexity for Φ0/1 to          {most-likely outcome determinization (MLOD), full model}.
generate T 0 for a 0/1 RM is O(|A||S|2 τmax ).                     We compare the results of 0/1 RM to that of MLOD and
Proof Sketch. For each (s, a), at most |Z| outcome selection       M02 reduction, which greedily selects two outcomes per
principles are to be evaluated and this takes at most τmax         each state-action pair.
time. Since this process is repeated for every (s, a), Φ takes        The expected number of negative side effects, cost of
O(|S||A||Z|τmax ) to generate T 0 . In the worst case, every       reaching the goal, and planning time are used as evalua-
action may transition to all states and the outcome selection      tion metrics. For 0/1 RM, we use the full model in states
principles in Z may be redundant in terms of the number            that had a reachability of 0.25 or more to at least one of the
and specific outcomes set produced by them. To formulate           risky states, which is estimated using thirty samples. In all
a 0/1 RM of an SSP, it may be necessary to evaluate ev-            other states, MLOD is used. All results are averaged over
ery outcome selection principle that corresponds to a deter-       100 trials of planning and execution simulations and the
minization or a full model. The set of unique outcomes, k,         average times include re-planning time. The deterministic
for a 0/1 RM is composed of all unique deterministic out-          problems are solved using the A* algorithm (Hart, Nilsson,
comes, which is every state in the SSP, and the full model,        and Raphael 1968) and other problems using LAO*. All al-
|k| ≤ |S| + 1. Replacing |Z| with |k|, the complexity is           gorithms were implemented with =10−3 and using hmin
reduced to O(|A||S|2 τmax ).                                       heuristic computed using a labeled version of LRTA* (Korf
                                                                   1990).
Heuristic Approach to Model Selection                              EV Charging Problem
Since it is computationally complex to perform an exhaus-          The electric vehicle (EV) charging domain considers an
tive search in the space of possible 0/1 RM models, we             EV operating in a vehicle-to-grid (V2G) setting in which
present an approximation to identify when to use the full          an EV can either charge or discharge energy from the
model, using samples.                                              smart grid (Ma et al. 2012; Saisubramanian, Zilberstein, and
   Given the features characterizing negative side effects,        Shenoy 2017). By planning when to buy or sell electricity, an
model selection is performed by generating and solving             EV can devise a robust policy for charging and discharging
samples. The samples are generated automatically by sam-           that is consistent with the owner’s preferences, while mini-
pling states from the target problem. Alternatively, small         mizing the long-term cost of operating the vehicle.
problem instances in the target domain may be used as sam-            We consider uncertainty in parking duration, which is
ples. In this paper, smaller problems are created automati-        specified by a probability that certain states may become
cally by multiple trials of depth limited random walk on the       terminal states. This problem is modeled as a discrete finite-
target problems and solved using LAO* (Hansen and Zil-             horizon MDP with maximum parking duration as the hori-
berstein 2001). The probability of reaching the states unsafe      zon H. Each state is represented by the tuple hl, t, d, p, ei,
for deliberation is computed for these samples, based on the       where l denotes the current level of charge of the vehicle,
features. Trivially, as the number of samples and the depth        t ≤ H denotes the current time step. d and p denote the cur-
of the random walk are increased, the estimates converge to        rent demand level and price distribution for electricity and
their true values. The full model is used in states that meet      0 ≤ e ≤ 3 denotes the anticipated departure time specified
the reachability threshold. In all other states, determinization   by the owner. If the owner has not provided this informa-
is used.                                                           tion, then e = 3 and the agent plans for H. Otherwise, e
denotes the time steps remaining for departure. The vehicle            Problem Instance         MLOD        M02     0/1 RM
                                    −
can charge (Ω+ i ) and discharge (Ωi ) at three different speed        EV-RF-1                    12.91     7.16          0
levels, where 1 ≤ i ≤ 3 denotes the speed level, or remain
idle (∅). The Ω+           −                                           EV-RF-2                    11.88     6.60          0
                  i and Ωi actions are stochastic, while the
∅ action is deterministic. The objective is to find a reward           EV-RF-3                    19.42     8.24          0
maximizing policy π ∗ : S → A that maximizes goal reach-               EV-RF-4                    19.90    12.09          0
ability, given the charging preferences of the vehicle owner.          Racetrack-Square-3         20.80        0       0.20
If the exit level charge (goal) requirement of the owner is            Racetrack-Square-4         21.90    10.73       7.20
not met at departure, then the owner may need to extend the
parking duration to reach the required charge level. Since             Racetrack-Square-5         21.20    16.30       9.03
this is undesirable, any state from where the goal cannot              Racetrack-Ring-3           23.40        0          0
be reached in the remaining parking duration is treated as a           Racetrack-Ring-5           26.20    11.58      17.00
preference violation (risk) in the MDP with a large penalty.           Racetrack-Ring-6           28.00     3.33       3.28
   Four demand levels and two price distributions are used             Sailing-20(C)               51.4    30.26          0
in the experiments. Each t is equivalent to 30 minutes in real
time. We assume that the owner may depart between four                 Sailing-40(C)              13.51    15.98       9.88
to six hours of parking with a probability of 0.2 that they            Sailing-80(C)               50.3     16.7      13.25
announce their planned departure time. Outside that win-               Sailing-20(M)              50.09    14.82       0.12
dow, there is a lower probability of 0.05 that they announce           Sailing-40(M)               9.59    33.15       1.11
their departure. The charging rewards and the peak hours are           Sailing-80(M)               52.9    11.61          0
based on real data (Eversource 2017). The battery capacity
and the charge speeds for the EV are based on Nissan Leaf          Table 2: Average negative side effects of different reduced
configuration. The charge and discharge speeds are consid-         models on three domains. The results are averaged over 100
ered to be equal. The battery inefficiency is accounted for        trials.
by adding a 15% penalty on the rewards. The negative side
effects are estimated using: time remaining for departure, if
the current time is peak hour, and if there is sufficient charge   in the direction of the wind. The negative side effects are
for discharging, with one step-lookahead.                          estimated using one-step lookahead and based on state fea-
EV Dataset The data used in our experiments consist of             tures such as: the difference between the action’s intended
charging schedules of electric cars over a four month du-          direction of movement and the wind’s direction, and if the
ration in 2017 from an American university campus. The             successor is moving away from goal, estimated using the
data is clustered based on the entry and exit charges, and         heuristic value.
we selected 25 representative problem instances across clus-
ters for our experiments. The data is from a typical charging      Discussion
station, where the EV is unplugged once the desired charge
level is reached. Since we are considering an extended park-       Table 2 shows the average number of negative side effects
ing scenario as in a vehicle parked at work, we assume park-       (risky states encountered without an action to execute) of
ing duration of up to eight hours in our experiments. There-       the three techniques in three domains. For the EV domain,
fore, for each instance, we alter the parking duration.            the results are aggregated over 25 problem instances for each
                                                                   reward function. In our experiments, we observe that as the
Racetrack Domain                                                   model fidelity is improved, the negative side effects are min-
                                                                   imized. In all four reward functions cases of the EV, 0/1 RM
We experimented with six problem instances from the race-          has no negative side effects. Overall, 0/1 RM has the least
track domain (Barto, Bradtke, and Singh 1995), with modi-          negative side effects in all three proof-of-concept domains.
fications to increase the difficulty of the problem. We mod-          Table 3 shows the average increase in cost (%) (with re-
ified the problem such that, in addition to a 0.10 probability     spect to the optimal cost as lower bound) and the time sav-
of slipping, there is a 0.20 probability of randomly chang-        ings (%) (with respect to solving the original problem). The
ing the intended acceleration by one unit. The negative side       least cost increase % indicate that the performance of the
effects are estimated using one-step lookahead and state fea-      technique is closer to optimal. The higher time savings val-
tures such as: whether the successor is a wall or pothole or       ues indicate improved run time gains by using the model.
goal, and if the successor is moving away from the goal, es-       The runtime of 0/1 RM is considerably faster than solving
timated using the heuristic.                                       the original problem, while yielding almost optimal solu-
                                                                   tions. This shows that our approach does not significantly
Sailing Domain                                                     affect the run time gains so as to improve the solution qual-
Finally, we present results on six instances of the sailing do-    ity and minimize negative side effects. Note that we solve
main (Kocsis and Szepesvári 2006). The problems vary in           0/1 RM using an optimal algorithm, LAO*, to demonstrate
terms of grid size and the goal position (opposite corner (C)      the effectiveness of our framework by comparing the opti-
or middle (M) of the grid). In this domain, the actions are de-    mal solutions of the models. In practice, any SSP solver (op-
terministic and uncertainty is caused by stochastic changes        timal or not) may be used to improve run time gains. Over-
                                            MLOD                           M02                     0/1 RM
           Problem Instance           % Cost    % Time           % Cost          % Time      % Cost      % Time
                                     Increase   Savings         Increase         Savings    Increase     Savings
           EV-RF-1                       28.17         31.03       48.42            15.96        7.04         21.47
           EV-RF-2                       35.28         36.29       35.96            24.93        5.45         22.17
           EV-RF-3                       38.55         35.56       30.61            29.05        9.59         28.54
           EV-RF-4                       32.29         55.56       26.85            29.05       11.11         46.06
           Racetrack-Square-3            24.69         99.73       22.45            99.45        0.26         99.59
           Racetrack-Square-4            44.61         99.85       18.87            97.13        5.73         99.68
           Racetrack-Square-5            33.32         93.83        8.73            34.21        5.58         94.34
           Racetrack-Ring-4               8.59         99.20       14.07            99.13        2.93         99.06
           Racetrack-Ring-5              58.25         99.55       30.45            82.91       17.17         94.35
           Racetrack-Ring-6              30.62         99.26        6.41            66.05        8.30         92.07
           Sailing-20(C)                 76.79         92.86       31.83            49.68        7.11         85.72
           Sailing-40(C)                 63.34         75.91       19.52            62.21        2.85         71.66
           Sailing-80(C)                 39.19         86.99       13.25            70.27        2.81         83.88
           Sailing-20(M)                 73.81         99.84       79.62            39.56        1.63         93.25
           Sailing-40(M)                 65.64         94.56       37.75            88.86        3.08         85.01
           Sailing-80(M)                 87.66         93.48       24.59            90.37        4.99         96.67

Table 3: Comparison of the solution quality of different reduced models (formed using different model selectors) on three
domains. The reported values are averaged over 100 trials.

all, 0/1 RM yields almost optimal results, in terms of cost       balance the trade-off between solution quality and model fi-
and negative side effects.                                        delity in reduced model. However, unlike the existing works,
                                                                  we target a well-defined and a specific form of risk that arise
                     Related Work                                 due to replanning in a reduced model setting.
                                                                     The importance of addressing negative side effects in AI
Interest in using reduced models to accelerate planning in-       systems is gaining growing interest. Amodei et al. (2016)
creased after the success of FF-Replan (Yoon, Fern, and Gi-       address the problem of negative side effects in a reinforce-
van 2007), which won the 2004 IPPC using the Fast Forward         ment learning setting by penalizing the agent, while opti-
(FF) technique to generate deterministic plans (Hoffmann          mizing the reward. Zhang, Durfee, and Singh (2018) study
2001). Following the success of FF-Replan, researchers have       a form of negative side effects in MDPs that occur when
proposed various methods to improve determinization (Yoon         an agent’s actions alter the state features in a way that may
et al. 2010; Keller and Eyerich 2011; Issakkimuthu et al.         negatively surprise the user. In their work, the agent queries
2015). Robust FF (RFF) generates a plan for an envelope of        a user to learn the features that are acceptable for modifi-
states such that the probability of reaching a state outside      cation and computes a plan accordingly. Hadfield-Menell et
the envelope is below some predefined threshold (Teichteil-       al. (2017) address negative side effects that arise from mis-
Königsbuch, Kuter, and Infantes 2010) and replans to the         specified objectives and misspecified reward by using the
subset of states for which the policy is already computed.        specified reward as an observation to learn the true reward
Our work differs from that of RFF since we compute reach-         function.
ability as a pre-processing to formulate 0/1 RM and not the
policy.
   Safety issues in SSPs have been studied mostly with re-
                                                                                 Conclusion and Future Work
spect to unavoidable dead ends that affect the goal reacha-       Reduced models are a promising approach to quickly solve
bility (Kolobov and Mausam 2012; Kolobov, Mausam, and             large SSPs. However, the existing techniques are oblivious
Weld 2012; Trevizan, Teichteil-Königsbuch, and Thiébaux         to the risky outcomes in the original problem when formu-
2017). Recently, researchers have focused on formulating          lating a reduced model. We propose planning using a port-
richer reduced models that can balance the trade-off between      folio of reduced models that provides flexibility in outcome
complexity and model fidelity so as to improve the solu-          selection to minimize the negative side effects while still ac-
tion quality and safety when planning with reduced mod-           celerating planning. We also present a heuristic approach to
els (Pineda and Zilberstein 2014; Saisubramanian, Zilber-         determine the outcome selection principles. Our empirical
stein, and Shenoy 2018). Our work aims to strike a balance        results demonstrate the promise of this framework; 0/1 RM
between minimizing negative side effects and using a sim-         that is based on the proposed heuristic yields improves the
pler model for planning, bearing similarity to the works that     solution quality, apart from minimizing the negative side ef-
fects. Our results also contribute to a better understanding            Kolobov, A., and Mausam. 2012. Planning with markov decision
of how disparate reduce model techniques help improve the               processes: An AI perspective. Synthesis Lectures on Artificial In-
solution quality, while minimizing the negative side effects.           telligence and Machine Learning 6:1–210.
   This paper focuses on addressing one specific form of                Kolobov, A.; Mausam; and Weld, D. 2012. A theory of goal-
negative side effect in planning that occurs when using re-             oriented MDPs with dead ends. In Conference on Uncertainty in
duced models. There are a number of interesting future re-              Artificial Intelligence, 438–447.
search directions in this area. First, we are looking into              Korf, R. E. 1990. Real-time heuristic search. Artificial Intelligence
ways to automatically identify risks in the system. One ap-             42:189–211.
proach is to generate samples for learning risks using ma-              Kulić, D., and Croft, E. A. 2005. Safe planning for human-robot
chine learning techniques such as regression and decision               interaction. Journal of Field Robotics 22:383–396.
stump. Another approach is to employ limited user data in               Littman, M. L. 1997. Probabilistic propositional planning: Repre-
the form of human demonstrations to learn the feature and               sentations and complexity. In Proc. of the 14th AAAI International
states that are risky for replanning. Second, we aim to build           Conference on Artificial Intelligence.
a classifier to distinguish side effects from negative side ef-         Ma, Y.; Houghton, T.; Cruden, A.; and Infield, D. 2012. Modeling
fects, which is a critical component for automatically iden-            the benefits of vehicle-to-grid technology to a power system. IEEE
tifying the risks. Finally, the 0/1 RM represents an initial            Transactions on Power Systems 27:1012–1020.
exploration of a broad spectrum of PRMs, which we will                  Petrik, M., and Zilberstein, S. 2006. Learning parallel portfolios
continue to explore and further analyze outcome selection               of algorithms. Annals of Mathematics and Artificial Intelligence
methods for PRMs with varying levels of probabilistic com-              48:85–106.
plexity.                                                                Pineda, L., and Zilberstein, S. 2014. Planning under uncer-
                                                                        tainty using reduced models: Revisiting determinization. In Proc.
                    Acknowledgments                                     of the 24th International Conference on Automated Planning and
                                                                        Scheduling.
Support for this work was provided in part by the U.S. Na-
                                                                        Saisubramanian, S.; Zilberstein, S.; and Shenoy, P. 2017. Optimiz-
tional Science Foundation grants IIS-1524797 and IIS-IIS-
                                                                        ing electric vehicle charging through determinization. In ICAPS
1724101.                                                                Workshop on Scheduling and Planning Applications.
                                                                        Saisubramanian, S.; Zilberstein, S.; and Shenoy, P. 2018. Planning
                          References                                    using a portfolio of reduced models. In Proceedings of the 17th
Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.;     International Conference on Autonomous Agents and MultiAgent
and Mané, D. 2016. Concrete problems in ai safety. arXiv preprint      Systems.
arXiv:1606.06565.
                                                                        Teichteil-Königsbuch, F.; Kuter, U.; and Infantes, G. 2010. Incre-
Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. Learning to        mental plan aggregation for generating policies in MDPs. In 9th
act using real-time dynamic programming. Artificial Intelligence        International Conference on Autonomous Agents and Multiagent
72:81–138.                                                              Systems.
Bertsekas, D. P., and Tsitsiklis, J. N. 1991. An analysis of stochas-   Trevizan, F.; Teichteil-Königsbuch, F.; and Thiébaux, S. 2017. Ef-
tic shortest path problems. Mathematics of Operations Research          ficient solutions for stochastic shortest path problems with dead
16:580–595.                                                             ends. In Proceedings of the 33rd Conference on Uncertainty in
Eversource. 2017. Eversource energy - time of use rates. https:         Artificial Intelligence.
//www.eversource.com/clp/vpp/vpp.aspx.                                  Yoon, S.; Ruml, W.; Benton, J.; and Do, M. B. 2010. Improving
Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and Dra-    determinization in hindsight for on-line probabilistic planning. In
gan, A. 2017. Inverse reward design. In Advances in Neural Infor-       Proc. of the 20th International Conference on Automated Planning
mation Processing Systems.                                              and Scheduling.
Hansen, E. A., and Zilberstein, S. 2001. LAO*: A heuristic search       Yoon, S.; Fern, A.; and Givan, R. 2007. FF-Replan: A baseline for
algorithm that finds solutions with loops. Artificial Intelligence      probabilistic planning. In Proc. of the 17th International Confer-
129:35–62.                                                              ence on Automated Planning and Scheduling.
Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1968. A formal ba-         Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax-regret
sis for the heuristic determination of minimum cost paths. IEEE         querying on side effects for safe optimality in factored markov de-
Transactions on Systems Science and Cybernetics 4:100–107.              cision processes. In IJCAI, 4867–4873.
Hoffmann, J. 2001. FF: The fast-forward planning system. AI             Zilberstein, S. 2015. Building strong semi-autonomous systems.
Magazine 22:57.                                                         In Proc. of the 29th AAAI Conference on Artificial Intelligence.
Issakkimuthu, M.; Fern, A.; Khardon, R.; Tadepalli, P.; and Xue,
S. 2015. Hindsight optimization for probabilistic planning with
factored actions. In Proc. of the 25th International Conference on
Automated Planning and Scheduling.
Keller, T., and Eyerich, P. 2011. A polynomial all outcome deter-
minization for probabilistic planning. In Proc. of the 21st Interna-
tional Conference on Automated Planning and Scheduling.
Kocsis, L., and Szepesvári, C. 2006. Bandit based Monte-Carlo
planning. In Proc. of the 17th European Conference on Machine
Learning, volume 6, 282–293. Springer.