Minimizing the Negative Side Effects of Planning with Reduced Models Sandhya Saisubramanian and Shlomo Zilberstein College of Information and Computer Sciences University of Massachusetts Amherst, MA 01003, USA {saisubramanian,shlomo}@cs.umass.edu Abstract Reduced models of large Markov decision processes accel- erate planning by considering a subset of outcomes for each state-action pair. This reduction in reachable states leads to replanning when the agent encounters states without a pre- computed action during plan execution. However, not all states are suitable for replanning. In the worst case, the agent may not be able to reach the goal from the newly encoun- tered state. Agents should be better prepared to handle such risky situations and avoid replanning in risky states. Hence, we consider replanning in states that are unsafe for delibera- tion as a negative side effect of planning with reduced mod- els. While the negative side effects can be minimized by al- ways using the full model, this defeats the purpose of using reduced models. The challenge is to plan with reduced mod- Figure 1: Reduced models are complemented by replanning els, but somehow account for the possibility of encountering when the agent encounters a state for which the policy com- risky situations. An agent should thus only replan in states puted using the reduced model does not specify an action. that the user has approved as safe for replanning. To that end, Some states may be unsafe for deliberation (replanning), we propose planning using a portfolio of reduced models, a which is considered a negative side effect of planning with planning paradigm that minimizes the negative side effects of reduced models. planning using reduced models by alternating between dif- ferent outcome selection approaches. We empirically demon- strate the effectiveness of our approach on three domains: an In such instances, the agent needs to replan online and find electric vehicle charging domain using real-world data from a new way to reach a goal state. We consider this as a side a university campus and two benchmark planning problems. effect of using reduced models as the agent never has to replan when the complete model is used. Current reduced model techniques assume that the agent can replan in any Introduction state (Yoon, Fern, and Givan 2007; Keller and Eyerich 2011; A Stochastic Shortest Path (SSP) problem is a Markov De- Saisubramanian, Zilberstein, and Shenoy 2017). That is, cision Process (MDP) with a start state and goal state(s), they are oblivious to how the ignored outcomes may affect where the objective is to find a policy that minimizes the ex- the system. Some of these side effects may be acceptable to pected cost of reaching a goal state from the start state (Bert- the user, but others may be unsafe even though they acceler- sekas and Tsitsiklis 1991). Solving large SSPs optimally and ate planning (Figure 1). efficiently is an active research area in automated planning, Consider, for example, a robot navigating on a sidewalk with many practical applications. Given the computational adjacent to a street. The robot has a 0.1 probability of slid- complexity of solving large SSPs optimally (Littman 1997), ing on to the street when navigating at high speed and 0.02 there has been considerable interest in developing efficient probability of sliding to the street when navigating at a low approximations, such as reduced models, that trade solution speed. When planning with a reduced model that ignores the quality for computational gains. Reduced models simplify possibility of sliding as an outcome, the robot will have to the problem by partially or completely ignoring uncertainty, replan if the ignored outcome occurs during plan execution. thereby reducing the set of reachable states for the planner. However, it is risky for the robot to replan while on the street We consider reduced models in which the number of out- and it is expected to be better prepared to handle such situ- comes per action is reduced relative to the original model. ations. Another example of replanning in risky states is an Due to model simplification, an agent may encounter un- autonomous vehicle replanning in the middle of an intersec- expected states during plan execution, for which the policy tion after missing a turn or an exit. In the terminology intro- computed using reduced models may not specify an action. duced by Amodei et al. (2016), these are negative side effects of planning with reduced models — an undesired behavior Property Property Values that is the result of ignoring certain outcomes in planning. Impact Reversible or irreversible Table 1 outlines various properties of negative side effects. The importance of accounting for risks in AI systems is Severity Significant or insignificant attracting growing interest (Kulić and Croft 2005; Zilber- Occurrence Avoidable or unavoidable stein 2015). One such specific form of risk is the negative Observability Fully observable or partially observable side effects (Amodei et al. 2016) that occur when an au- tonomous agent optimizes an objective function that focuses Table 1: Classification of negative side effects. on certain aspects of the environment, leading to undesirable effects on other aspects of the environment. Recently, re- searchers have proposed techniques to address the negative Stochastic Shortest Path Problems side effects that arise due to misspecified rewards (Hadfield- A Stochastic Shortest Path (SSP) MDP defined by M = Menell et al. 2017) and incomplete model (Zhang, Durfee, hS, A, T, C, s0 , SG i, where and Singh 2018). In this work, we focus on addressing neg- ative side effects in the form of deliberation in unsafe states • S is a finite set of states; that arise due to planning with simplified or reduced mod- • A is a finite set of actions; els. This is a type of irreversible negative side effect that is • T : S × A × S → [0, 1] is the transition function which fully observable and avoidable. The severity of replanning specifies the probability of reaching a state s0 by executing may be mild or significant depending on the state. While an action a in state s and is denoted by T (s, a, s0 ); the negative side effects can always be reduced by using the full model, this defeats the purpose of using reduced models. • C : S × A → R+ ∪ {0} specifies the cost of executing an The challenge is to plan with reduced models while simulta- action a in state s and is denoted by C(s, a); neously being prepared for handling risky situations. This • s0 ∈ S is the start state; and can be achieved by formulating reduced models that can adapt to risks by selectively improving the model fidelity. • SG ⊆ S is the set of absorbing goal states. We propose planning using a portfolio of outcome selec- The cost of an action is positive in all states except goal tion principles to minimize the negative side effects of plan- states, where it is zero. The objective in an SSP is to min- ning with reduced models. The model fidelity is improved imize the expected cost of reaching a goal state from the in states that are risky — unsafe for deliberation or replan- start state. It is assumed that there exists at least one proper ning. We consider a factored state representation in which policy, one that reaches a goal state from any state s with the states are represented as a vector of features. Given a set probability 1. The optimal policy, π ∗ , can be extracted using of features that characterize states unsafe for deliberation, the value function defined over the states, V ∗ (s): our approach employs more informative outcomes or the full model for actions in states leading to these side effects V ∗ (s) = min Q∗ (s, a), ∀s ∈ S (1) a and a simple outcome selection otherwise. To identify states ∗ X where model fidelity is to be improved, the agent performs Q (s, a) = C(s, a)+ T (s, a, s0 )V ∗ (s0 ), ∀(s, a) (2) s0 reachability analysis using samples and the given features. The sample trajectories are generated by depth-limited ran- where Q∗ (s, a) denotes the optimal Q-value of the action a dom walk. Reachability analysis is performed primarily with in state s in the SSP M (Bertsekas and Tsitsiklis 1991). respect to negative side effects and this is used as a heuristic for choosing the outcome selection principles in a portfo- Planning Using Reduced Models lio. We evaluate our approach in three proof-of-concept do- While SSPs can be solved in polynomial time in the num- mains: an electric vehicle charging problem using real-world ber of states, many problems of interest have a state-space data, a racetrack domain and a sailing domain. Our empirical whose size is exponential in the number of variables describ- results show that our approach significantly minimizes the ing the problem (Littman 1997). This complexity has led to negative side effects, without affecting the run time. While the use of approximate methods such as reduced models. we address one particular form of negative side effects in Reduced models simplify planning by considering a subset planning using reduced models, our proposed framework of outcomes. Let θ(s, a) denote the set of all outcomes of is generic and can be extended to potentially address other (s, a), θ(s, a) = {s0 |T (s, a, s0 ) > 0}. forms of negative side effects. Our primary contributions are: (i) formalizing negative Definition 1. A reduced model of an SSP M is represented side effects in planning using reduced models; (ii) introduc- by the tuple M 0 = hS, A, T 0 , C, s0 , SG i and characterized ing a portfolio of reduced models approach to minimize the by an altered transition function T 0 such that ∀(s, a) ∈ S × negative side effects; and (iii) empirical evaluation of the A, θ0 (s, a) ⊆ θ(s, a), where θ0 (s, a) = {s0 |T 0 (s, a, s0 ) > 0} proposed approach on three domains. denotes the set of outcomes in the reduced model for action a ∈ A in state s ∈ S. Background We normalize the probabilities of the outcomes included This section provides a brief background on stochastic short- in the reduced model, but more complex ways to redistribute est path problems and planning using reduced models. the probabilities of ignored outcomes may be considered. The outcome selection process in a reduced model frame- lios of algorithms to solve complex computational prob- work determines the number of outcomes and how the spe- lems (Petrik and Zilberstein 2006). cific outcomes are selected. Depending on these two aspects, Definition 2. Given a portfolio of finite outcome selection a spectrum of reductions exist with varying levels of proba- principles, Z = {ρ1 , ρ2 , ..., ρk }, k>1, a model selector, Φ, bilistic complexity. generates T 0 for a reduced model by mapping every (s, a) to The outcome selection principle (OSP) determines the an outcome selection principle, Φ : S×A → ρi , ρi ∈ Z, such outcomes included in the reduced model per state-action that T 0 (s, a, s0 ) = TΦ(s,a) (s, a, s0 ), where TΦ(s,a) (s, a, s0 ) pair, and the altered transition function for each state-action denotes the transition probability corresponding to the out- pair. The outcome selection may be performed using a sim- come selection principle selected by the model selector. ple function such as always choosing the most-likely out- come or a more complex function. Typically, a reduced Clearly, the existing reduced models are a special case model is characterized by a single OSP. That is, a single prin- of PRMs, with a model selector (Φ) that always selects the ciple is used to determine the number of outcomes and how same OSP for every state-action pair. In planning using a the outcomes are selected across the entire model. Hence, portfolio of reduced models, however, the model selector existing reduced models are incapable of selectively improv- typically utilizes the synergy of multiple outcome selection ing model fidelity. principles. Each state-action pair may have a different num- ber of outcomes and a different mechanism to select the spe- Minimizing Negative Side Effects cific outcomes. Hence, we leverage this flexibility in out- come selection to minimize negative side effects in reduced One form of negative side effects of planning using reduced models by using more informative outcomes in the risky models is replanning in states, as a result of ignoring out- states and using simple outcome selection principles other- comes, which are prescribed by a human as unsafe for de- wise. Though the model selector may use multiple outcome liberation. The set of states characterizing negative side ef- selection principles to generate T 0 in a PRM, note that the fects for planning with a reduced model M 0 , denoted by resulting model is still an SSP. In the rest of the paper, we fo- N SE(M 0 ), is defined as N SE(M 0 ) = {s0 |T (s, a, s0 ) > cus on a basic instantiation of a PRM that alternates between 0 ∧ D(s0 ) = true ∧ s0 ∈ / θ0 (s, a)}, where D(s0 ) is true if the extremes of reductions. the deliberation in the state is risky during policy execution. Thus, the OSP determines the reduced model and thereby Definition 3. A 0/1 reduced model (0/1 RM) is a PRM with the resulting negative side effects. The challenge is to de- a model selector that selects either one or all outcomes of termine outcome selection principles that minimize the neg- an action in a state to be included in the reduced model. ative side effects of planning with reduced models without A 0/1 RM is characterized by a model selector, Φ0/1 , that significantly compromising the run time gains. either ignores the stochasticity completely (0) by consider- While the negative side effects can always be minimized ing only one outcome of (s, a), or fully accounts for the using the full model or picking the right determinization, stochasticity (1) by considering all outcomes of the state- the former is computationally complex and finding the right action pair in the reduced model. For example, it may use the determinization is a non-trivial meta-level decision prob- full model in states prone to negative side effects, and deter- lem (Pineda and Zilberstein 2014). Another approach is to minization otherwise. Thus, a 0/1 RM that guarantees goal penalize the agent for replanning in risky states; that is, not reachability with probability 1 can be devised, if a proper accounting for the risky outcomes in the reduced model for- policy exists in the SSP. Our experiments using 0/1 RM mulation. However, this is useful only when the agent per- show that even such basic instantiation of a PRM works well forms an exhaustive search in the space of reduced mod- in practice. els, which is computationally complex. Given a set of fea- Devising an efficient model selector automatically can be tures characterizing states unsuitable for deliberation, D(f~), treated as a meta-level decision problem that is computa- we address this issue by providing efficient reduced model tionally more complex than solving the reduced model, due formulation that can selectively improve model fidelity. A to the numerous possible combinations of outcome selec- straightforward approach is to improve the model fidelity in tion principles. Even in a 0/1 RM, finding when to use de- risky states. However, this does not minimize the negative terminization and when to use the full model is non-trivial. side effects and it is more beneficial to improve the model In the more general setting, all OSPs in Z may have to be fidelity in states leading to these risky states. Based on this evaluated to determine the best reduced model formulation insight, we present a heuristic approach, which is based on in the worst case. Let τmax denote the maximum time taken reachability to these risky states, to select outcome selection for this evaluation across all states. In a PRM with a large principles for each state-action pair. portfolio, the OSPs may be redundant in terms of specific outcomes. For example, selecting the most-likely outcome Portfolios of Reduced Models and greedily selecting an outcome based on heuristic could We present planning using a portfolio of reduced models result in the same outcome for some (s, a) pair. We use this (PRM), a generalized approach to formulate reduced mod- property to show that the worst case time complexity of de- els, by switching between different outcome selection prin- vising an efficient model selector is independent of the size ciples, each of which represents a different reduced model. of the portfolio, which may be much larger than the size of The approach is inspired by the benefits of using portfo- the problem under consideration, and depends only the size Figure 2 is an example of an SSP with a risky state (shaded) and instantiations of reduced models formed with uniform and adaptive outcome selection principles. The most likely outcome determinization (MLOD) uses uniform outcome selection principle – always selects the most likely outcome – and may ignore s5 in the reduced model. A 0/1 RM, however, uses the full model in state action pair leading to the risky state, minimizing negative side effects and resulting in a model that is better prepared to handle the risky state. The reachability threshold can be varied to devise reduced models with different fidelity with varying propor- (a) Orignal Problem (b) MLOD (c) 0/1 RM tions of full model usage, resulting in different sensitivity to negative side effects and computational complexity. Figure 2: Examples of most likely outcome determinization Experimental Results (MLOD) and 0/1 RM of a problem. The shaded state s5 in We experiment with the 0/1 RM on three proof-of- the original problem denotes a risky state. concept domains including an electric vehicle charg- ing problem using real world data from a univer- of states and actions. The following proposition formally sity campus, and two benchmark planning prob- proves this complexity. lems: the racetrack domain and the sailing domain. The experiments employ a simple portfolio Z = Proposition 1. The worst case time complexity for Φ0/1 to {most-likely outcome determinization (MLOD), full model}. generate T 0 for a 0/1 RM is O(|A||S|2 τmax ). We compare the results of 0/1 RM to that of MLOD and Proof Sketch. For each (s, a), at most |Z| outcome selection M02 reduction, which greedily selects two outcomes per principles are to be evaluated and this takes at most τmax each state-action pair. time. Since this process is repeated for every (s, a), Φ takes The expected number of negative side effects, cost of O(|S||A||Z|τmax ) to generate T 0 . In the worst case, every reaching the goal, and planning time are used as evalua- action may transition to all states and the outcome selection tion metrics. For 0/1 RM, we use the full model in states principles in Z may be redundant in terms of the number that had a reachability of 0.25 or more to at least one of the and specific outcomes set produced by them. To formulate risky states, which is estimated using thirty samples. In all a 0/1 RM of an SSP, it may be necessary to evaluate ev- other states, MLOD is used. All results are averaged over ery outcome selection principle that corresponds to a deter- 100 trials of planning and execution simulations and the minization or a full model. The set of unique outcomes, k, average times include re-planning time. The deterministic for a 0/1 RM is composed of all unique deterministic out- problems are solved using the A* algorithm (Hart, Nilsson, comes, which is every state in the SSP, and the full model, and Raphael 1968) and other problems using LAO*. All al- |k| ≤ |S| + 1. Replacing |Z| with |k|, the complexity is gorithms were implemented with =10−3 and using hmin reduced to O(|A||S|2 τmax ). heuristic computed using a labeled version of LRTA* (Korf 1990). Heuristic Approach to Model Selection EV Charging Problem Since it is computationally complex to perform an exhaus- The electric vehicle (EV) charging domain considers an tive search in the space of possible 0/1 RM models, we EV operating in a vehicle-to-grid (V2G) setting in which present an approximation to identify when to use the full an EV can either charge or discharge energy from the model, using samples. smart grid (Ma et al. 2012; Saisubramanian, Zilberstein, and Given the features characterizing negative side effects, Shenoy 2017). By planning when to buy or sell electricity, an model selection is performed by generating and solving EV can devise a robust policy for charging and discharging samples. The samples are generated automatically by sam- that is consistent with the owner’s preferences, while mini- pling states from the target problem. Alternatively, small mizing the long-term cost of operating the vehicle. problem instances in the target domain may be used as sam- We consider uncertainty in parking duration, which is ples. In this paper, smaller problems are created automati- specified by a probability that certain states may become cally by multiple trials of depth limited random walk on the terminal states. This problem is modeled as a discrete finite- target problems and solved using LAO* (Hansen and Zil- horizon MDP with maximum parking duration as the hori- berstein 2001). The probability of reaching the states unsafe zon H. Each state is represented by the tuple hl, t, d, p, ei, for deliberation is computed for these samples, based on the where l denotes the current level of charge of the vehicle, features. Trivially, as the number of samples and the depth t ≤ H denotes the current time step. d and p denote the cur- of the random walk are increased, the estimates converge to rent demand level and price distribution for electricity and their true values. The full model is used in states that meet 0 ≤ e ≤ 3 denotes the anticipated departure time specified the reachability threshold. In all other states, determinization by the owner. If the owner has not provided this informa- is used. tion, then e = 3 and the agent plans for H. Otherwise, e denotes the time steps remaining for departure. The vehicle Problem Instance MLOD M02 0/1 RM − can charge (Ω+ i ) and discharge (Ωi ) at three different speed EV-RF-1 12.91 7.16 0 levels, where 1 ≤ i ≤ 3 denotes the speed level, or remain idle (∅). The Ω+ − EV-RF-2 11.88 6.60 0 i and Ωi actions are stochastic, while the ∅ action is deterministic. The objective is to find a reward EV-RF-3 19.42 8.24 0 maximizing policy π ∗ : S → A that maximizes goal reach- EV-RF-4 19.90 12.09 0 ability, given the charging preferences of the vehicle owner. Racetrack-Square-3 20.80 0 0.20 If the exit level charge (goal) requirement of the owner is Racetrack-Square-4 21.90 10.73 7.20 not met at departure, then the owner may need to extend the parking duration to reach the required charge level. Since Racetrack-Square-5 21.20 16.30 9.03 this is undesirable, any state from where the goal cannot Racetrack-Ring-3 23.40 0 0 be reached in the remaining parking duration is treated as a Racetrack-Ring-5 26.20 11.58 17.00 preference violation (risk) in the MDP with a large penalty. Racetrack-Ring-6 28.00 3.33 3.28 Four demand levels and two price distributions are used Sailing-20(C) 51.4 30.26 0 in the experiments. Each t is equivalent to 30 minutes in real time. We assume that the owner may depart between four Sailing-40(C) 13.51 15.98 9.88 to six hours of parking with a probability of 0.2 that they Sailing-80(C) 50.3 16.7 13.25 announce their planned departure time. Outside that win- Sailing-20(M) 50.09 14.82 0.12 dow, there is a lower probability of 0.05 that they announce Sailing-40(M) 9.59 33.15 1.11 their departure. The charging rewards and the peak hours are Sailing-80(M) 52.9 11.61 0 based on real data (Eversource 2017). The battery capacity and the charge speeds for the EV are based on Nissan Leaf Table 2: Average negative side effects of different reduced configuration. The charge and discharge speeds are consid- models on three domains. The results are averaged over 100 ered to be equal. The battery inefficiency is accounted for trials. by adding a 15% penalty on the rewards. The negative side effects are estimated using: time remaining for departure, if the current time is peak hour, and if there is sufficient charge in the direction of the wind. The negative side effects are for discharging, with one step-lookahead. estimated using one-step lookahead and based on state fea- EV Dataset The data used in our experiments consist of tures such as: the difference between the action’s intended charging schedules of electric cars over a four month du- direction of movement and the wind’s direction, and if the ration in 2017 from an American university campus. The successor is moving away from goal, estimated using the data is clustered based on the entry and exit charges, and heuristic value. we selected 25 representative problem instances across clus- ters for our experiments. The data is from a typical charging Discussion station, where the EV is unplugged once the desired charge level is reached. Since we are considering an extended park- Table 2 shows the average number of negative side effects ing scenario as in a vehicle parked at work, we assume park- (risky states encountered without an action to execute) of ing duration of up to eight hours in our experiments. There- the three techniques in three domains. For the EV domain, fore, for each instance, we alter the parking duration. the results are aggregated over 25 problem instances for each reward function. In our experiments, we observe that as the Racetrack Domain model fidelity is improved, the negative side effects are min- imized. In all four reward functions cases of the EV, 0/1 RM We experimented with six problem instances from the race- has no negative side effects. Overall, 0/1 RM has the least track domain (Barto, Bradtke, and Singh 1995), with modi- negative side effects in all three proof-of-concept domains. fications to increase the difficulty of the problem. We mod- Table 3 shows the average increase in cost (%) (with re- ified the problem such that, in addition to a 0.10 probability spect to the optimal cost as lower bound) and the time sav- of slipping, there is a 0.20 probability of randomly chang- ings (%) (with respect to solving the original problem). The ing the intended acceleration by one unit. The negative side least cost increase % indicate that the performance of the effects are estimated using one-step lookahead and state fea- technique is closer to optimal. The higher time savings val- tures such as: whether the successor is a wall or pothole or ues indicate improved run time gains by using the model. goal, and if the successor is moving away from the goal, es- The runtime of 0/1 RM is considerably faster than solving timated using the heuristic. the original problem, while yielding almost optimal solu- tions. This shows that our approach does not significantly Sailing Domain affect the run time gains so as to improve the solution qual- Finally, we present results on six instances of the sailing do- ity and minimize negative side effects. Note that we solve main (Kocsis and Szepesvári 2006). The problems vary in 0/1 RM using an optimal algorithm, LAO*, to demonstrate terms of grid size and the goal position (opposite corner (C) the effectiveness of our framework by comparing the opti- or middle (M) of the grid). In this domain, the actions are de- mal solutions of the models. In practice, any SSP solver (op- terministic and uncertainty is caused by stochastic changes timal or not) may be used to improve run time gains. Over- MLOD M02 0/1 RM Problem Instance % Cost % Time % Cost % Time % Cost % Time Increase Savings Increase Savings Increase Savings EV-RF-1 28.17 31.03 48.42 15.96 7.04 21.47 EV-RF-2 35.28 36.29 35.96 24.93 5.45 22.17 EV-RF-3 38.55 35.56 30.61 29.05 9.59 28.54 EV-RF-4 32.29 55.56 26.85 29.05 11.11 46.06 Racetrack-Square-3 24.69 99.73 22.45 99.45 0.26 99.59 Racetrack-Square-4 44.61 99.85 18.87 97.13 5.73 99.68 Racetrack-Square-5 33.32 93.83 8.73 34.21 5.58 94.34 Racetrack-Ring-4 8.59 99.20 14.07 99.13 2.93 99.06 Racetrack-Ring-5 58.25 99.55 30.45 82.91 17.17 94.35 Racetrack-Ring-6 30.62 99.26 6.41 66.05 8.30 92.07 Sailing-20(C) 76.79 92.86 31.83 49.68 7.11 85.72 Sailing-40(C) 63.34 75.91 19.52 62.21 2.85 71.66 Sailing-80(C) 39.19 86.99 13.25 70.27 2.81 83.88 Sailing-20(M) 73.81 99.84 79.62 39.56 1.63 93.25 Sailing-40(M) 65.64 94.56 37.75 88.86 3.08 85.01 Sailing-80(M) 87.66 93.48 24.59 90.37 4.99 96.67 Table 3: Comparison of the solution quality of different reduced models (formed using different model selectors) on three domains. The reported values are averaged over 100 trials. all, 0/1 RM yields almost optimal results, in terms of cost balance the trade-off between solution quality and model fi- and negative side effects. delity in reduced model. However, unlike the existing works, we target a well-defined and a specific form of risk that arise Related Work due to replanning in a reduced model setting. The importance of addressing negative side effects in AI Interest in using reduced models to accelerate planning in- systems is gaining growing interest. Amodei et al. (2016) creased after the success of FF-Replan (Yoon, Fern, and Gi- address the problem of negative side effects in a reinforce- van 2007), which won the 2004 IPPC using the Fast Forward ment learning setting by penalizing the agent, while opti- (FF) technique to generate deterministic plans (Hoffmann mizing the reward. Zhang, Durfee, and Singh (2018) study 2001). Following the success of FF-Replan, researchers have a form of negative side effects in MDPs that occur when proposed various methods to improve determinization (Yoon an agent’s actions alter the state features in a way that may et al. 2010; Keller and Eyerich 2011; Issakkimuthu et al. negatively surprise the user. In their work, the agent queries 2015). Robust FF (RFF) generates a plan for an envelope of a user to learn the features that are acceptable for modifi- states such that the probability of reaching a state outside cation and computes a plan accordingly. Hadfield-Menell et the envelope is below some predefined threshold (Teichteil- al. (2017) address negative side effects that arise from mis- Königsbuch, Kuter, and Infantes 2010) and replans to the specified objectives and misspecified reward by using the subset of states for which the policy is already computed. specified reward as an observation to learn the true reward Our work differs from that of RFF since we compute reach- function. ability as a pre-processing to formulate 0/1 RM and not the policy. Safety issues in SSPs have been studied mostly with re- Conclusion and Future Work spect to unavoidable dead ends that affect the goal reacha- Reduced models are a promising approach to quickly solve bility (Kolobov and Mausam 2012; Kolobov, Mausam, and large SSPs. However, the existing techniques are oblivious Weld 2012; Trevizan, Teichteil-Königsbuch, and Thiébaux to the risky outcomes in the original problem when formu- 2017). Recently, researchers have focused on formulating lating a reduced model. We propose planning using a port- richer reduced models that can balance the trade-off between folio of reduced models that provides flexibility in outcome complexity and model fidelity so as to improve the solu- selection to minimize the negative side effects while still ac- tion quality and safety when planning with reduced mod- celerating planning. We also present a heuristic approach to els (Pineda and Zilberstein 2014; Saisubramanian, Zilber- determine the outcome selection principles. Our empirical stein, and Shenoy 2018). Our work aims to strike a balance results demonstrate the promise of this framework; 0/1 RM between minimizing negative side effects and using a sim- that is based on the proposed heuristic yields improves the pler model for planning, bearing similarity to the works that solution quality, apart from minimizing the negative side ef- fects. Our results also contribute to a better understanding Kolobov, A., and Mausam. 2012. Planning with markov decision of how disparate reduce model techniques help improve the processes: An AI perspective. Synthesis Lectures on Artificial In- solution quality, while minimizing the negative side effects. telligence and Machine Learning 6:1–210. This paper focuses on addressing one specific form of Kolobov, A.; Mausam; and Weld, D. 2012. A theory of goal- negative side effect in planning that occurs when using re- oriented MDPs with dead ends. In Conference on Uncertainty in duced models. There are a number of interesting future re- Artificial Intelligence, 438–447. search directions in this area. First, we are looking into Korf, R. E. 1990. Real-time heuristic search. Artificial Intelligence ways to automatically identify risks in the system. One ap- 42:189–211. proach is to generate samples for learning risks using ma- Kulić, D., and Croft, E. A. 2005. Safe planning for human-robot chine learning techniques such as regression and decision interaction. Journal of Field Robotics 22:383–396. stump. Another approach is to employ limited user data in Littman, M. L. 1997. Probabilistic propositional planning: Repre- the form of human demonstrations to learn the feature and sentations and complexity. In Proc. of the 14th AAAI International states that are risky for replanning. Second, we aim to build Conference on Artificial Intelligence. a classifier to distinguish side effects from negative side ef- Ma, Y.; Houghton, T.; Cruden, A.; and Infield, D. 2012. Modeling fects, which is a critical component for automatically iden- the benefits of vehicle-to-grid technology to a power system. IEEE tifying the risks. Finally, the 0/1 RM represents an initial Transactions on Power Systems 27:1012–1020. exploration of a broad spectrum of PRMs, which we will Petrik, M., and Zilberstein, S. 2006. Learning parallel portfolios continue to explore and further analyze outcome selection of algorithms. Annals of Mathematics and Artificial Intelligence methods for PRMs with varying levels of probabilistic com- 48:85–106. plexity. Pineda, L., and Zilberstein, S. 2014. Planning under uncer- tainty using reduced models: Revisiting determinization. In Proc. Acknowledgments of the 24th International Conference on Automated Planning and Scheduling. Support for this work was provided in part by the U.S. Na- Saisubramanian, S.; Zilberstein, S.; and Shenoy, P. 2017. Optimiz- tional Science Foundation grants IIS-1524797 and IIS-IIS- ing electric vehicle charging through determinization. In ICAPS 1724101. Workshop on Scheduling and Planning Applications. Saisubramanian, S.; Zilberstein, S.; and Shenoy, P. 2018. Planning References using a portfolio of reduced models. In Proceedings of the 17th Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; International Conference on Autonomous Agents and MultiAgent and Mané, D. 2016. Concrete problems in ai safety. arXiv preprint Systems. arXiv:1606.06565. Teichteil-Königsbuch, F.; Kuter, U.; and Infantes, G. 2010. Incre- Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. Learning to mental plan aggregation for generating policies in MDPs. In 9th act using real-time dynamic programming. Artificial Intelligence International Conference on Autonomous Agents and Multiagent 72:81–138. Systems. Bertsekas, D. P., and Tsitsiklis, J. N. 1991. An analysis of stochas- Trevizan, F.; Teichteil-Königsbuch, F.; and Thiébaux, S. 2017. Ef- tic shortest path problems. Mathematics of Operations Research ficient solutions for stochastic shortest path problems with dead 16:580–595. ends. In Proceedings of the 33rd Conference on Uncertainty in Eversource. 2017. Eversource energy - time of use rates. https: Artificial Intelligence. //www.eversource.com/clp/vpp/vpp.aspx. Yoon, S.; Ruml, W.; Benton, J.; and Do, M. B. 2010. Improving Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and Dra- determinization in hindsight for on-line probabilistic planning. In gan, A. 2017. Inverse reward design. In Advances in Neural Infor- Proc. of the 20th International Conference on Automated Planning mation Processing Systems. and Scheduling. Hansen, E. A., and Zilberstein, S. 2001. LAO*: A heuristic search Yoon, S.; Fern, A.; and Givan, R. 2007. FF-Replan: A baseline for algorithm that finds solutions with loops. Artificial Intelligence probabilistic planning. In Proc. of the 17th International Confer- 129:35–62. ence on Automated Planning and Scheduling. Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1968. A formal ba- Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax-regret sis for the heuristic determination of minimum cost paths. IEEE querying on side effects for safe optimality in factored markov de- Transactions on Systems Science and Cybernetics 4:100–107. cision processes. In IJCAI, 4867–4873. Hoffmann, J. 2001. FF: The fast-forward planning system. AI Zilberstein, S. 2015. Building strong semi-autonomous systems. Magazine 22:57. In Proc. of the 29th AAAI Conference on Artificial Intelligence. Issakkimuthu, M.; Fern, A.; Khardon, R.; Tadepalli, P.; and Xue, S. 2015. Hindsight optimization for probabilistic planning with factored actions. In Proc. of the 25th International Conference on Automated Planning and Scheduling. Keller, T., and Eyerich, P. 2011. A polynomial all outcome deter- minization for probabilistic planning. In Proc. of the 21st Interna- tional Conference on Automated Planning and Scheduling. Kocsis, L., and Szepesvári, C. 2006. Bandit based Monte-Carlo planning. In Proc. of the 17th European Conference on Machine Learning, volume 6, 282–293. Springer.