Interpretable Local Tree Surrogate Policies John Mern,1 Sidhart Krishnan, 2 Anil Yildiz, 1 Kyle Hatch, 2 Mykel J. Kochenderfer 1 1 Department of Aeronautics and Astronautics, Stanford University 2 Department of Computer Science, Stanford University Abstract High-dimensional policies, such as those represented by neu- ral networks, cannot be reasonably interpreted by humans. This lack of interpretability reduces the trust users have in policy behavior, limiting their use to low-impact tasks such as video games. Unfortunately, many methods rely on neural network representations for effective learning. In this work, we propose a method to build interpretable policy trees as sur- rogates for policies such as neural networks. The policy trees are easily human interpretable and provide quantitative pre- dictions of future behavior. We demonstrate the performance of this approach on several simulated tasks. Figure 1: Surrogate Policy Tree. The figure shows a sur- rogate policy tree for the vaccine planning task, a finite- Introduction horizon MDP with 8 discrete actions. Each node represents an action and the states that led to it. Node borders and edge Deep reinforcement learning has achieved state of the art widths are proportional to the probability of encountering performance in several challenging task domains (Mnih that node or edge during policy execution. The most likely et al. 2015). Much of that performance comes from the use trajectory from root to leaf is shown in blue. of highly expressive neural networks to represent policies. Humans are generally unable to meaningfully interpret neu- ral network parameters, which has lead to the common view of neural networks as “black-box” functions. Poor inter- analyst. In these cases, knowing the extended course of ac- pretability often leads to a lack of trust in neural networks tions before committing to the recommendation can enhance and use of other more transparent, though potentially less the trust of the human operator. high-performing policy models. This is often referred to as It is difficult for humans to holistically interpret models the performance-transparency trade-off. with even a small number of interacting terms (Lipton 2018). Interpretability is important for high-consequence tasks. Neural networks commonly have several thousand param- Domains in which neural networks have already been ap- eters and non-linear interactions, making holistic interpre- plied, such as image classification, often do not require in- tation infeasible. Some existing methods constrain neural terpretable decisions because the consequences of mislabel- networks architectures and train them to learn human in- ing images are typically low. In many potential applications terpretable features during task learning. The training and of deep reinforcement learning, such as autonomous vehi- architecture constraints, however, can degrade performance cle control, erroneous actions may be costly and dangerous. compared to an unconstrained policy. A common approach In these cases, greater trust in policy decisions is typically is to learn transparent surrogates from the original mod- desired before systems are deployed. els (Adadi and Berrada 2018). A major challenge in this ap- proach is balancing the fidelity of the surrogate to the origi- Our work is motivated by tasks in which a human inter- nal with the interpretability of the surrogate. Surrogate mod- acts directly with the policy, either by approving agent ac- els that are generated stochastically can additionally struggle tions before they are taken or by enacting recommendations to provide consistent representations for the same policy. directly. This is often referred to having a human “in the In this work, we propose a method to develop transparent loop”. An example is an automated cyber security incident surrogate models as local policy trees. The resulting trees response system that provides recommendations to a human encode an intuitive plan of future actions with performance Copyright © 2022 for this paper by its authors. Use permitted under comparable to the original policy. The proposed approach Creative Commons License Attribution 4.0 International (CC BY allows users to specify tree size constraints and fidelity tar- 4.0). gets. The method is model-agnostic, meaning that it does not require the original policy to take any specific form. Though over all states. Learning an optimal policy is equivalent to this work was motivated by neural networks, the proposed learning the optimal action value function approach may be used with any baseline policy form. " # During execution of a tree policy, the actions taken by X τ −t ∗  Q(st , at ) = E r(st , at ) + γ r sτ , π (sτ ) (2) an agent are guaranteed to be along one of the unique paths τ =t+1 from the root. Experiments on a simple grid world task high- ∗ light the impact of algorithm parameters on tree behavior. where π is the optimal policy. When learning an action The experiments also show that using the trees as a receding- value function estimator, the effective policy is horizon policy maintains good performance relative to the π(s) ← arg max Q̂(s, a) (3) baseline model. Additional experiments on more complex a∈A infrastructure planning and cyber-physical security domains demonstrate the potential utility of this approach in real- where Q̂(s, a) is the learned approximator. These problems world tasks. can be solved using a variety of methods such as dynamic programming, Monte Carlo planning, and reinforcement learning (Kochenderfer, Wheeler, and Wray 2022). Simi- Background larly, various function types can be used to model the policy or value function estimator. Neural networks are commonly Despite its importance in machine learning, there is no pre- used as policies for complex tasks. cise qualitative or mathematical definition of interpretabil- ity. In the literature, interpretability is commonly used to describe a model with one of two characteristics. The first Related Work is explainability, defined by Miller (2019) as the degree There has been a wealth of prior work in the field of ex- to which a human can understand the cause of a decision. plainable and interpretable artificial intelligence. The book The second characteristic is predictability, which is defined by Molnar (2019) provides an overview of model inter- by Kim, Koyejo, and Khanna (2016) as the degree to which pretability techniques for general machine learning meth- a human can consistently predict a model’s result. In this ods. Methods for specific models and tasks have also been work, we will refer to models that are either explainable or proposed. Puiutta and Veith (2020) provides a survey of re- predictable as interpretable. cent work in explainable reinforcement learning. The discus- Techniques for model interpretability can vary greatly. We sion in this section is restricted to methods relevant to deep characterize the methods presented in this work using the reinforcement learning. Methods requiring domain specific taxonomy proposed by Adadi and Berrada (2018) as model- representations (Verma et al. 2018) are not considered. agnostic, local surrogate methods. Model-agnostic methods A common technique for model-agnostic surrogate mod- are those that may be applied to any model with the ap- eling is to learn a surrogate model of an inherently inter- propriate mapping from input to output. Surrogate modeling pretable class. An useful example is locally interpretable techniques distill the behavior of a complex baseline model model-agnostic explanations (LIME) (Ribeiro, Singh, and into a more transparent surrogate. Local methods provide Guestrin 2016). LIME learns sparse linear representations interpretability that is valid only in the neighborhood of a of a target policy at a specific input by training on a data target input point. As a result, they tend to maintain higher set of points near the target point. While the resulting linear fidelity to the original model in the acceptable region than functions are interpretable, they are often not consistent and models that attempt to provide global interpretations. small variations in the training data can result in drastically different linear functions. Several methods have been proposed to distill neural net- Markov Decision Processes work policies to decision tree surrogates. Linear model U- trees (Liu et al. 2018) and soft decision trees (Coppens et al. The control tasks in this work are assumed to satisfy the 2019) take similar approaches to tree representation and Markov assumption and may be modeled as either Markov learning. Decision trees are global policy representations decision processes (MDPs) or partially observable Markov that map input observations to output actions by traversing a decision processes (POMDPs). MDPs are defined by tu- binary tree. Actions can be partially understood by inspect- ples (S, A, T, r, γ), where S and A are the state and ac- ing the values at each internal node. Unfortunately, the un- tion spaces, respectively, T (s0 | s, a) is the transition model, derstanding provided by decision trees can be limited be- r(s, a, s0 ) is the reward function, and γ is a time discount cause the mappings still pass the input observation through factor. POMDPs are modeled by the same tuple with the several layers of affine transforms and non-linear functions. addition of the observation space O and observation model Structural causal models (Madumal et al. 2020) also try Z(o | s, a). A policy is a function that maps a state to an to learn an inherently interpretable surrogate as a directed action in MDPs or a history of observations to an action in acyclic graph (DAG). The learned DAG represents relations POMDPs. To solve a MDP is to learn the policy π : S → A between objects in the task environment that can be easily that maximizes the expected sum of discounted rewards understood by humans. The DAG structure, however, must " # be provided for each task, and the fidelity is not assured. Our work proposes tree representations that provide in- X τ −t  V (st ) = E γ r sτ , π(sτ ) (1) τ =t tuitive maps over future courses of action. When used as a policy, the realized course of action is guaranteed to be along of the probability that the action sequence up to that node the branches of the tree. In this way, the method is similar to will be taken P (a0 , . . . , at | s0 , π) and estimates of the pol- methods of neural network verification that seek to provide icy value Q(s, a). Each node also contains a set of example guarantees of network outputs over a given set of inputs. Liu states or observations that would result in that action. et al. (2021) provides an overview of verification for gen- Figure 2 provides more detailed views of the policy tree eral neural networks. Additional works have been proposed in fig. 1. The trajectory probabilities and value estimates specifically for verification of neural network control poli- shown in fig. 2a are calculated during tree construction and cies (Sidrane et al. 2021). may be presented for any surrogate policy. The states in The methods proposed in this work also resemble the example problem st ∈ Rn are real vectors. Each node Monte Carlo tree search (MCTS) algorithms (Kocsis and in fig. 2b shows the mean of that node’s state values. While Szepesvári 2006). MCTS methods search for an optimal ac- the mean state value is useful for understanding this prob- tion from a given initial state or belief by sampling trajecto- lem, it may not be useful in all problems. For example, the ries using a search policy. The result of an MCTS search is mean value may be meaningless for problems with discrete a tree of trajectories reachable from the initial state. The tree state spaces. Methods to compactly represent a node’s set trajectories, however, are not limited to those reached un- of states or observations cannot be generally defined and in- der a fixed policy, but rather by a non-stationary tree search stead should be specified for each problem. policy. The resulting tree cannot be used to interpret or pre- Trees are an intuitive choice for the local surrogate model dict future policy behavior. Trees from planners using UCB of a control policy. Local surrogate methods seek to provide exploration (Auer, Cesa-Bianchi, and Fischer 2002) tend to simple approximate models that are faithful to the original grow exponentially with search depth h as O(|S|h |A|h ). policy in a space around a target point. For control tasks, often only predictions of future behavior are useful. Trees Proposed Method provide compact surrogate models by only representing be- havior in states that are forward-reachable from the current We present model-agnostic methods to represent policies state. This is in contrast to methods that define local neigh- for both MDPs and POMDPs as interpretable tree policies. borhoods by arbitrarily perturbing the initial point (Ribeiro, Trees are inherently more interpretable than high dimen- Singh, and Guestrin 2016). In these cases, explanatory ca- sional models like neural networks. The proposed method pacity is wasted on unnecessary states. uses the baseline policy to generate simulations of trajecto- ries reachable from a given initial state. The trajectories are then clustered to develop a width-constrained tree that repre- Tree Build Algorithm sents the original policy with low expected return loss. The Trees are built by simulating multiple executions of the base- methods assume that baseline policies return distributions line policy from the given initial state or belief and clustering over actions or estimates of action values for a given state. them into action nodes. The process is illustrated in fig. 3. Building trees also requires a generative model of the task Each simulation is represented by a collection of particles environment. pt for each time t along the simulation trajectory. Particles In addition to representing the future policy, the tree also are tuples (st , rt ) of the state at time t and reward received provides useful statistics on expected performance such as from t − 1 to t. For POMDPs, particles also record the ob- likelihood of following each represented trajectory. The tree servation ot and belief bt . may be used as a policy during task execution with guaran- Tree construction begins by first generating a set of par- tees on expected behavior. We present methods to generalize ticles for the initial state or belief. For the initial step, all to states not seen during tree construction by using the tree particles are assigned to the root action node an0 , which as a constraint on the baseline policy. takes the action given by the baseline policy a0 = π(s0 ) for MDPs or a0 = π(b0 ) for POMDPs. The particles are Local Tree Policy then advanced through one simulation time step to produce Before describing the tree construction algorithm, we first the particles for t + 1, as shown in fig. 3a. The t + 1 par- present the local tree policies that it produces. A policy tree ticles that did not enter terminal states are clustered to new is a rooted polytree where nodes represent actions taken dur- action nodes as shown in fig. 3b. This process continues un- ing policy execution. An example tree policy for a stochas- til a terminal state is encountered or a fixed depth limit is tic, fully observable task is shown in fig. 1. Each node of the met. Action nodes that do not have at least nmin particles tree represents an action at taken at time t. The root action are not expanded further to limit over-fitting, where nmin is of each tree is the action recommended by the policy at the specified by the user. initial state. Each path from the tree root to a leaf node gives The point of entry for tree construction is the B UILD func- a trajectory of actions at , at+1 , . . . , at+h that the agent may tion, presented in algorithm 1 for MDPs. The initial particle take during policy execution up to some depth h. Trajecto- set is created and clustered into the root action node. Each ries leading to terminal conditions before h steps result in root particle for an MDP policy is initialized with the same shallower tree branches. state s0 . Each action node is a tuple (P, a, Ch), where P is Policy trees are interpretable representations of the future the node’s set of particles, a is the action taken from that behavior of a policy that give explicit, quantitative predic- node, and Ch is the set of child nodes. The B UILD func- tions of future trajectories. Each node provides an estimate tion for POMDPs follows the same procedure as for MDPs, Action: City 2 Action: City 2 Value: -0.52 Average Infections: Prob: 100 % City 1: 6.6% City 2: 6.6% City 3: 6.6% City 4: 8.4% City 5: 6.7% City 6: 25.0% City 7: 9.3% City 8: 3.1% Action: City 4 Action: City 5 Value: -0.44 Value: -0.43 Prob: 55.2 % Prob: 44.8 % Action: City 4 Action: City 5 Average Infections: Average Infections: City 1: 10.4% City 1: 7.2% City 2: 8.5% City 2: 9.9% City 3: 9.0% City 3: 9.2% City 4: 11.6% City 4: 10.4% City 5: 8.1% City 5: 10.9% Action: City 8 Action: City 5 Action: City 1 Action: City 7 Value: -0.35 Value: -0.26 Value: -0.26 Value: -0.38 City 6: 33.3% City 6: 26.8% Prob: 34.4 % Prob: 20.8 % Prob: 20.7 % Prob: 24.1 % City 7: 11.0% City 7: 14.8% City 8: 4.1% City 8: 5.0% (a) (b) Figure 2: Detailed Tree Views. (a) The figure shows the first three layers of the vaccine surrogate policy tree. Each node provides its action, estimated value, and probability of reaching it. (b) The figure shows the first two tree layers with the mean of the observations leading to each action node displayed. though states and observations for the initial particle set are Algorithm 2: Rollout sampled from the initial belief b0 . 1: procedure ROLLOUT(an, π, d) 2: if |anP | ≥ nmin and d < dmax Algorithm 1: Build MDP 3: P ←∅ 1: procedure B UILD MDP(s0 , π, n, dmax ) 4: for p ∈ anP 2: P ←∅ 5: s ← ps 3: Ch ← ∅ 6: s0 , o0 , r0 , done ← G EN(s, ana ) 4: d←0 7: if not done 5: for i ∈ 1 : n 8: p0 ← PARTICLE(s0 , o0 , r0 ) 6: P ← P ∪ {(s0 , 0)} 9: P ← P ∪ {p0 } 7: p0 ← π(s0 ) 10: C ← C LUSTER(P, π) 8: a0 ← arg maxa∈A π(s0 ) 11: for c ∈ C 9: root ← Node(P, a0 , p0 , Ch) 12: c0 ← ROLLOUT(c, π, d + 1) 10: root ← ROLLOUT(root, π, d) 13: anch ← anch ∪ {c0 } 11: return root 14: return node Particles are advanced through recursive calls to the ROLLOUT function (algorithm 2). Each time ROLLOUT is clustered into action nodes to minimize how much this pol- called from an action node an, it proceeds if the size of icy deviates from the baseline policy π while meeting con- the node particle set anP exceeds the minimum threshold straints on tree size. To achieve this, we developed a clus- and the depth limit dmax has not been exceeded. If these tering algorithm that approximately solves for the optimal conditions are met, each particle is advanced by calling the clustering through recursive greedy optimization. simulator G EN(s, a) using that particle’s state and the node The recursive clustering approach is presented in algo- action. The set of new particles are then grouped into new rithm 3. The algorithm progressively clusters particles into action nodes by the C LUSTER function. ROLLOUT is recur- an increasing number of nodes k until a distance measure sively called on each new action node and the returned node δ between the actions assigned by the tree and the baseline is added to the child node set anCh . policy is less than a threshold δ ∗ or the maximum number of The action nodes of the tree T encode a deterministic pol- nodes cmax is exceeded. icy for the sampled particles, T : SR → A, where SR is the Clusters are assigned according to a greedy heuristic pro- set of states contained in the particle set. The particles are cess shown in algorithm 4. In this approach, the complete set t=0 a0 t=1 t=2 (a) (b) (c) Figure 3: Tree build example. (a) An initial set of particles is sampled from the given initial state or belief. These particles are advanced in the simulation using the action at the root node a0 . Particles reaching terminals states are marked with ×. (b) Non-terminal particles are clustered into new action nodes. (c) Action nodes with a sufficient number of particles are advanced another time step. This process will continue for each action node until all particles terminate or reach a maximum depth. Algorithm 3: Recursive Cluster The distance metric may be any appropriate measure as- 1: procedure C LUSTER(P , π) signed by the user for the given policy type. For action value 2: k←1 function policies, an effective distance measure for an action 3: repeat a would be 4: C, δ ← G REEDY C LUSTER(P, k, π) δ = kQ̂(s, a) − max Q̂(s, a0 )k (4) 5: k ←k+1 0 a ∈A 6: until δ ≤ δ ∗ or |C| ≥ cmax which intuitively gives the predicted sub-optimality of the 7: return C action. For stochastic policies which output distributions over actions, the difference in action probability would be an appropriate distance. of actions assigned to each particle under the baseline pol- icy AU is ranked according to frequency by the U NIQUE - Tree Policy Control ACTIONS function. Action nodes for each of the top k ac- To use the tree as a policy during task execution, it must be tions and all particles assigned that action by the baseline able to generalize to states not encountered in the particle set policy are clustered. Any particles not clustered to a node at of the tree. To do this, we propose a simple method that uses the end of this process are assigned to the previously formed the baseline policy, constrained by the tree at each time step. node that minimizes the distance between that action and the For a tree constructed for a state s0 , the policy will always action assigned by the baseline policy. take the action at the root node an0 . For all remaining steps, the policy will return Algorithm 4: Greedy Cluster a ← arg max π(st ) (5) a0 ∈AT 1: procedure G REEDY C LUSTER(P , k, π) 2: C←∅ where AT is the set of actions in the child set of the pre- 3: δ←0 ceding action node ant−1 . Intuitively, the agent will take the 4: AU ← U NIQUE ACTIONS(P, π) best action predicted by the baseline policy that is included 5: for i ∈ 1 : k in the tree. Building a new tree each time a leaf state is en- 6: a ← AU [i] countered allows the tree policies to be run in-the-loop for 7: P A ← {p ∈ P | arg maxa∈A π(ps ) = a} long or infinite-horizon problems. 8: C ← {N ODE(P A , a, ∅)} ∪ C 9: P ← P \ PA Experiments 10: for p ∈ P We ran experiments to test the performance of the proposed 11: an ← arg minan0 ∈C D ISTANCE(p, an0 , π) approach. One set of experiments are conducted on a simple 12: δ ← δ + D ISTANCE(p, an, π) grid world task. These experiments were designed to quan- 13: anP ← {p} ∪ anP titatively measure the effect of various algorithm parameters and environment features on tree size and performance. Two 14: return C, δ additional experiments were run on more complex tasks that demonstrate the utility of the approach on real-world mo- Parameter Value Rel Change (%) Leaf Depth tivating examples. The first of these is multi-city vaccine deployment planning. In large-scale infrastructure planning 0.5 4.5 ± 4.7 4.4 tasks such as this, it is often necessary to have a reasonable Trans. Prob. 0.7 −8.0 ± 4.0 4.4 prediction of all steps of the plan in order to gain stakeholder 0.9 −4.4 ± 1.0 4.2 trust. The second task is a cyber security agent that provides 1.0 0.0 ± 0.0 4.0 recommendations to a human analyst to secure a network 4 −4.4 ± 1.0 4.2 against attack. Interpretable policies are important for tasks No. Actions 8 −2.6 ± 0.4 2.6 with human oversight and cooperation. 16 −1.0 ± 0.3 1.5 We implemented the proposed algorithm with the distance 0.005 −5.7 ± 1.0 4.2 function defined in eq. (4) in Python. Neural network train- δ∗ 0.01 −4.4 ± 1.0 4.2 ing was done in PyTorch (Paszke et al. 2019). Source code, 0.1 −3.1 ± 1.0 3.1 full experiment descriptions, and results are available in the Appendix. 3 −1.3 ± 0.9 2.0 Max Depth 5 −2.9 ± 1.0 3.5 Grid World 10 −4.4 ± 1.0 4.2 In the grid world task, an agent must navigate a discrete, 2D world to reach a goal state while avoiding trap states. The Table 1: Parameter Search Results. The change in task per- agent may take one of n total actions to move between 1 and formance relative to the baseline policy performance and the b n4 c units in any of the four cardinal directions on the grid. average depth of leaf nodes in the trees are given. Values The agent moves in the intended direction with probability p generated with the baseline settings are shown in bold. and takes a random action otherwise. To incentivize solving the problem quickly, the agent receives a cost of −1 at each time step. The agent gets a positive reward of +10 for reach- with more actions. This likely explains the improved perfor- ing a goal state and a penalty of −5 for reaching a trap state. mance, as shallower trees will transition back to the baseline The episode terminates when the agent reaches a goal state policy sooner in the tests. Another possible explanation is or when a maximum number of steps is reached. Because we that as the number of actions grows, the cost of taking a sub- know exact transition probabilities, we used discrete value optimal action decreases. For example, with 4 actions, if the iteration to learn a baseline policy (Sutton and Barto 2018). optimal action is not taken, then the agent moves in a com- We constructed surrogate trees from the baseline policy pletely different direction than it should. With n > 4 actions, using various environment and tree-build algorithm settings. the agent may still move in the correct direction, though by For the environment, we swept over different values of the more or less distance than optimal. transition probability p and the number of actions n. For the For the tree building algorithm parameters, we see that as tree build algorithm, we varied the total number of parti- δ ∗ increases, the depth of the leaf nodes also decreases. This cles, the minimum particle count, the distance threshold δ ∗ , is likely because higher values of δ ∗ leads to more aggres- and the maximum leaf node depth. Only one parameter was sive node clustering and fewer branches. Similar to the trend varied for each test, with all others held fixed. The baseline observed by varying the number of actions, as the average settings for the environment were n = 4 actions and a suc- leaf depth decreases, relative performance increases. Simi- cessful transition probability of p = 0.9. The baseline tree larly, as the max depth increases, the tree is allowed to grow build parameters are 1000 total particles, 250 minimum par- deeper and the relative performance drops. ticles, distance threshold of 0.01, and maximum depth of 10. To test each tree, we ran it as a policy from the initial Vaccine Planning state until it reached a leaf node. The baseline policy was In the vaccine deployment task, an agent decides the order then used to complete the episode. We tested 2,500 trees for in which to start vaccine distributions in cities in the midst each parameter configuration. Select results are shown ta- of a pandemic outbreak. Each step, the agent picks a city ble 1. The mean change in performance of the tree policy in which to start a vaccine program. The spread of the dis- relative to the baseline is shown along with one standard er- ease is modeled as a stochastic SIRD model (Bailey 1975), ror bounds. The average depth of leaf nodes is also shown, with portions of each city population being susceptible to though standard error is omitted as each had SE < 0.05. infection (S), infected (I), recovered (R), or dead (D). The From the transition probability sweep, we see that relative n cities are modeled as a fully connected, weighted graph, performance generally improves as transition probability in- where weight wij encodes the amount of traffic between city creases. At p = 0.5, both policies perform very poorly and i and j. The state of city i changes from time t to t + 1 as the difference between the two is not significant as a result. (i) (i) (i) (i) (i) S,(i) In the deterministic case, the surrogate tree perfectly repre- dSt = −β I˜t St − αt St + t (6) sents the baseline policy. These results suggest that surrogate (i) (i) (i) S,(i) (i) (i) dIt = β I˜t St + t − γIt − µIt (7) trees are better suited for tasks with low stochasticity. (i) (i) (i) (i) As we increase the number of actions, the difference in dRt = γIt + αt St (8) performance between the baseline and the tree policy de- (i) (i) creases. The average leaf depth of the trees also decreases dDt = µIt (9) where β, γ, and µ are the mean infection, recovery, and (i) death rates, respectively. The vaccination rate αt of city i at Scan LAN 1 time t and is equal to zero until an action is taken to deploy Value: -0.43 Prob: 100 % a vaccination program to that city. The noise  is sampled from a zero mean Gaussian. TheP effective infection expo- sure at city i is defined as I˜(i) = j wij Ij , where the sum is taken over all cities, and wii = 1. Cities with closer index values will have higher weights, for example w12 > w13 . Each episode is initialized with up to 10% of each city in- Scan LAN 2 fected and the remainder susceptible. One city is initialized Value: -0.35 Prob: 100.0 % with 25% infected. The episode concludes after five cities have had vaccine programs started. The simulation is then run until all cities have zero susceptible or infected popu- P (i) lation. The reward at each time step is rt = a i dIt + P (i) b i dDt , where a and b are parameters. We trained a neural network policy using double Mode Observation: DQN (van Hasselt 2010) with n-step returns. The trained Alert: Host Detection Clean Host (2,6) Value: -0.25 Scan LAN 1 Value: -0.26 policy achieved an average score with one standard error LAN: 2 Prob: 62.5 % Prob: 37.5 % IP: 192.168.2.6 bounds of −149.8 ± 0.6 over 100 trial episodes. We built trees for 100 random initial states with 2000 particles and a minimum particle threshold of 250 and tested their perfor- mance as forward policies. The trees achieved an average Figure 4: Cyber Security Policy Tree. The figure shows the score of −152.2 ± 0.8 over the 100 trials, for an average first three layers of a surrogate tree for the cyber security performance drop of 1.6%. To compare to a baseline surro- task. The most frequent observation from the “Clean Host” gate modeling approach, we also trained and tested a LIME action node set is shown. model (Ribeiro, Singh, and Guestrin 2016) with 2000 sam- ples. The LIME average score was −184.2 ± 4.5, for an av- erage performance loss of 23.0%. The surrogate tree in fig. 2 provides an intuitive under- interpreted are more likely to be trusted by a human operator. standing of the learned policy. In fig. 2b we can see that the A surrogate tree for the neural network is shown in fig. 4. policy does not deploy vaccines to the most heavily infected Unlike the baseline neural network, the policy encoded by cities first. It instead prioritizes cities with larger susceptible this tree can be easily interpreted. The agent will continually populations to give the vaccine time to take effect on a larger scan LAN 1 in most cases, and will only clean a workstation amount of the population. after malware has been detected. Cyber Security Conclusions The cyber security task requires an agent to prevent unau- In this work, we presented methods to construct local sur- thorized access to secure data server on a computer network. rogate policy trees from arbitrary control policies. The trees The computer network is comprised of four local area net- are more interpretable than high-dimensional policies such works (LANs), each of which has a local application server as neural networks and provide quantitative estimates of fu- and ten workstations, and a single secure data server. The ture behavior. Our experiments show that, despite truncat- compromise state of the network is not known may be ob- ing the set of actions that may be taken at each future time served through noisy alerts generated from malware scans. step, the trees retain performance comparable to their base- Workstations are networked to all others on their LAN and line policies. Experiments demonstrate the effect of various to the LAN application server. Servers are randomly con- environment and algorithm parameters on tree size and per- nected in a complete graph. An attacker begins with a single formance in a simple grid world. Demonstrations show how workstation compromised and takes actions to compromise surrogate trees may be used in more complex, real-world additional workstations and servers to reach the data server. scenarios. The defender can scan all nodes on a LAN to locate The action node clustering presented in this work used compromised nodes with probability pdetect . Compromised a heuristic search method that provided good results, but nodes will also generate alerts without being scanned with without any optimality guarantees. Future work will look at low probability. The defender can also scan and clean indi- improved approaches to clustering, for example by using a vidual nodes to detect and remove compromise. The reward mixed integer program optimization. We will also explore is zero unless the data server is compromised, in which case using the scenarios simulated to construct the tree to backup a large penalty is incurred. The defender was trained using more accurate value estimates, and refine the resulting pol- Rainbow DQN (Hessel et al. 2018). icy. Including empirical backups such as these may also al- Automated systems such as this are often implemented low calculation of confidence intervals or bounds on policy with a human in the loop. Policies that can be more easily performance (Mern and Kochenderfer 2021). References Mern, J.; and Kochenderfer, M. J. 2021. Measurable Monte Adadi, A.; and Berrada, M. 2018. Peeking Inside the Black- Carlo Search Error Bounds. Computing Research Reposi- Box: A Survey on Explainable Artificial Intelligence (XAI). tory, abs/2106.04715. IEEE Access, 6: 52138–52160. Miller, T. 2019. Explanation in artificial intelligence: In- Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time sights from the social sciences. Artificial Intelligence, 267: Analysis of the Multiarmed Bandit Problem. Journal of Ma- 1–38. chine Learning Research, 47(2-3): 235–256. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, Bailey, N. T. 1975. The mathematical theory of infectious J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidje- diseases and its applications. 2nd edition. Charles Griffin & land, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Company Ltd. Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, Coppens, Y.; Efthymiadis, K.; Lenaerts, T.; Nowé, A.; S.; and Hassabis, D. 2015. Human-level control through Miller, T.; Weber, R.; and Magazzeni, D. 2019. Distilling deep reinforcement learning. Nature, 518(7540): 529–533. deep reinforcement learning policies in soft decision trees. Molnar, C. 2019. Interpretable Machine Learning. https: In International Joint Conference on Artificial Intelligence //christophm.github.io/interpretable-ml-book/. (IJCAI), 1–6. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostro- Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; vski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M. G.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; and Silver, D. 2018. Rainbow: Combining Improvements Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; in Deep Reinforcement Learning. In AAAI Conference on and Chintala, S. 2019. PyTorch: An Imperative Style, High- Artificial Intelligence (AAAI), 3215–3222. Performance Deep Learning Library. In Advances in Neural Kim, B.; Koyejo, O.; and Khanna, R. 2016. Examples are Information Processing Systems (NeurIPS), 8024–8035. not enough, learn to criticize! Criticism for Interpretabil- Puiutta, E.; and Veith, E. M. S. P. 2020. Explainable Rein- ity. In Advances in Neural Information Processing Systems forcement Learning: A Survey. In Machine Learning and (NeurIPS), 2280–2288. Knowledge Extraction International Cross-Domain Confer- Kochenderfer, M. J.; Wheeler, T. A.; and Wray, K. H. 2022. ence (CD-MAKE), volume 12279, 77–95. Algorithms for Decision Making. MIT Press. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “Why Kocsis, L.; and Szepesvári, C. 2006. Bandit Based Monte- Should I Trust You?”: Explaining the Predictions of Any Carlo Planning. In European Conference on Machine Learn- Classifier. In SIGKDD International Conference on Knowl- ing (ECML). edge Discovery and Data Mining, 1135–1144. ACM. Lipton, Z. C. 2018. The mythos of model interpretability. Sidrane, C.; Maleki, A.; Irfan, A.; and Kochenderfer, M. J. Communications of the ACM, 61(10): 36–43. 2021. OVERT: An Algorithm for Safety Verification of Neu- Liu, C.; Arnon, T.; Lazarus, C.; Strong, C. A.; Barrett, C. W.; ral Network Control Policies for Nonlinear Systems. Com- and Kochenderfer, M. J. 2021. Algorithms for Verifying puting Research Repository, abs/2108.01220. Deep Neural Networks. Foundations and Trends in Opti- mization, 4(3-4): 244–404. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learn- ing: An Introduction. The MIT Press, second edition. Liu, G.; Schulte, O.; Zhu, W.; and Li, Q. 2018. Toward Inter- pretable Deep Reinforcement Learning with Linear Model van Hasselt, H. 2010. Double Q-learning. In Advances in U-Trees. In European Conference on Machine Learning Neural Information Processing Systems (NeurIPS), 2613– (ECML), volume 11052, 414–429. 2621. Madumal, P.; Miller, T.; Sonenberg, L.; and Vetere, F. Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and Chaud- 2020. Explainable Reinforcement Learning through a huri, S. 2018. Programmatically Interpretable Reinforce- Causal Lens. In AAAI Conference on Artificial Intelligence ment Learning. In International Conference on Machine (AAAI), 2493–2500. Learning (ICML), volume 80, 5052–5061.