Interpretable Local Tree Surrogate Policies

          John Mern,1 Sidhart Krishnan, 2 Anil Yildiz, 1 Kyle Hatch, 2 Mykel J. Kochenderfer 1
                                 1
                                     Department of Aeronautics and Astronautics, Stanford University
                                        2
                                          Department of Computer Science, Stanford University


                            Abstract
  High-dimensional policies, such as those represented by neu-
  ral networks, cannot be reasonably interpreted by humans.
  This lack of interpretability reduces the trust users have in
  policy behavior, limiting their use to low-impact tasks such
  as video games. Unfortunately, many methods rely on neural
  network representations for effective learning. In this work,
  we propose a method to build interpretable policy trees as sur-
  rogates for policies such as neural networks. The policy trees
  are easily human interpretable and provide quantitative pre-
  dictions of future behavior. We demonstrate the performance
  of this approach on several simulated tasks.
                                                                      Figure 1: Surrogate Policy Tree. The figure shows a sur-
                                                                      rogate policy tree for the vaccine planning task, a finite-
                        Introduction                                  horizon MDP with 8 discrete actions. Each node represents
                                                                      an action and the states that led to it. Node borders and edge
Deep reinforcement learning has achieved state of the art
                                                                      widths are proportional to the probability of encountering
performance in several challenging task domains (Mnih
                                                                      that node or edge during policy execution. The most likely
et al. 2015). Much of that performance comes from the use
                                                                      trajectory from root to leaf is shown in blue.
of highly expressive neural networks to represent policies.
Humans are generally unable to meaningfully interpret neu-
ral network parameters, which has lead to the common view
of neural networks as “black-box” functions. Poor inter-              analyst. In these cases, knowing the extended course of ac-
pretability often leads to a lack of trust in neural networks         tions before committing to the recommendation can enhance
and use of other more transparent, though potentially less            the trust of the human operator.
high-performing policy models. This is often referred to as              It is difficult for humans to holistically interpret models
the performance-transparency trade-off.                               with even a small number of interacting terms (Lipton 2018).
   Interpretability is important for high-consequence tasks.          Neural networks commonly have several thousand param-
Domains in which neural networks have already been ap-                eters and non-linear interactions, making holistic interpre-
plied, such as image classification, often do not require in-         tation infeasible. Some existing methods constrain neural
terpretable decisions because the consequences of mislabel-           networks architectures and train them to learn human in-
ing images are typically low. In many potential applications          terpretable features during task learning. The training and
of deep reinforcement learning, such as autonomous vehi-              architecture constraints, however, can degrade performance
cle control, erroneous actions may be costly and dangerous.           compared to an unconstrained policy. A common approach
In these cases, greater trust in policy decisions is typically        is to learn transparent surrogates from the original mod-
desired before systems are deployed.                                  els (Adadi and Berrada 2018). A major challenge in this ap-
                                                                      proach is balancing the fidelity of the surrogate to the origi-
   Our work is motivated by tasks in which a human inter-
                                                                      nal with the interpretability of the surrogate. Surrogate mod-
acts directly with the policy, either by approving agent ac-
                                                                      els that are generated stochastically can additionally struggle
tions before they are taken or by enacting recommendations
                                                                      to provide consistent representations for the same policy.
directly. This is often referred to having a human “in the
                                                                         In this work, we propose a method to develop transparent
loop”. An example is an automated cyber security incident
                                                                      surrogate models as local policy trees. The resulting trees
response system that provides recommendations to a human
                                                                      encode an intuitive plan of future actions with performance
Copyright © 2022 for this paper by its authors. Use permitted under   comparable to the original policy. The proposed approach
Creative Commons License Attribution 4.0 International (CC BY         allows users to specify tree size constraints and fidelity tar-
4.0).                                                                 gets. The method is model-agnostic, meaning that it does not
require the original policy to take any specific form. Though       over all states. Learning an optimal policy is equivalent to
this work was motivated by neural networks, the proposed            learning the optimal action value function
approach may be used with any baseline policy form.                                  "                                     #
   During execution of a tree policy, the actions taken by
                                                                                                    X
                                                                                                         τ −t       ∗
                                                                                                                         
                                                                      Q(st , at ) = E r(st , at ) +     γ r sτ , π (sτ )     (2)
an agent are guaranteed to be along one of the unique paths                                       τ =t+1
from the root. Experiments on a simple grid world task high-
                                                                             ∗
light the impact of algorithm parameters on tree behavior.          where π is the optimal policy. When learning an action
The experiments also show that using the trees as a receding-       value function estimator, the effective policy is
horizon policy maintains good performance relative to the
                                                                                      π(s) ← arg max Q̂(s, a)                  (3)
baseline model. Additional experiments on more complex                                             a∈A
infrastructure planning and cyber-physical security domains
demonstrate the potential utility of this approach in real-         where Q̂(s, a) is the learned approximator. These problems
world tasks.                                                        can be solved using a variety of methods such as dynamic
                                                                    programming, Monte Carlo planning, and reinforcement
                                                                    learning (Kochenderfer, Wheeler, and Wray 2022). Simi-
                       Background                                   larly, various function types can be used to model the policy
                                                                    or value function estimator. Neural networks are commonly
Despite its importance in machine learning, there is no pre-        used as policies for complex tasks.
cise qualitative or mathematical definition of interpretabil-
ity. In the literature, interpretability is commonly used to
describe a model with one of two characteristics. The first
                                                                                          Related Work
is explainability, defined by Miller (2019) as the degree           There has been a wealth of prior work in the field of ex-
to which a human can understand the cause of a decision.            plainable and interpretable artificial intelligence. The book
The second characteristic is predictability, which is defined       by Molnar (2019) provides an overview of model inter-
by Kim, Koyejo, and Khanna (2016) as the degree to which            pretability techniques for general machine learning meth-
a human can consistently predict a model’s result. In this          ods. Methods for specific models and tasks have also been
work, we will refer to models that are either explainable or        proposed. Puiutta and Veith (2020) provides a survey of re-
predictable as interpretable.                                       cent work in explainable reinforcement learning. The discus-
   Techniques for model interpretability can vary greatly. We       sion in this section is restricted to methods relevant to deep
characterize the methods presented in this work using the           reinforcement learning. Methods requiring domain specific
taxonomy proposed by Adadi and Berrada (2018) as model-             representations (Verma et al. 2018) are not considered.
agnostic, local surrogate methods. Model-agnostic methods              A common technique for model-agnostic surrogate mod-
are those that may be applied to any model with the ap-             eling is to learn a surrogate model of an inherently inter-
propriate mapping from input to output. Surrogate modeling          pretable class. An useful example is locally interpretable
techniques distill the behavior of a complex baseline model         model-agnostic explanations (LIME) (Ribeiro, Singh, and
into a more transparent surrogate. Local methods provide            Guestrin 2016). LIME learns sparse linear representations
interpretability that is valid only in the neighborhood of a        of a target policy at a specific input by training on a data
target input point. As a result, they tend to maintain higher       set of points near the target point. While the resulting linear
fidelity to the original model in the acceptable region than        functions are interpretable, they are often not consistent and
models that attempt to provide global interpretations.              small variations in the training data can result in drastically
                                                                    different linear functions.
                                                                       Several methods have been proposed to distill neural net-
Markov Decision Processes                                           work policies to decision tree surrogates. Linear model U-
                                                                    trees (Liu et al. 2018) and soft decision trees (Coppens et al.
The control tasks in this work are assumed to satisfy the
                                                                    2019) take similar approaches to tree representation and
Markov assumption and may be modeled as either Markov
                                                                    learning. Decision trees are global policy representations
decision processes (MDPs) or partially observable Markov
                                                                    that map input observations to output actions by traversing a
decision processes (POMDPs). MDPs are defined by tu-
                                                                    binary tree. Actions can be partially understood by inspect-
ples (S, A, T, r, γ), where S and A are the state and ac-
                                                                    ing the values at each internal node. Unfortunately, the un-
tion spaces, respectively, T (s0 | s, a) is the transition model,
                                                                    derstanding provided by decision trees can be limited be-
r(s, a, s0 ) is the reward function, and γ is a time discount
                                                                    cause the mappings still pass the input observation through
factor. POMDPs are modeled by the same tuple with the
                                                                    several layers of affine transforms and non-linear functions.
addition of the observation space O and observation model
                                                                       Structural causal models (Madumal et al. 2020) also try
Z(o | s, a). A policy is a function that maps a state to an
                                                                    to learn an inherently interpretable surrogate as a directed
action in MDPs or a history of observations to an action in
                                                                    acyclic graph (DAG). The learned DAG represents relations
POMDPs. To solve a MDP is to learn the policy π : S → A
                                                                    between objects in the task environment that can be easily
that maximizes the expected sum of discounted rewards
                                                                    understood by humans. The DAG structure, however, must
                         "                           #              be provided for each task, and the fidelity is not assured.
                                                                       Our work proposes tree representations that provide in-
                           X
                                τ −t
                                                   
              V (st ) = E      γ r sτ , π(sτ )                (1)
                          τ =t
                                                                    tuitive maps over future courses of action. When used as a
policy, the realized course of action is guaranteed to be along       of the probability that the action sequence up to that node
the branches of the tree. In this way, the method is similar to       will be taken P (a0 , . . . , at | s0 , π) and estimates of the pol-
methods of neural network verification that seek to provide           icy value Q(s, a). Each node also contains a set of example
guarantees of network outputs over a given set of inputs. Liu         states or observations that would result in that action.
et al. (2021) provides an overview of verification for gen-              Figure 2 provides more detailed views of the policy tree
eral neural networks. Additional works have been proposed             in fig. 1. The trajectory probabilities and value estimates
specifically for verification of neural network control poli-         shown in fig. 2a are calculated during tree construction and
cies (Sidrane et al. 2021).                                           may be presented for any surrogate policy. The states in
   The methods proposed in this work also resemble                    the example problem st ∈ Rn are real vectors. Each node
Monte Carlo tree search (MCTS) algorithms (Kocsis and                 in fig. 2b shows the mean of that node’s state values. While
Szepesvári 2006). MCTS methods search for an optimal ac-             the mean state value is useful for understanding this prob-
tion from a given initial state or belief by sampling trajecto-       lem, it may not be useful in all problems. For example, the
ries using a search policy. The result of an MCTS search is           mean value may be meaningless for problems with discrete
a tree of trajectories reachable from the initial state. The tree     state spaces. Methods to compactly represent a node’s set
trajectories, however, are not limited to those reached un-           of states or observations cannot be generally defined and in-
der a fixed policy, but rather by a non-stationary tree search        stead should be specified for each problem.
policy. The resulting tree cannot be used to interpret or pre-           Trees are an intuitive choice for the local surrogate model
dict future policy behavior. Trees from planners using UCB            of a control policy. Local surrogate methods seek to provide
exploration (Auer, Cesa-Bianchi, and Fischer 2002) tend to            simple approximate models that are faithful to the original
grow exponentially with search depth h as O(|S|h |A|h ).              policy in a space around a target point. For control tasks,
                                                                      often only predictions of future behavior are useful. Trees
                    Proposed Method                                   provide compact surrogate models by only representing be-
                                                                      havior in states that are forward-reachable from the current
We present model-agnostic methods to represent policies
                                                                      state. This is in contrast to methods that define local neigh-
for both MDPs and POMDPs as interpretable tree policies.
                                                                      borhoods by arbitrarily perturbing the initial point (Ribeiro,
Trees are inherently more interpretable than high dimen-
                                                                      Singh, and Guestrin 2016). In these cases, explanatory ca-
sional models like neural networks. The proposed method
                                                                      pacity is wasted on unnecessary states.
uses the baseline policy to generate simulations of trajecto-
ries reachable from a given initial state. The trajectories are
then clustered to develop a width-constrained tree that repre-        Tree Build Algorithm
sents the original policy with low expected return loss. The          Trees are built by simulating multiple executions of the base-
methods assume that baseline policies return distributions            line policy from the given initial state or belief and clustering
over actions or estimates of action values for a given state.         them into action nodes. The process is illustrated in fig. 3.
Building trees also requires a generative model of the task           Each simulation is represented by a collection of particles
environment.                                                          pt for each time t along the simulation trajectory. Particles
   In addition to representing the future policy, the tree also       are tuples (st , rt ) of the state at time t and reward received
provides useful statistics on expected performance such as            from t − 1 to t. For POMDPs, particles also record the ob-
likelihood of following each represented trajectory. The tree         servation ot and belief bt .
may be used as a policy during task execution with guaran-               Tree construction begins by first generating a set of par-
tees on expected behavior. We present methods to generalize           ticles for the initial state or belief. For the initial step, all
to states not seen during tree construction by using the tree         particles are assigned to the root action node an0 , which
as a constraint on the baseline policy.                               takes the action given by the baseline policy a0 = π(s0 )
                                                                      for MDPs or a0 = π(b0 ) for POMDPs. The particles are
Local Tree Policy                                                     then advanced through one simulation time step to produce
Before describing the tree construction algorithm, we first           the particles for t + 1, as shown in fig. 3a. The t + 1 par-
present the local tree policies that it produces. A policy tree       ticles that did not enter terminal states are clustered to new
is a rooted polytree where nodes represent actions taken dur-         action nodes as shown in fig. 3b. This process continues un-
ing policy execution. An example tree policy for a stochas-           til a terminal state is encountered or a fixed depth limit is
tic, fully observable task is shown in fig. 1. Each node of the       met. Action nodes that do not have at least nmin particles
tree represents an action at taken at time t. The root action         are not expanded further to limit over-fitting, where nmin is
of each tree is the action recommended by the policy at the           specified by the user.
initial state. Each path from the tree root to a leaf node gives         The point of entry for tree construction is the B UILD func-
a trajectory of actions at , at+1 , . . . , at+h that the agent may   tion, presented in algorithm 1 for MDPs. The initial particle
take during policy execution up to some depth h. Trajecto-            set is created and clustered into the root action node. Each
ries leading to terminal conditions before h steps result in          root particle for an MDP policy is initialized with the same
shallower tree branches.                                              state s0 . Each action node is a tuple (P, a, Ch), where P is
   Policy trees are interpretable representations of the future       the node’s set of particles, a is the action taken from that
behavior of a policy that give explicit, quantitative predic-         node, and Ch is the set of child nodes. The B UILD func-
tions of future trajectories. Each node provides an estimate          tion for POMDPs follows the same procedure as for MDPs,
                                    Action: City 2                                                                Action: City 2
                                     Value: -0.52                                                               Average Infections:
                                    Prob: 100 %                                                                    City 1: 6.6%
                                                                                                                   City 2: 6.6%
                                                                                                                   City 3: 6.6%
                                                                                                                   City 4: 8.4%
                                                                                                                   City 5: 6.7%
                                                                                                                   City 6: 25.0%
                                                                                                                   City 7: 9.3%
                                                                                                                   City 8: 3.1%


                        Action: City 4          Action: City 5
                         Value: -0.44            Value: -0.43
                        Prob: 55.2 %            Prob: 44.8 %


                                                                                                  Action: City 4                  Action: City 5
                                                                                                Average Infections:             Average Infections:
                                                                                                   City 1: 10.4%                   City 1: 7.2%
                                                                                                   City 2: 8.5%                    City 2: 9.9%
                                                                                                   City 3: 9.0%                    City 3: 9.2%
                                                                                                   City 4: 11.6%                   City 4: 10.4%
                                                                                                   City 5: 8.1%                    City 5: 10.9%
       Action: City 8   Action: City 5          Action: City 1   Action: City 7
        Value: -0.35     Value: -0.26            Value: -0.26     Value: -0.38                     City 6: 33.3%                   City 6: 26.8%
       Prob: 34.4 %     Prob: 20.8 %            Prob: 20.7 %     Prob: 24.1 %                      City 7: 11.0%                   City 7: 14.8%
                                                                                                   City 8: 4.1%                    City 8: 5.0%


                                         (a)                                                                           (b)

Figure 2: Detailed Tree Views. (a) The figure shows the first three layers of the vaccine surrogate policy tree. Each node provides
its action, estimated value, and probability of reaching it. (b) The figure shows the first two tree layers with the mean of the
observations leading to each action node displayed.


though states and observations for the initial particle set are                   Algorithm 2: Rollout
sampled from the initial belief b0 .                                               1: procedure ROLLOUT(an, π, d)
                                                                                   2:    if |anP | ≥ nmin and d < dmax
Algorithm 1: Build MDP                                                             3:         P ←∅
 1: procedure B UILD MDP(s0 , π, n, dmax )                                         4:         for p ∈ anP
 2:    P ←∅                                                                        5:             s ← ps
 3:    Ch ← ∅                                                                      6:             s0 , o0 , r0 , done ← G EN(s, ana )
 4:    d←0                                                                         7:             if not done
 5:    for i ∈ 1 : n                                                               8:                   p0 ← PARTICLE(s0 , o0 , r0 )
 6:        P ← P ∪ {(s0 , 0)}                                                      9:                   P ← P ∪ {p0 }
 7:    p0 ← π(s0 )                                                                10:         C ← C LUSTER(P, π)
 8:    a0 ← arg maxa∈A π(s0 )                                                     11:         for c ∈ C
 9:    root ← Node(P, a0 , p0 , Ch)                                               12:             c0 ← ROLLOUT(c, π, d + 1)
10:    root ← ROLLOUT(root, π, d)                                                 13:             anch ← anch ∪ {c0 }
11:    return root                                                                14:     return node

   Particles are advanced through recursive calls to the
ROLLOUT function (algorithm 2). Each time ROLLOUT is                              clustered into action nodes to minimize how much this pol-
called from an action node an, it proceeds if the size of                         icy deviates from the baseline policy π while meeting con-
the node particle set anP exceeds the minimum threshold                           straints on tree size. To achieve this, we developed a clus-
and the depth limit dmax has not been exceeded. If these                          tering algorithm that approximately solves for the optimal
conditions are met, each particle is advanced by calling the                      clustering through recursive greedy optimization.
simulator G EN(s, a) using that particle’s state and the node                        The recursive clustering approach is presented in algo-
action. The set of new particles are then grouped into new                        rithm 3. The algorithm progressively clusters particles into
action nodes by the C LUSTER function. ROLLOUT is recur-                          an increasing number of nodes k until a distance measure
sively called on each new action node and the returned node                       δ between the actions assigned by the tree and the baseline
is added to the child node set anCh .                                             policy is less than a threshold δ ∗ or the maximum number of
   The action nodes of the tree T encode a deterministic pol-                     nodes cmax is exceeded.
icy for the sampled particles, T : SR → A, where SR is the                           Clusters are assigned according to a greedy heuristic pro-
set of states contained in the particle set. The particles are                    cess shown in algorithm 4. In this approach, the complete set
      t=0
                a0


      t=1


      t=2

                     (a)                                         (b)                                          (c)

Figure 3: Tree build example. (a) An initial set of particles is sampled from the given initial state or belief. These particles
are advanced in the simulation using the action at the root node a0 . Particles reaching terminals states are marked with ×. (b)
Non-terminal particles are clustered into new action nodes. (c) Action nodes with a sufficient number of particles are advanced
another time step. This process will continue for each action node until all particles terminate or reach a maximum depth.


Algorithm 3: Recursive Cluster                                            The distance metric may be any appropriate measure as-
 1: procedure C LUSTER(P , π)                                          signed by the user for the given policy type. For action value
 2:    k←1                                                             function policies, an effective distance measure for an action
 3:    repeat                                                          a would be
 4:        C, δ ← G REEDY C LUSTER(P, k, π)                                           δ = kQ̂(s, a) − max Q̂(s, a0 )k              (4)
 5:        k ←k+1                                                                                      0
                                                                                                       a ∈A
 6:    until δ ≤ δ ∗ or |C| ≥ cmax                                     which intuitively gives the predicted sub-optimality of the
 7:    return C                                                        action. For stochastic policies which output distributions
                                                                       over actions, the difference in action probability would be
                                                                       an appropriate distance.
of actions assigned to each particle under the baseline pol-
icy AU is ranked according to frequency by the U NIQUE -
                                                                       Tree Policy Control
ACTIONS function. Action nodes for each of the top k ac-               To use the tree as a policy during task execution, it must be
tions and all particles assigned that action by the baseline           able to generalize to states not encountered in the particle set
policy are clustered. Any particles not clustered to a node at         of the tree. To do this, we propose a simple method that uses
the end of this process are assigned to the previously formed          the baseline policy, constrained by the tree at each time step.
node that minimizes the distance between that action and the           For a tree constructed for a state s0 , the policy will always
action assigned by the baseline policy.                                take the action at the root node an0 . For all remaining steps,
                                                                       the policy will return

Algorithm 4: Greedy Cluster                                                                a ← arg max π(st )                      (5)
                                                                                                     a0 ∈AT
 1: procedure G REEDY C LUSTER(P , k, π)
 2:    C←∅                                                             where AT is the set of actions in the child set of the pre-
 3:    δ←0                                                             ceding action node ant−1 . Intuitively, the agent will take the
 4:    AU ← U NIQUE ACTIONS(P, π)                                      best action predicted by the baseline policy that is included
 5:    for i ∈ 1 : k                                                   in the tree. Building a new tree each time a leaf state is en-
 6:        a ← AU [i]                                                  countered allows the tree policies to be run in-the-loop for
 7:        P A ← {p ∈ P | arg maxa∈A π(ps ) = a}                       long or infinite-horizon problems.
 8:        C ← {N ODE(P A , a, ∅)} ∪ C
 9:        P ← P \ PA                                                                         Experiments
10:     for p ∈ P                                                      We ran experiments to test the performance of the proposed
11:         an ← arg minan0 ∈C D ISTANCE(p, an0 , π)                   approach. One set of experiments are conducted on a simple
12:         δ ← δ + D ISTANCE(p, an, π)                                grid world task. These experiments were designed to quan-
13:         anP ← {p} ∪ anP                                            titatively measure the effect of various algorithm parameters
                                                                       and environment features on tree size and performance. Two
14:     return C, δ
                                                                       additional experiments were run on more complex tasks that
demonstrate the utility of the approach on real-world mo-            Parameter       Value     Rel Change (%)      Leaf Depth
tivating examples. The first of these is multi-city vaccine
deployment planning. In large-scale infrastructure planning                            0.5        4.5 ± 4.7            4.4
tasks such as this, it is often necessary to have a reasonable       Trans. Prob.      0.7       −8.0 ± 4.0            4.4
prediction of all steps of the plan in order to gain stakeholder                       0.9       −4.4 ± 1.0            4.2
trust. The second task is a cyber security agent that provides                         1.0        0.0 ± 0.0            4.0
recommendations to a human analyst to secure a network                                  4        −4.4 ± 1.0            4.2
against attack. Interpretable policies are important for tasks       No. Actions        8        −2.6 ± 0.4            2.6
with human oversight and cooperation.                                                  16        −1.0 ± 0.3            1.5
   We implemented the proposed algorithm with the distance
                                                                                     0.005       −5.7 ± 1.0            4.2
function defined in eq. (4) in Python. Neural network train-
                                                                     δ∗               0.01       −4.4 ± 1.0            4.2
ing was done in PyTorch (Paszke et al. 2019). Source code,
                                                                                       0.1       −3.1 ± 1.0            3.1
full experiment descriptions, and results are available in the
Appendix.                                                                               3        −1.3 ± 0.9            2.0
                                                                     Max Depth          5        −2.9 ± 1.0            3.5
Grid World                                                                             10        −4.4 ± 1.0            4.2
In the grid world task, an agent must navigate a discrete, 2D
world to reach a goal state while avoiding trap states. The        Table 1: Parameter Search Results. The change in task per-
agent may take one of n total actions to move between 1 and        formance relative to the baseline policy performance and the
b n4 c units in any of the four cardinal directions on the grid.   average depth of leaf nodes in the trees are given. Values
The agent moves in the intended direction with probability p       generated with the baseline settings are shown in bold.
and takes a random action otherwise. To incentivize solving
the problem quickly, the agent receives a cost of −1 at each
time step. The agent gets a positive reward of +10 for reach-      with more actions. This likely explains the improved perfor-
ing a goal state and a penalty of −5 for reaching a trap state.    mance, as shallower trees will transition back to the baseline
The episode terminates when the agent reaches a goal state         policy sooner in the tests. Another possible explanation is
or when a maximum number of steps is reached. Because we           that as the number of actions grows, the cost of taking a sub-
know exact transition probabilities, we used discrete value        optimal action decreases. For example, with 4 actions, if the
iteration to learn a baseline policy (Sutton and Barto 2018).      optimal action is not taken, then the agent moves in a com-
    We constructed surrogate trees from the baseline policy        pletely different direction than it should. With n > 4 actions,
using various environment and tree-build algorithm settings.       the agent may still move in the correct direction, though by
For the environment, we swept over different values of the         more or less distance than optimal.
transition probability p and the number of actions n. For the         For the tree building algorithm parameters, we see that as
tree build algorithm, we varied the total number of parti-         δ ∗ increases, the depth of the leaf nodes also decreases. This
cles, the minimum particle count, the distance threshold δ ∗ ,     is likely because higher values of δ ∗ leads to more aggres-
and the maximum leaf node depth. Only one parameter was            sive node clustering and fewer branches. Similar to the trend
varied for each test, with all others held fixed. The baseline     observed by varying the number of actions, as the average
settings for the environment were n = 4 actions and a suc-         leaf depth decreases, relative performance increases. Simi-
cessful transition probability of p = 0.9. The baseline tree       larly, as the max depth increases, the tree is allowed to grow
build parameters are 1000 total particles, 250 minimum par-        deeper and the relative performance drops.
ticles, distance threshold of 0.01, and maximum depth of 10.
    To test each tree, we ran it as a policy from the initial      Vaccine Planning
state until it reached a leaf node. The baseline policy was        In the vaccine deployment task, an agent decides the order
then used to complete the episode. We tested 2,500 trees for       in which to start vaccine distributions in cities in the midst
each parameter configuration. Select results are shown ta-         of a pandemic outbreak. Each step, the agent picks a city
ble 1. The mean change in performance of the tree policy           in which to start a vaccine program. The spread of the dis-
relative to the baseline is shown along with one standard er-      ease is modeled as a stochastic SIRD model (Bailey 1975),
ror bounds. The average depth of leaf nodes is also shown,         with portions of each city population being susceptible to
though standard error is omitted as each had SE < 0.05.            infection (S), infected (I), recovered (R), or dead (D). The
    From the transition probability sweep, we see that relative    n cities are modeled as a fully connected, weighted graph,
performance generally improves as transition probability in-       where weight wij encodes the amount of traffic between city
creases. At p = 0.5, both policies perform very poorly and         i and j. The state of city i changes from time t to t + 1 as
the difference between the two is not significant as a result.                (i)      (i) (i) (i) (i) S,(i)
In the deterministic case, the surrogate tree perfectly repre-              dSt = −β I˜t St − αt St + t                      (6)
sents the baseline policy. These results suggest that surrogate               (i)     (i) (i)     S,(i)     (i)   (i)
                                                                            dIt = β I˜t St + t         − γIt − µIt           (7)
trees are better suited for tasks with low stochasticity.
                                                                              (i)     (i)     (i) (i)
    As we increase the number of actions, the difference in                dRt = γIt + αt St                                  (8)
performance between the baseline and the tree policy de-                      (i)     (i)
creases. The average leaf depth of the trees also decreases                dDt = µIt                                          (9)
where β, γ, and µ are the mean infection, recovery, and
                                                      (i)
death rates, respectively. The vaccination rate αt of city i at                                                         Scan LAN 1
time t and is equal to zero until an action is taken to deploy                                                          Value: -0.43
                                                                                                                        Prob: 100 %
a vaccination program to that city. The noise  is sampled
from a zero mean Gaussian. TheP          effective infection expo-
sure at city i is defined as I˜(i) = j wij Ij , where the sum
is taken over all cities, and wii = 1. Cities with closer index
values will have higher weights, for example w12 > w13 .
   Each episode is initialized with up to 10% of each city in-                                                           Scan LAN 2
fected and the remainder susceptible. One city is initialized                                                            Value: -0.35
                                                                                                                        Prob: 100.0 %
with 25% infected. The episode concludes after five cities
have had vaccine programs started. The simulation is then
run until all cities have zero susceptible or infected popu-
                                                        P     (i)
lation. The reward at each time step is rt = a i dIt +
  P        (i)
b i dDt , where a and b are parameters.
   We trained a neural network policy using double                      Mode Observation:
DQN (van Hasselt 2010) with n-step returns. The trained                   Alert: Host Detection
                                                                                                     Clean Host (2,6)
                                                                                                       Value: -0.25
                                                                                                                                        Scan LAN 1
                                                                                                                                        Value: -0.26
policy achieved an average score with one standard error                  LAN: 2                      Prob: 62.5 %                      Prob: 37.5 %
                                                                          IP: 192.168.2.6
bounds of −149.8 ± 0.6 over 100 trial episodes. We built
trees for 100 random initial states with 2000 particles and a
minimum particle threshold of 250 and tested their perfor-
mance as forward policies. The trees achieved an average
                                                                      Figure 4: Cyber Security Policy Tree. The figure shows the
score of −152.2 ± 0.8 over the 100 trials, for an average
                                                                      first three layers of a surrogate tree for the cyber security
performance drop of 1.6%. To compare to a baseline surro-
                                                                      task. The most frequent observation from the “Clean Host”
gate modeling approach, we also trained and tested a LIME
                                                                      action node set is shown.
model (Ribeiro, Singh, and Guestrin 2016) with 2000 sam-
ples. The LIME average score was −184.2 ± 4.5, for an av-
erage performance loss of 23.0%.
   The surrogate tree in fig. 2 provides an intuitive under-          interpreted are more likely to be trusted by a human operator.
standing of the learned policy. In fig. 2b we can see that the        A surrogate tree for the neural network is shown in fig. 4.
policy does not deploy vaccines to the most heavily infected          Unlike the baseline neural network, the policy encoded by
cities first. It instead prioritizes cities with larger susceptible   this tree can be easily interpreted. The agent will continually
populations to give the vaccine time to take effect on a larger       scan LAN 1 in most cases, and will only clean a workstation
amount of the population.                                             after malware has been detected.

Cyber Security                                                                                    Conclusions
The cyber security task requires an agent to prevent unau-            In this work, we presented methods to construct local sur-
thorized access to secure data server on a computer network.          rogate policy trees from arbitrary control policies. The trees
The computer network is comprised of four local area net-             are more interpretable than high-dimensional policies such
works (LANs), each of which has a local application server            as neural networks and provide quantitative estimates of fu-
and ten workstations, and a single secure data server. The            ture behavior. Our experiments show that, despite truncat-
compromise state of the network is not known may be ob-               ing the set of actions that may be taken at each future time
served through noisy alerts generated from malware scans.             step, the trees retain performance comparable to their base-
Workstations are networked to all others on their LAN and             line policies. Experiments demonstrate the effect of various
to the LAN application server. Servers are randomly con-              environment and algorithm parameters on tree size and per-
nected in a complete graph. An attacker begins with a single          formance in a simple grid world. Demonstrations show how
workstation compromised and takes actions to compromise               surrogate trees may be used in more complex, real-world
additional workstations and servers to reach the data server.         scenarios.
   The defender can scan all nodes on a LAN to locate                    The action node clustering presented in this work used
compromised nodes with probability pdetect . Compromised              a heuristic search method that provided good results, but
nodes will also generate alerts without being scanned with            without any optimality guarantees. Future work will look at
low probability. The defender can also scan and clean indi-           improved approaches to clustering, for example by using a
vidual nodes to detect and remove compromise. The reward              mixed integer program optimization. We will also explore
is zero unless the data server is compromised, in which case          using the scenarios simulated to construct the tree to backup
a large penalty is incurred. The defender was trained using           more accurate value estimates, and refine the resulting pol-
Rainbow DQN (Hessel et al. 2018).                                     icy. Including empirical backups such as these may also al-
   Automated systems such as this are often implemented               low calculation of confidence intervals or bounds on policy
with a human in the loop. Policies that can be more easily            performance (Mern and Kochenderfer 2021).
                        References                                Mern, J.; and Kochenderfer, M. J. 2021. Measurable Monte
Adadi, A.; and Berrada, M. 2018. Peeking Inside the Black-        Carlo Search Error Bounds. Computing Research Reposi-
Box: A Survey on Explainable Artificial Intelligence (XAI).       tory, abs/2106.04715.
IEEE Access, 6: 52138–52160.                                      Miller, T. 2019. Explanation in artificial intelligence: In-
Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time     sights from the social sciences. Artificial Intelligence, 267:
Analysis of the Multiarmed Bandit Problem. Journal of Ma-         1–38.
chine Learning Research, 47(2-3): 235–256.                        Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness,
Bailey, N. T. 1975. The mathematical theory of infectious         J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidje-
diseases and its applications. 2nd edition. Charles Griffin &     land, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
Company Ltd.                                                      Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
Coppens, Y.; Efthymiadis, K.; Lenaerts, T.; Nowé, A.;            S.; and Hassabis, D. 2015. Human-level control through
Miller, T.; Weber, R.; and Magazzeni, D. 2019. Distilling         deep reinforcement learning. Nature, 518(7540): 529–533.
deep reinforcement learning policies in soft decision trees.      Molnar, C. 2019. Interpretable Machine Learning. https:
In International Joint Conference on Artificial Intelligence      //christophm.github.io/interpretable-ml-book/.
(IJCAI), 1–6.                                                     Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostro-      Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;
vski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M. G.;          Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.;
and Silver, D. 2018. Rainbow: Combining Improvements              Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.;
in Deep Reinforcement Learning. In AAAI Conference on             and Chintala, S. 2019. PyTorch: An Imperative Style, High-
Artificial Intelligence (AAAI), 3215–3222.                        Performance Deep Learning Library. In Advances in Neural
Kim, B.; Koyejo, O.; and Khanna, R. 2016. Examples are            Information Processing Systems (NeurIPS), 8024–8035.
not enough, learn to criticize! Criticism for Interpretabil-      Puiutta, E.; and Veith, E. M. S. P. 2020. Explainable Rein-
ity. In Advances in Neural Information Processing Systems         forcement Learning: A Survey. In Machine Learning and
(NeurIPS), 2280–2288.                                             Knowledge Extraction International Cross-Domain Confer-
Kochenderfer, M. J.; Wheeler, T. A.; and Wray, K. H. 2022.        ence (CD-MAKE), volume 12279, 77–95.
Algorithms for Decision Making. MIT Press.                        Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “Why
Kocsis, L.; and Szepesvári, C. 2006. Bandit Based Monte-         Should I Trust You?”: Explaining the Predictions of Any
Carlo Planning. In European Conference on Machine Learn-          Classifier. In SIGKDD International Conference on Knowl-
ing (ECML).                                                       edge Discovery and Data Mining, 1135–1144. ACM.
Lipton, Z. C. 2018. The mythos of model interpretability.         Sidrane, C.; Maleki, A.; Irfan, A.; and Kochenderfer, M. J.
Communications of the ACM, 61(10): 36–43.                         2021. OVERT: An Algorithm for Safety Verification of Neu-
Liu, C.; Arnon, T.; Lazarus, C.; Strong, C. A.; Barrett, C. W.;   ral Network Control Policies for Nonlinear Systems. Com-
and Kochenderfer, M. J. 2021. Algorithms for Verifying            puting Research Repository, abs/2108.01220.
Deep Neural Networks. Foundations and Trends in Opti-
mization, 4(3-4): 244–404.                                        Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learn-
                                                                  ing: An Introduction. The MIT Press, second edition.
Liu, G.; Schulte, O.; Zhu, W.; and Li, Q. 2018. Toward Inter-
pretable Deep Reinforcement Learning with Linear Model            van Hasselt, H. 2010. Double Q-learning. In Advances in
U-Trees. In European Conference on Machine Learning               Neural Information Processing Systems (NeurIPS), 2613–
(ECML), volume 11052, 414–429.                                    2621.
Madumal, P.; Miller, T.; Sonenberg, L.; and Vetere, F.            Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and Chaud-
2020. Explainable Reinforcement Learning through a                huri, S. 2018. Programmatically Interpretable Reinforce-
Causal Lens. In AAAI Conference on Artificial Intelligence        ment Learning. In International Conference on Machine
(AAAI), 2493–2500.                                                Learning (ICML), volume 80, 5052–5061.