=Paper= {{Paper |id=Vol-2600/short4 |storemode=property |title=Attribution-based Salience Method towards Interpretable Reinforcement Learning |pdfUrl=https://ceur-ws.org/Vol-2600/short4.pdf |volume=Vol-2600 |authors=Yuyao Wang,Masayoshi Mase,Masashi Egi |dblpUrl=https://dblp.org/rec/conf/aaaiss/WangME20 }} ==Attribution-based Salience Method towards Interpretable Reinforcement Learning== https://ceur-ws.org/Vol-2600/short4.pdf
       Attribution-based Salience Method towards Interpretable Reinforcement
                                     Learning

                                   Yuyao Wang, Masayoshi Mase, and Masashi Egi
                                            Research & Development Group
                                                     Hitachi, Ltd.
                {yuyao.wang.fe@hitachi.com, masayoshi.mase.mh@hitachi.com, masashi.egi.zj@hitachi.com}


                           Abstract                                 intentions and insights regarding failure cases. For this rea-
                                                                    son, policy explanation is important.
  Reinforcement Learning (RL), a general learning, predicting
  and decision-making paradigm, has achieved great success in          Research on Explainable Artificial Intelligence (XAI) is
  a wide range of games and robotics. Recently, RL has also         becoming increasingly popular these years. One trend of re-
  proven its worth in real world scenarios, such as adaptive de-    search in providing post-hoc explanations focuses on how to
  cision control and recommendation. It is promising to deploy      explain individual predictions by learning local approxima-
  RL in the real world to gain real benefits. However, RL is        tion of a model. SHAP (Lundberg and Lee 2017) is one of
  criticized for its being black-box. The real systems are owned    the state-of-art techniques. SHAP decomposes the AI pre-
  and operated by humans, who need to be reassured about the        diction into the sum of the contribution degree of each input
  controller’s intentions and insights regarding failure cases.     feature. SHAP works well for regression and classification
  Therefore, policy explanation is important. Existing meth-        problems, while it does not work well for RL. We will dis-
  ods towards interpretable RL include Jacobian saliency map
                                                                    cuss this issue in latter sections.
  and perturbation-based saliency map, which are limited to vi-
  sual input problems. To model the complicated real-world use         Existing methods for explaining deep RL include Ja-
  cases, numerical data are widely employed. In this paper, we      cobian saliency map (Zahavy, Ben-Zrihem, and Mannor
  propose an attribution-based salience method that is applica-     2016) and perturbation-based saliency map (Greydanus et
  ble on visual and numerical input. We aim to understand RL        al. 2017). These tools use visual inputs test beds and are
  agents in terms of the information they attend to for decision    not applicable to problems with numerical feature values.
  making. We verify our method with a machine control use           There is a need for an explainable method for numerical in-
  case. Explanations we provided are understandable to both         puts which are widely employed to model complicated real-
  AI experts and non-experts alike. (short paper)                   world use cases. For example, in our machine control use
                                                                    case, RL rely on sensor data to control the machine.
                       Introduction                                    One of the challenges that arise in reinforcement learning,
                                                                    and not in other kinds of learning, is the trade-off between
Reinforcement learning (RL) is a general learning, pre-
                                                                    exploration and exploitation (Sutton and Barto 2018). An-
dicting and decision-making paradigm. It provides solution
                                                                    other key feature of reinforcement learning is that it explic-
methods for decision making problems. RL has achieved re-
                                                                    itly considers the whole problems of a goal-directed agent
markable success in a broad range of game-playing, con-
                                                                    interacting with an uncertain environment (Sutton and Barto
tinuous control and robotics. Deep Reinforcement Learning
                                                                    2018). These features make the explanation requested in RL
(Deep RL) exceeded human baseline in Atari games (Mnih
                                                                    different from other approaches.In this paper, we want to
et al. 2015) and beat professional human player in GO (Sil-
                                                                    find out how RL agents make decisions. We aim to under-
ver et al. 2016). Recently, RL has also proven its worth in
                                                                    stand RL agents in terms of the information they attend to
real world scenarios, such as production system and recom-
                                                                    for decision making.
mendation. Growing numbers of real-world use cases show
                                                                       The contribution of the paper is as follows:
that it is promising to deploy RL in the real world to gain
real benefits. However, there are many issues for RL to be          • Clarify the problem on application of attribution methods
widely deployed in the real world. One of them is about RL             for RL
being black box. The real systems are owned and operated            • Generate attribution by background data selection with
by humans, who need to be reassured about the controller’s             domain knowledge for interpretable RL
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-    • Evaluate on machine control use case
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-                             Prerequisite
bining Machine Learning and Knowledge Engineering in Practice
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,       Attribution Method
USA, March 23-25, 2020. Use permitted under Creative Commons        The concept of attribution is studied in various papers, such
License Attribution 4.0 International (CC BY 4.0).                  as integrated gradient (Sundararajan, Taly, and Yan 2017)
and SHAP (Lundberg and Lee 2017). We give the definition
of attribution following the statement in paper above.
Definition (Attribution):
 Suppose we have a function f : Rn →Rm that represents a
   model, and an input x = (x1 , ..., xn )∈Rn . An attribution
   of the prediction at input x relative to a baseline input x 0
   is a vector φ(x, x0 ) = (φ1 , ..., φn )∈Rn where φi is the
   contribution of xi to the prediction f (x).

Shapley Value
Let f be the original prediction model and g the explanation
model. The explanation model uses simplified inputs x0 that                             Figure 1: Problem Setting
map to the original inputs through a mapping function x =
hx (x0 ). Assuming g(z 0 ) ≈ f (hx (z 0 )) whenever z 0 ≈ x0 , the
attribution method is defined as                                      lead to different explanation results. We want to solve this
                                      N
                                      X                               problem in our work. Also, we want to understand deep RL
                     g(z 0 ) = φ0 +         φi zi0             (1)    agents in terms of what information of the environment they
                                      i=1                             take to make decisions. This match the intuition of post-hoc
                                                                      explanations. Among the group of attribution methods, we
where z 0 ⊂ {0, 1}N , N is the number of simplified input             use SHAP to analyze RL. We focus on the agent trained on
features, and φi ⊂ R.                                                 Deep Q-Network (DQN) (Mnih et al. 2015). Figure 1 shows
   Assume four axioms such as efficiency, symmetry,                   the intuition of our problem setting.
dummy and additivity, the attribution is proved to have a sin-
gle unique solution known as Shapley value (Shapley 1953)               Attribution-based Salience Method towards
in cooperative game theory:
                                                                                     interpretable RL
              X |z 0 |!(N − |z 0 | − 1)!
 φi (f, x) =                             [fx (z 0 ) − fx ( z 0 \i)]   Attribution generation
                           N !
              0  0
              z ⊆x                                                    Deep RL agents learn what to do so as to maximize the cu-
                                                             (2)      mulative reward or the value. In DQN, the value is approx-
where |z 0 | is the number of non-zero entries in z 0 and z 0 ⊆       imated by Q-function. The output of the DQN model is the
x0 represents all z 0 vectors where the non-zero entries are a        Q-value for each action candidate. We adjust the original
subset of the non-zero entries in x0 .                                DQN model with argmax operator in order to bridge the
   SHAP (SHapley Additive exPlanation) (Lundberg and                  gap between the outputs and the action selection (decision-
Lee 2017) is a state-of-art explanation framework using               making). We load the trained DQN model fmodel from deep
Shapley value. The SHAP value is defined as an approxi-               RL agents and adjust the output by adding an activation
mation to equation 2:                                                 layer. Note that this is done after the training process of our
                                                                      deep RL agent. In this way, the output of the modified model
              fx (z 0 ) = f (hx (z 0 )) = E[f (z)|zS ]         (3)
                                                                      fmodif ied is the selected action with higher Q-value.
where S is the set of non-zero indexes in z 0 .                          Next, we deal with the issue of background data. Instead
   Thus, SHAP value attributes to each feature the change in          of using one fixed set of background data, we embed domain
the expected model prediction when the feature is toggled             knowledge to select the background data according to the
on. They explain how to get from the base value E[f (z)]              environment RL interacts with.
that would be predicted if we did not know any features to               In RL environment, we make a transition from one state
the model f(x).                                                       s to the next state s0 by performing some action a and re-
                                                                      ceive a reward r. We load the learnt policy trajectory of our
Problem of Attribution Methods on RL                                  deep RL agent along the learning process and regard it as
The effect of each feature on a prediction is calculated based        the dataset of our approach. Let P1:t denote the trajectory of
on a baseline prediction. The input features of the baseline          learnt policies from time step 1 to time step t, the trajectory
prediction (or base value) are called background data (or ref-        file contains the state s and action a pair at each time step t.
erence data). Usually, the background data is set to zero or          Therefore, we have Pt = Pt (st , at ). Our background data is
the average value of the training dataset in prediction tasks.        selected according to the trajectory P1:t = P1:t (s1:t , a1:t ).
In image recognition tasks, the background data can be a                 Then we calculated the attribution of each input, which is
black image, i.e., all pixel intensities are zero for example.        the SHAP value with our trained model and selected back-
However, reinforcement learning proceeds by making train-             ground data.
ing data by exploitation and exploration in uncertain envi-
ronment. The dynamic learning process of a deep RL agent              Salience Method
makes some problems to use SHAP. According to our exper-              The higher value of attribution means bigger impact of the
iment results, different selection of the background data will        input on the output of the model. The impact of the input is
  Figure 2: Image of Automatic Crane Control Use Case             Figure 4: SHAP Values (Background Data: Start Position)


                                                                     Figure 3 is a scaled version of the trajectory - the state
                                                                  and action pair at each episode. In Automatic Crane Control,
                                                                  there are four states (inputs of our DQN model); the travel-
                                                                  ing distance of the trolley x; the velocity of the travelling
                                                                  trolley v; the angular of the wire φ; the angular velocity of
                                                                  swing ω. For the intuitive understanding, we scaled the states
                                                                  in the figure. The grey line represents the action selected at
                                                                  each time step, which is the acceleration (targets 0.73m/s)
                                                                  or de-acceleration (targets 0m/s) signal our agent conducted
                                                                  at each time step. The blue line represents the distance to the
                                                                  goal of the travelling trolley x. The orange line represents
            Figure 3: Image of State/Action Pair                  the velocity of the traveling trolley v. The green line repre-
                                                                  sents the swing angle for the moving direction φ. And, the
                                                                  pink line represents the angular velocity of the swing ω.
changing along the time. This means that the information RL          We applied our attribution-based salience method on the
attend to for decision-making changes. We select the higher       automatic crane control trajectory. We used KernelSHAP
attributions at each time step and visualize it to demonstrate    (Lundberg and Lee 2017) for the attribution method. We se-
the attention change of RL agent.                                 lected the start position as the background data. Figure 4
                                                                  shows the SHAP values scores for the four states. The blue,
                       Experiment                                 orange, green and pink lines in the figure correspond to x,
We evaluated the proposed method on the automatic crane           v, φ, and ω, respectively. The horizontal axis represents the
control use case.                                                 attribution value score for each state.
                                                                     The result shows that at the beginning, the RL agent cares
Automatic Crane Control                                           more about the velocity of the trolley. Gradually, it pays at-
A crane is a type of machine, generally equipped with a hoist     tention to the angle of the wire, or swing, during travelling
rope, wire ropes or chains and sheaves, that can be used          at high speed. It takes the traveling distance as the most im-
to lift and lower materials and to move them horizontally.        portant state near the goal.
We want to realize automatic control of crane with deep RL           The strategy above is different from the one usually con-
agent and explain the policies of the agent. In Figure 2, we      ducted by a human operator. A human operator firstly looks
model the crane control problem.                                  at the traveling distance and velocity to travel the trolley and
   The object is connected to a trolley with a piece of wire.     stops near to the goal as fast as possible. But in there, the
The object is supposed to be delivered by the trolley from        wire is swinging. Then, the operator looks at the wire angle
the start position to the goal position. Operators could add      and accelerate and brake the trolley a little at an appropriate
acceleration and deceleration signal to the trolley to accom-     wire angle to stabilize the swing at the goal position.
plish the delivery. Note that the trolley can only travel hori-      The RL agent conveys faster than a human operator be-
zontally on the rail. The trolley would either be accelerated     cause the RL agent does not wait for the appropriate angle
by a specific constant value until the velocity of travelling     of the swing by once stopping near the goal position. The
reaches the maximum, or de-accelerated by the same value          adjustment of the swing phase is realized by paying atten-
until the velocity reaches zero. As the trolley starts moving,    tion to the swing angle and putting a little acceleration and
the object starts swinging like a pendulum. The objective is      brake while travelling at high speed as described above. This
to deliver the object to the goal position as soon as possible    result might be surprising for human operators but would be
and at the mean time with neglectful swinging at the goal         intuitive after understanding the attention sequence of the
position.                                                         RL agent.
                                                                   Apparently, there are three phases in the operation of do-
                                                                   main experts. According to the experiment result, it makes
                                                                   sense when we select start position for these three phases of
                                                                   crane. However, in more complicated use cases, there will be
                                                                   more phases. Different background data should be selected
                                                                   for comparing with different patterns of data,

                                                                                          Conclusion
                                                                   Our experiments show that different selection of background
                                                                   data generates different explanation. And some of the expla-
                                                                   nations match human intuition, while others are not straight-
                                                                   forward enough for humans to understand. Since the calcu-
Figure 5: SHAP Values (Background Data: Goal Position)             lation of attribution methods includes the selection of back-
                                                                   ground data, we claim that this is a key issue for implement-
                                                                   ing attribution methods and reaching human-understandable
                                                                   explanations. Therefore, we select the background data and
                                                                   the generated explanation considering the domain knowl-
                                                                   edge and human intuition. Our proposed method explains
                                                                   the policies in regarding to the contribution of each input
                                                                   state. We will verify our method with more use cases as
                                                                   the future work. How to embed in domain knowledge and
                                                                   human intuition in the explanation that make them under-
                                                                   standable to both expert and non-expert alike is also an open
                                                                   question.

                                                                                           References
Figure 6: SHAP Values (Background Data: Middle Position)           Greydanus, S.; Koul, A.; Dodge, J.; and Fern, A. 2017.
                                                                   Visualizing and understanding atari agents. arXiv preprint
                                                                   arXiv:1711.00138.
                        Discussion                                 Lundberg, S. M., and Lee, S.-I. 2017. A unified approach
In this section, we discuss about the background data selec-       to interpreting model predictions. In Advances in Neural
tion problem. We take automatic crane control as an exam-          Information Processing Systems, 4765–4774.
ple.                                                               Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
   We also tried other candidate background data as com-           ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
parative experiments. We selected the middle position and          Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
the goal position as the background data. Figure 5 shows           level control through deep reinforcement learning. Nature
the SHAP values results for the problem with the goal po-          518(7540):529.
sition selected as the background data. As shown in the fig-       Shapley, L. S. 1953. A value for n-person games. Contribu-
ure, the traveling distance and traveling velocity are still the   tions to the Theory of Games 2(28):307–317.
main features that contributes to the decision making. In this
case, SHAP values of the traveling distance of the trolley         Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L. a.;
and the traveling velocity are approximately similar but in        Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
different directions. At the beginning, the traveling distance     Pãnneershelvam, V.; Lanctot, M.; et al. 2016. Mastering
contributes most, while near the goal direction, the traveling     the game of go with deep neural networks and tree search.
velocity contributes most. This is in contrast to what we ob-      nature 529(7587):484–489.
served in the experiment that used the start position as the       Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Ax-
background data.                                                   iomatic attribution for deep networks. arXiv preprint
   Figure 6 shows the SHAP values result for the problem           arXiv:1703.01365.
where we selected the middle position as background data.          Sutton, R. S., and Barto, A. G. 2018. Reinforcement learn-
From 0s to around 5s, the traveling distance has much con-         ing: An introduction, Second edition, volume 1. MIT press
tribution. However, their contributions decrease from 5s to        Cambridge.
10s, and other states becomes greater around 8s. At the end        Zahavy, T.; Ben-Zrihem, N.; and Mannor, S. 2016. Graying
of the trajectory, the traveling distance contributed most.        the black box: Understanding dqns. In International Con-
   According to our investigation, when domain experts op-         ference on Machine Learning, 1899–1908.
erate the crane, they will firstly accelerate the crane. Then,
when crane reaches the maximum velocity, they operate to
remain the crane at the maximum velocity. When the crane
comes close to the goal position, they deaccelerate the crane.