<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Attribution-based Salience Method towards Interpretable Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuyao Wang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masayoshi Mase</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masashi Egi Research</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Development Group Hitachi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>yuyao.wang.fe@hitachi.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>masayoshi.mase.mh@hitachi.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>masashi.egi.zj@hitachi.comg</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Reinforcement Learning (RL), a general learning, predicting and decision-making paradigm, has achieved great success in a wide range of games and robotics. Recently, RL has also proven its worth in real world scenarios, such as adaptive decision control and recommendation. It is promising to deploy RL in the real world to gain real benefits. However, RL is criticized for its being black-box. The real systems are owned and operated by humans, who need to be reassured about the controller's intentions and insights regarding failure cases. Therefore, policy explanation is important. Existing methods towards interpretable RL include Jacobian saliency map and perturbation-based saliency map, which are limited to visual input problems. To model the complicated real-world use cases, numerical data are widely employed. In this paper, we propose an attribution-based salience method that is applicable on visual and numerical input. We aim to understand RL agents in terms of the information they attend to for decision making. We verify our method with a machine control use case. Explanations we provided are understandable to both AI experts and non-experts alike. (short paper)</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Reinforcement learning (RL) is a general learning,
predicting and decision-making paradigm. It provides solution
methods for decision making problems. RL has achieved
remarkable success in a broad range of game-playing,
continuous control and robotics. Deep Reinforcement Learning
(Deep RL) exceeded human baseline in Atari games
        <xref ref-type="bibr" rid="ref4">(Mnih
et al. 2015)</xref>
        and beat professional human player in GO
        <xref ref-type="bibr" rid="ref6">(Silver et al. 2016)</xref>
        . Recently, RL has also proven its worth in
real world scenarios, such as production system and
recommendation. Growing numbers of real-world use cases show
that it is promising to deploy RL in the real world to gain
real benefits. However, there are many issues for RL to be
widely deployed in the real world. One of them is about RL
being black box. The real systems are owned and operated
by humans, who need to be reassured about the controller’s
intentions and insights regarding failure cases. For this
reason, policy explanation is important.
      </p>
      <p>
        Research on Explainable Artificial Intelligence (XAI) is
becoming increasingly popular these years. One trend of
research in providing post-hoc explanations focuses on how to
explain individual predictions by learning local
approximation of a model. SHAP
        <xref ref-type="bibr" rid="ref3">(Lundberg and Lee 2017)</xref>
        is one of
the state-of-art techniques. SHAP decomposes the AI
prediction into the sum of the contribution degree of each input
feature. SHAP works well for regression and classification
problems, while it does not work well for RL. We will
discuss this issue in latter sections.
      </p>
      <p>
        Existing methods for explaining deep RL include
Jacobian saliency map
        <xref ref-type="bibr" rid="ref10">(Zahavy, Ben-Zrihem, and Mannor
2016)</xref>
        and perturbation-based saliency map
        <xref ref-type="bibr" rid="ref1">(Greydanus et
al. 2017)</xref>
        . These tools use visual inputs test beds and are
not applicable to problems with numerical feature values.
There is a need for an explainable method for numerical
inputs which are widely employed to model complicated
realworld use cases. For example, in our machine control use
case, RL rely on sensor data to control the machine.
      </p>
      <p>
        One of the challenges that arise in reinforcement learning,
and not in other kinds of learning, is the trade-off between
exploration and exploitation
        <xref ref-type="bibr" rid="ref9">(Sutton and Barto 2018)</xref>
        .
Another key feature of reinforcement learning is that it
explicitly considers the whole problems of a goal-directed agent
interacting with an uncertain environment
        <xref ref-type="bibr" rid="ref9">(Sutton and Barto
2018)</xref>
        . These features make the explanation requested in RL
different from other approaches.In this paper, we want to
find out how RL agents make decisions. We aim to
understand RL agents in terms of the information they attend to
for decision making.
      </p>
      <p>The contribution of the paper is as follows:
Clarify the problem on application of attribution methods
for RL
Generate attribution by background data selection with
domain knowledge for interpretable RL
Evaluate on machine control use case</p>
    </sec>
    <sec id="sec-2">
      <title>Prerequisite</title>
      <sec id="sec-2-1">
        <title>Attribution Method</title>
        <p>
          The concept of attribution is studied in various papers, such
as integrated gradient
          <xref ref-type="bibr" rid="ref3">(Sundararajan, Taly, and Yan 2017)</xref>
          and SHAP
          <xref ref-type="bibr" rid="ref3">(Lundberg and Lee 2017)</xref>
          . We give the definition
of attribution following the statement in paper above.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Definition (Attribution):</title>
        <p>Suppose we have a function f : Rn!Rm that represents a
model, and an input x = (x1; :::; xn)2Rn. An attribution
of the prediction at input x relative to a baseline input x 0
is a vector (x; x0) = ( 1; :::; n)2Rn where i is the
contribution of xi to the prediction f (x).</p>
      </sec>
      <sec id="sec-2-3">
        <title>Shapley Value</title>
        <p>Let f be the original prediction model and g the explanation
model. The explanation model uses simplified inputs x0 that
map to the original inputs through a mapping function x =
hx(x0). Assuming g(z0) f (hx(z0)) whenever z0 x0, the
attribution method is defined as
g(z0) =
0 +</p>
        <p>N
X
i=1
izi0
where z0 f0; 1gN , N is the number of simplified input
features, and i R.</p>
        <p>
          Assume four axioms such as efficiency, symmetry,
dummy and additivity, the attribution is proved to have a
single unique solution known as Shapley value
          <xref ref-type="bibr" rid="ref5">(Shapley 1953)</xref>
          in cooperative game theory:
i(f; x) =
        </p>
        <p>X jz0j!(N
z0 x0</p>
        <p>jz0j
N !
1)!
[fx(z0)
fx( z0ni)]
where jz0j is the number of non-zero entries in z0 and z0
x0 represents all z0 vectors where the non-zero entries are a
subset of the non-zero entries in x0.</p>
        <p>
          SHAP (SHapley Additive exPlanation)
          <xref ref-type="bibr" rid="ref3">(Lundberg and
Lee 2017)</xref>
          is a state-of-art explanation framework using
Shapley value. The SHAP value is defined as an
approximation to equation 2:
        </p>
        <p>fx(z0) = f (hx(z0)) = E[f (z)jzS ]
where S is the set of non-zero indexes in z0.</p>
        <p>Thus, SHAP value attributes to each feature the change in
the expected model prediction when the feature is toggled
on. They explain how to get from the base value E[f (z)]
that would be predicted if we did not know any features to
the model f(x).</p>
      </sec>
      <sec id="sec-2-4">
        <title>Problem of Attribution Methods on RL</title>
        <p>
          The effect of each feature on a prediction is calculated based
on a baseline prediction. The input features of the baseline
prediction (or base value) are called background data (or
reference data). Usually, the background data is set to zero or
the average value of the training dataset in prediction tasks.
In image recognition tasks, the background data can be a
black image, i.e., all pixel intensities are zero for example.
However, reinforcement learning proceeds by making
training data by exploitation and exploration in uncertain
environment. The dynamic learning process of a deep RL agent
makes some problems to use SHAP. According to our
experiment results, different selection of the background data will
(1)
(2)
(3)
lead to different explanation results. We want to solve this
problem in our work. Also, we want to understand deep RL
agents in terms of what information of the environment they
take to make decisions. This match the intuition of post-hoc
explanations. Among the group of attribution methods, we
use SHAP to analyze RL. We focus on the agent trained on
Deep Q-Network (DQN)
          <xref ref-type="bibr" rid="ref4">(Mnih et al. 2015)</xref>
          . Figure 1 shows
the intuition of our problem setting.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Attribution-based Salience Method towards interpretable RL</title>
      <sec id="sec-3-1">
        <title>Attribution generation</title>
        <p>Deep RL agents learn what to do so as to maximize the
cumulative reward or the value. In DQN, the value is
approximated by Q-function. The output of the DQN model is the
Q-value for each action candidate. We adjust the original
DQN model with argmax operator in order to bridge the
gap between the outputs and the action selection
(decisionmaking). We load the trained DQN model fmodel from deep
RL agents and adjust the output by adding an activation
layer. Note that this is done after the training process of our
deep RL agent. In this way, the output of the modified model
fmodified is the selected action with higher Q-value.</p>
        <p>Next, we deal with the issue of background data. Instead
of using one fixed set of background data, we embed domain
knowledge to select the background data according to the
environment RL interacts with.</p>
        <p>In RL environment, we make a transition from one state
s to the next state s0 by performing some action a and
receive a reward r. We load the learnt policy trajectory of our
deep RL agent along the learning process and regard it as
the dataset of our approach. Let P1:t denote the trajectory of
learnt policies from time step 1 to time step t, the trajectory
file contains the state s and action a pair at each time step t.
Therefore, we have Pt = Pt(st; at). Our background data is
selected according to the trajectory P1:t = P1:t(s1:t; a1:t).</p>
        <p>Then we calculated the attribution of each input, which is
the SHAP value with our trained model and selected
background data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Salience Method</title>
        <p>The higher value of attribution means bigger impact of the
input on the output of the model. The impact of the input is
changing along the time. This means that the information RL
attend to for decision-making changes. We select the higher
attributions at each time step and visualize it to demonstrate
the attention change of RL agent.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <p>We evaluated the proposed method on the automatic crane
control use case.</p>
      <sec id="sec-4-1">
        <title>Automatic Crane Control</title>
        <p>A crane is a type of machine, generally equipped with a hoist
rope, wire ropes or chains and sheaves, that can be used
to lift and lower materials and to move them horizontally.
We want to realize automatic control of crane with deep RL
agent and explain the policies of the agent. In Figure 2, we
model the crane control problem.</p>
        <p>The object is connected to a trolley with a piece of wire.
The object is supposed to be delivered by the trolley from
the start position to the goal position. Operators could add
acceleration and deceleration signal to the trolley to
accomplish the delivery. Note that the trolley can only travel
horizontally on the rail. The trolley would either be accelerated
by a specific constant value until the velocity of travelling
reaches the maximum, or de-accelerated by the same value
until the velocity reaches zero. As the trolley starts moving,
the object starts swinging like a pendulum. The objective is
to deliver the object to the goal position as soon as possible
and at the mean time with neglectful swinging at the goal
position.</p>
        <p>Figure 3 is a scaled version of the trajectory - the state
and action pair at each episode. In Automatic Crane Control,
there are four states (inputs of our DQN model); the
traveling distance of the trolley x; the velocity of the travelling
trolley v; the angular of the wire ; the angular velocity of
swing !. For the intuitive understanding, we scaled the states
in the figure. The grey line represents the action selected at
each time step, which is the acceleration (targets 0:73m=s)
or de-acceleration (targets 0m=s) signal our agent conducted
at each time step. The blue line represents the distance to the
goal of the travelling trolley x. The orange line represents
the velocity of the traveling trolley v. The green line
represents the swing angle for the moving direction . And, the
pink line represents the angular velocity of the swing !.</p>
        <p>
          We applied our attribution-based salience method on the
automatic crane control trajectory. We used KernelSHAP
          <xref ref-type="bibr" rid="ref3">(Lundberg and Lee 2017)</xref>
          for the attribution method. We
selected the start position as the background data. Figure 4
shows the SHAP values scores for the four states. The blue,
orange, green and pink lines in the figure correspond to x,
v, , and !, respectively. The horizontal axis represents the
attribution value score for each state.
        </p>
        <p>The result shows that at the beginning, the RL agent cares
more about the velocity of the trolley. Gradually, it pays
attention to the angle of the wire, or swing, during travelling
at high speed. It takes the traveling distance as the most
important state near the goal.</p>
        <p>The strategy above is different from the one usually
conducted by a human operator. A human operator firstly looks
at the traveling distance and velocity to travel the trolley and
stops near to the goal as fast as possible. But in there, the
wire is swinging. Then, the operator looks at the wire angle
and accelerate and brake the trolley a little at an appropriate
wire angle to stabilize the swing at the goal position.</p>
        <p>The RL agent conveys faster than a human operator
because the RL agent does not wait for the appropriate angle
of the swing by once stopping near the goal position. The
adjustment of the swing phase is realized by paying
attention to the swing angle and putting a little acceleration and
brake while travelling at high speed as described above. This
result might be surprising for human operators but would be
intuitive after understanding the attention sequence of the
RL agent.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>In this section, we discuss about the background data
selection problem. We take automatic crane control as an
example.</p>
      <p>We also tried other candidate background data as
comparative experiments. We selected the middle position and
the goal position as the background data. Figure 5 shows
the SHAP values results for the problem with the goal
position selected as the background data. As shown in the
figure, the traveling distance and traveling velocity are still the
main features that contributes to the decision making. In this
case, SHAP values of the traveling distance of the trolley
and the traveling velocity are approximately similar but in
different directions. At the beginning, the traveling distance
contributes most, while near the goal direction, the traveling
velocity contributes most. This is in contrast to what we
observed in the experiment that used the start position as the
background data.</p>
      <p>Figure 6 shows the SHAP values result for the problem
where we selected the middle position as background data.
From 0s to around 5s, the traveling distance has much
contribution. However, their contributions decrease from 5s to
10s, and other states becomes greater around 8s. At the end
of the trajectory, the traveling distance contributed most.</p>
      <p>According to our investigation, when domain experts
operate the crane, they will firstly accelerate the crane. Then,
when crane reaches the maximum velocity, they operate to
remain the crane at the maximum velocity. When the crane
comes close to the goal position, they deaccelerate the crane.
Apparently, there are three phases in the operation of
domain experts. According to the experiment result, it makes
sense when we select start position for these three phases of
crane. However, in more complicated use cases, there will be
more phases. Different background data should be selected
for comparing with different patterns of data,</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Our experiments show that different selection of background
data generates different explanation. And some of the
explanations match human intuition, while others are not
straightforward enough for humans to understand. Since the
calculation of attribution methods includes the selection of
background data, we claim that this is a key issue for
implementing attribution methods and reaching human-understandable
explanations. Therefore, we select the background data and
the generated explanation considering the domain
knowledge and human intuition. Our proposed method explains
the policies in regarding to the contribution of each input
state. We will verify our method with more use cases as
the future work. How to embed in domain knowledge and
human intuition in the explanation that make them
understandable to both expert and non-expert alike is also an open
question.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Greydanus</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Koul</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dodge</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Fern</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Visualizing and understanding atari agents</article-title>
          .
          <source>arXiv preprint arXiv:1711</source>
          .
          <fpage>00138</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Lundberg</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.-I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A unified approach to interpreting model predictions</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>4765</volume>
          -
          <fpage>4774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bellemare,
          <string-name>
            <given-names>M. G.</given-names>
            ;
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            ;
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; et al.
          <year>2015</year>
          .
          <article-title>Humanlevel control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Shapley</surname>
            ,
            <given-names>L. S.</given-names>
          </string-name>
          <year>1953</year>
          .
          <article-title>A value for n-person games</article-title>
          .
          <source>Contributions to the Theory of Games</source>
          <volume>2</volume>
          (
          <issue>28</issue>
          ):
          <fpage>307</fpage>
          -
          <lpage>317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maddison</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sifre</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>a</year>
          .; Van Den Driessche, G.;
          <string-name>
            <surname>Schrittwieser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Pa˜nneershelvam, V.;
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; et al.
          <year>2016</year>
          .
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>nature</source>
          <volume>529</volume>
          (
          <issue>7587</issue>
          ):
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>arXiv:1703</source>
          .
          <fpage>01365</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          2017.
          <article-title>AxarXiv preprint Sutton</article-title>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            , and
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Reinforcement learning: An introduction, Second edition</article-title>
          , volume
          <volume>1</volume>
          . MIT press Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Zahavy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ben-Zrihem</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Mannor</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Graying the black box: Understanding dqns</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          ,
          <fpage>1899</fpage>
          -
          <lpage>1908</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>