<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Attribution-based Salience Method towards Interpretable Reinforcement Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yuyao</forename><surname>Wang</surname></persName>
							<email>yuyao.wang.fe@hitachi.com</email>
						</author>
						<author>
							<persName><forename type="first">Masayoshi</forename><surname>Mase</surname></persName>
							<email>masayoshi.mase.mh@hitachi.com</email>
						</author>
						<author>
							<persName><forename type="first">Masashi</forename><surname>Egi</surname></persName>
							<email>masashi.egi.zj@hitachi.com</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Research &amp; Development Group Hitachi, Ltd</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Stanford University</orgName>
								<address>
									<settlement>Palo Alto</settlement>
									<region>California</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Attribution-based Salience Method towards Interpretable Reinforcement Learning</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0E6DF02BCA507BE808A9D0C8B1DE09C5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T15:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Reinforcement Learning (RL), a general learning, predicting and decision-making paradigm, has achieved great success in a wide range of games and robotics. Recently, RL has also proven its worth in real world scenarios, such as adaptive decision control and recommendation. It is promising to deploy RL in the real world to gain real benefits. However, RL is criticized for its being black-box. The real systems are owned and operated by humans, who need to be reassured about the controller's intentions and insights regarding failure cases. Therefore, policy explanation is important. Existing methods towards interpretable RL include Jacobian saliency map and perturbation-based saliency map, which are limited to visual input problems. To model the complicated real-world use cases, numerical data are widely employed. In this paper, we propose an attribution-based salience method that is applicable on visual and numerical input. We aim to understand RL agents in terms of the information they attend to for decision making. We verify our method with a machine control use case. Explanations we provided are understandable to both AI experts and non-experts alike. (short paper)</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Reinforcement learning (RL) is a general learning, predicting and decision-making paradigm. It provides solution methods for decision making problems. RL has achieved remarkable success in a broad range of game-playing, continuous control and robotics. Deep Reinforcement Learning (Deep RL) exceeded human baseline in Atari games <ref type="bibr" target="#b2">(Mnih et al. 2015)</ref> and beat professional human player in GO <ref type="bibr" target="#b4">(Silver et al. 2016)</ref>. Recently, RL has also proven its worth in real world scenarios, such as production system and recommendation. Growing numbers of real-world use cases show that it is promising to deploy RL in the real world to gain real benefits. However, there are many issues for RL to be widely deployed in the real world. One of them is about RL being black box. The real systems are owned and operated by humans, who need to be reassured about the controller's intentions and insights regarding failure cases. For this reason, policy explanation is important.</p><p>Research on Explainable Artificial Intelligence (XAI) is becoming increasingly popular these years. One trend of research in providing post-hoc explanations focuses on how to explain individual predictions by learning local approximation of a model. SHAP <ref type="bibr" target="#b1">(Lundberg and Lee 2017)</ref> is one of the state-of-art techniques. SHAP decomposes the AI prediction into the sum of the contribution degree of each input feature. SHAP works well for regression and classification problems, while it does not work well for RL. We will discuss this issue in latter sections.</p><p>Existing methods for explaining deep RL include Jacobian saliency map (Zahavy, Ben-Zrihem, and Mannor 2016) and perturbation-based saliency map <ref type="bibr" target="#b0">(Greydanus et al. 2017</ref>). These tools use visual inputs test beds and are not applicable to problems with numerical feature values. There is a need for an explainable method for numerical inputs which are widely employed to model complicated realworld use cases. For example, in our machine control use case, RL rely on sensor data to control the machine.</p><p>One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation <ref type="bibr" target="#b6">(Sutton and Barto 2018</ref>). Another key feature of reinforcement learning is that it explicitly considers the whole problems of a goal-directed agent interacting with an uncertain environment <ref type="bibr" target="#b6">(Sutton and Barto 2018)</ref>. These features make the explanation requested in RL different from other approaches.In this paper, we want to find out how RL agents make decisions. We aim to understand RL agents in terms of the information they attend to for decision making.</p><p>The contribution of the paper is as follows: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prerequisite Attribution Method</head><p>The concept of attribution is studied in various papers, such as integrated gradient <ref type="bibr" target="#b5">(Sundararajan, Taly, and Yan 2017)</ref> and SHAP <ref type="bibr" target="#b1">(Lundberg and Lee 2017)</ref>. We give the definition of attribution following the statement in paper above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition (Attribution):</head><p>Suppose we have a function f : R n →R m that represents a model, and an input x = (x 1 , ..., x n )∈R n . An attribution of the prediction at input x relative to a baseline input x is a vector φ(x, x ) = (φ 1 , ..., φ n )∈R n where φ i is the contribution of x i to the prediction f (x).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Shapley Value</head><p>Let f be the original prediction model and g the explanation model. The explanation model uses simplified inputs x that map to the original inputs through a mapping function x = h x (x ). Assuming g(z ) ≈ f (h x (z )) whenever z ≈ x , the attribution method is defined as</p><formula xml:id="formula_0">g(z ) = φ 0 + N i=1 φ i z i (1)</formula><p>where z ⊂ {0, 1} N , N is the number of simplified input features, and φ i ⊂ R.</p><p>Assume four axioms such as efficiency, symmetry, dummy and additivity, the attribution is proved to have a single unique solution known as Shapley value <ref type="bibr" target="#b3">(Shapley 1953)</ref> in cooperative game theory:</p><formula xml:id="formula_1">φ i (f, x) = z ⊆x |z |!(N − |z | − 1)! N ! [f x (z ) − f x ( z \i)]</formula><p>(2) where |z | is the number of non-zero entries in z and z ⊆ x represents all z vectors where the non-zero entries are a subset of the non-zero entries in x .</p><p>SHAP (SHapley Additive exPlanation) (Lundberg and Lee 2017) is a state-of-art explanation framework using Shapley value. The SHAP value is defined as an approximation to equation 2:</p><formula xml:id="formula_2">f x (z ) = f (h x (z )) = E[f (z)|z S ]<label>(3)</label></formula><p>where S is the set of non-zero indexes in z . Thus, SHAP value attributes to each feature the change in the expected model prediction when the feature is toggled on. They explain how to get from the base value E[f (z)] that would be predicted if we did not know any features to the model f(x).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Problem of Attribution Methods on RL</head><p>The effect of each feature on a prediction is calculated based on a baseline prediction. The input features of the baseline prediction (or base value) are called background data (or reference data). Usually, the background data is set to zero or the average value of the training dataset in prediction tasks. In image recognition tasks, the background data can be a black image, i.e., all pixel intensities are zero for example. However, reinforcement learning proceeds by making training data by exploitation and exploration in uncertain environment. The dynamic learning process of a deep RL agent makes some problems to use SHAP. According to our experiment results, different selection of the background data will Figure <ref type="figure">1</ref>: Problem Setting lead to different explanation results. We want to solve this problem in our work. Also, we want to understand deep RL agents in terms of what information of the environment they take to make decisions. This match the intuition of post-hoc explanations. Among the group of attribution methods, we use SHAP to analyze RL. We focus on the agent trained on Deep Q-Network (DQN) <ref type="bibr" target="#b2">(Mnih et al. 2015)</ref>. Figure <ref type="figure">1</ref> shows the intuition of our problem setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Attribution-based Salience Method towards interpretable RL Attribution generation</head><p>Deep RL agents learn what to do so as to maximize the cumulative reward or the value. In DQN, the value is approximated by Q-function. The output of the DQN model is the Q-value for each action candidate. We adjust the original DQN model with argmax operator in order to bridge the gap between the outputs and the action selection (decisionmaking). We load the trained DQN model f model from deep RL agents and adjust the output by adding an activation layer. Note that this is done after the training process of our deep RL agent. In this way, the output of the modified model f modif ied is the selected action with higher Q-value.</p><p>Next, we deal with the issue of background data. Instead of using one fixed set of background data, we embed domain knowledge to select the background data according to the environment RL interacts with.</p><p>In RL environment, we make a transition from one state s to the next state s by performing some action a and receive a reward r. We load the learnt policy trajectory of our deep RL agent along the learning process and regard it as the dataset of our approach. Let P 1:t denote the trajectory of learnt policies from time step 1 to time step t, the trajectory file contains the state s and action a pair at each time step t. Therefore, we have P t = P t (s t , a t ). Our background data is selected according to the trajectory P 1:t = P 1:t (s 1:t , a 1:t ).</p><p>Then we calculated the attribution of each input, which is the SHAP value with our trained model and selected background data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Salience Method</head><p>The higher value of attribution means bigger impact of the input on the output of the model. The impact of the input is </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment</head><p>We evaluated the proposed method on the automatic crane control use case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Automatic Crane Control</head><p>A crane is a type of machine, generally equipped with a hoist rope, wire ropes or chains and sheaves, that can be used to lift and lower materials and to move them horizontally. We want to realize automatic control of crane with deep RL agent and explain the policies of the agent. In Figure <ref type="figure" target="#fig_0">2</ref>, we model the crane control problem.</p><p>The object is connected to a trolley with a piece of wire. The object is supposed to be delivered by the trolley from the start position to the goal position. Operators could add acceleration and deceleration signal to the trolley to accomplish the delivery. Note that the trolley can only travel horizontally on the rail. The trolley would either be accelerated by a specific constant value until the velocity of travelling reaches the maximum, or de-accelerated by the same value until the velocity reaches zero. As the trolley starts moving, the object starts swinging like a pendulum. The objective is to deliver the object to the goal position as soon as possible and at the mean time with neglectful swinging at the goal position. We applied our attribution-based salience method on the automatic crane control trajectory. We used KernelSHAP (Lundberg and Lee 2017) for the attribution method. We selected the start position as the background data. Figure <ref type="figure" target="#fig_1">4</ref> shows the SHAP values scores for the four states. The blue, orange, green and pink lines in the figure correspond to x, v, φ, and ω, respectively. The horizontal axis represents the attribution value score for each state.</p><p>The result shows that at the beginning, the RL agent cares more about the velocity of the trolley. Gradually, it pays attention to the angle of the wire, or swing, during travelling at high speed. It takes the traveling distance as the most important state near the goal.</p><p>The strategy above is different from the one usually conducted by a human operator. A human operator firstly looks at the traveling distance and velocity to travel the trolley and stops near to the goal as fast as possible. But in there, the wire is swinging. Then, the operator looks at the wire angle and accelerate and brake the trolley a little at an appropriate wire angle to stabilize the swing at the goal position.</p><p>The RL agent conveys faster than a human operator because the RL agent does not wait for the appropriate angle of the swing by once stopping near the goal position. The adjustment of the swing phase is realized by paying attention to the swing angle and putting a little acceleration and brake while travelling at high speed as described above. This result might be surprising for human operators but would be intuitive after understanding the attention sequence of the RL agent. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>In this section, we discuss about the background data selection problem. We take automatic crane control as an example.</p><p>We also tried other candidate background data as comparative experiments. We selected the middle position and the goal position as the background data. Figure <ref type="figure" target="#fig_2">5</ref> shows the SHAP values results for the problem with the goal position selected as the background data. As shown in the figure, the traveling distance and traveling velocity are still the main features that contributes to the decision making. In this case, SHAP values of the traveling distance of the trolley and the traveling velocity are approximately similar but in different directions. At the beginning, the traveling distance contributes most, while near the goal direction, the traveling velocity contributes most. This is in contrast to what we observed in the experiment that used the start position as the background data.</p><p>Figure <ref type="figure">6</ref> shows the SHAP values result for the problem where we selected the middle position as background data. From 0s to around 5s, the traveling distance has much contribution. However, their contributions decrease from 5s to 10s, and other states becomes greater around 8s. At the end of the trajectory, the traveling distance contributed most.</p><p>According to our investigation, when domain experts operate the crane, they will firstly accelerate the crane. Then, when crane reaches the maximum velocity, they operate to remain the crane at the maximum velocity. When the crane comes close to the goal position, they deaccelerate the crane.</p><p>Apparently, there are three phases in the operation of domain experts. According to the experiment result, it makes sense when we select start position for these three phases of crane. However, in more complicated use cases, there will be more phases. Different background data should be selected for comparing with different patterns of data,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>Our experiments show that different selection of background data generates different explanation. And some of the explanations match human intuition, while others are not straightforward enough for humans to understand. Since the calculation of attribution methods includes the selection of background data, we claim that this is a key issue for implementing attribution methods and reaching human-understandable explanations. Therefore, we select the background data and the generated explanation considering the domain knowledge and human intuition. Our proposed method explains the policies in regarding to the contribution of each input state. We will verify our method with more use cases as the future work. How to embed in domain knowledge and human intuition in the explanation that make them understandable to both expert and non-expert alike is also an open question.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Image of Automatic Crane Control Use Case</figDesc><graphic coords="3,59.85,217.31,226.80,114.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: SHAP Values (Background Data: Start Position)</figDesc><graphic coords="3,325.35,54.00,226.80,124.53" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: SHAP Values (Background Data: Goal Position)</figDesc><graphic coords="4,59.85,213.33,226.80,123.78" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Visualizing and understanding atari agents</title>
		<author>
			<persName><forename type="first">S</forename><surname>Greydanus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Koul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dodge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fern</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.00138</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A unified approach to interpreting model predictions</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Lundberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-I</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="4765" to="4774" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Humanlevel control through deep reinforcement learning</title>
		<author>
			<persName><forename type="first">V</forename><surname>Mnih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Rusu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Veness</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Bellemare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Riedmiller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Fidjeland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ostrovski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">518</biblScope>
			<biblScope unit="page">529</biblScope>
			<date type="published" when="2015">2015. 7540</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A value for n-person games</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">S</forename><surname>Shapley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Contributions to the Theory of Games</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">28</biblScope>
			<biblScope unit="page" from="307" to="317" />
			<date type="published" when="1953">1953</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Mastering the game of go with deep neural networks and tree search</title>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Maddison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Sifre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Van Den Driessche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schrittwieser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Antonoglou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Pãnneershelvam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lanctot</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">nature</title>
		<imprint>
			<biblScope unit="volume">529</biblScope>
			<biblScope unit="page" from="484" to="489" />
			<date type="published" when="2016">2016. 7587</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Axiomatic attribution for deep networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sundararajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Taly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1703.01365</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Graying the black box: Understanding dqns</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>Sutton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Barto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<publisher>MIT press Cambridge</publisher>
			<date type="published" when="2016">2018. 2016</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1899" to="1908" />
		</imprint>
	</monogr>
	<note>Reinforcement learning: An introduction</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
