<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An Agent-Based Learning Approach for Finding and Exploiting Heuristics in Unknown Environments</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Daan</forename><surname>Apeldoorn</surname></persName>
							<email>daan.apeldoorn@tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Technische Universität Dortmund</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gabriele</forename><surname>Kern-Isberner</surname></persName>
							<email>gabriele.kern-isberner@cs.tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Technische Universität Dortmund</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">An Agent-Based Learning Approach for Finding and Exploiting Heuristics in Unknown Environments</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8F5008AAD95C7B5CD69FF5401B777CF1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:59+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents an agent-based hybrid symbolic/subsymbolic learning approach as a basic model for the human ability of quickly recognizing and exploiting heuristic rules in an initially unknown environment. Instead of mechanically iterating a task until the optimal policy was found (as usually done by common Reinforcement Learning techniques), the agent finds heuristics expressed by symbolic knowledge bases which allow for an intelligent exploitation of knowledge gained from few experiences. For this purpose, a measure is proposed which allows an agent to decide on its own at which point during the learning process, its decisions should rely on the recognized heuristics rather than on a learned state-action pair representation. The approach will be evaluated both in the context of several grid world scenarios and a game from the GVGAI framework. It will be shown that our approach outperforms standard Q-Learning in the selected scenarios and arguments will be provided that also other agent-based machine learning techniques could profit from the proposed approach.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Common machine learning techniques used in the context of learning agents (e. g., Reinforcement Learning based approaches like Q-Learning <ref type="bibr" target="#b15">(Watkins 1989)</ref>, <ref type="bibr" target="#b14">(Sutton and Barto 1998)</ref>) are usually based on a trial-and-error principle: In unknown environments, agents based on such approaches start with purely random actions and successively improve their behavior as the learning process progresses. Having no a priori knowledge about the environment, such agents have to explore large amounts of the state-action space, even if the underlying structure of the environment follows simple rules.</p><p>Humans, in contrast, are known for successfully applying rough heuristics learned from very few examples. This, of course, can lead to wrong decisions (see, e. g., <ref type="bibr" target="#b5">(Dörner 1992)</ref> for several examples), however, in the common cases many problems can be quickly solved by applying such heuristics derived from few examples. One reason for that is, that many problem domains (e. g., games) are geared towards the commonsense human way of thinking, and these kinds of problems can therefore be quickly solved by exploiting such heuristics.</p><p>In this paper, we propose a hybrid symbolic/sub-symbolic agent model which is able to autonomously find and exploit useful heuristics in a previously unknown environment. During a Reinforcement Learning process, the agent creates a symbolic knowledge base from learned experiences which comprises rule-based knowledge on multiple levels of abstraction as a model for heuristics. We do not incorporate any a priori knowledge (as done e. g. in <ref type="bibr" target="#b10">(Shapiro, Langley, and Shachter 2001)</ref>-although this would be possible in our model), instead the heuristics are inferred by the agent itself during the learning process. Furthermore, the agent is able to decide itself at which point during the learning process the decisions should be relied on the found heuristics rather than on the weighted state-action pairs of the Reinforcement Learning process. More concretely, the following contributions are made:</p><p>• A hybrid symbolic/sub-symbolic agent model which incorporates both symbolic rule-based knowledge and weighted state-action pair representations learned by a sub-symbolic learning approach.</p><p>• An approach for estimating the subjective structural complexity of a previously unknown environment from an agent perspective during a learning process (based on a "commonsense" measure for strategic depth <ref type="bibr" target="#b3">(Apeldoorn and Volz 2017)</ref>).</p><p>Compared to our previous works on knowledge base extraction presented in <ref type="bibr" target="#b1">(Apeldoorn and Kern-Isberner 2016)</ref> and <ref type="bibr" target="#b2">(Apeldoorn and Kern-Isberner 2017)</ref>, the novelty of this paper lies in the incorporation of our extraction approach into a learning agent model which is able to decide on its own when to exploit the extracted knowledge as a heuristic in an unknown environment. This is realized by incorporating the strategic depth measure introduced in (Apeldoorn and Volz 2017) as a decision criterion.</p><p>As a side-product, the resulting agent model is still able to explain the learned knowledge and the made decisions to some extent at any point during the learning process, as described in <ref type="bibr" target="#b2">(Apeldoorn and Kern-Isberner 2017)</ref>.</p><p>In Section 2 related similar approaches will be discussed. A brief introduction to the fundamental preliminary works will be given in Section 3. The agent model will be introduced in Section 4. In Section 5 the model will be evaluated in the context of several simple grid world scenarios and later in different levels of a game from the General Video Game Playing Artificial Intelligence (GVGAI) framework, which is usually used for the GVGAI competition where agents have to compete in playing previously unknown video games <ref type="bibr" target="#b9">(Perez-Liebana et al. 2016</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Several attempts have been made so far to closer relate symbolic knowledge representation and sub-symbolic machine learning approaches in the context of agents. The basic ideas can be roughly assigned to one of the following groups (representatives for every group will be discussed subsequently):</p><p>1. Extraction techniques to gain different knowledge representations (e. g., propositional rules) learned from data through machine learning techniques 2. Methods to integrate a priori knowledge into machine learning approaches to support the learning process (e. g., by shaping the search space with a priori knowledge and later refining this knowledge with learning techniques)</p><p>3. Cognitive architectures that combine 1. and 2.</p><p>Representatives of the first group are, e. g., <ref type="bibr" target="#b13">(Sun 2002)</ref>, where extraction techniques have been proposed to gain simple rules or plans from reinforcement learning. In <ref type="bibr" target="#b6">(Junges and Klügl 2013)</ref>, decision trees are created from learned weighted state-action representations with the primary goal of supporting agent developers in the implementation of adequate agent behavior. These works focus on the extraction of knowledge in different forms. In contrast, in this paper, we propose to (re)incorporate the extracted knowledge back into the learning process as a model for finding and exploiting rough heuristics in a previously unknown environment.</p><p>As representatives for the second group, e. g., <ref type="bibr" target="#b11">(Singh et al. 2011</ref>) and <ref type="bibr" target="#b10">(Shapiro, Langley, and Shachter 2001</ref>) can be considered: In these approaches, machine learning techniques are combined with mechanisms to incorporate a priori knowledge to accelerate the learning process or to later refine and adapt the a priori knowledge to a dynamically changing environment (in <ref type="bibr" target="#b11">(Singh et al. 2011)</ref> this is realized in the context of a BDI agent model where initial beliefs which can be refined during a learning process). In contrast, our approach works (nearly) without any a priori knowledge about the environment.</p><p>The third group is represented, e. g., by models such as CLARION <ref type="bibr" target="#b12">(Sun, Peterson, and Merrill 1999)</ref>, a cognitive architecture which incorporates both sub-symbolic learning and symbolic knowledge during learning and decision making. In this model, a sub-symbolic representation is combined with a rule network on top, and both the sub-symbolic and the rule level are used in the agent's decision making process. In our approach, however, we assume a reasonable point during a learning process, where a transition from a sub-symbolic (implicit) knowledge to a symbolic (explicit) knowledge representation should take place with the primary objective to quickly find useful heuristics in a previously unknown environment. Another representative related to the third group, the SPHINX approach, is described in <ref type="bibr" target="#b7">(Leopold, Kern-Isberner, and Peters 2008)</ref>, where reinforcement learning is combined with ranking functions and belief revision to support the learning process. It was shown on a object recognition task that the considered learning agent can benefit from the combined approach (with and without involving additional background knowledge). However, in contrast to our approach, it was not investigated, in which phase during the learning process the agent should rely its behavior more on the explicit, symbolic part of the knowledge (i. e., the ranking function) rather than on the implicit representation of the weights learned through the reinforcement learning part. Furthermore, the symbolic part is not primarily dedicated to represent a model for heuristics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Preliminaries</head><p>This section introduces the two preliminary works needed for our approach: First, the concept of exception-tolerant Hierarchical Knowledge Bases (HKBs) <ref type="bibr">(Apeldoorn and</ref><ref type="bibr">Kern-Isberner 2016, 2017)</ref> will be briefly introduced and how they can be retrieved (Sections 3.1 and 3.2).<ref type="foot" target="#foot_0">1</ref> HKBs will be used later as a model for heuristics. After that, a "commonsense" strategy measure (recently introduced in (Apeldoorn and Volz 2017)), is described which will be used later in our agent model to let an agent subjectively estimate whether or not heuristics should be exploited (Section 3.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Definition of HKBs</head><p>To introduce HKBs and how they can be learned, we closely follow <ref type="bibr">(Apeldoorn and</ref><ref type="bibr">Kern-Isberner 2016, 2017)</ref>. There, an extraction approach is proposed which is able to extract an HKB from a weighted state-action pair representations learned by a common reinforcement learning approach like Q-Learning <ref type="bibr" target="#b15">(Watkins 1989</ref>). 2  This section only provides the main definitions needed to understand the basic idea of HKBs. For more details on HKBs, the reader should refer to the original literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Basics of the Agent Model</head><p>We consider an agent which is learning a policy by acting autonomously in a previously unknown environment. The agent is equipped with n sensors through which it can perceive its current state in the environment. The agent is able to perform actions from a predefined action space and can furthermore perceive, whether or not the performed actions were good, in form of (numeric) rewards. The perceived rewards are then used to learn a weighted state-action pair representation where the weights determine which action has to be performed, given a perceived state (usually the one with the highest weight).</p><p>More formally, in such a representation, a state s is an element of a multi-dimensional state space S = S 1 × ... × S n where n is the number of the agent's sensors (through which the agent is able to perceive its state in the environment) and every S i is a set of possible sensor values of the corresponding sensor. Furthermore, the agent selects actions from a predefined action set A and the learned weights are stored in a multi-dimensional matrix Q = (q s1,...,sn,a ) with s i ∈ S i and a ∈ A. The weights can be learned by different machine learning approaches, provided that the learning approach converges such that given a state, the highest weight determines the best action to be selected (i. e., a max s1,...,sn = arg max a ∈A q s1,...,sn,a ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Exception-Tolerant Hierarchical Knowledge Bases</head><p>An HKB consists of rules which are organized on different levels of abstraction. In contrast to Exception Lists <ref type="bibr" target="#b8">(Michael 2011)</ref>, an HKB can handle multiple rules per level and the rules also comprise weights. To be able to define these rules, two different kinds of states and two different kinds of rules will be distinguished ( </p><formula xml:id="formula_0">⇒ a ρ [w ρ ]</formula><p>, where p ρ is either a complete state (in case of an complete rule) or a partial state (in case of a generalized rule), the conclusion a ρ ∈ A is an action of an agent's action space A and w ρ ∈ [0, 1] is the rule's weight.</p><p>Thus, complete rules map complete states to actions and generalized rules map partial states to actions. An HKB can now be defined as follows:</p><p>Definition 3 (Exception-Tolerant Hierarchical Knowledge Base) An exception-tolerant Hierarchical Knowledge Base (HKB) is an ordered set KB := {R 1 , ..., R n+1 } of n + 1 rule sets, where n is the number of sensors (i. e., the number of state space dimensions). Every set R i&lt;n+1 contains generalized rules and the set R n+1 contains complete rules, such that every premise</p><formula xml:id="formula_1">p ρ = s∈Sρ s of a rule ρ ∈ R i is of length |S ρ | = i − 1.</formula><p>According to Definition 3, the set R 1 contains the most general rules (with empty premises) and the set R n+1 contains the most specific (i. e., complete) rules.</p><p>For the relations of rules, the term of needed exception will be used, according to the following definition (cf. (Apeldoorn and Kern-Isberner 2017)):</p><p>Definition 4 (Needed Exception) A rule ρ ∈ R j&gt;1 is an exception to a rule τ ∈ R j−1 with premise p τ = s∈Sτ s, action a τ as conclusion and weight w τ , if S τ ⊂ S ρ and a ρ = a τ . The exception is needed, if there exists no other rule υ ∈ R j−1 with premise p υ = s∈Sυ s and action a υ as conclusion where S υ ⊂ S ρ , a υ = a ρ and w υ &gt; w τ .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Learning an HKB</head><p>An HKB can be extracted from a weighted state-action pair representation Q (that is learned, e. g., through a Reinforcement Learning technique) using the following approach (Apeldoorn and Kern-Isberner 2016): <ref type="foot" target="#foot_3">3</ref>The approach takes a weighted state-action pair representation Q as input and returns an HKB KB Q which reflects the knowledge contained in Q by performing the following steps:</p><p>1. Initial creation of rule sets: In the first step, the multiple abstraction levels R 1 , ..., R n+1 of the knowledge base are initially filled with rules. We only consider state-action pairs here that contribute to the best policy found during the learning process so far.<ref type="foot" target="#foot_4">4</ref> </p><p>2. Removal of worse rules: In all sets R j , a rule ρ ∈ R j is removed, if there exists another rule σ ∈ R j with the same partial state as premise having a higher weight (i. e., in every set R j only the best rules for a given partial state are kept).</p><p>3. Removal of worse more specific rules: In all sets R j&gt;1 , a rule ρ ∈ R j with premise p ρ = s∈Sρ s, conclusion a ρ and weight w ρ is removed, if there exists a more general rule σ ∈ R j &lt;j with premise p σ = s∈Sσ s where S σ ⊂ S ρ = {s 1 , ..., s j−1 } and weight w σ ≥ w ρ .</p><p>4. Removal of too specific rules: In all sets R j , a rule ρ ∈ R j&gt;1 with premise p ρ = s∈Sρ and conclusion a ρ is removed, if there exists a more general rule σ ∈ R j &lt;j with the same action a σ = a ρ as conclusion and with premise p σ = s∈Sσ s where S σ ⊂ S ρ = {s 1 , ..., s j−1 } and if ρ is not a needed exception to a rule τ ∈ R j−1 .</p><p>5. Optional filter step: Optionally, filters may be applied to filter out further rules which are, e. g., helpful to explain the knowledge contained in Q through the optimal found policy so far, but which are not needed for reasoning later.</p><p>After performing these steps on Q, the knowledge base KB Q comprises all sets R j = ∅ with the extracted rules representing the implicit knowledge contained in the learned weights of Q in a compact way.</p><p>Reasoning on HKBs A reasoning algorithm on HKBs was proposed in (Apeldoorn and Kern-Isberner 2016): Given perceived sensor values s 1 , ..., s n , this algorithm searches an HKB upwards, starting from the bottom-most level R n+1 , for the first rule whose premise is fulfilled. This rule is then returned as concluding action (see <ref type="bibr" target="#b1">(Apeldoorn and Kern-Isberner 2016)</ref> for details). By this, the algorithm selects the most specific rule that fits to the perceived sensor values and falls back to the next more unspecific rule as a heuristic, in case no more specific rule with a fitting premise could be found.</p><p>Due to the possibility of falling back to more general rules in case no matching rule could be found for the given sensor values, the reasoning mechanism on an HKB allows for drawing meaningful conclusions over states that have not been seen before by an agent. The idea that a more general rule serves as a rough heuristic for unknown states (for which no better knowledge is available) will later contribute to accelerate the learning process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Strategic Depth</head><p>Based on an HKB, in <ref type="bibr" target="#b3">(Apeldoorn and Volz 2017)</ref>, a strategy measure was introduced which seems to adequately reflect the commonsense strategic depth of games. The measure was evaluated in the context of a survey. This measure will be used later in our agent model, to let the agent judge subjectively when to apply heuristics. According to (Apeldoorn and Volz 2017), the measure is defined as follows:</p><p>Definition 4 (Strategic Depth Measure d s ) Let KB = {R 1 , ..., R n+1 } be an HKB and S = {S 1 , ..., S n } be the set of all sensor value sets of an agent, then the strategic depth is defined as a function</p><formula xml:id="formula_2">d s (KB, S) := n+1 i=1 n i − 1 b i−1 |R i | S⊆S |S|=i−1 S∈S |S| ,<label>(1)</label></formula><p>where n + 1 = |KB| = |S| + 1 is the number of levels in KB, b is a weighting constant, and R i ∈ KB is the i-th level of KB.</p><p>As rules on a level R j&gt;1 can be considered exceptions to the rules on level R j−1 , the intuition behind d s is that it measures the weighted relative number of rules (hence, exceptions) represented by an HKB.</p><p>In <ref type="bibr" target="#b3">(Apeldoorn and Volz 2017)</ref>, also a modified version of the measure is proposed, where d s is additionally divided by the sum of weights to normalize it to values of range [0, 1]. For our purpose, this version of the measure is extremely useful, since it allows our agent model to estimate the strategic depth of a previously unknown environment relatively to the maximum strategic depth that can be expected, given the sets of possible sensor values S 1 , ..., S n . Thus, in the following, the normalized version of the measure (according to (Apeldoorn and Volz 2017)) will be used: (2)</p><p>In the following, ds will be used in our agent model to allow the agent to estimate subjectively when it could be useful to exploit heuristics during a learning process in an a priori unknown environment: If ds falls below a certain threshold th then this indicates that heuristics could possibly be applied successfully.</p><p>The intuition behind this is that the (normalized) strategic depth based on an HKB which was extracted from a weighted state-action pair representation learned by an agent, will successively decrease as the agent's learning process progresses: Since the agent's knowledge about the problem increases, the problem appears successively less complex to the agent and the subjectively measures normalized strategy depth converges to the de facto strategic depth of the task to be learned (i. e., the strategic depth according to the HKB of the optimal policy to solve the task). We will demonstrate this in the following on several examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Examples for Subjective Normalized Strategic Depth</head><p>We consider four different grid worlds of different structural complexities where an agent has to learn to navigate from a starting point A to a target point B (see Figure <ref type="figure">2</ref>). We use standard Q-Learning <ref type="bibr" target="#b15">(Watkins 1989</ref>) as a learning approach with a learning rate of α = 0.1, a random action probability of = 0.1 and a discount factor γ = 0.9. We run every scenario for 250 runs and we measure the normalized strategic depth subjectively perceived by the agent (averaged over 30 repetitions) in dependency of the number of runs already performed. Two phenomena can be observed in Figure <ref type="figure">2:</ref> • The subjective strategic depth converges to the de facto strategic depth of the respective scenario as the agent behavior converges to the optimal policy.</p><p>• The subjective strategic depth decreases in general as the learning process progresses, since the agent's knowledge about the scenario increases, and therefore the scenarios appears to be simpler to the agent. (In case of the fourth scenario, after the first runs, the agent seems to underestimate the depth of the scenario, since the scenario locally has straight path structures that form together a more complex path from a global point of view.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Agent Model</head><p>This section introduces our model of a learning agent which is able to determine and exploit heuristics based on HKBs during a learning process in a previously unknown environment. In our agent model, a sub-symbolic learning approach can be used to learn weighted state-action pairs (where the maximum weight determines the preferred action, given a state). Figure <ref type="figure" target="#fig_1">1</ref> illustrates the main ideas of our agent model which mainly consists of two parts:</p><p>• an initialization part which is executed every time before a new learning episode starts, and</p><p>• a common agent cycle which is executed every time step.</p><p>In the former, the agent decides, whether it relies its decisions on the weighted state-action pairs (represented by the Q matrix) or on the heuristics (represented by the current  In the latter, the agent either chooses a random action (for exploration purpose) or decides to choose its action according to the Q matrix or the extracted HKB, respectively, depending on the decision made during the initialization. Note that in Figure <ref type="figure" target="#fig_1">1</ref>, components belonging to the sub-symbolic learning approach are indicated by a gray coloring, whereas those parts belonging to our heuristics extension are colored white. The gray parts could be replaced by a different learning approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation</head><p>This Section first evaluates the approach in the four grid world scenarios from Figure <ref type="figure">2</ref> in Section 3.3. Later, the approach will be evaluated additionally in a more realistic example: a game from the GVGAI framework (Perez-Liebana et al. 2016).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Evaluation in the Context of Grid World Scenarios</head><p>We run the four grid world Scenarios from Figure <ref type="figure">2</ref> in Section 3.3 for 50 runs each and measure the percentage of how many times the optimal policy was found by the agent over 200 repetitions of the experiment. As learning approach we use again Q-Learning with the same parameters used for measuring the subjective strategic depth at the end of Section 3.3. As threshold parameter to determine the exploitation of found heuristics by means of the subjective strategic depth ds , we choose a threshold of th = 0.2. According to Figure <ref type="figure">2</ref>, this means that-in average-the agent would try to exploit the found heuristics after</p><p>• ≈ 25 runs in case of the first scenario,</p><p>• ≈ 75 runs in case of the second scenario and</p><p>• ≈ 100 runs in case of the third scenario.</p><p>For the second and the third scenario, this may sound confusing since we perform a total number of 50 runs only. However, the reader should be aware that Figure <ref type="figure">2</ref> shows the average development of ds measure and thus there are enough cases where the agent individually decides to exploit a found heuristic that leads to an optimal policy.</p><p>As for the fourth scenario, the agent never considers to exploit any heuristics, since the subjective strategic depth does not decrease below th = 0.2. This can be interpreted in a way that the scenario comprises too many exceptions, such that an exploitation of heuristics does not make sense from the agent's point of view (this point of view is in fact what the parameter th controls, depending on whether the desired agent should be more "heuristics-affine" or more "conservative"). against a standard Q-Learning agent with the same parameters for the learning part. In addition, for reasons of comparison, Table <ref type="table" target="#tab_2">1</ref> also provides results of a heuristics-only version of the agent model<ref type="foot" target="#foot_5">5</ref> , where the agent was enforced to rely its decisions always on the extracted heuristics (i. e., in the initialization phase in Figure <ref type="figure" target="#fig_1">1</ref> the action selection mode is always set to HKB action selection and the weights contained in the Q matrix are never considered directly for action selection in the agent cycle).</p><p>As can be seen in Table <ref type="table" target="#tab_2">1</ref>, standard Q-Learning rarely manages to solve the grid worlds during the first 50 runs, whereas our heuristics approach clearly outperforms these results. As for the heuristics-only version, the results are slightly worse: This seems to be the case, since the agent starts too early to rely its decisions on the heuristics (which are nearly random in the beginning of the learning process). This may prevent the agent from further exploring the environment to an adequate extent and may result in local optimal policies.</p><p>One could argue that standard Q-Learning is a rather old approach and there are nowadays more elaborate learning approaches available. However, if a better learning approach will be chosen in our setup (better in the sense that it will converge faster to the optimal policy), the extracted HKBs and hence the measure ds will be more accurate as well and therefore, better heuristics could be exploited earlier during the learning process. This means that in our approach, both the heuristics and the measure ds (to decide when to exploit them) will improve with the quality of the learning approach used for the agent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Evaluation in the Context of a GVGAI Game</head><p>We evaluate the agent model now in a more realistic example by consider the game Camel Race from the GVGAI framework <ref type="bibr" target="#b9">(Perez-Liebana et al. 2016)</ref>. In this game, the agent has to control a camel which must be moved to the opposite site of the screen-faster than any other camel in the game (additionally avoiding obstacles in the higher levels). Figure <ref type="figure" target="#fig_2">3</ref> shows the first and the third level of the game (with obstacles).<ref type="foot" target="#foot_6">6</ref>   </p><formula xml:id="formula_3">× S C2 x × S C2 y × S C3</formula><p>x containing the coordinates of the camels in the game (where C 1 , ..., C 3 correspond to the three camels, C 2 is the camel in the middle controlled by the agent and C 1 , C 3 only move in x direction) and A = {up, down, left, right, none} are the five actions that can be performed by the agent. As reward, the agent perceives the current distance to the fastest opponent camel in x direction.</p><p>Even if the game is rather simple, it has an interesting aspect: Due to the dynamics of the environment caused by the movement of the two opponent camels, our agent perceives new and previously unseen states in nearly every game tick of the early learning phase. Thus, the agent has to explore large parts of the state-action space to improve its behavior, although the game could be easily won by applying a rough heuristic like always move right ( ⇒ right [1.0]) for the first level. This renders the game an eligible test environment for our agent model.</p><p>We perform 100 runs for each level and measure the percentage of wins by the agent over 30 repetitions. Again, standard Q-Learning with and without the heuristics approach, as well as the heuristics-only approach are used, given the same standard parameters as provided in Section 5.2. Table <ref type="table" target="#tab_3">2</ref> shows the results for the first and the third level of the game: With standard Q-Learning, the agent is not able to win the game within 100 runs, whereas using the heuristics approach, the agent wins both levels in over 50% of the cases. Surprisingly, the heuristics-only approach performs even better here. This seems to be the case since-although the game has some dynamics and can thus be considered more complex than the grid world scenarios-the considered levels allow for rougher heuristics than the grid worlds: In contrast to the grid worlds, the goal to be reached is not located in a single corner, instead, the game can be won by simply reaching the rightmost side of the screen (which can already be achieved by a very rough heuristic).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future Work</head><p>In this paper, we described a model of a learning agent which is able to find and to exploit heuristics without a priori knowledge in unknown environments. We showed on several examples (four grid world scenarios and one game from the GVGAI framework) that our approach drastically accelerates the learning process to find optimal policies.</p><p>However, the approach is not perfect yet and leaves room for future work: (1) The runtime to extract an HKB should be further improved (a faster algorithm has been proposed in (Apeldoorn and Kern-Isberner 2017) already; cf. footnote 4) and (2) a quality criterion could be incorporated into the measure that helps to decide whether heuristics should be exploited, depending on previous experiences with applying these heuristics. The latter could possibly be useful to help preventing an agent from repeatedly trying out bad heuristics.</p><p>Furthermore, it could be interesting to extend the model with revision techniques for the found heuristics and to incorporate mechanisms to store and reuse heuristics.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Definition 5 (</head><label>5</label><figDesc>Normalized Strategic Depth ds ) Let KB, S, n, b and d s be as in Definition 5. Then the normalized strategic depth is defined as a function ds (KB, S) := d s (KB, S)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Learning Agent Model with the Ability of Finding and Exploiting Heuristics</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: First Level (a) and Third Level (b) of the GVGAI Game Camel Race</figDesc><graphic coords="7,306.84,63.43,254.14,50.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 :</head><label>1</label><figDesc>Table 1 contains the comparison of our agent model Standard Q-Learning vs Heuristics Approach in the Context of the Grid Worlds Scenarios-The table shows the percentage of 100 repetitions in which the optimal policy was found by the agent during the first 50 runs.</figDesc><table><row><cell>(a)</cell><cell cols="2">Avg. Subjective Norm.</cell><cell>Strategic Depth</cell><cell>0.0 0.2 0.4 0.6 0.8 1.0</cell><cell>1</cell><cell>50</cell><cell>100</cell><cell>150</cell><cell>200</cell><cell>250</cell><cell>Scenario 1] Scenario 2] Scenario 3] Scenario 4]</cell><cell>With Heuristics Heuristics Q-Learning (th = 0.2) Standard Only 3.5% 66.5% 43.0% 4.5% 20.0% 10.0% 0.0% 10.5% 0.03% 0.0% 0.0% 0.0%</cell></row><row><cell>(b)</cell><cell cols="2">Avg. Subjective Norm.</cell><cell>Strategic Depth</cell><cell>0.0 0.2 0.4 0.6 0.8 1.0</cell><cell>1</cell><cell>50</cell><cell>100</cell><cell>150</cell><cell>200</cell><cell>250</cell><cell></cell></row><row><cell>(c)</cell><cell>Avg. Subjective Norm.</cell><cell cols="2">Strategic Depth</cell><cell>0.0 0.2 0.4 0.6 0.8 1.0</cell><cell>1</cell><cell>50</cell><cell>100</cell><cell>150</cell><cell>200</cell><cell>250</cell><cell></cell></row><row><cell>(d)</cell><cell cols="2">Avg. Subjective Norm.</cell><cell>Strategic Depth</cell><cell>0.0 0.2 0.4 0.6 0.8 1.0</cell><cell>1</cell><cell>50</cell><cell>100</cell><cell>150</cell><cell>200</cell><cell>250</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">Number of Runs r</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="11">Figure 2: Examples of Subjective Normalized Strategic</cell><cell></cell></row><row><cell cols="11">Depth for the Considered Grid Worlds Scenarios of Increas-</cell><cell></cell></row><row><cell cols="11">ing Structural Complexity with the Optimal Policies Indi-</cell><cell></cell></row><row><cell cols="11">cated by Arrows-Scenarios 1 (a) and 2 (b) are taken from</cell><cell></cell></row><row><cell cols="11">(Apeldoorn and Kern-Isberner 2016), Scenario 3 (c) is from</cell><cell></cell></row><row><cell cols="11">(Apeldoorn and Kern-Isberner 2017) and Scenario 4 (d)</cell><cell></cell></row><row><cell cols="8">from (Apeldoorn and Volz 2017).</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 :</head><label>2</label><figDesc>Standard Q-Learning vs Heuristics Approach in the Context of the GVGAI Game Camel Race-The table shows the percentage of 30 repetitions in which the agent won the game in the first 100 runs.</figDesc><table><row><cell>The agent's state space is described by the sensor value</cell></row><row><cell>sets S C1 x</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Note that in(Apeldoorn and Kern-Isberner  </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2016" xml:id="foot_1">,</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2017" xml:id="foot_2">), exception-tolerant Hierarchical Knowledge Bases were originally called Hierarchical Knowledge Bases-but to avoid confusion with the Hierarchical Knowledge Bases by Borgida et al.<ref type="bibr" target="#b4">(Borgida and Etherington 1989)</ref> and in order to further outline the primary purpose of our approach, we call them exception-tolerant here.2 Note that the used learning approach is not of importance as long as it results in a weighted state-action pair representation, where the weights determine which action has to be chosen given a perceived state (usually the one with the highest weight).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3">Note that a faster version of the algorithm improving its first step is proposed in (Apeldoorn and Kern-Isberner 2017) (cf. foot-</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4">note 4).4  Furthermore, since the number of possible premises of the rules can grow drastically with the size of the given problem instance, an efficient algorithm has to be used here. An attempt for efficiently preselecting the rules using adapted ideas from the APRIORI algorithm<ref type="bibr" target="#b0">(Agrawal et al. 1996)</ref> has been made in (Apeldoorn and Kern-Isberner 2017) to increase the overall performance of the extraction approach.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5">Thanks to an anonymous reviewer for proposing this idea.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6">The game was slightly modified for our purpose by reducing the number of camels, in order to reduce the dimensionality of the state-action space without changing the basic game mechanics.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Fast Discovery of Association Rules</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mannila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Toivonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Verkamo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Knowledge discovery and Data Mining</title>
				<meeting><address><addrLine>Cambridge, MA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>MIT Press</publisher>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page" from="307" to="328" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">When should learning agents switch to explicit knowledge?</title>
		<author>
			<persName><forename type="first">D</forename><surname>Apeldoorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kern-Isberner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">GCAI 2016. 2nd Global Conference on Artificial Intelligence</title>
		<title level="s">EPiC Series in Computing</title>
		<editor>
			<persName><forename type="first">C</forename><surname>Benzmüller</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Sutcliffe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Rojas</surname></persName>
		</editor>
		<imprint>
			<publisher>EasyChair Publications</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="174" to="186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Towards an understanding of what is learned: Extracting multi-abstractionlevel knowledge from learning agents</title>
		<author>
			<persName><forename type="first">D</forename><surname>Apeldoorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kern-Isberner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference</title>
				<editor>
			<persName><forename type="first">V</forename><surname>Rus</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Markov</surname></persName>
		</editor>
		<meeting>the Thirtieth International Florida Artificial Intelligence Research Society Conference<address><addrLine>Palo Alto, California</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="764" to="767" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Measuring strategic depth in games using hierarchical knowledge bases</title>
		<author>
			<persName><forename type="first">D</forename><surname>Apeldoorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Volz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computational Intelligence and Games (CIG)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="9" to="16" />
		</imprint>
	</monogr>
	<note>To be published</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Hierarchical knowledge bases and efficient disjunctive reasoning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Borgida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Etherington</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First International Conference on Principles of Knowledge Representation Reasoning</title>
				<editor>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Brachman</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Levesque</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Reiter</surname></persName>
		</editor>
		<meeting>the First International Conference on Principles of Knowledge Representation Reasoning</meeting>
		<imprint>
			<date type="published" when="1989">1989</date>
			<biblScope unit="page" from="33" to="43" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Dörner</surname></persName>
		</author>
		<title level="m">Die Logik des Mißlingens -Strategisches Denken in komplexen Situationen</title>
				<meeting><address><addrLine>Reinbek; Hamburg</addrLine></address></meeting>
		<imprint>
			<publisher>Rowohlt Taschenbuch Verlag</publisher>
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Learning tools for agentbased modeling and simulation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Junges</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Klügl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">KI -Künstliche Intelligenz</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="273" to="280" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Belief Revision with Reinforcement Learning for Interactive Object Recognition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Leopold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kern-Isberner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Peters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECAI 2008 -18th European Conference on Artificial Intelligence Proceedings</title>
				<meeting><address><addrLine>Amsterdam</addrLine></address></meeting>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="65" to="69" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Causal Learnability</title>
		<author>
			<persName><forename type="first">L</forename><surname>Michael</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), 1014-1020</title>
				<meeting>the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), 1014-1020<address><addrLine>Palo Alto, California</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">General video game ai: Competition, challenges and opportunities</title>
		<author>
			<persName><forename type="first">D</forename><surname>Perez-Liebana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Samothrakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Togelius</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schaul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Lucas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence and the Twenty-Eighth Innovative Applications of Artificial Intelligence Conference</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Wellman</surname></persName>
		</editor>
		<meeting>the Thirtieth AAAI Conference on Artificial Intelligence and the Twenty-Eighth Innovative Applications of Artificial Intelligence Conference<address><addrLine>Palo Alto, California</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="4335" to="4337" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Using Background Knowledge to Speed Reinforcement Learning in Physical Agents</title>
		<author>
			<persName><forename type="first">D</forename><surname>Shapiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Langley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Shachter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AGENTS &apos;01 Proceedings of the fifth international conference on Autonomous agents</title>
				<meeting><address><addrLine>New York</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="254" to="261" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Integrating learning into a bdi agent for environments with changing dynamics</title>
		<author>
			<persName><forename type="first">D</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sardina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Padgham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>James</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence</title>
				<editor>
			<persName><forename type="first">T</forename><surname>Walsh</surname></persName>
		</editor>
		<meeting>the Twenty-Second International Joint Conference on Artificial Intelligence<address><addrLine>Menlo Park, California</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press/International Joint Conferences on Artificial Intelligence</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="2525" to="2530" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A hybrid architecture for situated learning of reactive sequential decision making</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Peterson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Merrill</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Intelligence</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="109" to="127" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Knowledge Extraction from Reinforcement Learning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">New Learning Paradigms in Soft Computing</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="170" to="180" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Reinforcement Learning: An Introduction</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sutton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Barto</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>The MIT Press</publisher>
			<pubPlace>Camebridge, Massachusetts; London, England</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Learning from Delayed Rewards</title>
		<author>
			<persName><forename type="first">C</forename><surname>Watkins</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">England</title>
				<imprint>
			<date type="published" when="1989">1989</date>
		</imprint>
		<respStmt>
			<orgName>University of Cambridge</orgName>
		</respStmt>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
