An Agent-Based Learning Approach for Finding and Exploiting Heuristics in Unknown Environments Daan Apeldoorn, Gabriele Kern-Isberner Technische Universität Dortmund daan.apeldoorn@tu-dortmund.de, gabriele.kern-isberner@cs.tu-dortmund.de Abstract useful heuristics in a previously unknown environment. Dur- ing a Reinforcement Learning process, the agent creates a This paper presents an agent-based hybrid symbolic/sub- symbolic learning approach as a basic model for the human symbolic knowledge base from learned experiences which ability of quickly recognizing and exploiting heuristic rules in comprises rule-based knowledge on multiple levels of ab- an initially unknown environment. Instead of mechanically straction as a model for heuristics. We do not incorporate iterating a task until the optimal policy was found (as usu- any a priori knowledge (as done e. g. in (Shapiro, Langley, ally done by common Reinforcement Learning techniques), and Shachter 2001)—although this would be possible in our the agent finds heuristics expressed by symbolic knowledge model), instead the heuristics are inferred by the agent itself bases which allow for an intelligent exploitation of knowl- during the learning process. Furthermore, the agent is able edge gained from few experiences. For this purpose, a mea- to decide itself at which point during the learning process sure is proposed which allows an agent to decide on its own at the decisions should be relied on the found heuristics rather which point during the learning process, its decisions should than on the weighted state-action pairs of the Reinforcement rely on the recognized heuristics rather than on a learned state-action pair representation. The approach will be eval- Learning process. More concretely, the following contribu- uated both in the context of several grid world scenarios tions are made: and a game from the GVGAI framework. It will be shown • A hybrid symbolic/sub-symbolic agent model which that our approach outperforms standard Q-Learning in the incorporates both symbolic rule-based knowledge and selected scenarios and arguments will be provided that also weighted state-action pair representations learned by a other agent-based machine learning techniques could profit from the proposed approach. sub-symbolic learning approach. • An approach for estimating the subjective structural com- 1 Introduction plexity of a previously unknown environment from an agent perspective during a learning process (based on a Common machine learning techniques used in the context “commonsense” measure for strategic depth (Apeldoorn of learning agents (e. g., Reinforcement Learning based ap- and Volz 2017)). proaches like Q-Learning (Watkins 1989), (Sutton and Barto 1998)) are usually based on a trial-and-error principle: In Compared to our previous works on knowledge base ex- unknown environments, agents based on such approaches traction presented in (Apeldoorn and Kern-Isberner 2016) start with purely random actions and successively improve and (Apeldoorn and Kern-Isberner 2017), the novelty of their behavior as the learning process progresses. Having this paper lies in the incorporation of our extraction ap- no a priori knowledge about the environment, such agents proach into a learning agent model which is able to de- have to explore large amounts of the state-action space, cide on its own when to exploit the extracted knowledge even if the underlying structure of the environment follows as a heuristic in an unknown environment. This is realized simple rules. by incorporating the strategic depth measure introduced in Humans, in contrast, are known for successfully applying (Apeldoorn and Volz 2017) as a decision criterion. rough heuristics learned from very few examples. This, of As a side-product, the resulting agent model is still able course, can lead to wrong decisions (see, e. g., (Dörner 1992) to explain the learned knowledge and the made decisions for several examples), however, in the common cases many to some extent at any point during the learning process, as problems can be quickly solved by applying such heuris- described in (Apeldoorn and Kern-Isberner 2017). tics derived from few examples. One reason for that is, that In Section 2 related similar approaches will be discussed. many problem domains (e. g., games) are geared towards A brief introduction to the fundamental preliminary works the commonsense human way of thinking, and these kinds will be given in Section 3. The agent model will be intro- of problems can therefore be quickly solved by exploiting duced in Section 4. In Section 5 the model will be eval- such heuristics. uated in the context of several simple grid world scenar- In this paper, we propose a hybrid symbolic/sub-symbolic ios and later in different levels of a game from the Gen- agent model which is able to autonomously find and exploit eral Video Game Playing Artificial Intelligence (GVGAI) framework, which is usually used for the GVGAI compe- a object recognition task that the considered learning agent tition where agents have to compete in playing previously can benefit from the combined approach (with and without unknown video games (Perez-Liebana et al. 2016). involving additional background knowledge). However, in contrast to our approach, it was not investigated, in which 2 Related Work phase during the learning process the agent should rely its behavior more on the explicit, symbolic part of the knowl- Several attempts have been made so far to closer relate sym- edge (i. e., the ranking function) rather than on the implicit bolic knowledge representation and sub-symbolic machine representation of the weights learned through the reinforce- learning approaches in the context of agents. The basic ideas ment learning part. Furthermore, the symbolic part is not pri- can be roughly assigned to one of the following groups (rep- marily dedicated to represent a model for heuristics. resentatives for every group will be discussed subsequently): 1. Extraction techniques to gain different knowledge rep- 3 Preliminaries resentations (e. g., propositional rules) learned from data This section introduces the two preliminary works needed through machine learning techniques for our approach: First, the concept of exception-tolerant 2. Methods to integrate a priori knowledge into machine Hierarchical Knowledge Bases (HKBs) (Apeldoorn and learning approaches to support the learning process (e. g., Kern-Isberner 2016, 2017) will be briefly introduced and by shaping the search space with a priori knowledge and how they can be retrieved (Sections 3.1 and 3.2).1 HKBs later refining this knowledge with learning techniques) will be used later as a model for heuristics. After that, a 3. Cognitive architectures that combine 1. and 2. “commonsense” strategy measure (recently introduced in (Apeldoorn and Volz 2017)), is described which will be used Representatives of the first group are, e. g., (Sun 2002), later in our agent model to let an agent subjectively estimate where extraction techniques have been proposed to gain sim- whether or not heuristics should be exploited (Section 3.3). ple rules or plans from reinforcement learning. In (Junges and Klügl 2013), decision trees are created from learned 3.1 Definition of HKBs weighted state-action representations with the primary goal To introduce HKBs and how they can be learned, we closely of supporting agent developers in the implementation of ad- follow (Apeldoorn and Kern-Isberner 2016, 2017). There, equate agent behavior. These works focus on the extraction an extraction approach is proposed which is able to extract of knowledge in different forms. In contrast, in this paper, an HKB from a weighted state-action pair representations we propose to (re)incorporate the extracted knowledge back learned by a common reinforcement learning approach like into the learning process as a model for finding and exploit- Q-Learning (Watkins 1989).2 ing rough heuristics in a previously unknown environment. This section only provides the main definitions needed As representatives for the second group, e. g., (Singh et to understand the basic idea of HKBs. For more details on al. 2011) and (Shapiro, Langley, and Shachter 2001) can HKBs, the reader should refer to the original literature. be considered: In these approaches, machine learning tech- niques are combined with mechanisms to incorporate a pri- Basics of the Agent Model We consider an agent which is ori knowledge to accelerate the learning process or to later learning a policy by acting autonomously in a previously un- refine and adapt the a priori knowledge to a dynamically known environment. The agent is equipped with n sensors changing environment (in (Singh et al. 2011) this is realized through which it can perceive its current state in the envi- in the context of a BDI agent model where initial beliefs ronment. The agent is able to perform actions from a prede- which can be refined during a learning process). In contrast, fined action space and can furthermore perceive, whether or our approach works (nearly) without any a priori knowledge not the performed actions were good, in form of (numeric) about the environment. rewards. The perceived rewards are then used to learn a The third group is represented, e. g., by models such as weighted state-action pair representation where the weights C LARION (Sun, Peterson, and Merrill 1999), a cognitive determine which action has to be performed, given a per- architecture which incorporates both sub-symbolic learning ceived state (usually the one with the highest weight). and symbolic knowledge during learning and decision mak- More formally, in such a representation, a state s is an el- ing. In this model, a sub-symbolic representation is com- ement of a multi-dimensional state space S = S1 × ... × Sn bined with a rule network on top, and both the sub-symbolic where n is the number of the agent’s sensors (through which and the rule level are used in the agent’s decision making the agent is able to perceive its state in the environment) process. In our approach, however, we assume a reasonable 1 point during a learning process, where a transition from a Note that in (Apeldoorn and Kern-Isberner 2016, 2017), sub-symbolic (implicit) knowledge to a symbolic (explicit) exception-tolerant Hierarchical Knowledge Bases were originally called Hierarchical Knowledge Bases—but to avoid confusion with knowledge representation should take place with the pri- the Hierarchical Knowledge Bases by Borgida et al. (Borgida and mary objective to quickly find useful heuristics in a previ- Etherington 1989) and in order to further outline the primary pur- ously unknown environment. Another representative related pose of our approach, we call them exception-tolerant here. to the third group, the S PHINX approach, is described in 2 Note that the used learning approach is not of importance as (Leopold, Kern-Isberner, and Peters 2008), where reinforce- long as it results in a weighted state-action pair representation, ment learning is combined with ranking functions and belief where the weights determine which action has to be chosen given revision to support the learning process. It was shown on a perceived state (usually the one with the highest weight). and every Si is a set of possible sensor values of the cor- 3.2 Learning an HKB responding sensor. Furthermore, the agent selects actions An HKB can be extracted from a weighted state-action from a predefined action set A and the learned weights are pair representation Q̂ (that is learned, e. g., through a Re- stored in a multi-dimensional matrix Q̂ = (qs1 ,...,sn ,a ) with inforcement Learning technique) using the following ap- si ∈ Si and a ∈ A. The weights can be learned by differ- proach (Apeldoorn and Kern-Isberner 2016):3 ent machine learning approaches, provided that the learn- The approach takes a weighted state-action pair repre- ing approach converges such that given a state, the high- est weight determines the best action to be selected (i. e., sentation Q̂ as input and returns an HKB KB Q̂ which re- amax flects the knowledge contained in Q̂ by performing the s1 ,...,sn = arg 0max qs1 ,...,sn ,a0 ). a ∈A following steps: Exception-Tolerant Hierarchical Knowledge Bases 1. Initial creation of rule sets: In the first step, the multiple An HKB consists of rules which are organized on differ- abstraction levels R1 , ..., Rn+1 of the knowledge base are ent levels of abstraction. In contrast to Exception Lists initially filled with rules. We only consider state-action (Michael 2011), an HKB can handle multiple rules per pairs here that contribute to the best policy found during level and the rules also comprise weights. To be able to the learning process so far.4 define these rules, two different kinds of states and two different kinds of rules will be distinguished (Apeldoorn 2. Removal of worse rules: In all sets Rj , a rule ρ ∈ Rj and Kern-Isberner 2017): is removed, if there exists another rule σ ∈ Rj with the same partial state as premise having a higher weight (i. e., Definition 1 (Complete States/Partial States) A com- in every set Rj only the best rules for a given partial state plete state is a conjunction s := s1 ∧ ... ∧ sn of all val- are kept). ues si currently perceived by an agent’s sensors, where n is the number of sensors (and every perceived sensor value V In all sets Rj>1 , a 3. Removal of worse more specific rules: rule ρ ∈ Rj with premise pρ = s∈Sρ s, conclusion aρ si ∈ Si of the corresponding sensor value set Si is assumed and weight wρ is removed, if there V exists a more general to be a fact in the V agent’s current state). A partial state is a rule σ ∈ Rj 0 1 with premise pρ = s∈Sρ and conclusion aρ is plete rules and generalized rules are of the form pρ ⇒ aρ [wρ ], where pρ is either a complete state (in case of an removed, if there exists a more general rule σ ∈ Rj 0 1 V is an rules can grow drastically with the size of the given problem in- exception to a rule τ ∈ Rj−1 with premise pτ = s∈Sτ s, stance, an efficient algorithm has to be used here. An attempt action aτ as conclusion and weight wτ , if Sτ ⊂ Sρ and for efficiently preselecting the rules using adapted ideas from the aρ 6= aτ . The exception is needed, V if there exists no other A PRIORI algorithm (Agrawal et al. 1996) has been made in (Apel- rule υ ∈ Rj−1 with premise pυ = s∈Sυ s and action aυ as doorn and Kern-Isberner 2017) to increase the overall performance conclusion where Sυ ⊂ Sρ , aυ = aρ and wυ > wτ . of the extraction approach. and Kern-Isberner 2016) for details). By this, the algorithm In the following, d¯s will be used in our agent model to al- selects the most specific rule that fits to the perceived sensor low the agent to estimate subjectively when it could be useful values and falls back to the next more unspecific rule as a to exploit heuristics during a learning process in an a priori heuristic, in case no more specific rule with a fitting premise unknown environment: If d¯s falls below a certain threshold could be found. th then this indicates that heuristics could possibly be ap- Due to the possibility of falling back to more general rules plied successfully. in case no matching rule could be found for the given sen- The intuition behind this is that the (normalized) strate- sor values, the reasoning mechanism on an HKB allows for gic depth based on an HKB which was extracted from drawing meaningful conclusions over states that have not a weighted state-action pair representation learned by an been seen before by an agent. The idea that a more gen- agent, will successively decrease as the agent’s learning pro- eral rule serves as a rough heuristic for unknown states (for cess progresses: Since the agent’s knowledge about the prob- which no better knowledge is available) will later contribute lem increases, the problem appears successively less com- to accelerate the learning process. plex to the agent and the subjectively measures normalized strategy depth converges to the de facto strategic depth of 3.3 Strategic Depth the task to be learned (i. e., the strategic depth according to Based on an HKB, in (Apeldoorn and Volz 2017), a strat- the HKB of the optimal policy to solve the task). We will egy measure was introduced which seems to adequately re- demonstrate this in the following on several examples. flect the commonsense strategic depth of games. The mea- Examples for Subjective Normalized Strategic Depth sure was evaluated in the context of a survey. This measure We consider four different grid worlds of different structural will be used later in our agent model, to let the agent judge complexities where an agent has to learn to navigate from a subjectively when to apply heuristics. According to (Apel- starting point A to a target point B (see Figure 2). We use doorn and Volz 2017), the measure is defined as follows: standard Q-Learning (Watkins 1989) as a learning approach Definition 4 (Strategic Depth Measure ds ) Let KB = with a learning rate of α = 0.1, a random action probabil- {R1 , ..., Rn+1 } be an HKB and S = {S1 , ..., Sn } be the set ity of  = 0.1 and a discount factor γ = 0.9. We run every of all sensor value sets of an agent, then the strategic depth scenario for 250 runs and we measure the normalized strate- is defined as a function gic depth subjectively perceived by the agent (averaged over 30 repetitions) in dependency of the number of runs already n+1 X n  |Ri | performed. Two phenomena can be observed in Figure 2: ds (KB, S) := bi−1 P Q , (1) i−1 |S| • The subjective strategic depth converges to the de facto i=1 S⊆S S∈S strategic depth of the respective scenario as the agent be- |S|=i−1 havior converges to the optimal policy. where n + 1 = |KB| = |S| + 1 is the number of levels in • The subjective strategic depth decreases in general as the KB, b is a weighting constant, and Ri ∈ KB is the i-th level learning process progresses, since the agent’s knowledge of KB. about the scenario increases, and therefore the scenarios appears to be simpler to the agent. (In case of the fourth As rules on a level Rj>1 can be considered exceptions scenario, after the first runs, the agent seems to underesti- to the rules on level Rj−1 , the intuition behind ds is that mate the depth of the scenario, since the scenario locally it measures the weighted relative number of rules (hence, has straight path structures that form together a more com- exceptions) represented by an HKB. plex path from a global point of view.) In (Apeldoorn and Volz 2017), also a modified version of the measure is proposed, where ds is additionally divided by the sum of weights to normalize it to values of range [0, 1]. 4 Agent Model For our purpose, this version of the measure is extremely This section introduces our model of a learning agent which useful, since it allows our agent model to estimate the strate- is able to determine and exploit heuristics based on HKBs gic depth of a previously unknown environment relatively during a learning process in a previously unknown environ- to the maximum strategic depth that can be expected, given ment. In our agent model, a sub-symbolic learning approach the sets of possible sensor values S1 , ..., Sn . Thus, in the fol- can be used to learn weighted state-action pairs (where the lowing, the normalized version of the measure (according to maximum weight determines the preferred action, given a (Apeldoorn and Volz 2017)) will be used: state). Figure 1 illustrates the main ideas of our agent model which mainly consists of two parts: Definition 5 (Normalized Strategic Depth d¯s ) Let KB, S, n, b and ds be as in Definition 5. Then the normalized • an initialization part which is executed every time before strategic depth is defined as a function a new learning episode starts, and ds (KB, S) • a common agent cycle which is executed every time step. d¯s (KB, S) :=   . (2) Pn+1 n i−1 In the former, the agent decides, whether it relies its deci- b i=1 i−1 sions on the weighted state-action pairs (represented by the Q̂ matrix) or on the heuristics (represented by the current if HKB action Agent selection and glob. reward if < th < max. glob. Initialization reward (executed before a Percepts ; Reward new episode starts) Set action selection Set HKB action selection Update Weights if no random action (with prob. 1 ) if HKB action selection if random Extract HKB Environment action if action (with Agent cycle selection prob. ) (executed every HKB time step) Random action selection Action selection by Action selection by HKB Action a Figure 1: Learning Agent Model with the Ability of Finding and Exploiting Heuristics extracted HKB) in the upcoming learning episode. This de- use again Q-Learning with the same parameters used for cision is made depending on the d¯s value calculated from measuring the subjective strategic depth at the end of Sec- the current extracted HKB. tion 3.3. As threshold parameter to determine the exploita- In the latter, the agent either chooses a random action (for tion of found heuristics by means of the subjective strategic exploration purpose) or decides to choose its action accord- depth d¯s , we choose a threshold of th = 0.2. According to ing to the Q̂ matrix or the extracted HKB, respectively, de- Figure 2, this means that—in average—the agent would try pending on the decision made during the initialization. Note to exploit the found heuristics after that in Figure 1, components belonging to the sub-symbolic • ≈ 25 runs in case of the first scenario, learning approach are indicated by a gray coloring, whereas those parts belonging to our heuristics extension are colored • ≈ 75 runs in case of the second scenario and white. The gray parts could be replaced by a different learn- • ≈ 100 runs in case of the third scenario. ing approach. For the second and the third scenario, this may sound con- 5 Evaluation fusing since we perform a total number of 50 runs only. However, the reader should be aware that Figure 2 shows This Section first evaluates the approach in the four grid the average development of d¯s measure and thus there are world scenarios from Figure 2 in Section 3.3. Later, the ap- enough cases where the agent individually decides to exploit proach will be evaluated additionally in a more realistic ex- a found heuristic that leads to an optimal policy. ample: a game from the GVGAI framework (Perez-Liebana As for the fourth scenario, the agent never considers to et al. 2016). exploit any heuristics, since the subjective strategic depth does not decrease below th = 0.2. This can be interpreted in 5.1 Evaluation in the Context of Grid World a way that the scenario comprises too many exceptions, such Scenarios that an exploitation of heuristics does not make sense from We run the four grid world Scenarios from Figure 2 in Sec- the agent’s point of view (this point of view is in fact what tion 3.3 for 50 runs each and measure the percentage of how the parameter th controls, depending on whether the desired many times the optimal policy was found by the agent over agent should be more “heuristics-affine” or more “conser- 200 repetitions of the experiment. As learning approach we vative”). Table 1 contains the comparison of our agent model (a) 1.0 With Avg. Subjective Norm. Standard Heuristics Heuristics 0.8 Q-Learning (th = 0.2) Only Strategic Depth 0.6 Scenario 1] 3.5% 66.5% 43.0% 0.4 Scenario 2] 4.5% 20.0% 10.0% Scenario 3] 0.0% 10.5% 0.03% 0.2 Scenario 4] 0.0% 0.0% 0.0% 0.0 1 50 100 150 200 250 Table 1: Standard Q-Learning vs Heuristics Approach in the Context of the Grid Worlds Scenarios—The table shows the (b) 1.0 percentage of 100 repetitions in which the optimal policy Avg. Subjective Norm. 0.8 was found by the agent during the first 50 runs. Strategic Depth 0.6 against a standard Q-Learning agent with the same parame- ters for the learning part. In addition, for reasons of compari- 0.4 son, Table 1 also provides results of a heuristics-only version of the agent model5 , where the agent was enforced to rely its 0.2 decisions always on the extracted heuristics (i. e., in the ini- 0.0 tialization phase in Figure 1 the action selection mode is al- 1 50 100 150 200 250 ways set to HKB action selection and the weights contained in the Q̂ matrix are never considered directly for action se- (c) 1.0 lection in the agent cycle). Avg. Subjective Norm. 0.8 As can be seen in Table 1, standard Q-Learning rarely Strategic Depth manages to solve the grid worlds during the first 50 runs, 0.6 whereas our heuristics approach clearly outperforms these results. As for the heuristics-only version, the results are 0.4 slightly worse: This seems to be the case, since the agent starts too early to rely its decisions on the heuristics (which 0.2 are nearly random in the beginning of the learning process). This may prevent the agent from further exploring the en- 0.0 1 50 100 150 200 250 vironment to an adequate extent and may result in local optimal policies. (d) 1.0 One could argue that standard Q-Learning is a rather old Avg. Subjective Norm. approach and there are nowadays more elaborate learning 0.8 approaches available. However, if a better learning approach Strategic Depth will be chosen in our setup (better in the sense that it will 0.6 converge faster to the optimal policy), the extracted HKBs 0.4 and hence the measure d¯s will be more accurate as well and therefore, better heuristics could be exploited earlier during 0.2 the learning process. This means that in our approach, both the heuristics and the measure d¯s (to decide when to exploit 0.0 them) will improve with the quality of the learning approach 1 50 100 150 200 250 used for the agent. Number of Runs r 5.2 Evaluation in the Context of a GVGAI Game We evaluate the agent model now in a more realistic ex- ample by consider the game Camel Race from the GVGAI Figure 2: Examples of Subjective Normalized Strategic framework (Perez-Liebana et al. 2016). In this game, the Depth for the Considered Grid Worlds Scenarios of Increas- agent has to control a camel which must be moved to the ing Structural Complexity with the Optimal Policies Indi- opposite site of the screen—faster than any other camel in cated by Arrows—Scenarios 1 (a) and 2 (b) are taken from the game (additionally avoiding obstacles in the higher lev- (Apeldoorn and Kern-Isberner 2016), Scenario 3 (c) is from els). Figure 3 shows the first and the third level of the game (Apeldoorn and Kern-Isberner 2017) and Scenario 4 (d) (with obstacles).6 from (Apeldoorn and Volz 2017). 5 Thanks to an anonymous reviewer for proposing this idea. 6 The game was slightly modified for our purpose by reducing the number of camels, in order to reduce the dimensionality of the state-action space without changing the basic game mechanics. Figure 3: First Level (a) and Third Level (b) of the GVGAI Game Camel Race With 6 Conclusion and Future Work Standard Heuristics Heuristics Q-Learning (th = 0.2) Only In this paper, we described a model of a learning agent which is able to find and to exploit heuristics without a priori 1st Level] 0.0% 63.3% 66.0% knowledge in unknown environments. We showed on sev- 3rd Level] 0.0% 53.3% 60.0% eral examples (four grid world scenarios and one game from the GVGAI framework) that our approach drastically accel- Table 2: Standard Q-Learning vs Heuristics Approach in the erates the learning process to find optimal policies. Context of the GVGAI Game Camel Race—The table shows However, the approach is not perfect yet and leaves the percentage of 30 repetitions in which the agent won the room for future work: (1) The runtime to extract an HKB game in the first 100 runs. should be further improved (a faster algorithm has been proposed in (Apeldoorn and Kern-Isberner 2017) already; cf. footnote 4) and (2) a quality criterion could be incorpo- The agent’s state space is described by the sensor value rated into the measure that helps to decide whether heuris- sets SC C2 C2 C3 tics should be exploited, depending on previous experiences x × Sx × Sy × Sx containing the coordinates 1 of the camels in the game (where C1 , ..., C3 correspond to with applying these heuristics. The latter could possibly be the three camels, C2 is the camel in the middle controlled useful to help preventing an agent from repeatedly trying out by the agent and C1 , C3 only move in x direction) and bad heuristics. A = {up, down, left, right, none} are the five actions that Furthermore, it could be interesting to extend the model can be performed by the agent. As reward, the agent per- with revision techniques for the found heuristics and to in- ceives the current distance to the fastest opponent camel in corporate mechanisms to store and reuse heuristics. x direction. Even if the game is rather simple, it has an interesting as- References pect: Due to the dynamics of the environment caused by the Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.; and movement of the two opponent camels, our agent perceives Verkamo, A. 1996. Fast Discovery of Association Rules. new and previously unseen states in nearly every game tick Advances in Knowledge discovery and Data Mining. Cam- of the early learning phase. Thus, the agent has to explore bridge, MA, USA: MIT Press. 307–328. large parts of the state-action space to improve its behavior, Apeldoorn, D., and Kern-Isberner, G. 2016. When although the game could be easily won by applying a rough should learning agents switch to explicit knowledge? In heuristic like always move right (> ⇒ right [1.0]) for the Benzmüller, C.; Sutcliffe, G.; and Rojas, R., eds., GCAI first level. This renders the game an eligible test environment 2016. 2nd Global Conference on Artificial Intelligence, vol- for our agent model. ume 41 of EPiC Series in Computing, 174–186. EasyChair We perform 100 runs for each level and measure the per- Publications. centage of wins by the agent over 30 repetitions. Again, stan- Apeldoorn, D., and Kern-Isberner, G. 2017. Towards an un- dard Q-Learning with and without the heuristics approach, derstanding of what is learned: Extracting multi-abstraction- as well as the heuristics-only approach are used, given level knowledge from learning agents. In Rus, V., and the same standard parameters as provided in Section 5.2. Markov, Z., eds., Proceedings of the Thirtieth International Table 2 shows the results for the first and the third level of Florida Artificial Intelligence Research Society Conference, the game: With standard Q-Learning, the agent is not able 764–767. Palo Alto, California: AAAI Press. to win the game within 100 runs, whereas using the heuris- tics approach, the agent wins both levels in over 50% of the Apeldoorn, D., and Volz, V. 2017. Measuring strategic cases. Surprisingly, the heuristics-only approach performs depth in games using hierarchical knowledge bases. In 2017 even better here. This seems to be the case since—although IEEE Conference on Computational Intelligence and Games the game has some dynamics and can thus be considered (CIG), 9–16. IEEE. To be published. more complex than the grid world scenarios—the consid- Borgida, A., and Etherington, D. W. 1989. Hierarchi- ered levels allow for rougher heuristics than the grid worlds: cal knowledge bases and efficient disjunctive reasoning. In In contrast to the grid worlds, the goal to be reached is not Brachman, R. J.; Levesque, H. J.; and Reiter, R., eds., Pro- located in a single corner, instead, the game can be won by ceedings of the First International Conference on Principles simply reaching the rightmost side of the screen (which can of Knowledge Representation and Reasoning, 33–43. San already be achieved by a very rough heuristic). Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Dörner, D. 1992. Die Logik des Mißlingens – Strategisches Denken in komplexen Situationen. Reinbek bei Hamburg: Rowohlt Taschenbuch Verlag. Junges, R., and Klügl, F. 2013. Learning tools for agent- based modeling and simulation. KI – Künstliche Intelligenz 27(3):273–280. Leopold, T.; Kern-Isberner, G.; and Peters, G. 2008. Belief Revision with Reinforcement Learning for Interactive Object Recognition. ECAI 2008 – 18th European Conference on Artificial Intelligence Proceedings. Amsterdam: IOS Press. 65–69. Michael, L. 2011. Causal Learnability. In Proceedings of the 22nd International Joint Conference on Artificial In- telligence (IJCAI 2011), 1014–1020. Palo Alto, California: AAAI Press. Perez-Liebana, D.; Samothrakis, S.; Togelius, J.; Schaul, T.; and Lucas, S. M. 2016. General video game ai: Competi- tion, challenges and opportunities. In Schuurmans, D., and Wellman, M., eds., Proceedings of the Thirtieth AAAI Con- ference on Artificial Intelligence and the Twenty-Eighth In- novative Applications of Artificial Intelligence Conference, 4335–4337. Palo Alto, California: AAAI Press. Shapiro, D.; Langley, P.; and Shachter, R. 2001. Using Background Knowledge to Speed Reinforcement Learning in Physical Agents. AGENTS ’01 Proceedings of the fifth international conference on Autonomous agents. New York: ACM. 254–261. Singh, D.; Sardina, S.; Padgham, L.; and James, G. 2011. Integrating learning into a bdi agent for environments with changing dynamics. In Walsh, T., ed., Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2525–2530. Menlo Park, California: AAAI Press/International Joint Conferences on Artificial Intelli- gence. Sun, R.; Peterson, T.; and Merrill, E. 1999. A hybrid archi- tecture for situated learning of reactive sequential decision making. Applied Intelligence 11(1):109–127. Sun, R. 2002. Knowledge Extraction from Reinforcement Learning. New Learning Paradigms in Soft Computing. Berlin Heidelberg: Springer. 170–180. Sutton, R., and Barto, A. 1998. Reinforcement Learning: An Introduction. Camebridge, Massachusetts; London, Eng- land: The MIT Press. Watkins, C. 1989. Learning from Delayed Rewards. Eng- land: University of Cambridge.