<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Agent-Based Learning Approach for Finding and Exploiting Heuristics in Unknown Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daan Apeldoorn</string-name>
          <email>daan.apeldoorn@tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Kern-Isberner</string-name>
          <email>gabriele.kern-isberner@cs.tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universita ̈t Dortmund</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1992</year>
      </pub-date>
      <abstract>
        <p>This paper presents an agent-based hybrid symbolic/subsymbolic learning approach as a basic model for the human ability of quickly recognizing and exploiting heuristic rules in an initially unknown environment. Instead of mechanically iterating a task until the optimal policy was found (as usually done by common Reinforcement Learning techniques), the agent finds heuristics expressed by symbolic knowledge bases which allow for an intelligent exploitation of knowledge gained from few experiences. For this purpose, a measure is proposed which allows an agent to decide on its own at which point during the learning process, its decisions should rely on the recognized heuristics rather than on a learned state-action pair representation. The approach will be evaluated both in the context of several grid world scenarios and a game from the GVGAI framework. It will be shown that our approach outperforms standard Q-Learning in the selected scenarios and arguments will be provided that also other agent-based machine learning techniques could profit from the proposed approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Common machine learning techniques used in the context
of learning agents (e. g., Reinforcement Learning based
approaches like Q-Learning
        <xref ref-type="bibr" rid="ref18">(Watkins 1989)</xref>
        ,
        <xref ref-type="bibr" rid="ref17">(Sutton and Barto
1998)</xref>
        ) are usually based on a trial-and-error principle: In
unknown environments, agents based on such approaches
start with purely random actions and successively improve
their behavior as the learning process progresses. Having
no a priori knowledge about the environment, such agents
have to explore large amounts of the state-action space,
even if the underlying structure of the environment follows
simple rules.
      </p>
      <p>Humans, in contrast, are known for successfully applying
rough heuristics learned from very few examples. This, of
course, can lead to wrong decisions (see, e. g., (Do¨rner 1992)
for several examples), however, in the common cases many
problems can be quickly solved by applying such
heuristics derived from few examples. One reason for that is, that
many problem domains (e. g., games) are geared towards
the commonsense human way of thinking, and these kinds
of problems can therefore be quickly solved by exploiting
such heuristics.</p>
      <p>
        In this paper, we propose a hybrid symbolic/sub-symbolic
agent model which is able to autonomously find and exploit
useful heuristics in a previously unknown environment.
During a Reinforcement Learning process, the agent creates a
symbolic knowledge base from learned experiences which
comprises rule-based knowledge on multiple levels of
abstraction as a model for heuristics. We do not incorporate
any a priori knowledge (as done e. g. in
        <xref ref-type="bibr" rid="ref11">(Shapiro, Langley,
and Shachter 2001)</xref>
        —although this would be possible in our
model), instead the heuristics are inferred by the agent itself
during the learning process. Furthermore, the agent is able
to decide itself at which point during the learning process
the decisions should be relied on the found heuristics rather
than on the weighted state-action pairs of the Reinforcement
Learning process. More concretely, the following
contributions are made:
      </p>
      <p>A hybrid symbolic/sub-symbolic agent model which
incorporates both symbolic rule-based knowledge and
weighted state-action pair representations learned by a
sub-symbolic learning approach.</p>
      <p>
        An approach for estimating the subjective structural
complexity of a previously unknown environment from an
agent perspective during a learning process (based on a
“commonsense” measure for strategic depth
        <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn
and Volz 2017)</xref>
        ).
      </p>
      <p>
        Compared to our previous works on knowledge base
extraction presented in
        <xref ref-type="bibr" rid="ref3">(Apeldoorn and Kern-Isberner 2016)</xref>
        and
        <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Kern-Isberner 2017)</xref>
        , the novelty of
this paper lies in the incorporation of our extraction
approach into a learning agent model which is able to
decide on its own when to exploit the extracted knowledge
as a heuristic in an unknown environment. This is realized
by incorporating the strategic depth measure introduced in
        <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Volz 2017)</xref>
        as a decision criterion.
      </p>
      <p>
        As a side-product, the resulting agent model is still able
to explain the learned knowledge and the made decisions
to some extent at any point during the learning process, as
described in
        <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Kern-Isberner 2017)</xref>
        .
      </p>
      <p>
        In Section 2 related similar approaches will be discussed.
A brief introduction to the fundamental preliminary works
will be given in Section 3. The agent model will be
introduced in Section 4. In Section 5 the model will be
evaluated in the context of several simple grid world
scenarios and later in different levels of a game from the
General Video Game Playing Artificial Intelligence (GVGAI)
framework, which is usually used for the GVGAI
competition where agents have to compete in playing previously
unknown video games
        <xref ref-type="bibr" rid="ref10">(Perez-Liebana et al. 2016)</xref>
        .
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Several attempts have been made so far to closer relate
symbolic knowledge representation and sub-symbolic machine
learning approaches in the context of agents. The basic ideas
can be roughly assigned to one of the following groups
(representatives for every group will be discussed subsequently):
1. Extraction techniques to gain different knowledge
representations (e. g., propositional rules) learned from data
through machine learning techniques
2. Methods to integrate a priori knowledge into machine
learning approaches to support the learning process (e. g.,
by shaping the search space with a priori knowledge and
later refining this knowledge with learning techniques)
3. Cognitive architectures that combine 1. and 2.
Representatives of the first group are, e. g.,
        <xref ref-type="bibr" rid="ref15">(Sun 2002)</xref>
        ,
where extraction techniques have been proposed to gain
simple rules or plans from reinforcement learning. In
        <xref ref-type="bibr" rid="ref7">(Junges
and Klu¨gl 2013)</xref>
        , decision trees are created from learned
weighted state-action representations with the primary goal
of supporting agent developers in the implementation of
adequate agent behavior. These works focus on the extraction
of knowledge in different forms. In contrast, in this paper,
we propose to (re)incorporate the extracted knowledge back
into the learning process as a model for finding and
exploiting rough heuristics in a previously unknown environment.
      </p>
      <p>
        As representatives for the second group, e. g.,
        <xref ref-type="bibr" rid="ref12">(Singh et
al. 2011)</xref>
        and
        <xref ref-type="bibr" rid="ref11">(Shapiro, Langley, and Shachter 2001)</xref>
        can
be considered: In these approaches, machine learning
techniques are combined with mechanisms to incorporate a
priori knowledge to accelerate the learning process or to later
refine and adapt the a priori knowledge to a dynamically
changing environment (in
        <xref ref-type="bibr" rid="ref12">(Singh et al. 2011)</xref>
        this is realized
in the context of a BDI agent model where initial beliefs
which can be refined during a learning process). In contrast,
our approach works (nearly) without any a priori knowledge
about the environment.
      </p>
      <p>
        The third group is represented, e. g., by models such as
CLARION
        <xref ref-type="bibr" rid="ref14">(Sun, Peterson, and Merrill 1999)</xref>
        , a cognitive
architecture which incorporates both sub-symbolic learning
and symbolic knowledge during learning and decision
making. In this model, a sub-symbolic representation is
combined with a rule network on top, and both the sub-symbolic
and the rule level are used in the agent’s decision making
process. In our approach, however, we assume a reasonable
point during a learning process, where a transition from a
sub-symbolic (implicit) knowledge to a symbolic (explicit)
knowledge representation should take place with the
primary objective to quickly find useful heuristics in a
previously unknown environment. Another representative related
to the third group, the SPHINX approach, is described in
        <xref ref-type="bibr" rid="ref8">(Leopold, Kern-Isberner, and Peters 2008)</xref>
        , where
reinforcement learning is combined with ranking functions and belief
revision to support the learning process. It was shown on
a object recognition task that the considered learning agent
can benefit from the combined approach (with and without
involving additional background knowledge). However, in
contrast to our approach, it was not investigated, in which
phase during the learning process the agent should rely its
behavior more on the explicit, symbolic part of the
knowledge (i. e., the ranking function) rather than on the implicit
representation of the weights learned through the
reinforcement learning part. Furthermore, the symbolic part is not
primarily dedicated to represent a model for heuristics.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>
        This section introduces the two preliminary works needed
for our approach: First, the concept of exception-tolerant
Hierarchical Knowledge Bases (HKBs)
        <xref ref-type="bibr" rid="ref3 ref4 ref5">(Apeldoorn and
Kern-Isberner 2016, 2017)</xref>
        will be briefly introduced and
how they can be retrieved (Sections 3.1 and 3.2).1 HKBs
will be used later as a model for heuristics. After that, a
“commonsense” strategy measure (recently introduced in
        <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Volz 2017)</xref>
        ), is described which will be used
later in our agent model to let an agent subjectively estimate
whether or not heuristics should be exploited (Section 3.3).
3.1
      </p>
      <sec id="sec-3-1">
        <title>Definition of HKBs</title>
        <p>
          To introduce HKBs and how they can be learned, we closely
follow
          <xref ref-type="bibr" rid="ref3 ref4 ref5">(Apeldoorn and Kern-Isberner 2016, 2017)</xref>
          . There,
an extraction approach is proposed which is able to extract
an HKB from a weighted state-action pair representations
learned by a common reinforcement learning approach like
Q-Learning
          <xref ref-type="bibr" rid="ref18">(Watkins 1989)</xref>
          .2
        </p>
        <p>This section only provides the main definitions needed
to understand the basic idea of HKBs. For more details on
HKBs, the reader should refer to the original literature.
Basics of the Agent Model We consider an agent which is
learning a policy by acting autonomously in a previously
unknown environment. The agent is equipped with n sensors
through which it can perceive its current state in the
environment. The agent is able to perform actions from a
predefined action space and can furthermore perceive, whether or
not the performed actions were good, in form of (numeric)
rewards. The perceived rewards are then used to learn a
weighted state-action pair representation where the weights
determine which action has to be performed, given a
perceived state (usually the one with the highest weight).</p>
        <p>
          More formally, in such a representation, a state s is an
element of a multi-dimensional state space S = S1 ::: Sn
where n is the number of the agent’s sensors (through which
the agent is able to perceive its state in the environment)
1Note that in
          <xref ref-type="bibr" rid="ref3 ref4 ref5">(Apeldoorn and Kern-Isberner 2016, 2017)</xref>
          ,
exception-tolerant Hierarchical Knowledge Bases were originally
called Hierarchical Knowledge Bases—but to avoid confusion with
the Hierarchical Knowledge Bases by Borgida et al.
          <xref ref-type="bibr" rid="ref6">(Borgida and
Etherington 1989)</xref>
          and in order to further outline the primary
purpose of our approach, we call them exception-tolerant here.
        </p>
        <p>2Note that the used learning approach is not of importance as
long as it results in a weighted state-action pair representation,
where the weights determine which action has to be chosen given
a perceived state (usually the one with the highest weight).
asm1a;:x::;sn = arg max qs1;:::;sn;a0 ).</p>
        <p>a02A
and every Si is a set of possible sensor values of the
corresponding sensor. Furthermore, the agent selects actions
from a predefined action set A and the learned weights are
stored in a multi-dimensional matrix Q^ = (qs1;:::;sn;a) with
si 2 Si and a 2 A. The weights can be learned by
different machine learning approaches, provided that the
learning approach converges such that given a state, the
highest weight determines the best action to be selected (i. e.,</p>
      </sec>
      <sec id="sec-3-2">
        <title>Exception-Tolerant Hierarchical Knowledge Bases</title>
        <p>
          An HKB consists of rules which are organized on
different levels of abstraction. In contrast to Exception Lists
          <xref ref-type="bibr" rid="ref9">(Michael 2011)</xref>
          , an HKB can handle multiple rules per
level and the rules also comprise weights. To be able to
define these rules, two different kinds of states and two
different kinds of rules will be distinguished
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn
and Kern-Isberner 2017)</xref>
          :
Definition 1 (Complete States/Partial States) A
complete state is a conjunction s := s1 ^ ::: ^ sn of all
values si currently perceived by an agent’s sensors, where n
is the number of sensors (and every perceived sensor value
si 2 Si of the corresponding sensor value set Si is assumed
to be a fact in the agent’s current state). A partial state is a
conjunction s := Vs02S s0 of a subset S fs1; :::; sng of
the sensor values of a complete state.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Definition 2 (Complete Rules/Generalized Rules) Com</title>
        <p>plete rules and generalized rules are of the form p )
a [w ], where p is either a complete state (in case of an
complete rule) or a partial state (in case of a generalized
rule), the conclusion a 2 A is an action of an agent’s
action space A and w 2 [0; 1] is the rule’s weight.</p>
        <p>Thus, complete rules map complete states to actions and
generalized rules map partial states to actions. An HKB can
now be defined as follows:</p>
        <p>2 Ri is of length jS j = i
Definition 3 (Exception-Tolerant Hierarchical
Knowledge Base) An exception-tolerant Hierarchical
Knowledge Base (HKB) is an ordered set KB := fR1; :::; Rn+1g
of n + 1 rule sets, where n is the number of sensors (i. e.,
the number of state space dimensions). Every set Ri&lt;n+1
contains generalized rules and the set Rn+1 contains
complete rules, such that every premise p = Vs2S s of a rule
1.</p>
        <p>According to Definition 3, the set R1 contains the most
general rules (with empty premises) and the set Rn+1
contains the most specific (i. e., complete) rules.</p>
        <p>
          For the relations of rules, the term of needed exception
will be used, according to the following definition (cf.
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Kern-Isberner 2017)</xref>
          ):
        </p>
        <sec id="sec-3-3-1">
          <title>Definition 4 (Needed Exception) A rule 2 Rj&gt;1 is an</title>
          <p>eaxcctieopntioan taosacrounlcelusi2onRajnd1 wweiitghhptrewmi,seif pS = VSs2Sansd,
a 6= a . The exception is needed, if there exists no other
rule 2 Rj 1 with premise p = V s and action a as
conclusion where S S , a = a sa2nSd w &gt; w .
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Learning an HKB</title>
        <p>
          An HKB can be extracted from a weighted state-action
pair representation Q^ (that is learned, e. g., through a
Reinforcement Learning technique) using the following
approach
          <xref ref-type="bibr" rid="ref3">(Apeldoorn and Kern-Isberner 2016)</xref>
          :3
        </p>
        <p>The approach takes a weighted state-action pair
representation Q^ as input and returns an HKB KBQ^ which
reflects the knowledge contained in Q^ by performing the
following steps:
1. Initial creation of rule sets: In the first step, the multiple
abstraction levels R1; :::; Rn+1 of the knowledge base are
initially filled with rules. We only consider state-action
pairs here that contribute to the best policy found during
the learning process so far.4
2. Removal of worse rules: In all sets Rj , a rule 2 Rj
is removed, if there exists another rule 2 Rj with the
same partial state as premise having a higher weight (i. e.,
in every set Rj only the best rules for a given partial state
are kept).
3. Removal of worse more specific rules: In all sets Rj&gt;1, a
rule 2 Rj with premise p = Vs2S s, conclusion a
and weight w is removed, if there exists a more general
rule 2 Rj0&lt;j with premise p = Vs2S s where S
S = fs1; :::; sj 1g and weight w w .
4. Removal of too specific rules: In all sets Rj , a rule 2
Rj&gt;1 with premise p = Vs2S and conclusion a is
removed, if there exists a more general rule 2 Rj0&lt;j
with the same action a = a as conclusion and with
premise p = V s where S S = fs1; :::; sj 1g
and if is not asn2eSeded exception to a rule 2 Rj 1.
5. Optional filter step: Optionally, filters may be applied to
filter out further rules which are, e. g., helpful to explain
the knowledge contained in Q^ through the optimal found
policy so far, but which are not needed for reasoning later.</p>
        <p>After performing these steps on Q^, the knowledge base
KBQ^ comprises all sets Rj 6= ; with the extracted rules
representing the implicit knowledge contained in the learned
weights of Q^ in a compact way.</p>
        <p>
          Reasoning on HKBs A reasoning algorithm on HKBs
was proposed in
          <xref ref-type="bibr" rid="ref3">(Apeldoorn and Kern-Isberner 2016)</xref>
          :
Given perceived sensor values s1; :::; sn, this algorithm
searches an HKB upwards, starting from the bottom-most
level Rn+1, for the first rule whose premise is fulfilled. This
rule is then returned as concluding action (see (Apeldoorn
3Note that a faster version of the algorithm improving its first
step is proposed in
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Kern-Isberner 2017)</xref>
          (cf.
footnote 4).
        </p>
        <p>
          4Furthermore, since the number of possible premises of the
rules can grow drastically with the size of the given problem
instance, an efficient algorithm has to be used here. An attempt
for efficiently preselecting the rules using adapted ideas from the
APRIORI algorithm
          <xref ref-type="bibr" rid="ref1">(Agrawal et al. 1996)</xref>
          has been made in
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Kern-Isberner 2017)</xref>
          to increase the overall performance
of the extraction approach.
and Kern-Isberner 2016) for details). By this, the algorithm
selects the most specific rule that fits to the perceived sensor
values and falls back to the next more unspecific rule as a
heuristic, in case no more specific rule with a fitting premise
could be found.
        </p>
        <p>Due to the possibility of falling back to more general rules
in case no matching rule could be found for the given
sensor values, the reasoning mechanism on an HKB allows for
drawing meaningful conclusions over states that have not
been seen before by an agent. The idea that a more
general rule serves as a rough heuristic for unknown states (for
which no better knowledge is available) will later contribute
to accelerate the learning process.
3.3</p>
      </sec>
      <sec id="sec-3-5">
        <title>Strategic Depth</title>
        <p>
          Based on an HKB, in
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Volz 2017)</xref>
          , a
strategy measure was introduced which seems to adequately
reflect the commonsense strategic depth of games. The
measure was evaluated in the context of a survey. This measure
will be used later in our agent model, to let the agent judge
subjectively when to apply heuristics. According to
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Volz 2017)</xref>
          , the measure is defined as follows:
Definition 4 (Strategic Depth Measure ds) Let KB =
fR1; :::; Rn+1g be an HKB and S = fS1; :::; Sng be the set
of all sensor value sets of an agent, then the strategic depth
is defined as a function
        </p>
        <p>n+1
ds(KB; S) := X
i=1</p>
        <p>n
i
1
bi 1</p>
        <p>P
jRij</p>
        <p>Q jSj
jSjS=i S1 S2S
;
(1)
where n + 1 = jKBj = jSj + 1 is the number of levels in
KB, b is a weighting constant, and Ri 2 KB is the i-th level
of KB.</p>
        <p>As rules on a level Rj&gt;1 can be considered exceptions
to the rules on level Rj 1, the intuition behind ds is that
it measures the weighted relative number of rules (hence,
exceptions) represented by an HKB.</p>
        <p>
          In
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Volz 2017)</xref>
          , also a modified version of
the measure is proposed, where ds is additionally divided by
the sum of weights to normalize it to values of range [0; 1].
For our purpose, this version of the measure is extremely
useful, since it allows our agent model to estimate the
strategic depth of a previously unknown environment relatively
to the maximum strategic depth that can be expected, given
the sets of possible sensor values S1; :::; Sn. Thus, in the
following, the normalized version of the measure (according to
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Volz 2017)</xref>
          ) will be used:
        </p>
        <sec id="sec-3-5-1">
          <title>Definition 5 (Normalized Strategic Depth ds) Let KB,</title>
          <p>S, n, b and ds be as in Definition 5. Then the normalized
strategic depth is defined as a function
ds(KB; S) :=</p>
          <p>Pn+1
i=1
ds(KB; S)</p>
          <p>n
i
1
bi 1
:
(2)</p>
          <p>In the following, ds will be used in our agent model to
allow the agent to estimate subjectively when it could be useful
to exploit heuristics during a learning process in an a priori
unknown environment: If ds falls below a certain threshold
th then this indicates that heuristics could possibly be
applied successfully.</p>
          <p>The intuition behind this is that the (normalized)
strategic depth based on an HKB which was extracted from
a weighted state-action pair representation learned by an
agent, will successively decrease as the agent’s learning
process progresses: Since the agent’s knowledge about the
problem increases, the problem appears successively less
complex to the agent and the subjectively measures normalized
strategy depth converges to the de facto strategic depth of
the task to be learned (i. e., the strategic depth according to
the HKB of the optimal policy to solve the task). We will
demonstrate this in the following on several examples.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>Examples for Subjective Normalized Strategic Depth</title>
        <p>
          We consider four different grid worlds of different structural
complexities where an agent has to learn to navigate from a
starting point A to a target point B (see Figure 2). We use
standard Q-Learning
          <xref ref-type="bibr" rid="ref18">(Watkins 1989)</xref>
          as a learning approach
with a learning rate of = 0:1, a random action
probability of = 0:1 and a discount factor = 0:9. We run every
scenario for 250 runs and we measure the normalized
strategic depth subjectively perceived by the agent (averaged over
30 repetitions) in dependency of the number of runs already
performed. Two phenomena can be observed in Figure 2:
The subjective strategic depth converges to the de facto
strategic depth of the respective scenario as the agent
behavior converges to the optimal policy.
        </p>
        <p>The subjective strategic depth decreases in general as the
learning process progresses, since the agent’s knowledge
about the scenario increases, and therefore the scenarios
appears to be simpler to the agent. (In case of the fourth
scenario, after the first runs, the agent seems to
underestimate the depth of the scenario, since the scenario locally
has straight path structures that form together a more
complex path from a global point of view.)
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Agent Model</title>
      <p>This section introduces our model of a learning agent which
is able to determine and exploit heuristics based on HKBs
during a learning process in a previously unknown
environment. In our agent model, a sub-symbolic learning approach
can be used to learn weighted state-action pairs (where the
maximum weight determines the preferred action, given a
state). Figure 1 illustrates the main ideas of our agent model
which mainly consists of two parts:
an initialization part which is executed every time before
a new learning episode starts, and
a common agent cycle which is executed every time step.
In the former, the agent decides, whether it relies its
decisions on the weighted state-action pairs (represented by the
Q^ matrix) or on the heuristics (represented by the current
Percepts
; Reward
Set action selection Set HKB action selection
if HKB action
selection and
glob. reward
&lt; max. glob.
reward
if no random action
(with prob. 1 )
if action
selection</p>
      <p>Agent
if</p>
      <p>&lt; th
if HKB action</p>
      <p>selection
Extract HKB</p>
      <p>HKB</p>
      <p>Initialization
(executed before a
new episode starts)
Agent cycle
(executed every
time step)
Environment</p>
      <p>Update Weights
if random
action
(with
prob. )
Random action selection Action selection by
Action selection by HKB</p>
      <p>Action a
extracted HKB) in the upcoming learning episode. This
decision is made depending on the ds value calculated from
the current extracted HKB.</p>
      <p>In the latter, the agent either chooses a random action (for
exploration purpose) or decides to choose its action
according to the Q^ matrix or the extracted HKB, respectively,
depending on the decision made during the initialization. Note
that in Figure 1, components belonging to the sub-symbolic
learning approach are indicated by a gray coloring, whereas
those parts belonging to our heuristics extension are colored
white. The gray parts could be replaced by a different
learning approach.</p>
      <p>5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        This Section first evaluates the approach in the four grid
world scenarios from Figure 2 in Section 3.3. Later, the
approach will be evaluated additionally in a more realistic
example: a game from the GVGAI framework
        <xref ref-type="bibr" rid="ref10">(Perez-Liebana
et al. 2016)</xref>
        .
5.1
      </p>
      <sec id="sec-5-1">
        <title>Evaluation in the Context of Grid World</title>
      </sec>
      <sec id="sec-5-2">
        <title>Scenarios</title>
        <p>We run the four grid world Scenarios from Figure 2 in
Section 3.3 for 50 runs each and measure the percentage of how
many times the optimal policy was found by the agent over
200 repetitions of the experiment. As learning approach we
use again Q-Learning with the same parameters used for
measuring the subjective strategic depth at the end of
Section 3.3. As threshold parameter to determine the
exploitation of found heuristics by means of the subjective strategic
depth ds, we choose a threshold of th = 0:2. According to
Figure 2, this means that—in average—the agent would try
to exploit the found heuristics after
25 runs in case of the first scenario,
75 runs in case of the second scenario and
100 runs in case of the third scenario.</p>
        <p>For the second and the third scenario, this may sound
confusing since we perform a total number of 50 runs only.
However, the reader should be aware that Figure 2 shows
the average development of ds measure and thus there are
enough cases where the agent individually decides to exploit
a found heuristic that leads to an optimal policy.</p>
        <p>As for the fourth scenario, the agent never considers to
exploit any heuristics, since the subjective strategic depth
does not decrease below th = 0:2. This can be interpreted in
a way that the scenario comprises too many exceptions, such
that an exploitation of heuristics does not make sense from
the agent’s point of view (this point of view is in fact what
the parameter th controls, depending on whether the desired
agent should be more “heuristics-affine” or more
“conservative”). Table 1 contains the comparison of our agent model
(a)
(b)
(c)
(d)
against a standard Q-Learning agent with the same
parameters for the learning part. In addition, for reasons of
comparison, Table 1 also provides results of a heuristics-only version
of the agent model5, where the agent was enforced to rely its
decisions always on the extracted heuristics (i. e., in the
initialization phase in Figure 1 the action selection mode is
always set to HKB action selection and the weights contained
in the Q^ matrix are never considered directly for action
selection in the agent cycle).</p>
        <p>As can be seen in Table 1, standard Q-Learning rarely
manages to solve the grid worlds during the first 50 runs,
whereas our heuristics approach clearly outperforms these
results. As for the heuristics-only version, the results are
slightly worse: This seems to be the case, since the agent
starts too early to rely its decisions on the heuristics (which
are nearly random in the beginning of the learning process).
This may prevent the agent from further exploring the
environment to an adequate extent and may result in local
optimal policies.</p>
        <p>
          One could argue that standard Q-Learning is a rather old
approach and there are nowadays more elaborate learning
approaches available. However, if a better learning approach
will be chosen in our setup (better in the sense that it will
converge faster to the optimal policy), the extracted HKBs
and hence the measure ds will be more accurate as well and
therefore, better heuristics could be exploited earlier during
the learning process. This means that in our approach, both
the heuristics and the measure ds (to decide when to exploit
them) will improve with the quality of the learning approach
used for the agent.
We evaluate the agent model now in a more realistic
example by consider the game Camel Race from the GVGAI
framework
          <xref ref-type="bibr" rid="ref10">(Perez-Liebana et al. 2016)</xref>
          . In this game, the
agent has to control a camel which must be moved to the
opposite site of the screen—faster than any other camel in
the game (additionally avoiding obstacles in the higher
levels). Figure 3 shows the first and the third level of the game
(with obstacles).6
5Thanks to an anonymous reviewer for proposing this idea.
6The game was slightly modified for our purpose by reducing
the number of camels, in order to reduce the dimensionality of the
state-action space without changing the basic game mechanics.
        </p>
        <p>The agent’s state space is described by the sensor value
sets SxC1 SxC2 SyC2 SxC3 containing the coordinates
of the camels in the game (where C1; :::; C3 correspond to
the three camels, C2 is the camel in the middle controlled
by the agent and C1; C3 only move in x direction) and
A = fup; down; left; right; noneg are the five actions that
can be performed by the agent. As reward, the agent
perceives the current distance to the fastest opponent camel in
x direction.</p>
        <p>Even if the game is rather simple, it has an interesting
aspect: Due to the dynamics of the environment caused by the
movement of the two opponent camels, our agent perceives
new and previously unseen states in nearly every game tick
of the early learning phase. Thus, the agent has to explore
large parts of the state-action space to improve its behavior,
although the game could be easily won by applying a rough
heuristic like always move right (&gt; ) right [1:0]) for the
first level. This renders the game an eligible test environment
for our agent model.</p>
        <p>We perform 100 runs for each level and measure the
percentage of wins by the agent over 30 repetitions. Again,
standard Q-Learning with and without the heuristics approach,
as well as the heuristics-only approach are used, given
the same standard parameters as provided in Section 5.2.
Table 2 shows the results for the first and the third level of
the game: With standard Q-Learning, the agent is not able
to win the game within 100 runs, whereas using the
heuristics approach, the agent wins both levels in over 50% of the
cases. Surprisingly, the heuristics-only approach performs
even better here. This seems to be the case since—although
the game has some dynamics and can thus be considered
more complex than the grid world scenarios—the
considered levels allow for rougher heuristics than the grid worlds:
In contrast to the grid worlds, the goal to be reached is not
located in a single corner, instead, the game can be won by
simply reaching the rightmost side of the screen (which can
already be achieved by a very rough heuristic).
In this paper, we described a model of a learning agent which
is able to find and to exploit heuristics without a priori
knowledge in unknown environments. We showed on
several examples (four grid world scenarios and one game from
the GVGAI framework) that our approach drastically
accelerates the learning process to find optimal policies.</p>
        <p>
          However, the approach is not perfect yet and leaves
room for future work: (1) The runtime to extract an HKB
should be further improved (a faster algorithm has been
proposed in
          <xref ref-type="bibr" rid="ref4 ref5">(Apeldoorn and Kern-Isberner 2017)</xref>
          already;
cf. footnote 4) and (2) a quality criterion could be
incorporated into the measure that helps to decide whether
heuristics should be exploited, depending on previous experiences
with applying these heuristics. The latter could possibly be
useful to help preventing an agent from repeatedly trying out
bad heuristics.
        </p>
        <p>Furthermore, it could be interesting to extend the model
with revision techniques for the found heuristics and to
incorporate mechanisms to store and reuse heuristics.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Mannila,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Srikant,
          <string-name>
            <given-names>R.</given-names>
            ; Toivonen, H.; and
            <surname>Verkamo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>1996</year>
          .
          <article-title>Fast Discovery of Association Rules</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Advances in Knowledge discovery and Data Mining</article-title>
          . Cambridge, MA, USA: MIT Press.
          <volume>307</volume>
          -
          <fpage>328</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Apeldoorn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kern-Isberner</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>When should learning agents switch to explicit knowledge? In Benzmu¨ller</article-title>
          , C.;
          <string-name>
            <surname>Sutcliffe</surname>
          </string-name>
          , G.; and Rojas, R., eds.,
          <source>GCAI 2016. 2nd Global Conference on Artificial Intelligence</source>
          , volume
          <volume>41</volume>
          of EPiC Series in Computing,
          <volume>174</volume>
          -
          <fpage>186</fpage>
          . EasyChair Publications.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Apeldoorn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kern-Isberner</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Towards an understanding of what is learned: Extracting multi-abstractionlevel knowledge from learning agents</article-title>
          . In
          <string-name>
            <surname>Rus</surname>
          </string-name>
          , V., and
          <string-name>
            <surname>Markov</surname>
          </string-name>
          , Z., eds.,
          <source>Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference</source>
          ,
          <volume>764</volume>
          -
          <fpage>767</fpage>
          . Palo Alto, California: AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Apeldoorn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Volz</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Measuring strategic depth in games using hierarchical knowledge bases</article-title>
          .
          <source>In 2017 IEEE Conference on Computational Intelligence and Games (CIG)</source>
          ,
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          . IEEE. To be published.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Borgida</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Etherington</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>Hierarchical knowledge bases and efficient disjunctive reasoning</article-title>
          . In Brachman, R. J.; Levesque,
          <string-name>
            <surname>H. J.;</surname>
          </string-name>
          and Reiter, R., eds.,
          <source>Proceedings of the First International Conference on Principles of Knowledge Representation and Reasoning</source>
          ,
          <volume>33</volume>
          -
          <fpage>43</fpage>
          . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Junges</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and Klu¨gl,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Learning tools for agentbased modeling and simulation</article-title>
          .
          <source>KI - Ku¨nstliche Intelligenz</source>
          <volume>27</volume>
          (
          <issue>3</issue>
          ):
          <fpage>273</fpage>
          -
          <lpage>280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Leopold</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kern-Isberner</surname>
            ,
            <given-names>G.</given-names>
            ; and Peters, G.
          </string-name>
          <year>2008</year>
          .
          <article-title>Belief Revision with Reinforcement Learning for Interactive Object Recognition</article-title>
          .
          <source>ECAI 2008 - 18th European Conference on Artificial Intelligence Proceedings</source>
          . Amsterdam: IOS Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Causal Learnability</article-title>
          .
          <source>In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI</source>
          <year>2011</year>
          ),
          <fpage>1014</fpage>
          -
          <lpage>1020</lpage>
          . Palo Alto, California: AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Perez-Liebana</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Samothrakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Togelius</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Schaul,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Lucas</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. M.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>General video game ai: Competition, challenges and opportunities</article-title>
          . In Schuurmans, D., and
          <string-name>
            <surname>Wellman</surname>
          </string-name>
          , M., eds.,
          <source>Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence and the Twenty-Eighth Innovative Applications of Artificial Intelligence Conference</source>
          ,
          <volume>4335</volume>
          -
          <fpage>4337</fpage>
          . Palo Alto, California: AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Shapiro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Langley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and Shachter,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>Using Background Knowledge to Speed Reinforcement Learning in Physical Agents</article-title>
          .
          <source>AGENTS '01 Proceedings of the fifth international conference on Autonomous agents. New York: ACM</source>
          .
          <fpage>254</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sardina</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Padgham</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and James,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>Integrating learning into a bdi agent for environments with changing dynamics</article-title>
          . In Walsh, T., ed.,
          <source>Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence</source>
          ,
          <fpage>2525</fpage>
          -
          <lpage>2530</lpage>
          . Menlo Park, California: AAAI Press/International Joint Conferences on Artificial Intelligence.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Peterson,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Merrill</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>1999</year>
          .
          <article-title>A hybrid architecture for situated learning of reactive sequential decision making</article-title>
          .
          <source>Applied Intelligence</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>109</fpage>
          -
          <lpage>127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Knowledge Extraction from Reinforcement Learning</article-title>
          .
          <source>New Learning Paradigms in Soft Computing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          Berlin Heidelberg: Springer.
          <fpage>170</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Reinforcement Learning: An Introduction</article-title>
          . Camebridge, Massachusetts; London, England: The MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Watkins</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>Learning from Delayed Rewards</article-title>
          . England: University of Cambridge.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>