An Agent-Based Learning Approach for Finding and Exploiting Heuristics in
                           Unknown Environments

                                         Daan Apeldoorn, Gabriele Kern-Isberner
                                               Technische Universität Dortmund
                            daan.apeldoorn@tu-dortmund.de, gabriele.kern-isberner@cs.tu-dortmund.de


                            Abstract                                 useful heuristics in a previously unknown environment. Dur-
                                                                     ing a Reinforcement Learning process, the agent creates a
  This paper presents an agent-based hybrid symbolic/sub-
  symbolic learning approach as a basic model for the human          symbolic knowledge base from learned experiences which
  ability of quickly recognizing and exploiting heuristic rules in   comprises rule-based knowledge on multiple levels of ab-
  an initially unknown environment. Instead of mechanically          straction as a model for heuristics. We do not incorporate
  iterating a task until the optimal policy was found (as usu-       any a priori knowledge (as done e. g. in (Shapiro, Langley,
  ally done by common Reinforcement Learning techniques),            and Shachter 2001)—although this would be possible in our
  the agent finds heuristics expressed by symbolic knowledge         model), instead the heuristics are inferred by the agent itself
  bases which allow for an intelligent exploitation of knowl-        during the learning process. Furthermore, the agent is able
  edge gained from few experiences. For this purpose, a mea-         to decide itself at which point during the learning process
  sure is proposed which allows an agent to decide on its own at     the decisions should be relied on the found heuristics rather
  which point during the learning process, its decisions should
                                                                     than on the weighted state-action pairs of the Reinforcement
  rely on the recognized heuristics rather than on a learned
  state-action pair representation. The approach will be eval-       Learning process. More concretely, the following contribu-
  uated both in the context of several grid world scenarios          tions are made:
  and a game from the GVGAI framework. It will be shown              • A hybrid symbolic/sub-symbolic agent model which
  that our approach outperforms standard Q-Learning in the             incorporates both symbolic rule-based knowledge and
  selected scenarios and arguments will be provided that also
                                                                       weighted state-action pair representations learned by a
  other agent-based machine learning techniques could profit
  from the proposed approach.                                          sub-symbolic learning approach.
                                                                     • An approach for estimating the subjective structural com-
                     1     Introduction                                plexity of a previously unknown environment from an
                                                                       agent perspective during a learning process (based on a
Common machine learning techniques used in the context
                                                                       “commonsense” measure for strategic depth (Apeldoorn
of learning agents (e. g., Reinforcement Learning based ap-
                                                                       and Volz 2017)).
proaches like Q-Learning (Watkins 1989), (Sutton and Barto
1998)) are usually based on a trial-and-error principle: In             Compared to our previous works on knowledge base ex-
unknown environments, agents based on such approaches                traction presented in (Apeldoorn and Kern-Isberner 2016)
start with purely random actions and successively improve            and (Apeldoorn and Kern-Isberner 2017), the novelty of
their behavior as the learning process progresses. Having            this paper lies in the incorporation of our extraction ap-
no a priori knowledge about the environment, such agents             proach into a learning agent model which is able to de-
have to explore large amounts of the state-action space,             cide on its own when to exploit the extracted knowledge
even if the underlying structure of the environment follows          as a heuristic in an unknown environment. This is realized
simple rules.                                                        by incorporating the strategic depth measure introduced in
   Humans, in contrast, are known for successfully applying          (Apeldoorn and Volz 2017) as a decision criterion.
rough heuristics learned from very few examples. This, of               As a side-product, the resulting agent model is still able
course, can lead to wrong decisions (see, e. g., (Dörner 1992)      to explain the learned knowledge and the made decisions
for several examples), however, in the common cases many             to some extent at any point during the learning process, as
problems can be quickly solved by applying such heuris-              described in (Apeldoorn and Kern-Isberner 2017).
tics derived from few examples. One reason for that is, that            In Section 2 related similar approaches will be discussed.
many problem domains (e. g., games) are geared towards               A brief introduction to the fundamental preliminary works
the commonsense human way of thinking, and these kinds               will be given in Section 3. The agent model will be intro-
of problems can therefore be quickly solved by exploiting            duced in Section 4. In Section 5 the model will be eval-
such heuristics.                                                     uated in the context of several simple grid world scenar-
   In this paper, we propose a hybrid symbolic/sub-symbolic          ios and later in different levels of a game from the Gen-
agent model which is able to autonomously find and exploit           eral Video Game Playing Artificial Intelligence (GVGAI)
framework, which is usually used for the GVGAI compe-            a object recognition task that the considered learning agent
tition where agents have to compete in playing previously        can benefit from the combined approach (with and without
unknown video games (Perez-Liebana et al. 2016).                 involving additional background knowledge). However, in
                                                                 contrast to our approach, it was not investigated, in which
                   2    Related Work                             phase during the learning process the agent should rely its
                                                                 behavior more on the explicit, symbolic part of the knowl-
Several attempts have been made so far to closer relate sym-
                                                                 edge (i. e., the ranking function) rather than on the implicit
bolic knowledge representation and sub-symbolic machine
                                                                 representation of the weights learned through the reinforce-
learning approaches in the context of agents. The basic ideas
                                                                 ment learning part. Furthermore, the symbolic part is not pri-
can be roughly assigned to one of the following groups (rep-
                                                                 marily dedicated to represent a model for heuristics.
resentatives for every group will be discussed subsequently):
1. Extraction techniques to gain different knowledge rep-                            3    Preliminaries
   resentations (e. g., propositional rules) learned from data
                                                                 This section introduces the two preliminary works needed
   through machine learning techniques
                                                                 for our approach: First, the concept of exception-tolerant
2. Methods to integrate a priori knowledge into machine          Hierarchical Knowledge Bases (HKBs) (Apeldoorn and
   learning approaches to support the learning process (e. g.,   Kern-Isberner 2016, 2017) will be briefly introduced and
   by shaping the search space with a priori knowledge and       how they can be retrieved (Sections 3.1 and 3.2).1 HKBs
   later refining this knowledge with learning techniques)       will be used later as a model for heuristics. After that, a
3. Cognitive architectures that combine 1. and 2.                “commonsense” strategy measure (recently introduced in
                                                                 (Apeldoorn and Volz 2017)), is described which will be used
Representatives of the first group are, e. g., (Sun 2002),       later in our agent model to let an agent subjectively estimate
where extraction techniques have been proposed to gain sim-      whether or not heuristics should be exploited (Section 3.3).
ple rules or plans from reinforcement learning. In (Junges
and Klügl 2013), decision trees are created from learned        3.1    Definition of HKBs
weighted state-action representations with the primary goal      To introduce HKBs and how they can be learned, we closely
of supporting agent developers in the implementation of ad-      follow (Apeldoorn and Kern-Isberner 2016, 2017). There,
equate agent behavior. These works focus on the extraction       an extraction approach is proposed which is able to extract
of knowledge in different forms. In contrast, in this paper,     an HKB from a weighted state-action pair representations
we propose to (re)incorporate the extracted knowledge back       learned by a common reinforcement learning approach like
into the learning process as a model for finding and exploit-    Q-Learning (Watkins 1989).2
ing rough heuristics in a previously unknown environment.
                                                                    This section only provides the main definitions needed
   As representatives for the second group, e. g., (Singh et     to understand the basic idea of HKBs. For more details on
al. 2011) and (Shapiro, Langley, and Shachter 2001) can          HKBs, the reader should refer to the original literature.
be considered: In these approaches, machine learning tech-
niques are combined with mechanisms to incorporate a pri-        Basics of the Agent Model We consider an agent which is
ori knowledge to accelerate the learning process or to later     learning a policy by acting autonomously in a previously un-
refine and adapt the a priori knowledge to a dynamically         known environment. The agent is equipped with n sensors
changing environment (in (Singh et al. 2011) this is realized    through which it can perceive its current state in the envi-
in the context of a BDI agent model where initial beliefs        ronment. The agent is able to perform actions from a prede-
which can be refined during a learning process). In contrast,    fined action space and can furthermore perceive, whether or
our approach works (nearly) without any a priori knowledge       not the performed actions were good, in form of (numeric)
about the environment.                                           rewards. The perceived rewards are then used to learn a
   The third group is represented, e. g., by models such as      weighted state-action pair representation where the weights
C LARION (Sun, Peterson, and Merrill 1999), a cognitive          determine which action has to be performed, given a per-
architecture which incorporates both sub-symbolic learning       ceived state (usually the one with the highest weight).
and symbolic knowledge during learning and decision mak-            More formally, in such a representation, a state s is an el-
ing. In this model, a sub-symbolic representation is com-        ement of a multi-dimensional state space S = S1 × ... × Sn
bined with a rule network on top, and both the sub-symbolic      where n is the number of the agent’s sensors (through which
and the rule level are used in the agent’s decision making       the agent is able to perceive its state in the environment)
process. In our approach, however, we assume a reasonable           1
point during a learning process, where a transition from a             Note that in (Apeldoorn and Kern-Isberner 2016, 2017),
sub-symbolic (implicit) knowledge to a symbolic (explicit)       exception-tolerant Hierarchical Knowledge Bases were originally
                                                                 called Hierarchical Knowledge Bases—but to avoid confusion with
knowledge representation should take place with the pri-
                                                                 the Hierarchical Knowledge Bases by Borgida et al. (Borgida and
mary objective to quickly find useful heuristics in a previ-     Etherington 1989) and in order to further outline the primary pur-
ously unknown environment. Another representative related        pose of our approach, we call them exception-tolerant here.
to the third group, the S PHINX approach, is described in            2
                                                                       Note that the used learning approach is not of importance as
(Leopold, Kern-Isberner, and Peters 2008), where reinforce-      long as it results in a weighted state-action pair representation,
ment learning is combined with ranking functions and belief      where the weights determine which action has to be chosen given
revision to support the learning process. It was shown on        a perceived state (usually the one with the highest weight).
and every Si is a set of possible sensor values of the cor-        3.2     Learning an HKB
responding sensor. Furthermore, the agent selects actions          An HKB can be extracted from a weighted state-action
from a predefined action set A and the learned weights are         pair representation Q̂ (that is learned, e. g., through a Re-
stored in a multi-dimensional matrix Q̂ = (qs1 ,...,sn ,a ) with   inforcement Learning technique) using the following ap-
si ∈ Si and a ∈ A. The weights can be learned by differ-           proach (Apeldoorn and Kern-Isberner 2016):3
ent machine learning approaches, provided that the learn-             The approach takes a weighted state-action pair repre-
ing approach converges such that given a state, the high-
est weight determines the best action to be selected (i. e.,       sentation Q̂ as input and returns an HKB KB Q̂ which re-
amax                                                               flects the knowledge contained in Q̂ by performing the
  s1 ,...,sn = arg 0max qs1 ,...,sn ,a0 ).
               a ∈A                                                following steps:
Exception-Tolerant Hierarchical Knowledge Bases                    1. Initial creation of rule sets: In the first step, the multiple
An HKB consists of rules which are organized on differ-               abstraction levels R1 , ..., Rn+1 of the knowledge base are
ent levels of abstraction. In contrast to Exception Lists             initially filled with rules. We only consider state-action
(Michael 2011), an HKB can handle multiple rules per                  pairs here that contribute to the best policy found during
level and the rules also comprise weights. To be able to              the learning process so far.4
define these rules, two different kinds of states and two
different kinds of rules will be distinguished (Apeldoorn          2. Removal of worse rules: In all sets Rj , a rule ρ ∈ Rj
and Kern-Isberner 2017):                                              is removed, if there exists another rule σ ∈ Rj with the
                                                                      same partial state as premise having a higher weight (i. e.,
Definition 1 (Complete States/Partial States) A com-                  in every set Rj only the best rules for a given partial state
plete state is a conjunction s := s1 ∧ ... ∧ sn of all val-           are kept).
ues si currently perceived by an agent’s sensors, where n
is the number of sensors (and every perceived sensor value                                               V In all sets Rj>1 , a
                                                                   3. Removal of worse more specific rules:
                                                                      rule ρ ∈ Rj with premise pρ = s∈Sρ s, conclusion aρ
si ∈ Si of the corresponding sensor value set Si is assumed
                                                                      and weight wρ is removed, if there V exists a more general
to be a fact in the V
                    agent’s current state). A partial state is a
                                                                      rule σ ∈ Rj 0 <j with premise pσ = s∈Sσ s where Sσ ⊂
conjunction s := s0 ∈S s0 of a subset S ⊂ {s1 , ..., sn } of
the sensor values of a complete state.                                Sρ = {s1 , ..., sj−1 } and weight wσ ≥ wρ .

Definition 2 (Complete Rules/Generalized Rules) Com-                                              V In all sets Rj , a rule ρ ∈
                                                                   4. Removal of too specific rules:
                                                                      Rj>1 with premise pρ = s∈Sρ and conclusion aρ is
plete rules and generalized rules are of the form pρ ⇒
aρ [wρ ], where pρ is either a complete state (in case of an          removed, if there exists a more general rule σ ∈ Rj 0 <j
complete rule) or a partial state (in case of a generalized           with the same V action aσ = aρ as conclusion and with
rule), the conclusion aρ ∈ A is an action of an agent’s ac-           premise pσ = s∈Sσ s where Sσ ⊂ Sρ = {s1 , ..., sj−1 }
tion space A and wρ ∈ [0, 1] is the rule’s weight.                    and if ρ is not a needed exception to a rule τ ∈ Rj−1 .
                                                                   5. Optional filter step: Optionally, filters may be applied to
  Thus, complete rules map complete states to actions and             filter out further rules which are, e. g., helpful to explain
generalized rules map partial states to actions. An HKB can
                                                                      the knowledge contained in Q̂ through the optimal found
now be defined as follows:
                                                                      policy so far, but which are not needed for reasoning later.
Definition 3 (Exception-Tolerant Hierarchical Knowl-                 After performing these steps on Q̂, the knowledge base
edge Base) An exception-tolerant Hierarchical Knowl-
edge Base (HKB) is an ordered set KB := {R1 , ..., Rn+1 }          KB Q̂ comprises all sets Rj 6= ∅ with the extracted rules
of n + 1 rule sets, where n is the number of sensors (i. e.,       representing the implicit knowledge contained in the learned
the number of state space dimensions). Every set Ri<n+1            weights of Q̂ in a compact way.
contains generalized rules and the set Rn+1
                                         V contains com-           Reasoning on HKBs A reasoning algorithm on HKBs
plete rules, such that every premise pρ = s∈Sρ s of a rule
                                                                   was proposed in (Apeldoorn and Kern-Isberner 2016):
ρ ∈ Ri is of length |Sρ | = i − 1.                                 Given perceived sensor values s1 , ..., sn , this algorithm
   According to Definition 3, the set R1 contains the most         searches an HKB upwards, starting from the bottom-most
general rules (with empty premises) and the set Rn+1 con-          level Rn+1 , for the first rule whose premise is fulfilled. This
tains the most specific (i. e., complete) rules.                   rule is then returned as concluding action (see (Apeldoorn
   For the relations of rules, the term of needed exception            3
                                                                         Note that a faster version of the algorithm improving its first
will be used, according to the following definition (cf. (Apel-    step is proposed in (Apeldoorn and Kern-Isberner 2017) (cf. foot-
doorn and Kern-Isberner 2017)):                                    note 4).
                                                                       4
                                                                         Furthermore, since the number of possible premises of the
Definition 4 (Needed Exception) A rule ρ ∈ Rj>1   V is an          rules can grow drastically with the size of the given problem in-
exception to a rule τ ∈ Rj−1 with premise pτ = s∈Sτ s,             stance, an efficient algorithm has to be used here. An attempt
action aτ as conclusion and weight wτ , if Sτ ⊂ Sρ and             for efficiently preselecting the rules using adapted ideas from the
aρ 6= aτ . The exception is needed,
                                 V if there exists no other        A PRIORI algorithm (Agrawal et al. 1996) has been made in (Apel-
rule υ ∈ Rj−1 with premise pυ = s∈Sυ s and action aυ as            doorn and Kern-Isberner 2017) to increase the overall performance
conclusion where Sυ ⊂ Sρ , aυ = aρ and wυ > wτ .                   of the extraction approach.
and Kern-Isberner 2016) for details). By this, the algorithm             In the following, d¯s will be used in our agent model to al-
selects the most specific rule that fits to the perceived sensor      low the agent to estimate subjectively when it could be useful
values and falls back to the next more unspecific rule as a           to exploit heuristics during a learning process in an a priori
heuristic, in case no more specific rule with a fitting premise       unknown environment: If d¯s falls below a certain threshold
could be found.                                                       th then this indicates that heuristics could possibly be ap-
   Due to the possibility of falling back to more general rules       plied successfully.
in case no matching rule could be found for the given sen-               The intuition behind this is that the (normalized) strate-
sor values, the reasoning mechanism on an HKB allows for              gic depth based on an HKB which was extracted from
drawing meaningful conclusions over states that have not              a weighted state-action pair representation learned by an
been seen before by an agent. The idea that a more gen-               agent, will successively decrease as the agent’s learning pro-
eral rule serves as a rough heuristic for unknown states (for         cess progresses: Since the agent’s knowledge about the prob-
which no better knowledge is available) will later contribute         lem increases, the problem appears successively less com-
to accelerate the learning process.                                   plex to the agent and the subjectively measures normalized
                                                                      strategy depth converges to the de facto strategic depth of
3.3   Strategic Depth                                                 the task to be learned (i. e., the strategic depth according to
Based on an HKB, in (Apeldoorn and Volz 2017), a strat-               the HKB of the optimal policy to solve the task). We will
egy measure was introduced which seems to adequately re-              demonstrate this in the following on several examples.
flect the commonsense strategic depth of games. The mea-              Examples for Subjective Normalized Strategic Depth
sure was evaluated in the context of a survey. This measure           We consider four different grid worlds of different structural
will be used later in our agent model, to let the agent judge         complexities where an agent has to learn to navigate from a
subjectively when to apply heuristics. According to (Apel-            starting point A to a target point B (see Figure 2). We use
doorn and Volz 2017), the measure is defined as follows:              standard Q-Learning (Watkins 1989) as a learning approach
Definition 4 (Strategic Depth Measure ds ) Let KB =                   with a learning rate of α = 0.1, a random action probabil-
{R1 , ..., Rn+1 } be an HKB and S = {S1 , ..., Sn } be the set        ity of  = 0.1 and a discount factor γ = 0.9. We run every
of all sensor value sets of an agent, then the strategic depth        scenario for 250 runs and we measure the normalized strate-
is defined as a function                                              gic depth subjectively perceived by the agent (averaged over
                                                                      30 repetitions) in dependency of the number of runs already
                       n+1
                       X n                 |Ri |                    performed. Two phenomena can be observed in Figure 2:
       ds (KB, S) :=                bi−1 P Q            , (1)
                             i−1                    |S|               • The subjective strategic depth converges to the de facto
                       i=1
                                             S⊆S S∈S                    strategic depth of the respective scenario as the agent be-
                                          |S|=i−1
                                                                        havior converges to the optimal policy.
where n + 1 = |KB| = |S| + 1 is the number of levels in               • The subjective strategic depth decreases in general as the
KB, b is a weighting constant, and Ri ∈ KB is the i-th level            learning process progresses, since the agent’s knowledge
of KB.                                                                  about the scenario increases, and therefore the scenarios
                                                                        appears to be simpler to the agent. (In case of the fourth
   As rules on a level Rj>1 can be considered exceptions                scenario, after the first runs, the agent seems to underesti-
to the rules on level Rj−1 , the intuition behind ds is that            mate the depth of the scenario, since the scenario locally
it measures the weighted relative number of rules (hence,               has straight path structures that form together a more com-
exceptions) represented by an HKB.                                      plex path from a global point of view.)
   In (Apeldoorn and Volz 2017), also a modified version of
the measure is proposed, where ds is additionally divided by
the sum of weights to normalize it to values of range [0, 1].                             4    Agent Model
For our purpose, this version of the measure is extremely             This section introduces our model of a learning agent which
useful, since it allows our agent model to estimate the strate-       is able to determine and exploit heuristics based on HKBs
gic depth of a previously unknown environment relatively              during a learning process in a previously unknown environ-
to the maximum strategic depth that can be expected, given            ment. In our agent model, a sub-symbolic learning approach
the sets of possible sensor values S1 , ..., Sn . Thus, in the fol-   can be used to learn weighted state-action pairs (where the
lowing, the normalized version of the measure (according to           maximum weight determines the preferred action, given a
(Apeldoorn and Volz 2017)) will be used:                              state). Figure 1 illustrates the main ideas of our agent model
                                                                      which mainly consists of two parts:
Definition 5 (Normalized Strategic Depth d¯s ) Let KB,
S, n, b and ds be as in Definition 5. Then the normalized             • an initialization part which is executed every time before
strategic depth is defined as a function                                a new learning episode starts, and
                                ds (KB, S)                            • a common agent cycle which is executed every time step.
            d¯s (KB, S) :=                     .             (2)
                             Pn+1    n       i−1                      In the former, the agent decides, whether it relies its deci-
                                           b
                              i=1   i−1                               sions on the weighted state-action pairs (represented by the
                                                                      Q̂ matrix) or on the heuristics (represented by the current
                                                                        if HKB action                           Agent
                                                                        selection and
                                                                        glob. reward                     if   < th
                                                                        < max. glob.
                                                                                                                             Initialization
                                                                        reward
                                                                                                                             (executed before a
             Percepts         ; Reward
                                                                                                                             new episode starts)
                                                                       Set     action selection   Set HKB action selection


                                                    Update Weights

                                                                       if no random action
                                                                          (with prob. 1  )
                                                                                                   if HKB action
                                                                                                        selection
                                              if random
                                                                                                     Extract HKB
Environment                                        action                    if    action
                                                    (with                                                                    Agent cycle
                                                                             selection
                                                prob. )                                                                      (executed every
                                                                                                         HKB                 time step)


                                             Random action selection   Action selection by        Action selection by HKB


                                         Action a

                        Figure 1: Learning Agent Model with the Ability of Finding and Exploiting Heuristics


extracted HKB) in the upcoming learning episode. This de-                     use again Q-Learning with the same parameters used for
cision is made depending on the d¯s value calculated from                     measuring the subjective strategic depth at the end of Sec-
the current extracted HKB.                                                    tion 3.3. As threshold parameter to determine the exploita-
   In the latter, the agent either chooses a random action (for               tion of found heuristics by means of the subjective strategic
exploration purpose) or decides to choose its action accord-                  depth d¯s , we choose a threshold of th = 0.2. According to
ing to the Q̂ matrix or the extracted HKB, respectively, de-                  Figure 2, this means that—in average—the agent would try
pending on the decision made during the initialization. Note                  to exploit the found heuristics after
that in Figure 1, components belonging to the sub-symbolic                    • ≈ 25 runs in case of the first scenario,
learning approach are indicated by a gray coloring, whereas
those parts belonging to our heuristics extension are colored                 • ≈ 75 runs in case of the second scenario and
white. The gray parts could be replaced by a different learn-                 • ≈ 100 runs in case of the third scenario.
ing approach.
                                                                                 For the second and the third scenario, this may sound con-
                        5    Evaluation                                       fusing since we perform a total number of 50 runs only.
                                                                              However, the reader should be aware that Figure 2 shows
This Section first evaluates the approach in the four grid                    the average development of d¯s measure and thus there are
world scenarios from Figure 2 in Section 3.3. Later, the ap-                  enough cases where the agent individually decides to exploit
proach will be evaluated additionally in a more realistic ex-                 a found heuristic that leads to an optimal policy.
ample: a game from the GVGAI framework (Perez-Liebana
                                                                                 As for the fourth scenario, the agent never considers to
et al. 2016).
                                                                              exploit any heuristics, since the subjective strategic depth
                                                                              does not decrease below th = 0.2. This can be interpreted in
5.1   Evaluation in the Context of Grid World                                 a way that the scenario comprises too many exceptions, such
      Scenarios                                                               that an exploitation of heuristics does not make sense from
We run the four grid world Scenarios from Figure 2 in Sec-                    the agent’s point of view (this point of view is in fact what
tion 3.3 for 50 runs each and measure the percentage of how                   the parameter th controls, depending on whether the desired
many times the optimal policy was found by the agent over                     agent should be more “heuristics-affine” or more “conser-
200 repetitions of the experiment. As learning approach we                    vative”). Table 1 contains the comparison of our agent model
(a)                           1.0                                                                              With
      Avg. Subjective Norm.
                                                                                             Standard        Heuristics     Heuristics
                              0.8                                                           Q-Learning       (th = 0.2)       Only
       Strategic Depth
                              0.6                                            Scenario 1]         3.5%           66.5%          43.0%
                              0.4                                            Scenario 2]         4.5%           20.0%          10.0%
                                                                             Scenario 3]         0.0%           10.5%           0.03%
                              0.2
                                                                             Scenario 4]         0.0%            0.0%           0.0%
                              0.0
                                    1   50      100    150    200   250   Table 1: Standard Q-Learning vs Heuristics Approach in the
                                                                          Context of the Grid Worlds Scenarios—The table shows the
(b)                           1.0                                         percentage of 100 repetitions in which the optimal policy
      Avg. Subjective Norm.


                              0.8                                         was found by the agent during the first 50 runs.
       Strategic Depth


                              0.6                                         against a standard Q-Learning agent with the same parame-
                                                                          ters for the learning part. In addition, for reasons of compari-
                              0.4                                         son, Table 1 also provides results of a heuristics-only version
                                                                          of the agent model5 , where the agent was enforced to rely its
                              0.2
                                                                          decisions always on the extracted heuristics (i. e., in the ini-
                              0.0                                         tialization phase in Figure 1 the action selection mode is al-
                                    1   50      100    150    200   250   ways set to HKB action selection and the weights contained
                                                                          in the Q̂ matrix are never considered directly for action se-
(c)                           1.0
                                                                          lection in the agent cycle).
      Avg. Subjective Norm.


                              0.8                                            As can be seen in Table 1, standard Q-Learning rarely
       Strategic Depth


                                                                          manages to solve the grid worlds during the first 50 runs,
                              0.6                                         whereas our heuristics approach clearly outperforms these
                                                                          results. As for the heuristics-only version, the results are
                              0.4                                         slightly worse: This seems to be the case, since the agent
                                                                          starts too early to rely its decisions on the heuristics (which
                              0.2                                         are nearly random in the beginning of the learning process).
                                                                          This may prevent the agent from further exploring the en-
                              0.0
                                    1   50      100    150   200    250   vironment to an adequate extent and may result in local
                                                                          optimal policies.
(d)                           1.0                                            One could argue that standard Q-Learning is a rather old
      Avg. Subjective Norm.


                                                                          approach and there are nowadays more elaborate learning
                              0.8                                         approaches available. However, if a better learning approach
       Strategic Depth


                                                                          will be chosen in our setup (better in the sense that it will
                              0.6
                                                                          converge faster to the optimal policy), the extracted HKBs
                              0.4                                         and hence the measure d¯s will be more accurate as well and
                                                                          therefore, better heuristics could be exploited earlier during
                              0.2                                         the learning process. This means that in our approach, both
                                                                          the heuristics and the measure d¯s (to decide when to exploit
                              0.0                                         them) will improve with the quality of the learning approach
                                    1   50     100     150    200   250   used for the agent.
                                             Number of Runs r
                                                                          5.2    Evaluation in the Context of a GVGAI Game
                                                                          We evaluate the agent model now in a more realistic ex-
                                                                          ample by consider the game Camel Race from the GVGAI
Figure 2: Examples of Subjective Normalized Strategic                     framework (Perez-Liebana et al. 2016). In this game, the
Depth for the Considered Grid Worlds Scenarios of Increas-                agent has to control a camel which must be moved to the
ing Structural Complexity with the Optimal Policies Indi-                 opposite site of the screen—faster than any other camel in
cated by Arrows—Scenarios 1 (a) and 2 (b) are taken from                  the game (additionally avoiding obstacles in the higher lev-
(Apeldoorn and Kern-Isberner 2016), Scenario 3 (c) is from                els). Figure 3 shows the first and the third level of the game
(Apeldoorn and Kern-Isberner 2017) and Scenario 4 (d)                     (with obstacles).6
from (Apeldoorn and Volz 2017).                                              5
                                                                               Thanks to an anonymous reviewer for proposing this idea.
                                                                             6
                                                                               The game was slightly modified for our purpose by reducing
                                                                          the number of camels, in order to reduce the dimensionality of the
                                                                          state-action space without changing the basic game mechanics.
                         Figure 3: First Level (a) and Third Level (b) of the GVGAI Game Camel Race

                                    With                                   6   Conclusion and Future Work
                 Standard        Heuristics    Heuristics
                Q-Learning       (th = 0.2)      Only             In this paper, we described a model of a learning agent which
                                                                  is able to find and to exploit heuristics without a priori
   1st Level]       0.0%           63.3%          66.0%           knowledge in unknown environments. We showed on sev-
   3rd Level]       0.0%           53.3%          60.0%           eral examples (four grid world scenarios and one game from
                                                                  the GVGAI framework) that our approach drastically accel-
Table 2: Standard Q-Learning vs Heuristics Approach in the        erates the learning process to find optimal policies.
Context of the GVGAI Game Camel Race—The table shows                 However, the approach is not perfect yet and leaves
the percentage of 30 repetitions in which the agent won the       room for future work: (1) The runtime to extract an HKB
game in the first 100 runs.                                       should be further improved (a faster algorithm has been
                                                                  proposed in (Apeldoorn and Kern-Isberner 2017) already;
                                                                  cf. footnote 4) and (2) a quality criterion could be incorpo-
   The agent’s state space is described by the sensor value       rated into the measure that helps to decide whether heuris-
sets SC         C2      C2      C3                                tics should be exploited, depending on previous experiences
        x × Sx × Sy × Sx containing the coordinates
         1

of the camels in the game (where C1 , ..., C3 correspond to       with applying these heuristics. The latter could possibly be
the three camels, C2 is the camel in the middle controlled        useful to help preventing an agent from repeatedly trying out
by the agent and C1 , C3 only move in x direction) and            bad heuristics.
A = {up, down, left, right, none} are the five actions that          Furthermore, it could be interesting to extend the model
can be performed by the agent. As reward, the agent per-          with revision techniques for the found heuristics and to in-
ceives the current distance to the fastest opponent camel in      corporate mechanisms to store and reuse heuristics.
x direction.
   Even if the game is rather simple, it has an interesting as-                          References
pect: Due to the dynamics of the environment caused by the        Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.; and
movement of the two opponent camels, our agent perceives          Verkamo, A. 1996. Fast Discovery of Association Rules.
new and previously unseen states in nearly every game tick        Advances in Knowledge discovery and Data Mining. Cam-
of the early learning phase. Thus, the agent has to explore       bridge, MA, USA: MIT Press. 307–328.
large parts of the state-action space to improve its behavior,    Apeldoorn, D., and Kern-Isberner, G. 2016. When
although the game could be easily won by applying a rough         should learning agents switch to explicit knowledge? In
heuristic like always move right (> ⇒ right [1.0]) for the        Benzmüller, C.; Sutcliffe, G.; and Rojas, R., eds., GCAI
first level. This renders the game an eligible test environment   2016. 2nd Global Conference on Artificial Intelligence, vol-
for our agent model.                                              ume 41 of EPiC Series in Computing, 174–186. EasyChair
   We perform 100 runs for each level and measure the per-        Publications.
centage of wins by the agent over 30 repetitions. Again, stan-    Apeldoorn, D., and Kern-Isberner, G. 2017. Towards an un-
dard Q-Learning with and without the heuristics approach,         derstanding of what is learned: Extracting multi-abstraction-
as well as the heuristics-only approach are used, given           level knowledge from learning agents. In Rus, V., and
the same standard parameters as provided in Section 5.2.          Markov, Z., eds., Proceedings of the Thirtieth International
Table 2 shows the results for the first and the third level of    Florida Artificial Intelligence Research Society Conference,
the game: With standard Q-Learning, the agent is not able         764–767. Palo Alto, California: AAAI Press.
to win the game within 100 runs, whereas using the heuris-
tics approach, the agent wins both levels in over 50% of the      Apeldoorn, D., and Volz, V. 2017. Measuring strategic
cases. Surprisingly, the heuristics-only approach performs        depth in games using hierarchical knowledge bases. In 2017
even better here. This seems to be the case since—although        IEEE Conference on Computational Intelligence and Games
the game has some dynamics and can thus be considered             (CIG), 9–16. IEEE. To be published.
more complex than the grid world scenarios—the consid-            Borgida, A., and Etherington, D. W. 1989. Hierarchi-
ered levels allow for rougher heuristics than the grid worlds:    cal knowledge bases and efficient disjunctive reasoning. In
In contrast to the grid worlds, the goal to be reached is not     Brachman, R. J.; Levesque, H. J.; and Reiter, R., eds., Pro-
located in a single corner, instead, the game can be won by       ceedings of the First International Conference on Principles
simply reaching the rightmost side of the screen (which can       of Knowledge Representation and Reasoning, 33–43. San
already be achieved by a very rough heuristic).                   Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Dörner, D. 1992. Die Logik des Mißlingens – Strategisches
Denken in komplexen Situationen. Reinbek bei Hamburg:
Rowohlt Taschenbuch Verlag.
Junges, R., and Klügl, F. 2013. Learning tools for agent-
based modeling and simulation. KI – Künstliche Intelligenz
27(3):273–280.
Leopold, T.; Kern-Isberner, G.; and Peters, G. 2008. Belief
Revision with Reinforcement Learning for Interactive Object
Recognition. ECAI 2008 – 18th European Conference on
Artificial Intelligence Proceedings. Amsterdam: IOS Press.
65–69.
Michael, L. 2011. Causal Learnability. In Proceedings
of the 22nd International Joint Conference on Artificial In-
telligence (IJCAI 2011), 1014–1020. Palo Alto, California:
AAAI Press.
Perez-Liebana, D.; Samothrakis, S.; Togelius, J.; Schaul, T.;
and Lucas, S. M. 2016. General video game ai: Competi-
tion, challenges and opportunities. In Schuurmans, D., and
Wellman, M., eds., Proceedings of the Thirtieth AAAI Con-
ference on Artificial Intelligence and the Twenty-Eighth In-
novative Applications of Artificial Intelligence Conference,
4335–4337. Palo Alto, California: AAAI Press.
Shapiro, D.; Langley, P.; and Shachter, R. 2001. Using
Background Knowledge to Speed Reinforcement Learning
in Physical Agents. AGENTS ’01 Proceedings of the fifth
international conference on Autonomous agents. New York:
ACM. 254–261.
Singh, D.; Sardina, S.; Padgham, L.; and James, G. 2011.
Integrating learning into a bdi agent for environments with
changing dynamics. In Walsh, T., ed., Proceedings of the
Twenty-Second International Joint Conference on Artificial
Intelligence, 2525–2530. Menlo Park, California: AAAI
Press/International Joint Conferences on Artificial Intelli-
gence.
Sun, R.; Peterson, T.; and Merrill, E. 1999. A hybrid archi-
tecture for situated learning of reactive sequential decision
making. Applied Intelligence 11(1):109–127.
Sun, R. 2002. Knowledge Extraction from Reinforcement
Learning. New Learning Paradigms in Soft Computing.
Berlin Heidelberg: Springer. 170–180.
Sutton, R., and Barto, A. 1998. Reinforcement Learning:
An Introduction. Camebridge, Massachusetts; London, Eng-
land: The MIT Press.
Watkins, C. 1989. Learning from Delayed Rewards. Eng-
land: University of Cambridge.