<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Autonomous Generation of Symbolic Knowledge via Option Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriele Sartor</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Zollo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Cialdea Mayer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Oddi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Rasconi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vieri Giuliano Santucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Cognitive Sciences and Technologies (ISTC-CNR)</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Roma Tre University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Turin</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work we present an empirical study where we demonstrate the possibility of developing an artificial agent that is capable to autonomously explore an experimental scenario. During the exploration, the agent is able to discover and learn interesting options allowing to interact with the environment without any assigned task, and then abstract and re-use the acquired knowledge to solve the assigned tasks. We test the system in the so-called Treasure Game domain described in the recent literature and we empirically demonstrate that the discovered options can be abstracted in an probabilistic symbolic planning model (using the PPDDL language), which allowed the agent to generate symbolic plans to achieve extrinsic goals.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;options</kwd>
        <kwd>intrinsic motivations</kwd>
        <kwd>automated planning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        If we want robots to be able to interact with complex and unstructured environments like the
real-life scenarios in which humans live, or if we want artificial agents to be able to explore
and operate in unknown environments, a crucial feature is to give these robots the ability
to autonomously acquire knowledge that can be used to solve human requests and adapt
to unpredicted new contexts and situations. At the same time, these robots should be able
to represent the acquired knowledge in structures that facilitate and speed up its reuse and
eventually facilitate human-robot interactions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The field of Intrinsically Motivated Open-ended Learning (IMOL, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) is showing promising
results in the development of versatile and adaptive artificial agents. Intrinsic Motivations
(IMs, [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]) are a class of self-generated signals that have been used, depending on diferent
implementations, to provide robots with an autonomous guidance for several diferent processes,
from state-and-action space exploration [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], to the autonomous discovery, selection and learning
of multiple goals [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. In general, IMs guide the agent in the acquisition of new knowledge
independently (or even in the absence) of any assigned task: this knowledge will then be
available to the system to solve user-assigned tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or as a scafolding to acquire new
knowledge in a cumulative fashion (similarly to what have been called curriculum learning
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Notwithstanding the advancements in this field, IMOL systems are still limited in acquiring
long sequences of skills that can generate complex action plans. In addition to the specific
complexity of the problem, this is also due to the fact that most of these systems store the
acquired knowledge (e.g., contexts, actions, goals) in low-level representations that poorly
support a higher-level reasoning that would guarantee a more efective reuse of such knowledge.
Even if some works have shown interesting results with architectural solutions [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], the
use of long-term memory [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], simplified planning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], or representational redescription [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
the need to use constructs that ensure greater abstraction and thus higher-level reasoning still
seems crucial to exploiting the full potential of autonomous learning.
      </p>
      <p>
        Within reinforcement learning (RL, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) the option framework [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] implements temporally
extended high-level actions (the options), defined as triplets composed of an initiation set (the
low-level states from which the option can be executed), the actual policy, and the termination
conditions (describing the probability of the option to end in specific low-level states). Options
can be handled at a higher level with respect to classical RL policies and, as shown within
hierarchical RL (HRL), [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]), they can be chunked together to form longer chains. Moreover,
HRL has been combined also with IMs to allow for diferent autonomous processes including, for
example, the formation of skill sequences [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the learning of sub-goals [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and, together with
deep RL techniques, to improve trajectories explorations [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. However, despite the theoretical
and operational power of this framework, options alone do not provide a complete abstraction
of all the necessary elements to allow high-level reasoning and planning. Opposed to the
low-level processes typical of IMOL and RL approaches, in classical planning frameworks [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
the states, actions and goals are represented as symbols that can be easily handled and composed
to perform complex sequences of behaviours to solve assigned tasks. However, in symbolic
planning systems, the knowledge on the domain is commonly fixed and provided by an expert
at design time, thus preventing the possibility of exploiting this approach in truly autonomous
systems.
      </p>
      <p>
        Finding a bridge between autonomous approaches gathering low-level knowledge on the
basis of IMs and high-level symbolic decision making is thus a crucial research topic towards
the development of a new and more complete generation of artificial agents. In a seminal
work, Konidaris and colleagues [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] presented an algorithm for the autonomous “translation” of
low-level knowledge into symbols for a PDDL domain and then used to solve complex high-level
goal-achievement tasks such as the Treasure Game (see Figure 1), by creating sequences of
operators (or symbolic plans). However, in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] the set of options to be abstracted into symbols
is given to the system, thus lacking to “close the loop” between the first phase of autonomous
exploration and learning, and the second phase of exploitation of the acquired knowledge.
      </p>
      <p>
        In this work, we will present an empirical study in which we extend the results obtained
in our previous research [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. In particular, we deploy the data abstraction procedure on a
more complex domain characterized by a higher number of variables, hence focusing on the
probabilistic version of the PDDL (Probabilistic Planning Domain Definition Language - PPDDL
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]), demonstrating the possibility to develop an artificial agent that is capable to autonomously
explore the experimental scenario, discover and learn interesting options allowing to interact
with the environment without any pre-assigned task, and then abstract the acquired knowledge
for potential re-utilization to reach high-level goals. In particular, we will test the system in
the so-called Treasure Game domain described in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], where the agent can move through
corridors, climb up and down stairs, interact with handles, bolts, keys and a treasure. Two
diferent results have been achieved by empirically following the proposed approach: on the
one hand, we experimentally verified how the agent is able to find a set of options and generate
symbolic plans to achieve high-level goals (e.g., open a door by using a key); on the other hand,
we analyzed a number of technicalities inherently connected to the task of making explicit
abstracted knowledge while directly exploring the environment (e.g., synthesizing the correct
preconditions of a discovered option).
      </p>
      <p>
        The paper is organized as follows: in Section 2 we introduce the basic notation and the option
framework and we describe our algorithm for automatically discovering options. Indeed, in
this paper we describe a two-step learning phases: the first, to generate options from scratch
and creating a preliminary action abstraction (Section 2), and the second, to produce a higher
representation partitioning the options and highlighting their causal efects. The latter is
described in Section 3, where we briefly describe the abstraction procedure introduced in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
Section 4 describes our empirical results for the Treasure Game domain; finally, in Section 5 we
give some conclusions and discuss some possible directions of future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Finding Options</title>
      <p>
        Options are temporally-extended actions defined as (, ,  ) [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], in which  is the policy
executed,  the set of states in which the policy can run and  the termination condition of
the option. The option’s framework revealed to be an efective tool to abstract actions and
extend them with a temporal component. The use of this kind of actions demonstrated to
improve significantly the performances of model-based Reinforcement Learning compared to
older models, such as one-step models in which the actions employed are the primitives of
the agent [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Intuitively, these low-level single-step actions, or primitives, can be repeatedly
exploited to create more complex behaviours.
      </p>
      <p>In this section, we describe a possible way to discover and build a set of options from scratch.
using the low-level actions available in the Treasure Game environment (see Figure 1). In such
environment, an agent starts from its initial position (home), moves through corridors and
climbs ladders over diferent floors, while interacting with a series of objects (e.g., keys, bolts,
and levers) to the goal of reaching a treasure placed in the bottom-right corner and bringing it
back home.</p>
      <p>In order to build new behaviours, the agent can execute the following primitives: 1)_,
2)_, 3)_ , 4)_ℎ, and 5), respectively used to move the agent up,
down, left or right by 2-4 pixels (the exact value is randomly selected with a uniform distribution)
and to interact with the closest object. In particular, the interaction with a lever changes the
state (open/close) the doors associated to that lever (both on the same floor or on diferent
lfoors) while the interaction with the key and/or the treasure simply collects the key and/or the
treasure inside the agent’s bag. Once the key is collected, the interaction with the bolt unlocks
the last door, thus granting the agent the access to the treasure.</p>
      <p>In our experiment, primitives are used as building blocks in the construction of the option,
participating to the definition of ,  and  . In more details, we create new options from scratch,
considering a slightly diferent definition of option (, , , ,  ) made up of the following
components:
• , the primitive used by the execution of  ;
• , the primitive which, when available, stops the execution of  ;
•  , the policy applied by the option, consisting in repeatedly executing  until  is available
or  can no longer be executed;
• , the set of states from which  can run;
•  , the termination condition of the action, corresponding to the availability of the primitive
 or to the impossibility of further executing .</p>
      <p>Consequently, this definition of option requires , to describe the policy and where it can run,
and , to define the condition stopping its execution, maintaining its characteristic temporal
abstraction. For the sake of simplicity, the option’s definition will follow the more compact
syntax (, ) in the remainder of the paper.</p>
      <p>Algorithm 1 Discovery option algorithm
1: procedure DISCOVER(, _, _)
2:  ← {}
3:  ← 0
4: while  &lt; _ do
5:  ← 0
6: .RESET_GAME()
7: while  &lt; _ do
8:  ← .GET_STATE()
9:  ← .GET_AVAILABLE_PRIMITIVE()
10: while .IS_AVAILABLE(p) and not (.NEW_AVAILABLE_PRIM()) do
11: .EXECUTE(p)
12: ′ ← .GET_STATE()
13: if  ̸= ′ then
14: if .NEW_AVAILABLE_PRIM() then
15:  ← .GET_NEW_AVAILABLE_PRIM()
16:  ← CREATE_NEW_OPTION(p, t)
17: else
18:  ← CREATE_NEW_OPTION(p, {})
19:  ←  ∪ 
20: return</p>
      <p>Algorithm 1 describes the process utilized to discover new options autonomously inside the
simulated environment. The procedure runs for a number of episodes _ and _
steps. Until the maximum of steps of the current episode is not reached, the function keeps
track of the starting state  and randomly selects an available primitive , such that  can be
executed in  (line 7-9). Then, as long as  is available and there is no new available primitives
(line 10), the option  is executed, and the final state ′ of the current potential option is updated.
The function NEW_AVAILABLE_PRIM returns True when a primitive which was not previously
executable becomes available while executing ; the function returns False in all the other
cases. For instance, if the agent finds out that there is a ladder over him while executing the
_ℎ option, the primitive _ gets available and the function return True. In other
words, NEW_AVAILABLE_PRIM detects the interesting event, thus implementing the surprise
element that catches the agent’s curiosity. For this reason, the primitive representing the
exact reverse with respect to the one currently being executed is not interesting for the agent,
i.e., the agent will not get interested in the _ℎ primitive while executing _ . The
same treatment applied to the (_ , _ℎ) primitive pair is also used with the pair
(_, _).</p>
      <p>When the stopping condition of the most inner while is verified and  ̸= ′, a new option
can be generated according to the following rationale. In case the while exits because of the
availability of a new primitive  in the new state ′, a new option (, ) is created (line 16);
otherwise, if the while exits because the primitive under execution is no longer available, a new
option (, {}) is created, meaning "execute  while it is possible" (line 18). In either case, the
created option  is added to the list  (line 19), which is the output of the function.</p>
      <p>
        In our test scenario, the algorithm generated 11 working options (see Section 4), suitable for
solving the environment, and collected experience data to be abstracted in PPDDL [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] format
successively. Consequently, as we introduced above, the agent performs two learning phases:
the first, to generate options from scratch and creating a preliminary action abstraction, and
the second, to produce a higher representation partitioning the options and highlighting their
causal efects. The latter phase, producing a symbolic representation suitable for planning, is
analyzed in the next section.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Abstracting options in PPDDL</title>
      <p>
        In this section we provide a summary description of the knowledge abstraction procedure, in
order to allow the reader to get a grasp of the rationale behind the synthesis of the PPDDL
domain. A thorough description of the abstraction algorithm is beyond the scope of this paper;
for further details, the reader is referred to [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        The procedure basically executes the following five steps:
1. Data collection: during this step, the options learned according to Section 2 are
repeatedly executed in the environment and the information about the initial and final state
(respectively before and after the execution of each option) are collected. Such data are
successively aggregated, and two data structures are returned from the Data collection
phase: the initiation data and the transition data, both to be used in the following steps.
2. Option partition: this step is dedicated to partitioning the learned options in terms
of abstract subgoal options. This operation is necessary as the (P)PDDL operators are
characterized by a single precondition set and a single efect set; therefore, options that
have multiple termination conditions starting from the same initiation set cannot be
correctly captured in terms of (P)PDDL operators. As a consequence, before launching
the abstraction procedure it is necessary to generate a set of options each of which is
guaranteed to produce a single efect ( partial subgoal option). This operation utilizes
the transition data set computed in Step 1, as they capture the information about the
domain segment the option modifies. Option partition is ultimately obtained by properly
clustering the transition data through the DBSCAN algorithm ([
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]) present in the
scikitlearn toolkit ([30]).
3. Precondition estimation: this step is dedicated to learning the symbols that will
constitute the preconditions of the PPDDL operators associated to all the options. This operation
utilizes the initiation data set computed in Step 1, and is performed utilizing the support
vector machine ([31]) classifier implementation in scikit-learn.
4. Efect estimation : analogously, this step is dedicated to learning the symbols that
will constitute the efects of the PPDDL operators. The efect distribution was modelled
through the Kernel density estimation ([32, 33]).
5. PPDDL Domain synthesis: finally, this step is dedicated to the synthesis of the PPDDL
domain, characterized by the complete definition of all the operators associated to the
learned options, in terms of preconditions and efect symbols.
      </p>
      <p>For instance, Figure 3 depicts an example operator (option-0) whose action corresponds to
modifying the agent’s position as it has to climb up a stair to reach a new location. The operator’s
(a) _19.</p>
      <p>(b) _29.</p>
      <p>(c) _30.
(:action option-0
:parameters ()
:precondition (and (symbol_19) (symbol_29))
:effect (and (symbol_30) (not (symbol_29)) (decrease (reward) 17.60))
)
formalization follows the standard Probabilistic PDDL (PPDDL)1, where the precondition set is
composed of the symbols {_19, _29}, the efect set is composed of the symbol
{_30}, and the negative efect set contains the symbol {_29} (note that the
name of the symbols is automatically generated). The reader should also consider that the
PPDDL operators returned by the abstraction procedure are grounded; automatically abstracting
parametric PPDDL representations is beyond the scope of this work and will be the object of
future work. Finally, each PPDDL operator is associated to a reward (17.60 in this case).</p>
      <p>In order to provide the reader with some information about the meaning of the symbols
that populate the previous operator, the semantics of all symbols is graphically presented in
Figure 2. In particular, _19 proposition in the operator’s preconditions has the following
semantics: “the agent’s  coordinate is vertically aligned with the stairs”, while the semantics of
_29 proposition is “the agent’s y coordinate positions it at a level equivalent to being at
the bottom of the stairs”. From the description above, it is clear that the intersection (i.e., the
logical AND) of the previous two symbols places the agent exactly at the bottom of the stairs.
Relatively to the operator’s efects, we see that _29 gets negated (the agent is no longer
at the bottom of the stairs) and it is replaced by _30 whose meaning is “the agent’s y
coordinate positions it at a level equivalent to being at the top of the stairs”. Lastly, the reader
should note that _19 remains valid throughout the whole execution of the operator,
1Despite the operator selected in this particular example does not make use of probabilities, it has been chosen
due to its simplicity to exemplify the utilization of the automatically generated symbols.
and that the logical intersection of _19 and _30 clearly describes the situation
where the agent has climbed the stairs.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Empirical Analysis</title>
      <p>
        In this section we describe the results obtained from a preliminary empirical study, carried out
by testing the Algorithm 1 in the context of the Treasure Game domain [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The algorithm
was implemented in Python 3.7 under Linux Ubuntu 16.04 as an additional module of the Skill
to Symbols software 2, using the Treasure Game Python package. As previously stated, the
Treasure Game domain defines an environment that can be explored by the agent by moving
through corridors and doors, climbing stairs, interacting with handles (necessary to open/close
the doors), bolts, keys (necessary to unlock the bolts) and a treasure.
      </p>
      <p>In our experimentation, the agent starts endowed with no previous knowledge about the
possible actions that can be executed in the environment; the agent is only aware of the basic
motion primitives at his disposal, as described in Section 2. The goal of the analysis is to
assess the correctness, usability and quality of the abstract knowledge of the environment
autonomously obtained by the agent.</p>
      <p>The experiment starts by using Algorithm 1, whose application endows the agent with the
following set of learned options (11 in total):
 ={(_, {}), (_, {}), (_ , {}), (_ , _), (_ , _),
(_ , ), (_ℎ, {}), (_ℎ, _), (_ℎ, _),
(1)
(_ℎ, ), (, {})}
The test has been run on an Intel I7 3.4GHz machine, and the whole process took 30 minutes. All
the options are expressed in the compact syntax (, ) described in Section 2, where p represents
the primitive action corresponding to the action’s behavior, and t represents the option’s stop
condition (i.e., the new primitive action discovered, or an empty set).</p>
      <p>Once the set of learned options has been obtained, the test proceeds by applying the knowledge
abstraction procedure described in the previous Section 3. In our specific case, the procedure
eventually generated a final PPDDL domain composed by a set of 1528 operators. In order to
empirically verify the correctness of the obtained PPDDL domain, we tested the domain with
the of-the-shelf mGPT probabilistic planner [ 34]. The selected planning goal was to find the
treasure, located in a hidden position of the environment (i.e., behind a locked door that could
be opened only by operating on a bolt with a key) and bring it back to the agent’s starting
position, in the upper part of the Treasure Game environment. The agent’s initial position is by
the small stairs located on the environment’s 5ℎ floor (up left).</p>
      <p>The symbolic plan depicted in Figure 3 was successfully generated and, as readily observable,
reaches the previous goal. The plan is composed of 33 actions, which confirmed the correctness
of the proposed methodology (note that the PPDDL operators are named after their exact
semantics manually, in order to facilitate their interpretation for the reader.). We also discovered
2We thank George Konidaris and Steve James for making both the Skills to Symbols and the Treasure Game
software available.
a number of dificulties inherently connected to the task of explicitly abstracting the knowledge
by means of a direct exploration of the environment. One consequence of such dificulties
is evident from the quality of the plan outlined above. In fact, it is clear that the plan is
not optimal, as the agent performs sometime useless actions, such as going left to the bolt
and then right to the stairs (22. _ [ ] and 23. _ℎ[ ], respectively)
instead of directly executing a _ [ ]. Another examples of redundant actions are
25. _ℎ[ ], 26. _ [ ] instead of directly executing a _ℎ[ ].
It can be easily seen that the optimal plan is composed of 31 actions.</p>
      <p>The previous analysis is still ongoing work. In this paper, we are presenting the encouraging
results obtained so far, though we have observed that a number of improvements are worth being
studied. One observation can be made about the quality of the obtained PPDDL domain; despite
we have demonstrated that such domain can be successfully used for automated planning,
we have also observed that it contains a number of infeasible operators (i.e., characterized
by mutually conflicting preconditions) as well as operators characterized by a high failure
probability. Of course, the presence of such operators does not hinder the feasibility of the
produced plan (i.e., the former operators will always be discarded by the planner, while the
latter will at most make the planning process more demanding, thus decreasing the probability
of obtaining an optimal solution) yet, further work must be done to arrive to “crisper” domain
representations.</p>
      <p>In this respect, there are at least two research lines to investigate. The first line entails
the study of diferent fine-tuning strategies of all the parameters utilized in the previously
mentioned Machine Learning tools (such as DBSCAN, SVM, Kernel Density Estimator) involved
in the knowledge-abstraction process. The second line is about analyzing the most eficient
environment exploration strategy used to collect all the transition data that will be used for
the classification tasks that are part of the abstraction procedure, as both the quantity and the
quality of the collected data may be essential at this stage.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>
        In this paper we tested an option discovery algorithm driven by intrinsic motivations for an
agent operating in the Treasure Game domain [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. We experimentally demonstrated that the
discovered options can be abstracted in an probabilistic symbolic planning model (using the
PPDDL language), which allowed the agent to generate symbolic plans to achieve extrinsic
goals. One of the possible direction of future work will be the exploration of innovative iterative
procedures to incrementally refine [35] the generated PPDDL model.
Conference on Knowledge Discovery and Data Mining, KDD’96, AAAI Press, 1996, p.
226–231.
[30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in python, J. Mach. Learn. Res. 12
(2011) 2825–2830.
[31] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. URL:
https://doi.org/10.1023/A:1022627411411. doi:10.1023/A:1022627411411.
[32] M. Rosenblatt, Remarks on some nonparametric estimates of a density function, The Annals
of Mathematical Statistics 27 (1956) 832–837. URL: http://www.jstor.org/stable/2237390.
[33] E. Parzen, On estimation of a probability density function and mode, The Annals of
      </p>
      <p>Mathematical Statistics 33 (1962) 1065–1076. URL: http://www.jstor.org/stable/2237880.
[34] B. Bonet, H. Gefner, Mgpt: A probabilistic planner based on heuristic search, J. Artif. Int.</p>
      <p>Res. 24 (2005) 933–944.
[35] Y. Hayamizu, S. Amiri, K. Chandan, K. Takadama, S. Zhang, Guiding robot exploration
in reinforcement learning via automated planning, Proceedings of the International
Conference on Automated Planning and Scheduling 31 (2021) 625–633. URL: https://ojs.
aaai.org/index.php/ICAPS/article/view/16011.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eberhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <article-title>Towards bridging the neuro-symbolic gap: Deep deductive reasoners</article-title>
          ,
          <source>Applied Intelligence</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Santucci</surname>
          </string-name>
          , P.-Y. Oudeyer,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barto</surname>
          </string-name>
          , G. Baldassarre,
          <article-title>Intrinsically motivated open-ended learning in autonomous robots</article-title>
          ,
          <source>Frontiers in neurorobotics 13</source>
          (
          <year>2020</year>
          )
          <fpage>115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.-Y.</given-names>
            <surname>Oudeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <article-title>Intrinsic motivation systems for autonomous mental development</article-title>
          ,
          <source>IEEE transactions on evolutionary computation 11</source>
          (
          <year>2007</year>
          )
          <fpage>265</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Baldassarre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirolli</surname>
          </string-name>
          ,
          <source>Intrinsically Motivated Learning in Natural and Artificial Systems</source>
          , Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stollenga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Förster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Curiosity driven reinforcement learning for motion planning on humanoids</article-title>
          ,
          <source>Frontiers in neurorobotics 7</source>
          (
          <year>2014</year>
          )
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baranes</surname>
          </string-name>
          , P.-Y. Oudeyer,
          <article-title>Active learning of inverse models with intrinsically motivated goal exploration in robots</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>61</volume>
          (
          <year>2013</year>
          )
          <fpage>49</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Santucci</surname>
          </string-name>
          , G. Baldassarre,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirolli</surname>
          </string-name>
          ,
          <article-title>Grail: A goal-discovering robotic architecture for intrinsically-motivated learning</article-title>
          ,
          <source>IEEE Transactions on Cognitive and Developmental Systems</source>
          <volume>8</volume>
          (
          <year>2016</year>
          )
          <fpage>214</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Colas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fournier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chetouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sigaud</surname>
          </string-name>
          , P.-Y. Oudeyer,
          <article-title>Curious: intrinsically motivated modular multi-goal reinforcement learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1331</fpage>
          -
          <lpage>1340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Seepanomwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Santucci</surname>
          </string-name>
          , G. Baldassarre,
          <article-title>Intrinsically motivated discovered outcomes boost user's goals achievement in a humanoid robot</article-title>
          , in: 2017
          <source>Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDLEpiRob)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Curriculum learning</article-title>
          ,
          <source>in: Proceedings of the 26th annual international conference on machine learning</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Forestier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Portelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mollard</surname>
          </string-name>
          , P.-Y. Oudeyer,
          <article-title>Intrinsically motivated goal exploration processes with automatic curriculum learning</article-title>
          ,
          <source>arXiv preprint arXiv:1708.02190</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Santucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Baldassarre</surname>
          </string-name>
          , E. Cartoni,
          <article-title>Autonomous reinforcement learning of multiple interrelated tasks, in: 2019 Joint IEEE 9th international conference on development and learning and epigenetic robotics (ICDL-EpiRob)</article-title>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>221</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Becerra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Duro</surname>
          </string-name>
          ,
          <article-title>Motivational engine and long-term memory coupling within a cognitive architecture for lifelong open-ended learning</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>452</volume>
          (
          <year>2021</year>
          )
          <fpage>341</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Baldassarre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Granato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Santucci</surname>
          </string-name>
          ,
          <article-title>An embodied agent learning afordances with intrinsic motivations and solving extrinsic tasks with attention and one-step planning</article-title>
          ,
          <source>Frontiers in neurorobotics 13</source>
          (
          <year>2019</year>
          )
          <fpage>45</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doncieux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Filliat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hospedales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Duro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Coninx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Roijers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Girard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Perrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sigaud</surname>
          </string-name>
          ,
          <article-title>Open-ended learning: a conceptual framework based on representational redescription</article-title>
          ,
          <source>Frontiers in neurorobotics 12</source>
          (
          <year>2018</year>
          )
          <fpage>59</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Intra-option learning about temporally abstract actions</article-title>
          ,
          <source>in: Proc. 15th International Conf. on Machine Learning</source>
          , Morgan Kaufmann, San Francisco, CA,
          <year>1998</year>
          , pp.
          <fpage>556</fpage>
          -
          <lpage>564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahadevan</surname>
          </string-name>
          ,
          <article-title>Recent advances in hierarchical reinforcement learning</article-title>
          ,
          <source>Discrete event dynamic systems 13</source>
          (
          <year>2003</year>
          )
          <fpage>41</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>C. M. Vigorito</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <string-name>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Intrinsically motivated hierarchical skill learning in structured environments</article-title>
          ,
          <source>IEEE Transactions on Autonomous Mental Development</source>
          <volume>2</volume>
          (
          <year>2010</year>
          )
          <fpage>132</fpage>
          -
          <lpage>143</lpage>
          . doi:
          <volume>10</volume>
          .1109/TAMD.
          <year>2010</year>
          .
          <volume>2050205</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rafati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Noelle</surname>
          </string-name>
          ,
          <article-title>Learning representations in model-free hierarchical reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>10009</fpage>
          -
          <lpage>10010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>T. D. Kulkarni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Narasimhan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Saeedi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <article-title>Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          )
          <fpage>3675</fpage>
          -
          <lpage>3683</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghallab</surname>
          </string-name>
          , P. Traverso,
          <source>Automated Planning: Theory &amp; Practice</source>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Konidaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Kaelbling</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Lozano-Perez,
          <article-title>From skills to symbols: Learning symbolic representations for abstract high-level planning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>215</fpage>
          -
          <lpage>289</lpage>
          . URL: http://lis.csail.mit.edu/pubs/konidaris-jair18.
          <fpage>pdf</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rasconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Santucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sartor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cartoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mannella</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Baldassarre, Integrating open-ended learning in the sense-plan-act robot control paradigm</article-title>
          ,
          <source>in: ECAI</source>
          <year>2020</year>
          ,
          <source>the 24th European Conference on Artificial Intelligence</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Younes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <year>PPDDL1</year>
          .
          <article-title>0: An Extension to PDDL for Expressiong Planning Domains with Probabilistic Efects</article-title>
          ,
          <source>Technical Report</source>
          , Carnegie Mellon University,
          <year>2004</year>
          . CMU-CS-
          <volume>04</volume>
          -167.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning</article-title>
          ,
          <source>Artif. Intell</source>
          .
          <volume>112</volume>
          (
          <year>1999</year>
          )
          <fpage>181</fpage>
          -
          <lpage>211</lpage>
          . URL: http: //dx.doi.org/10.1016/S0004-
          <volume>3702</volume>
          (
          <issue>99</issue>
          )
          <fpage>00052</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0004-
          <volume>3702</volume>
          (
          <issue>99</issue>
          )
          <fpage>00052</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>G.</given-names>
            <surname>Konidaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Kaelbling</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Lozano-Perez,
          <article-title>From skills to symbols: Learning symbolic representations for abstract high-level planning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>215</fpage>
          -
          <lpage>289</lpage>
          . doi:https://doi.org/10.1613/jair.5575.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          , in: Proceedings of the Second International
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>