<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>WOA</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Capturing a Recursive Pattern in Neural-Symbolic Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Berett</string-name>
          <email>davide.beretta@unipr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>StefaniaMonica</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FedericoBergent</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Scienze Matematiche, Fisiche e Informatiche, Università degli Studi di Parma</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>24</volume>
      <fpage>6</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>Neural-symbolic methods have gained considerable attention in recent years because they are valid approaches to obtain synergistic integration between deep reinforcement learning and symbolic reinforcement learning. Along these lines of research, this paper presents an extension to a recent neural-symbolic method for reinforcement learning. The original method, called State-Driven Neural Logic Reinforcement Learning, generates sets of candidate logic rules from the states of the environment, and it uses a diferentiable architecture to select the best subsets of the generated rules that solve the considered training tasks. The proposed extension modifies the rule generation procedure of the original method to efectively capture a recursive pattern among the states of the environment. The experimental results presented in the last part of this paper provide empirical evidence that the proposed approach is beneficial to the learning process. Actually, the proposed extended method is able to tackle diverse tasks while ensuring good generalization capabilities, even in tasks that are problematic for the original method because they exhibit recursive patterns.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine learning</kwd>
        <kwd>Reinforcement learning</kwd>
        <kwd>Neural-symbolic learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Neural-symbolic methods for reinforcement learning have been extensively studied in the last
few years (e.g., 1[]). These methods try to overcome the limitations of deep reinforcement
learning methods while preserving their ability to efectively cope with partially observable and
stochastic environments. Among the mentioned limitations, it is worth noting that, even if deep
reinforcement learning methods obtained notable results on several complex ta2s,k3s, 4(e].)g,., [
they typically have limited generalization capabilities5,(e6.]g)., M[oreover, the models that
deep reinforcement learning methods produce are not easily interpretable. Neural-symbolic
methods for reinforcement learning are expected to overcome both limit5a]t.ions [</p>
      <p>
        Three interesting neural-symbolic methods for reinforcement learnDinifegreanrtieable Logic
Machine (DLM ) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], diferentiable Neural Logic (dNL) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], andNeural Logic Reinforcement Learning
(NLRL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These methods combine diferentiable architectures with first order logic, and they
can be used to define agents that learn sets of logic rules that represent strategies to efectively
      </p>
      <p>CEUR
Workshop
Proceedings
solve the target tasks in environments that present uncertainty. However, the mentioned
methods require a large amount of information to guide the search for an efective strategy,
which prevents these methods from being used in real-world applications. It is often dificult,
or even impossible, to define the structure, or even the characteristics of the final solution for
many complex tasks. In addition, these methods often require relevant computational resources
even to deal with toy problems.</p>
      <p>
        This paper presents an extension of a recently proposed neural-symbolic method for
reinforcement learning, callSetadte-Driven Neural Logic Reinforcement Learning (SD-NLRL) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
goal of this work is to extend SD-NLRL to capture the recursive patterns among the states of
the environment. A recursive pattern can be defined as a set of atoms in which each pair of
atoms in the set shares at least a constant, and the set can be replaced with a recursive predicate
to obtain a new state which is equivalent to the initial one. In particular, SD-NLRL is based on
NLRL, which is an adaptation ofILP [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for reinforcement learning tasks. In order to solve a
specific task, NLRL requires a large amount of information from the user. The user must specify
diferent characteristics of the generated rules using a set of hyper-parameters. On the contrary,
SD-NLRL requires information that can be directly obtained from the environment when the set
of the possible states of the environment is known. The algorithm that is the base of this work,
namely SD-NLRL, generates a set of candidate rules directly from the states of the environment.
Then, SD-NLRL employs a modified version of the diferentiable architecture of NLRL to select
the best subset of the generated rules that are able to solve the training task.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], SD-NLRL is compared with NLRL on five tasks, but SD-NLRL is not able to solve three
block manipulation tasks because these tasks present recursive patterns, and SD-NLRL is not
able to generate compact and suficiently general rules from the states. The proposed extension
is able to tackle even block manipulation tasks, as shown in Se3c.tion
      </p>
      <p>SD-NLRL is closely related to another neural-symbolic method for reinforcement learning
called dNL. dNL builds a neural network that is used to learn a logical formula in conjunctive
or disjunctive normal form. dNL requires the user to specify both the structure of the network
and the number of free variables that can be used in the learned rules. Therefore, as discussed
for NLRL, SD-NLRL requires less information from the user about the structure of the learned
rules, and it does not require the user to size a neural network.</p>
      <p>
        SD-NLRL is also related tRoelational Reinforcement Learning (RRL) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which combines
Q-learning1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with a logical representation of states and actions, because they both work
on reinforcement learning using logic. In particular, RRL induces a relational regression tree
that maps a Q-value, which represents the expected rewards for an action taken in a given
state, to each pair of state and action. Several extensions of the original RRL method have
been proposed (e.g., 1[
        <xref ref-type="bibr" rid="ref14 ref15 ref3">3, 14, 15</xref>
        ]). However, these extensions do not exploit deep reinforcement
learning and related innovations. On the contrary, SD-NLRL can be used to learn logic rules
using a diferentiable architecture, which can take advantage of deep reinforcement learning.
      </p>
      <p>Finally, another recent and notable neural-symbolic method for reinforcement learning is
DLM, which defines a neural architecture that mimics a logical deduction process. In order
to obtain interpretable strategies, DLM can be used to learn ground rules only, and it requires
the user to properly size the neural network. On the contrary, SD-NLRL requires the set of all
states, which is a strong requirement but, it does not require the user to tune a neural network
for the specific purpose of mimicking the deduction process.</p>
      <p>The remaining of the paper is organized as follows. Sect2iborniefly describes SD-NLRL.
Section2 also reports the experimental results of SD-NLRL on clif-walking tasks, for completeness.
Section3 presents the proposed extension of SD-NLRL, which is useful when the states of the
environment are characterized by recursive patterns. S3ecatlisoondiscusses a comparison
between the proposed extension of SD-NLRL and NLRL on block manipulation tasks. Finally,
Section4 concludes this paper and outlines possible future developments.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>This section summarizes SD-NLRL, as described in9][, and it briefly introduces Datalog, which
is the language considered throughout this paper. Then, it presents NLRL, which is the
neuralsymbolic method for reinforcement learning that is the base of SD-NLRL. Then, SD-NLRL is
briefly presented, discussing the main diferences with NLRL. Finally, the experimental results
reported in9[] are presented, for completeness.
2.1. Datalog
SD-NLRL can be used to learn rules that are written in a subset of first-order logic, called Datalog.
In Datalog, there are three main primitives: predicate symbols, variable symbols, and constant
symbols. Predicate symbols, or predicates, denote relations among objects in the considered
domain. Constant symbols, or constants, denote the objects of the considered domain. Variable
symbols, or variables, denote references to unspecified objects of the considered domain. An
atom is defined as ( 1, … ,   ), where  is a predicate and1, … ,   are terms, which can be
either variables or constants. An atom is called ground when the1,t…er, msare  constants.
A definite clause is a rule defined as ←  1, … ,   , where is the head of the rule, a n1d, … ,  
is the body of the rule. In this work, the word rule is used as a synonym of definite clause to
improve readability. A predicate defined using ground atoms only is called extensional predicate.
On the contrary, a predicate defined using a set of definite clauses is called intensional predicate.
2.2. NLRL
NLRL is an adaptation ofILP, a neural-symbolic method for Inductive Logic Programm1in6]g [
for reinforcement learning tasks. NLRL requires that both the environment and the background
knowledge are presented using ground atoms, and it produces sets of logic rules using program
templates, which are sets of hyper-parameters that describe the form of the generated rules.
In particular, NLRL generates sets of rules before the training phase, and then it can be used
to learn the subset of the generated rules that represent a good strategy to solve the task. In
order to perform rule generation, the user must define both the action predicates, representing
the possible actions, and the auxiliary predicates, used for predicate invention. During the
training phase, NLRL performs a predetermined number of forward chaining steps to obtain
the truth value of the action atoms. A program template indicates both the number and the
arity of the auxiliary predicates, the number of deduction s,taenpds a set of rule templates.
A program template must specify a rule template for each intensional predicate, either action
or auxiliary. A rule template indicates the number of free variables in the corresponding rule
and whether intensional predicates are allowed in the body of the rule. NLRL generates the
candidate rules using a top-down approach, and it enforces several constraints on the produced
rules. Besides the limitations imposed by the program template, NLRL requires that the body of
each rule contains exactly two atoms, and many additional constraints are applied to further
reduce the space of generated rules. Therefore, using a program template, NLRL requires the
user to specify many details on the form of the final solution, which are not always available
for many real-world problems.</p>
      <p>
        In order to select the best subset of rules that solve the training task, NLRL defines a trainable
weight for each generated rule. In particular, the algorithm defines a vector of  wefiogrhts
each predicate and for each rule templ a.tMeoreover, NLRL defines a valuation vector10[]
 ∈  = [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] || , where is the set of ground atoms. Each componen t orfepresents how much
the algorithm believes that the corresponding atom is true. In order to obtain the truth value of
each ground atom, NLRL perform sforward chaining steps, and the sequence of deduction
steps for iterati oncan be formalized as∶  ×  → 
, defined as follows:

(1)
(2)
(3)
(4)
if  = 0
      </p>
      <p />
      <p>L(), must be defined. Let  be a rule,</p>
      <p>takes a valuation vector containing
  (,  0) = {
 0
( −1 (,  0),  0) if  &gt; 0
(,  0) =  0 + ∑
⨁
∑  , ℎ</p>
      <p>, (),
 0≤≤ 0≤≤
where 0 represents the initial valuation vecto r, daensdcribes a single deduction step, which
can be formalized as:</p>
      <p>where represents the number of rule templates defined for predic,a ties the number of
rules produced using t h-eth rule templa te,. is the weight of the-th rule produced using the
 -th rule template for predic a,taendℎ, () denotes a forward chaining step using t-hteh rule
produced using th e-th rule template for predic a.tTehe symbol⊕ denotes the probabilistic
sum, which is defined as  ⊕  =  +  −  ⊙</p>
      <p>, where ⊙ represents the component-wise
multiplication. In order to select a single rule for each rule template, NLRL uses a softmax
function to obtain each weigh,t from the underlying vector of weig ht:s

 , = softmax  (  ),
  = {(, ) | satisfies</p>
      <p>(  ,   ) ∧ head (  ,   ) =   }
wheresoftmax  () =
exp(  )
∑ exp(  )</p>
      <p>.</p>
      <p>In order to complete the formalization of NLℎR,
identified by (, , ) , the corresponding functioℎn, ∶  →</p>
      <p>the expected truth values of the ground atoms, and it outputs a valuation vector that contains
the truth values of the atoms after the applicati o.nEoafch functionℎ is implemented by
a matri x  ∈ ℕ××2 , where denotes the maximum number of pairs of atoms that entail a
ground atom. Each  is built before the training phase, and it is defined as follows:
where  is a set of indexes pairs, and each pair is a reference to a pair of atoms that logically
entails the ground atom of in d e.xEach matri x  typically contains many unused indexes
pairs(0, 0), which must refer to a pair of atoms. Therefore, the algorithm adds the falsum atom
⊥ in  , and it associate(s0, 0) to(⊥, ⊥). During the training phase, NLRL extracts two slices of
  , namely 1,  2 ∈ ℕ× , such tha t 1 and 2 extract all the indexes of the first atoms fr om
and all the indexes of the second atoms f ro m,respectively:</p>
      <p>1 =   [_, _, 0] and  2 =   [_, _, 1].</p>
      <p>
        Then, the algorithm obtains the truth value of each ground atom, ob t1a,in2i∈ng[
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]× ,
using the functionℎ ∶  × ℕ × → ℝ× :
 1 = ℎ(, 
1) and  2 = ℎ(, 
2).
whereℎ , as defined as in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], is a function that retrieves the values from a valuation vector
given a matrix of indexes. Then, NLRL computes ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]× , defined as follows:
      </p>
      <p>=  1 ⊙  2.</p>
      <sec id="sec-3-1">
        <title>Finally, the algorithm compuℎte(s) as:</title>
        <p>ℎ () =  ′
where
 ′[] =
max   [, ].
1≤≤
It is worth noting that the maximum operator realizes a fuzzy disjunction, while the
componentwise multiplication implements a fuzzy conjunction.
SD-NLRL generates the candidate rules directly from the states of the environment. It uses the
list of all possible states, which can be provided by the environment in many situations, and the
number of forward chaining step.s</p>
        <p>The rule generation function of SD-NLRL, calgleenderates_rules, is shown in Algorithm1.
The function has three arguments: the set of action at,otmhse set of state,sand the set of
background atoms. Each state is subdivided into a set of groupussingget_groups, and
each group includes a subset of the atoms that are included in the state. In particular, each
atom in a group is directly or indirectly related with other atoms in the same group through
its constants. Therefore, each group has a diferent set of constants, and it describes a logical
characteristic of the state that gives a contribution to the truth value of the action atom. The
generates_rules function then usegset_shared to obtains, for each pair of group and action,
the constants that are present both in the action atom and in the grougpe.nTerhaetne, _rules
adds the background atoms that include at least one shared constant into the group. Here, only
the shared constants are considered because they represent a connection between the groups
of the state and the action to be performed. The function then obtains the set of unshared
constants, which are the constants that are not shared between the action atom and the group,
after the related background atoms are included. Now, when the unshared constants are not
present, the final rule is produced usinugnground, which converts each ground term into a
(5)
(6)
(7)
(8)
new variable. Otherwise, a rule is generated for each unshared con.sItnapnatrticular, the
set of unshared constants that does not in cluisdiemmediately transformed into variables.
Then, the function adds the background atoms that incltuodethe body of the rule. Finally,
the remaining constants are converted into variables.</p>
        <p>The generation of a new rule for each unshared constant is useful to decrease the complexity
of each rule. In fact, the algorithm tries to obtain the most simple characteristics from each
state. Then, the diferentiable architecture picks the best subset of rules that solves the training
task. When converting constants into variaubnlgerso,und starts from the body, which is the
second argument, and it transforms the rule going from left to right. Finally, it changes the
head of the rule, which is the first argument. The last argument is a set of constants, and the
function converts only the constants in the set when the latter is not empty.</p>
        <p>SD-NLRL allows an action predicate to be described by an unpredetermined number of rules.
Moreover, there is not limit on both the arity of the predicates and the number of atoms that
are included in the body of the rules. As a consequence, SD-NLRL defines an architecture that
is diferent from the one used by NLRL, and it defines a single deduction st epas:
following normalization function:
where  is the weight of ru lefor predicate , andℎ () denotes a forward chaining step using
the rule defined for predicate . The values of weights  are restricted i[n0, 1] applying the
(,   ) =   + ∑ ⨁   ℎ


 (),
  = max(  ) − min(  ) .</p>
        <p>(9)
(10)
(11)
In order to implemenℎt() for rul e, SD-NLRL defines a matrix   ∈ ℕ×× for each rule:
algorithm then comput e1s, … ,   ∈ ℝ×
usingℎ
, and it obtains , defined as:
Then, SD-NLRL computes matric es1, … ,   ∈ ℕ× before starting the training process. The
  =  1 ⊙ ⋯ ⊙   .</p>
      </sec>
      <sec id="sec-3-2">
        <title>Finally, the algorithm defineℎs using (8).</title>
        <p>SD-NLRL is able to associate the same value to diferent rule weights. Note that the proposed
normalization10() has some problems. In fact, it is not applicablmeaixf(  ) = min(  ), e.g.,
a single weight. Moreover, using this normalization imposes that both a rule with weight 0
and a rule with weight 1 must always be present. These problems can be detrimental in some
situations, although for most real-world applications these problems should not be present.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the performance of SD-NLRL has been evaluated on the same reinforcement learning
tasks that are discussed in the paper that originally introduced5]N. LInRLpa[rticular,
SDNLRL has been evaluated on two clif-walking tasks and three block manipulation tasks. The
clif-walking tasks are CLIFFWALKING and WINDYCLIFFWALKING, where the agent must
move from an initial position to a goal position on a 5x5 grid without going to a clif position
and using as few moves as possible. The diference between CLIFFWALKING and
WINDYCLIFFWALKING is that in the second task the agent has a 10% chance of moving downward on the
grid, independently from the taken action.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the block manipulation tasks are STACK, UNSTACK, and ON. STACK requires the
agent to build a single column of blocks, UNSTACK requires the agent to put all the blocks
on the floor, while in ON the agent must put a specific block onto another specific block. The
evaluation of the learned strategies was made using the same variations of the tasks described
in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In particular, many variations of the same task are used to evaluate the generalization
capabilities of the method. SD-NLRL algorithm fails to solve block manipulation9]tasks [
because it is not possible to subdivide states into small groups of atoms. In particular, in block
manipulation tasks the atoms of the states are connected to each other because all the blocks
are placed upon other blocks or they are on the floor. Therefore, they are connected to the other
blocks directly through tohne/2 predicate, or indirectly through the floor constant, which is
common to every pile of block. For example, the following state represents a pile of blocks:
top(a), on(a,b), on(b,c), on(c,floor)
in which blockc is on the floor, block b is on block c, and block a is on block b at the top of
the pile. This state produces a single group of atoms. This is a characteristic of the states of
the considered block manipulation tasks. Therefore, SD-NLRL produces complex rules, and the
learning process becomes more computationally demanding as both the number and the size of
the generated rules increase.
        </p>
        <p>Table1, taken from9[], reports the average returns of SD-NLRL. In order to obtain the results,
5 runs have been made for each training task, and the resulting models have been evaluated
on 100 episodes for each task. Finally, the returns have been averaged over all runs for each
task. The variants of CLIFFWALKING and WINDYCLIFFWALKING, which are used to measure
the generalization capabilities of the algorithm,taorpe-le5ft : , top-right, andcenter impose a
diferent starting position of the agent, wh6ixl6eand7x7 increase the size of the grid. From the
results shown in Tabl1eemerges that SD-NLRL is able to efectively solve CLIFFWALKING, and
it is able to generalize to unseen states. However, the results show great variance, and SD-NLRL
is not able to learn an efective strategy for WINDYCLIFFWALKING. The agent often remains
trapped into a local optimum. This behaviour is justified by the rule generation strategy, which
produces rules from the states of the environment, and it is not able to generate the best rules
for each action predicate.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Handling a recursive pattern</title>
      <p>One of the main issues of SD-NLRL is that it is not able to handle deeply connected states
because compact rules cannot be generated. This prevented SD-NLRL from generating an
efective strategy for block manipulation tasks. This section presents the proposed extension of
SD-NLRL to deal with a form of deeply connected states. In particular, the considered states
exhibit recursive patterns. The proposed extension can be easily adapted to capture other
recursive patterns within the states. Note that the proposed algorithm behaves as the original
one on clif-walking tasks because the states of the clif-walking tasks do not contain recursive
patterns. Therefore, the results shown in T1aabrle still valid for the proposed extension.</p>
      <p>As previously discussed, the states of block manipulation tasks discuss5e]dfionll[ow a
specific pattern. For example, a pile of blocks is always defined by:</p>
      <p>top( 1), on( 1, 2), on( 2, 3), ..., on(  ,floor),
where is a number andon(X,Y) is a predicate which specifies that a block X is on Y, where Y
is either a block or the floor. The other predicattoep,(X), specifies that block X is the top block
of a pile. In order to reduce the size and the number of the generated rules, it is important to
capture the recursive patterns within states.</p>
      <p>The discussed block manipulation tasks require the agent to capture the concept of a pile of
blocks, which can be modeled as:
pile(X,Y) :- on(X,Z), on(Z,Y).</p>
      <p>pile(X,Y) :- on(X,Z), pile(Z,Y).</p>
      <p>The predicatepile/2 is able to capture the chain relation among the blocks. The proposed
extension of SD-NLRL defines the auxiliary predicaptiele/2 during the rule generation phase,
and it replaces the occurrencesoonf/2 in the generated rules wiptihle( , ), where and
are the constants that limit the sequence on the left and on the right, respectively. An example
of the application orfeplace_chain is provided:
is rewritten as a more compact set of atoms:
top(a), on(a,b), on(b,c), on(c,floor), top(d), on(d,floor)</p>
      <p>top(a), pile(a,floor), top(d), on(d,floor).</p>
      <p>In particular, Algorith1mis modified adding two lines, as shown in Algorith m2. The
functionreplace_chain performs the substitution of the connected predicates, and it applies a
sorting algorithm to the group of atoms. The function sorts the atoms looking at the names of
the predicates. When two predicates have the same name, the function uses the arguments,
proceeding from left to right. The ordering on the atoms is enforced to reduce the number of
generated rules after the callutnoground. For example, starting from the following groups of
atoms, which both define a pile of blocks and a single block on the floor:
top(a), pile(a,floor), top(d), on(d,floor),
top(a), top(d), pile(a,floor), on(d,floor),
and considering the action atommove(a,d), if the ordering is not enforced on the body of the
rule the method generates the following rules:
move(X,Z) :- top(X), pile(X,Y), top(Z), on(Z,Y).</p>
      <p>move(X,Y) :- top(X), top(Y), pile(X,Z), on(Y,Z).</p>
      <p>Therefore, the algorithm would generate two rules that are equivalent. On the contrary, after
ordering the atoms and applyiunngground, the body of the generated rules would be:
move(Z,X) :- on(X,Y), pile(Z,Y), top(Z), top(X).</p>
      <p>move(Z,X) :- on(X,Y), pile(Z,Y), top(Z), top(X).</p>
      <p>Therefore, the algorithm generates two equal rules, and one rule can be discarded.</p>
      <p>Replacing low-level concepts with more abstract concepts allows the proposed extension to
reduce both the number and the size of the generated rules. Therefore, the required
computational resources are heavily reduced, and the algorithm is consequently able to complete the
training process for block manipulation tasks. Moreover, the agent is able to generalize to piles
of blocks of unpredetermined length.</p>
      <p>The proposed extension changes also the definitiongoeft_shared, which now returns the
set of shared constants among all the atoms in the group together with the action atom. The
functionget_unshared has been changed as well. This modification has been made to further
reduce the number of generated rules, although their length can grow because the number of
shared constants is bigger. In fact, the algorithm may add more background knowledge to each
rule. In general, it is not clear which definition should be preferred, and this can represent an
interesting research direction for the future.</p>
      <p>Table2 shows the performance of the proposed extension for block manipulation tasks. The
results have been obtained following the same procedure that is discussed in the previous
section. The base environment of UNSTACK starts with a single column of blocks. Five variants
of the task are defined:swap-2, divide-2, 5-blocks, 6-blocks, and7-blocks. swap-2 anddivide-2
swap the top two blocks and divide the blocks in two columns, respectively. The other three
variants increase the number of blocks. The base STACK environment starts with all the blocks
on the floor. Five variations of the task are defineds:wap-2, divide-2, 5-blocks, 6-blocks, and
7-blocks. swap-2 and divide-2 swap the two blocks on the right and divide the blocks in two
columns, respectively. Similarly to the UNSTACK problem, the other three variants increase the
number of blocks. The base environment of ON starts with a single column of blocks. Similar to
UNSTACK and STACK, three variants are defined with an increasing number of blo5c-bklso:cks,
6-blocks, and7-blocks. Two additional variants of the tasksawraep-2 andswap-middle, which
swap the top two blocks and the middle two blocks, respectively.</p>
      <p>Both SD-NLRL and the proposed extension are based on the oficial implementation of NLRL,
which is available agtithub.com/ZhengyaoJiang/NLR,Land they all share the same
hyperparameters. However, the learning rat e aanrde specific to each task. In particularw,as set
to 4 for block manipulation tasks during the training phase. In the evaluati onwpahsaset,
to 7 because more forward chaining steps are needed to obtain the correct truth value of the
action predicates. In fact, tphiele/2 predicate presents a recursive definition, and the number
of deduction steps must be greater than the number of recursive calls. The learning rate was
set to 0.05 for all tasks.</p>
      <p>As shown in Table2, the obtained results of both NLRL and SD-NLRL with the proposed
extension are comparable on both STACK and UNSTACK, but NLRL shows an overall better
performance. Moreover, SD-NLRL with the proposed extension shows worse generalization
capabilities than NLRL on STACK. Finally, the results on the ON task shows that NLRL
outperforms the presented algorithm by a considerable margin, and SD-NLRL with the proposed
extension does not efectively generalize to slightly diferent tasks. In fact, the proposed
extension fails to complete the evaluation process on three variants of ON because the realized
computational graph exceeds the technical limitations of the implementation. In the case of
STACK, the rules that are learned using SD-NLRL suggest that the rule generation process fails
to produce a general rule for the initial state, where all the blocks are on the floor. Therefore, the
algorithm performs worse as the number of blocks increases because more random moves are
needed to build the first pile of blocks. In the case of ON, the task is clearly more dificult than
the other tasks, and both the number and size of generated rules increase because an additional
predicateg,oal(X,Y), is required to specify that block X must be placed on block Y. Therefore,
it is even more dificult to generate good candidate rules from states. It is worth noting that,
in all cases, only the rules with a weight greater0.3thaarne reported, for space limitations.
In summary, the presented experimental results provide empirical evidence that the proposed
extension to SD-NLRL is beneficial, even if a much wider experimental campaign is needed to
assess convincing and reliable results.</p>
      <p>The following example shows another application of this technique, which could be a task
where a state of the environment describes a sequence of rectangles and circles. In particular,
letrect(X,Y) and circle(X,Y) represent two type of cells that connect positions X and Y
on a path, respectively. Then, tphaeth(X,Y) predicate can be used to define a specific path
structure, represented as a sequence of rectangles and circles:
path(X,Y) :- rect(X,Z), circle(Z,K), rect(K,Y).</p>
      <p>path(X,Y) :- circle(X,Z), path(Z,K), circle(K,Y).</p>
      <p>Note that this technique can be extended to every recursive pattern within the states, even with
predicates with arity higher than 2 and diferent forms of recursion. However, the presented
method is not able to automatically detect recursion within states and it does not automatically
define the recursive predicate.</p>
      <p>In order to provide a comprehensive view on the proposed method, the following part reports
the learned rules of the best model for each block manipulation task. In particular, the learned
rule (on the right) and the corresponding weight (on the left) for STACK are:
move(Y,M) :- floor(X),on(Y,X),on(Z,X),pile(M,X),top(Y),</p>
      <p>top(M),top(Z).</p>
      <p>The learned rule moves a block that is on the floor onto a pile of at least two blocks. Therefore,
SD-NLRL can be used to learn a compact strategy that efectively solves the task, but the learned
strategy is not the best one because some atoms, naomne(lZy,X) andtop(Z), may be omitted.
The learned rules for UNSTACK are instead:
In this case, SD-NLRL can be used to learn an efective strategy, but the number of learned rules
is even bigger. In fact, the presented strategy is to move a top block on the floor when there is
a single pile of blocks, two diferent piles of blocks or a pile of blocks and one or two blocks
on the floor. It is evident that the learned strategy is redundant, and the third rule would be
suficient to solve the task. Finally, the learned rules for ON are:
move(Y,X) :- floor(X),pile(Y,X),pile(Z,X),top(Y),top(Z).
move(Z,X) :- floor(X),pile(Y,X),pile(Z,X),top(Y),top(Z).
move(Y,X) :- floor(X),pile(Y,X),top(Y).
move(Z,X) :- floor(X),on(Y,X),pile(Z,X),top(Z),top(Y).
move(M,X) :- floor(X),on(Y,X),on(Z,X),pile(M,X),top(Y),</p>
      <p>top(M),top(Z).
move(M,X) :- floor(X),goal_on(Y,Z),pile(M,X),pile(N,X),</p>
      <p>top(M),top(N).
move(Y,Z) :- floor(X),goal_on(Y,Z),on(Y,X),on(Z,X),</p>
      <p>on(M,X),on(N,X),top(Y),top(Z),top(M),top(N).
move(Y,Z) :- floor(X),goal_on(Y,Z),on(Y,X),on(M,X),</p>
      <p>pile(Z,X),top(Y),top(Z),top(M).
move(Y,Z) :- floor(X),goal_on(Y,Z),on(Y,X),pile(Z,X),</p>
      <p>top(Y),top(Z).
The proposed method can be used to learn stacking the goal blocks when they are top blocks.
When it is not possible to directly stack the goal blocks, SD-NLRL can be used to learn to put
a top block on the floor. This strategy is efective but, similarly to the previously discussed
strategies, the number of learned rules is excessive. In particular, the third rule should be
omitted, and the strategy lacks a rule that moves the top block of a pile onto a block on the floor
when the two blocks are the goal blocks. Moreover, the learned rules contain many unnecessary
atoms, e.g.pile(N,X) andtop(N) in the first rule. Another example is the second rule that has
been learned for the ON task, which requires t4hbaltocks are on the floor. The learned rule is
not only redundant, but also dangerous because it cannot be used in an environment with only
3 blocks. Both these examples show that the learned rules are too much specific with respect
to the states of the training environment. In fact, the training environmen4tbdloecfinkess,
and the generated rules often represent specific situations only. In summary, from this analysis
emerges that, despite being able to learn an efective strategy, many rules that are learned using
SD-NLRL are unnecessary or overly specific. In fact, the agent learns a strategy that is specific
to the training environment. Therefore, generating more compact and general, yet useful, rules
is another important research direction for the future.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <p>This paper proposed an extension of SD-NLRL, which is a neural-symbolic method for
reinforcement learning. The proposed extension modifies the rule generation process of the original
algorithm to capture recursive patterns within the states of the environment. The experimental
results show that the proposed extension can be used to learn an efective strategy for block
manipulation tasks, which are problematic for the original method. Moreover, the algorithm
efectively generalizes to unforeseen situations. In particular, when the states exhibit a recursive
pattern, the proposed method can be useful to reduce the number of generated rules and the
required computational resources. Moreover, SD-NLRL performs worse than NLRL but in many
tasks the diference between the algorithms is small in terms of performance, and SD-NLRL
solves the tasks using the states of the environment. On the contrary, NLRL requires the user
to specify the characteristics of the generated rules. Both are strong requirements but, in many
cases, the states of the environment can be automatically obtained from the problem definition,
while it can be dificult to manually tune the program templates for complex tasks. Note that
the proposed extension could be easily generalized to other tasks.</p>
      <p>Despite consistently solving the considered tasks, the proposed extension performs worse
than NLRL, and it presents several limitations. In particular, it does not generate suficiently
general and abstract rules, and this afects also the generalization capabilities of the method.
Moreover, the proposed extension captures only a specific pattern among states and it is not
able to automatically detect recursive patterns within states. The proposed method is also not
able to tackle complex problems because it builds large models and it requires a large amount
of computational resources to solve simple tasks. Moreover, the experimental results suggest
that the proposed extension can be used to learn rules that are either redundant or too much
case-specific. Finally, many limitations of the original SD-NLRL algorithm are still present. In
fact, the proposed extension still needs the set of possible states, and it sometimes remains
trapped into a local optimum during training.</p>
      <p>There are many interesting research directions for the future. For example, more efort is
needed to generate abstract and general rules from deeply connected states, and the presented
approach could be extended to automatically detect diferent type of recursive patterns within
the states. Moreover, the proposed method can be improved by generating the rules iteratively
as the agent visit new states. This prevents the environment from specifying the set of all states,
and it is useful to reduce the number of generated rules, allowing the method to tackle more
complex tasks. Finally, the learning stability should be improved, and more efort is needed to
prevent the agent from learning redundant rules. In summary, despite covering a specific case,
this work provides convincing experimental evidence of the validity of the approach, which
can be a viable means towards successful neural-symbolic reinforcement learning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dourado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Velasquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Neurosymbolic reinforcement learning and planning: A survey</article-title>
          ,
          <source>IEEE Transactions on Artificial Intelligence</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. A. K.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sadik</surname>
          </string-name>
          , I. Antonoglou,
          <string-name>
            <given-names>H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>arXiv preprint:1707.06347</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Maddison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          , G. van den Driessche, J. Schrittwieser,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Panneershelvam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dieleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grewe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graepel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <article-title>Mastering the game of Go with deep neural networks and tree search</article-title>
          ,
          <source>Nature</source>
          <volume>529</volume>
          (
          <year>2016</year>
          )
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Neural logic reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>9P7roofceedings</volume>
          <source>of Machine Learning Research</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3110</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <article-title>Deep learning: A critical appraisal</article-title>
          , arXiv preprint:
          <year>1801</year>
          .
          <volume>00631</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zimmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Glanois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jianye</surname>
          </string-name>
          , L. Wulong,
          <article-title>Diferentiable logic machines</article-title>
          ,
          <source>arXiv preprint:2102.11529</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Payani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fekri</surname>
          </string-name>
          ,
          <article-title>Incorporating relational background knowledge into reinforcement learning via diferentiable inductive logic programming</article-title>
          ,
          <source>arXiv preprint:2003</source>
          .
          <volume>10386</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Beretta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Monica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bergenti</surname>
          </string-name>
          ,
          <article-title>Preliminary results on a state-driven method for rule construction in neural-symbolic reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 17th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy</source>
          <year>2023</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <article-title>Learning explanatory rules from noisy data</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Džeroski</surname>
          </string-name>
          , L. De Raedt,
          <string-name>
            <given-names>K.</given-names>
            <surname>Driessens</surname>
          </string-name>
          ,
          <article-title>Relational reinforcement learning</article-title>
          ,
          <source>Machine learning 43</source>
          (
          <year>2001</year>
          )
          <fpage>7</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dayan</surname>
          </string-name>
          ,
          <article-title>Q-learning</article-title>
          ,
          <source>Machine learning 8</source>
          (
          <year>1992</year>
          )
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Driessens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Blockeel</surname>
          </string-name>
          ,
          <article-title>Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner</article-title>
          ,
          <source>in: Proceedings of the 12th European Conference on Machine Learning (ECML</source>
          <year>2001</year>
          ), Springer,
          <year>2001</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Driessens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramon</surname>
          </string-name>
          ,
          <article-title>Relational instance based regression for relational reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 20th International Conference on Machine Learning (ICML</source>
          <year>2003</year>
          ), AAAI Press,
          <year>2003</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gärtner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Driessens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramon</surname>
          </string-name>
          ,
          <article-title>Graph kernels and Gaussian processes for relational reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on Inductive Logic Programming (ILP</source>
          <year>2003</year>
          ), Springer,
          <year>2003</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          ,
          <article-title>Inductive logic programming</article-title>
          ,
          <source>New generation computing 8</source>
          (
          <year>1991</year>
          )
          <fpage>295</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>