Preliminary Results on a State-Driven Method for Rule Construction in Neural-Symbolic Reinforcement Learning Davide Beretta1,∗ , Stefania Monica2 and Federico Bergenti1 1 Dipartimento di Scienze Matematiche, Fisiche e Informatiche, Università degli Studi di Parma, Italy 2 Dipartimento di Scienze e Metodi dell’Ingegneria, Università degli Studi di Modena e Reggio Emilia, Italy Abstract Deep reinforcement learning has obtained impressive results in the last few years. However, the limitations of deep reinforcement learning with respect to interpretability and generalization have been clearly identified and discussed. In order to overcome these limitations, neural-symbolic methods for reinforcement learning have been recently proposed. This paper presents preliminary results on a new neural-symbolic method for reinforcement learning called State-Driven Neural Logic Reinforcement Learning. The discussed method generates sets of candidate logic rules directly from the states of the environment. Then, it uses a differentiable architecture to select a good subset of the generated rules to successfully complete the training task. The experimental results presented in this paper provide empirical evidence that the discussed method can achieve good performance without requiring the user to specify the structure of the generated rules. Besides being preliminary, the experimental results also suggest that the presented method has sufficient generalization capabilities to allow using learned rules in environments that are sufficiently similar to the training environment. However, this is a preliminary work, and the experimental results show that the proposed method is not yet sufficiently effective. Keywords Artificial intelligence, Machine learning, Reinforcement learning, Neural-symbolic learning 1. Introduction Deep reinforcement learning has received increasing interest from the research community in recent years because its methods have obtained impressive results on many complex tasks (e.g., [1, 2, 3, 4, 5]). However, the limitations of deep reinforcement learning from the perspective of the interpretability of learned strategies are well known. Moreover, these methods are expected to have limited generalization capabilities because they are unable to reuse learned strategies on tasks that are different from the ones used for training [6, 7]. In order to overcome these two major limitations, neural-symbolic methods for reinforcement learning have gained relevant interest and appreciation. Many research proposals have tried to combine neural approaches with first-order logic, such as DLM [8], dNL [9], and NLRL [7]. NeSy 2023, 17th International Workshop on Neural-Symbolic Learning and Reasoning, Certosa di Pontignano, Siena, Italy ∗ Corresponding author. Envelope-Open davide.beretta@unipr.it (D. Beretta); stefania.monica@unimore.it (S. Monica); federico.bergenti@unipr.it (F. Bergenti) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) These three methods use differentiable architectures [10] to learn sets of logic rules that solve the tasks used for training. Therefore, these methods are able to effectively combine the benefits of neural architectures, such as the ability to cope with uncertain environments, with good interpretability and generalization capabilities. However, these methods require the user to provide large amounts of data to guide the search. Moreover, they have not been currently tested on complex reinforcement learning tasks. Finally, they often require large amounts of computational resources to successfully solve even simple tasks. This paper proposes a novel neural-symbolic method for reinforcement learning, called SD-NLRL (State-Driven Neural Logic Reinforcement Learning), that is largely based on NLRL. For a given learning task, NLRL generates a set of candidate rules using a top-down approach that requires the user to specify a set of hyper-parameters, called program template, to guide the generation of rules. On the contrary, SD-NLRL generates its candidate rules directly from the states of the environment. Actually, SD-NLRL uses a modified version of the differentiable architecture of NLRL to select the best subset of the generated rules that solves the task under investigation. In order to fairly compare SD-NLRL with NLRL, SD-NLRL was used on the tasks that are described in the paper [7] that introduced NLRL. Actually, the two methods are compared using two cliff-walking tasks and three block manipulation tasks. The work presented in this paper is closely related to relational reinforcement learning [11], which represents states and actions using first-order logic and induces a relational regression tree that associates each state-action pair to a Q-value. The literature documents several extensions (e.g., [12, 13, 14]) of the original relational reinforcement learning method. However, these extensions do not use differentiable architectures, and therefore they are not able to exploit the most recent breakthroughs of deep reinforcement learning. A recent work on neural-symbolic reinforcement learning that is closely related to SD-NLRL is documented in [9], which proposes an adaptation of dNL [15] for reinforcement learning tasks. dNL uses neural networks that learn logic formulae in conjunctive or disjunctive normal forms. dNL requires the user to provide the structure of the neural network together with the number of free variables to be used in the learned rules. On the contrary, SD-NLRL uses a differentiable architecture based on the states of the environment, and it does not require the user to provide additional parameters to define the structure of the learned rules. Another method that is closely related to SD-NLRL is DLM [8], which is based on NLM [16]. NLM obtained notable results on some reinforcement learning tasks using a deep neural network to mimic the logical deduction. However, the learned policies that NLM provides are not directly interpretable. DLM extends NLM by replacing neural operators with fuzzy operators, and therefore it is able to learn interpretable policies. However, DLM learns ground rules only. Moreover, both NLM and DLM require the user to specify the architecture of the neural network, which is not required by SD-NLRL. The remaining of this paper is structured as follows. Section 2 discusses the proposed method. Section 3 presents an experimental comparison of SD-NLRL and NLRL. Finally, Section 4 concludes this paper discussing some of the current limitations of SD-NLRL and outlining possible research directions for the future. 2. SD-NLRL The first part of this section presents a brief recall of the subset of first-order logic, namely Datalog, that is considered in this work. Then, this section proceeds with a brief introduction to NLRL, which is the neural-symbolic method for reinforcement learning that represents the base of SD-NLRL. Finally, SD-NLRL is described focusing on its main differences from NLRL. 2.1. Datalog Datalog is a subset of first-order logic, and its syntax comprises three main primitives: predicate symbols, variable symbols, and constant symbols. Predicate symbols, or predicates, represent relations among objects in the domain of discourse. Constant symbols, or constants, represent the objects of the domain of discourse. Variable symbols, or variables, represent references to unspecified objects of the domain of discourse. An atom 𝛽 is defined as 𝑝(𝑡1 , … , 𝑡𝑛 ), where 𝑝 is a predicate and 𝑡1 , … , 𝑡𝑛 are terms, which can be either variables or constants. For example, if on(X, table) is an atom that is true if the first object is on the second object, then on is the predicate symbol, 𝑋 is a variable, and table is a constant. When the terms 𝑡1 , … , 𝑡𝑛 are all constants, the atom is called ground. A rule in the form 𝛽 ← 𝛽1 , … , 𝛽𝑛 is called definite clause. For the sake of simplicity, in this work, the word rule is considered as a synonym of definite clause. When a predicate is defined using ground atoms only, it is called extensional predicate. On the contrary, when a predicate is defined using a set of definite clauses, it is called intensional predicate. 2.2. NLRL NLRL is an adaptation for reinforcement learning tasks of 𝜕ILP [10], a neural-symbolic method for ILP. In particular, in NLRL, both the environment and the background knowledge are described using ground atoms, and, for a given task, NLRL generates a set of candidate rules using a set of hyper-parameters called program template. Formally, a program template is defined as a quadruple: ⟨Pred𝑎 , arity𝑎 , Π, 𝑇 ⟩, (1) where Pred𝑎 and arity𝑎 denote, respectively, the number and the arity of the auxiliary predicates, 𝑇 denotes the number of forward chaining steps, and Π denotes a set of rule templates for each intensional predicate, either target or auxiliary. Each rule template is formally defined as ⟨𝑣, 𝑖𝑛𝑡⟩, where 𝑣 denotes the number of free variables in the rule, and 𝑖𝑛𝑡 indicates if the body of the rule can contain intensional predicates. Therefore, NLRL is based on strong assumptions on the rule generation process, and the user must specify many hyper-parameters to guide this process. Moreover, in NLRL, the body of each rule must contain exactly two atoms, and several additional constraints are enforced to further limit the space of generated rules. The needed program template provides many details on the form of the generated rules, and it is often difficult to appropriately tune all these hyper-parameters in real-world applications. It is worth noting that the program template must be accurately designed. In fact, the size of the generated rules grows very quickly with the complexity of the program template, which is determined, for example, by the number of auxiliary predicates. In order to solve a reinforcement learning task, NLRL associates a trainable weight with each generated rule. In particular, NLRL defines a valuation vector in 𝐸 = [0, 1]|𝐺| , where 𝐺 is the set of ground atoms. Each component of a valuation vector represents how likely the related ground atom is expected to be true. The deductive process can be formalized as a mapping 𝑓 ∶ 𝐸 × 𝐸 → 𝐸 defined, for the 𝑡-th iteration, as: 𝑔(𝑓 𝑡−1 (𝑒, 𝑒0 ), 𝑒0 ) if 𝑡 > 0 𝑓 𝑡 (𝑒, 𝑒0 ) = { (2) 𝑒0 if 𝑡 = 0, where 𝑒0 is the initial valuation vector, and 𝑔 represents a single forward chaining step. Let ⊕ denote the probabilistic sum, which is defined as 𝑎 ⊕ 𝑏 = 𝑎 + 𝑏 − 𝑎 ⊙ 𝑏, where ⊙ denotes element-wise multiplication. Then, 𝑔 can be defined as: 𝑝 𝑝 𝑔(𝑒, 𝑒0 ) = 𝑒0 + ∑ ⨁ ∑ 𝑤𝑖,𝑗 ℎ𝑖,𝑗 (𝑒), (3) 𝑝 0≤𝑖≤𝑛 0≤𝑗≤𝑘 where 𝑛 is the number of rule templates that define predicate 𝑝, 𝑘 is the number of rules generated 𝑝 from the 𝑖-th rule template, 𝑤𝑖,𝑗 is the trainable weight associated with the 𝑗-th rule generated 𝑝 from the 𝑖-th rule template for predicate 𝑝, and ℎ𝑖,𝑗 (𝑒) represents a single deduction step using the 𝑗-th rule generated from the 𝑖-th rule template for predicate 𝑝. For a given learning task, NLRL generates a set of candidate rules for each rule template, but 𝑝 only one rule can be selected for each rule template. Therefore, each weight 𝑤𝑖,𝑗 is obtained by 𝑝 applying a softmax function to the underlying vector of weights 𝜃𝑖 : 𝑝 𝑝 𝑤𝑖,𝑗 = softmax(𝜃𝑖 )[𝑗]. (4) Given a rule 𝑐, each function ℎ𝑐 ∶ 𝐸 → 𝐸 takes a valuation representing the truth value of the ground atoms and computes a valuation that represents the truth value of the atoms resulting from the application of the rule. For each function ℎ𝑐 , NLRL builds the following matrix 𝑋𝑐 : 𝑥 [𝑚] if 𝑚 < |𝑥𝑟 | 𝑋𝑐 [𝑟, 𝑚] = { 𝑟 (0, 0) otherwise (5) 𝑥𝑟 = {(𝑎, 𝑏) | satisfies𝑐 (𝛾𝑎 , 𝛾𝑏 ) ∧ head𝑐 (𝛾𝑎 , 𝛾𝑏 ) = 𝛾𝑟 }, where 𝑥𝑟 is a set of pairs of indexes, and each pair refers to two ground atoms that entail the ground atom of index 𝑟. Each matrix 𝑋𝑐 is a 𝑛 × 𝑤 × 2 matrix, where 𝑤 is the maximum number of ground atoms that entail each ground atom. It is worth noting that the unused index pair (0, 0) must reference to a pair of atoms. Therefore, NLRL introduces the falsum atom ⊥ in 𝐺, and it maps (0, 0) to (⊥, ⊥). In order to perform the forward chaining steps during the training phase, NLRL computes two slices of 𝑋𝑐 , namely 𝑋1 , 𝑋2 ∈ ℕ𝑛×𝑤 , as: 𝑋1 = 𝑋𝑐 [_, _, 0] 𝑋2 = 𝑋𝑐 [_, _, 1]. (6) NLRL retrieves the actual truth values of each ground atom to obtain 𝑌1 , 𝑌2 ∈ [0, 1]𝑛×𝑤 using gather ∶ 𝐸 × ℕ𝑛×𝑤 → ℝ𝑛×𝑤 as follows: 𝑌1 = gather(𝑒, 𝑋1 ) 𝑌2 = gather(𝑒, 𝑋2 ). (7) Then, NLRL builds a new matrix 𝑍𝑐 ∈ [0, 1]𝑛×𝑤 defined as: 𝑍𝑐 = 𝑌1 ⊙ 𝑌2 . (8) Finally, NLRL defines ℎ𝑐 (𝑒) as: ℎ𝑐 (𝑒) = 𝑒 ′ where 𝑒 ′ [𝑘] = max 𝑍𝑐 [𝑘, 𝑗] (9) 1≤𝑗≤𝑤 Note that the maximum operator implements fuzzy disjunction while element-wise multiplica- tion provides for fuzzy conjunction. 2.3. The proposed method In order to generate a set of suitable candidate rules to solve a reinforcement learning task, NLRL requires the user to specify a large amount of data about the structure of the solution. On the contrary, SD-NLRL requires the complete set of possible states, and it automatically generates a set of abstract rules that can represent the solutions of the learning task. The rule generation process that characterizes SD-NLRL is described as the g e n e r a t e _ r u l e s function in Algorithm 1. This function requires the set of possible action atoms 𝐴, the set of possible states 𝑆, and the set of background atoms 𝐵. The function subdivides each state into a set of groups 𝑉 using the g e t _ g r o u p s function. Each group contains a subset of the state, and each atom in a group is, directly or indirectly, connected with the other atoms in the same group through its constants. Therefore, different groups have different constant sets, and each group represents a feature of the state that contributes to the truth value of the action predicate. Then, for each pair of group and action, the g e n e r a t e _ r u l e s function retrieves the constants that are shared between the action atom and the group using the g e t _ s h a r e d function. The g e n e r a t e _ r u l e s function inserts into the group the background atoms that include at least one shared constant. In fact, the shared constants are the most significant constants because they relate the features of the state with the action to be taken. Then, g e n e r a t e _ r u l e s function computes, using the g e t _ f r e e function, the set of free constants, which are the constants that are not shared between the action atom and the group after the inclusion of the related background atoms. If the set of free constants is empty, the g e n e r a t e _ r u l e s function computes the final rules using the u n g r o u n d function, which transforms each ground term of the rule into a corresponding variable. Otherwise, the function produces a rule for each free constant. In particular, the function transforms the free constants that are different from the one taken in consideration. Then, it includes in the body of the generated rule the background atoms that contain the considered free constant. Finally, the u n g r o u n d function transforms the remaining constants in the rule into variables. The generation of a new rule for each free constant is useful to reduce the complexity of each rule. In fact, the goal of the function is to extract the most simple features from each state and let the differentiable architecture select the most appropriate subset of the rules that is capable to solve the learning task. The u n g r o u n d function starts from the body of the rule given as second argument, and it proceeds from left to right. Then, it transforms the head of the rule given as first argument. The third argument of the u n g r o u n d function is a set of constants. If the set is not empty, the function transforms only the constants that are included in this set. Algorithm 1 The function that performs the generation of rules in SD-NLRL 1: function generate_rules(𝐴, 𝑆, 𝐵) 2: rules ← {} 3: for each 𝑠 ∈ 𝑆 do 4: 𝑉 ← get_groups(𝑠) 5: for each 𝑔 ∈ 𝑉 do 6: for each 𝑎 ∈ 𝐴 do 7: shared ← get_shared(𝑎, 𝑔) 8: insert atoms 𝑏 ∈ 𝐵 into 𝑔 if constants(𝑏) ∩ shared ≠ ∅ 9: free ← get_free(𝑎, 𝑔) 10: if free = ∅ then 11: 𝑟 ← unground(𝑎, 𝑔, {}) 12: rules ← rules ∪ 𝑟 13: else 14: for each 𝑓 ∈ free do 15: fixed ← free ⧵ {𝑓 } 16: partial_rule ← unground(𝑎, 𝑔, fixed) 17: insert 𝑏 ∈ 𝐵 into the body of partial_rule if 𝑓 ∈ constants(𝑏) 18: r ← unground(𝑎, body(partial_rule), {}) 19: rules ← rules ∪ r 20: end for 21: end if 22: end for 23: end for 24: end for 25: return rules 26: end function The discussed function used by SD-NLRL for the generation of rules allows an action predicate to be defined by an unpredetermined number of rules. Moreover, the body of each rule can consist of an arbitrary number of atoms, and no limits on the arity of the predicates in these atoms is fixed. The differentiable architecture of NLRL is modified in SD-NLRL to coherently support the changes of the rule generation process. In particular, the deductive process is modified, and SD-NLRL defines a single deduction step 𝑔 as: 𝑝 𝑝 𝑔(𝑒, 𝑒𝑜 ) = 𝑒𝑜 + ∑ ⨁ 𝑤𝑗 ℎ𝑗 (𝑒), (10) 𝑝 0≤𝑗≤𝑘 𝑝 𝑝 where 𝑤𝑗 is the weight associated to the 𝑗-th rule defined for predicate 𝑝, and ℎ𝑗 (𝑒) represents a single deduction step that uses the 𝑗-th rule defined for predicate 𝑝. Note that, in order to 𝑝 𝑝 ensure that the weights 𝑤𝑗 are in [0, 1], each weight 𝑤𝑗 is normalized: 𝑝 𝑝 max(𝑤 𝑝 ) − 𝑤𝑗 𝑤𝑗 = . (11) max(𝑤 𝑝 ) − min(𝑤 𝑝 ) 𝑝 Moreover, the definition of the deduction step, represented by ℎ𝑗 (𝑒), has been coherently modified in SD-NLRL. In particular, SD-NLRL defines a matrix called 𝑋𝑐 ∈ ℕ𝑛×𝑤×𝑑 for each rule 𝑐, as follows: 𝑥 [𝑚] if 𝑚 < |𝑥𝑟 | 𝑋𝑐 [𝑟, 𝑚] = { 𝑟 (0, … , 0) otherwise (12) 𝑥𝑟 = {(𝑎1 , … , 𝑎𝑑 ) | satisfies𝑐 (𝛾𝑎1 , … , 𝛾𝑎𝑑 ) ∧ head𝑐 (𝛾𝑎1 , … , 𝛾𝑎𝑑 ) = 𝛾𝑟 }. SD-NLRL builds the matrices 𝑋1 , … , 𝑋𝑑 ∈ ℕ𝑛×𝑤 before the training phase, where 𝑋𝑖 represents the 𝑖-th element of the body of the rule 𝑐. Then, SD-NLRL obtains the matrices 𝑌1 , … , 𝑌𝑑 ∈ ℝ𝑛×𝑤 using the gather function. These matrices are used to build the matrix 𝑍𝑐 as: 𝑍𝑐 = 𝑌1 ⊙ ⋯ ⊙ 𝑌𝑑 . (13) Finally, SD-NLRL defines ℎ𝑐 (𝑒) in the same way as NLRL does. The weight normalization rule that SD-NLRL uses allows an action predicate to be defined by multiple rules. In particular, each weight depends only on the minimum weight and on the maximum weight defined for the predicate. Therefore, SD-NLRL can associate the same weight with different rules. However, the adopted normalization technique does not work well when SD-NLRL generates only one rule for an action predicate. Moreover, this normalization technique implies that a rule with weight 0 and a rule with weight 1 are always learned, which can prevent SD-NLRL from learning the optimal strategy in some cases. 3. Experimental results In order to assess the performance of SD-NLRL, the proposed method has been used on the reinforcement learning tasks that are discussed in [7]. In particular, the considered tasks are: CLIFFWALKING, WINDYCLIFFWALKING, UNSTACK, STACK, and ON. In the CLIFFWALKING task, the agent must go from a start position to a goal position of a 5 × 5 grid without reaching a cliff. The WINDYCLIFFWALKING task is similar to CLIFFWALKING, but the agent has a 10% chance of going downwards, no matter which action it takes. The other three tasks require the agent to manipulate blocks. In particular, STACK requires the agent to pile up all the blocks in a single column, UNSTACK requires the agent to move all blocks to the floor, and ON requires the agent to move a specified block onto another specified block. For a fair comparison between NLRL and SD-NLRL, SD-NLRL was tested using the task configurations documented in [7]. Note that, as previously discussed, SD-NLRL tries to subdivide each state into disjoint groups of atoms, but the states of block manipulation tasks are composed of densely connected atoms. Therefore, SD-NLRL generates a large number of complex rules. Unfortunately, the required computational resources increase considerably as the size of the rules and the number of rules increase. Therefore, SD-NLRL fails to complete the training phases of the block manipulation tasks, and only the cliff-walking tasks are discussed in the remaining of this section. The implementation of SD-NLRL is based on the official implementation of NLRL1 . In par- ticular, the implementation of SD-NLRL and the implementation of NLRL share the same 1 github.com/ZhengyaoJiang/NLRL Table 1 A performance comparison between NLRL and SD-NLRL on the CLIFFWALKING and on the WINDY- CLIFFWALKING tasks. The optimal values are taken from [7]. Environment Task NLRL SD-NLRL Optimal CLIFFWALKING training 0.862 ± 0.026 0.674 ± 0.292 0.880 top left 0.749 ± 0.057 0.652 ± 0.200 0.840 top right 0.809 ± 0.064 0.781 ± 0.038 0.920 center 0.859 ± 0.050 0.775 ± 0.198 0.920 6 by 6 0.841 ± 0.024 0.631 ± 0.381 0.860 7 by 7 0.824 ± 0.024 0.520 ± 0.514 0.840 WINDYCLIFFWALKING training 0.663 ± 0.377 −0.808 ± 0.341 0.769 ± 0.162 top left 0.726 ± 0.075 −0.536 ± 0.480 0.837 ± 0.068 top right 0.834 ± 0.061 −0.290 ± 0.548 0.920 ± 0.000 center 0.672 ± 0.579 −0.350 ± 0.430 0.868 ± 0.303 6 by 6 0.345 ± 0.736 −0.991 ± 0.093 0.748 ± 0.135 7 by 7 0.506 ± 0.528 −1.012 ± 0.047 0.716 ± 0.181 hyper-parameters. The only exceptions are the learning rate and the number of deduction steps. The learning rate of SD-NLRL was set to 0.0005 and 0.0001 for CLIFFWALKING and WINDYCLIFFWALKING, respectively. The number of deduction steps of SD-NLRL was set to 1 for both tasks because a single forward chaining step is sufficient to obtain the correct truth values of the action predicates. Table 1 shows the performance of SD-NLRL for the two considered cliff-walking tasks. The results shown in Table 1 were obtained performing 5 runs for each task, and each trained model was evaluated on 100 episodes for each task. Then, the results were averaged over all runs. In order to evaluate the generalization capabilities of SD-NLRL, the trained models were tested on other tasks that are slightly different from the ones used for training. In particular, the algorithm was evaluated on the same tasks discussed in [7]. The considered variants of cliff-walking tasks are five: top left, top right, and center, which change the starting position of the agent, and 6 𝑏𝑦 6 and 7 𝑏𝑦 7, which use a larger grid size. The results reported in Table 1 show that SD-NLRL is able to learn an effective strategy that solves CLIFFWALKING. Moreover, the learned model is able to generalize to different tasks. However, the results suffers from great variance. In fact, in 1 of 5 five trials, SD-NLRL learned a strategy that is only able to occasionally get a positive reward. In particular, when the agent does not succeed to quickly learn a good strategy, it remains trapped in a local optimum and stops learning. The results for WINDYCLIFFWALKING show even worse performance, and SD-NLRL consistently fails to learn a good strategy. The algorithm learns over time, but the learning speed is very slow, and it always remains trapped in a local optimum. This behavior is explained by the fact that in NLRL rules are generated from a carefully hand-crafted program template. On the contrary, SD-NLRL automatically generates the rules from the states of the environment, trying to limit both the size and the number of generated rules. Therefore, SD-NLRL does not succeed in generating the best rules for each action predicate. Both these problems represent interesting challenges for the future. In order to offer a comprehensive view on SD-NLRL, the learned rules that obtain the best performance in the CLIFFWALKING task (on the right) with their corresponding weights (on the left) are reported: 1.0 down() :- current(Y,X), last(Y), succ(Z,Y). 1.0 left() :- current(X,Y), succ(Y,Z), zero(Y). 1.0 right() :- current(Y,X), succ(Z,Y), succ(Y,M). 0.99 right() :- current(X,Y), succ(Z,Y), succ(Y,M). 0.98 right() :- current(X,Y), last(Y), succ(Z,Y). 0.70 right() :- current(X,X), succ(Y,X), succ(X,Z). 1.0 up() :- current(X,X), succ(X,Y), zero(X). 0.44 up() :- current(X,X), last(X), succ(Y,X). The learned rules represent an effective strategy, and the agent obtains high returns on the training environment (0.844 ± 0.034). However, the learned strategy is not compact, and some rules are unnecessary. For example, the last two rule defined for r i g h t ( ) and the last rule defined for u p ( ) can be safely omitted. SD-NLRL generates the rules directly from the states of the environment, and many generated rules are unnecessary or even harmful and they should be avoided. Therefore, an interesting research direction for the future regards the reduction of unnecessary rules either refining the rule generation process or introducing a penalization for redundant rules. It is worth noting that the only rule learned for l e f t ( ) can compromise the performance of the strategy because the agent often moves left before moving up. Therefore, the learning of strategies when actions are forbidden should be further investigated. Finally, the learned rules, together with their weights, that obtain the worst performance in the CLIFFWALKING task are reported for completeness: 0.58 down() :- current(X,X), last(X), succ(Y,X). 1.0 down() :- current(Y,X), last(Y), succ(Z,Y). 0.4 down() :- current(X,X), succ(Y,X), succ(X,Z). 0.67 down() :- current(X,Y), succ(Y,Z), zero(Y). 0.51 left() :- current(X,X), succ(X,Y), zero(X). 1.0 left() :- current(Y,X), succ(Z,Y), succ(Y,M). 1.0 right() :- current(Y,X), succ(Z,Y), succ(Y,M). 0.82 right() :- current(X,X), last(X), succ(Y,X). 1.0 right() :- current(X,Y), last(Y), succ(Z,Y). 0.42 right() :- current(X,X), succ(Y,X), succ(X,Z). 1.0 up() :- current(X,Y), succ(Y,Z), zero(Y). 0.49 up() :- current(Y,X), succ(Y,Z), zero(Y). 1.0 up() :- current(X,X), succ(Y,X), succ(X,Z). In this case, the number of learned rules is higher, and the agent obtains low returns in the training environment (0.154 ± 0.629). In fact, many rules are unnecessary or even wrong. Note that, in both cases, only the rules with a weight greater than 0.3 are shown for space limitations. From the point of view of the required computational resources, SD-NLRL drastically reduces the number of generated rules with respect to NLRL. Moreover, SD-NLRL performs only 1 deduction step, thus further reducing the required computational resources. In fact, NLRL generates 2813 rules for the cliff-walking tasks, while SD-NLRL generates only 36 rules for the same tasks. The number of generated rules directly influences the required amount of computational resources. Actually, NLRL requires approximately 10 days to complete a training process, while SD-NLRL complete a training process in less than 2 hours. The required memory is reduced as well. SD-NLRL uses approximately 1.6Gb, while NLRL uses approximately 17Gb. These results are preliminary, and a complete analysis of the computational resources that SD-NLRL requires is left for the future. However, it is worth noting that SD-NLRL does not always reduce the required computational resources because it struggle with densely connected states. In fact, in many cases, SD-NLRL generates a large amount of complex rules because it does not limit the size and the number of generated rules. Therefore, the generation of compact and abstract rules even in worst cases represents an important direction for future research. 4. Conclusion This paper introduced a novel neural-symbolic method for reinforcement learning called SD-NLRL. The proposed method is based on NLRL, but it generates candidate rules directly from the states of the environment. The experimental results discussed in the final part of this paper show that the proposed method is able to effectively learn a good strategy in one of the cliff-walking tasks, and that the method is also able to generalize this strategy to tasks that are slightly different from the one used for training. However, SD-NLRL fails to solve block manipulation tasks because it is not able to generate compact rules from the densely connected states that characterize these tasks. The generation of compact, yet general, rules from densely connected states represents an important challenge for the future. Moreover, SD-NLRL often remains trapped in local optima, and this is unquestionably evident when the difficulty of the task under investigation increases. Another important challenge for the future is to make SD- NLRL more robust against the traps represented by local optima. In addition, the analysis of the learned rules suggests that: SD-NLRL learns many unnecessary rules, and it is not able to learn an effective strategy when an action is forbidden. Both these problems are two other relevant challenges for the future. Finally, the assumption that the environment is able to provide the set of all possible states represents another important limitation of SD-NLRL. Therefore, a relevant improvement of the method regards the design of an iterative rule-generation process in which new rules are generated only when new states are encountered. In summary, this is a prelimi- nary work, and the current form of the proposed method cannot be considered competitive with other neural-symbolic methods for reinforcement learning like NLRL. However, the presented experimental results are encouraging, and an improved version of SD-NLRL can be considered as a viable means toward effective neural-symbolic reinforcement learning. References [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing Atari with deep reinforcement learning, Preprint arXiv:1312.5602 (2013). [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, F. A. K., G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, L. Shane, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533. [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch- brenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016) 484–489. [4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Mastering the game of Go without human knowledge, Nature 550 (2017) 354–359. [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, Preprint arXiv:1707.06347 (2017). [6] G. Marcus, Deep learning: A critical appraisal, Preprint arXiv:1801.00631 (2018). [7] Z. Jiang, S. Luo, Neural logic reinforcement learning, in: Proceedings of the 36th Interna- tional Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 2019, pp. 3110–3119. [8] M. Zimmer, X. Feng, C. Glanois, Z. Jiang, J. Zhang, P. Weng, L. Dong, H. Jianye, L. Wulong, Differentiable logic machines, Preprint arXiv:2102.11529 (2021). [9] A. Payani, F. Fekri, Incorporating relational background knowledge into reinforcement learning via differentiable inductive logic programming, Preprint arXiv:2003.10386 (2020). [10] R. Evans, E. Grefenstette, Learning explanatory rules from noisy data, Journal of Artificial Intelligence Research 61 (2018) 1–64. [11] S. Džeroski, L. De Raedt, K. Driessens, Relational reinforcement learning, Machine learning 43 (2001) 7–52. [12] K. Driessens, J. Ramon, H. Blockeel, Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner, in: Proceedings of the 12th European Conference on Machine Learning (ECML 2001), Springer, 2001, pp. 97–108. [13] K. Driessens, J. Ramon, Relational instance based regression for relational reinforcement learning, in: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), AAAI Press, 2003, pp. 123–130. [14] T. Gärtner, K. Driessens, J. Ramon, Graph kernels and Gaussian processes for relational reinforcement learning, in: Proceedings of the 13th International Conference on Inductive Logic Programming (ILP 2003), Springer, 2003, pp. 146–163. [15] A. Payani, F. Fekri, Inductive logic programming via differentiable deep neural logic networks, Preprint arXiv:1906.03523 (2019). [16] H. Dong, J. Mao, T. Lin, C. Wang, L. Li, D. Zhou, Neural logic machines, Preprint arXiv:1904.11694 (2019).