An Evolutionary Algorithm for Rule Learning over Knowledge Graphs Lianlong Wu1 , Emanuel Sallinger2,1 , Evgeny Sherkhonov1 , Sahar Vahdati1 , and Georg Gottlob2,1 1 University of Oxford 2 TU Wien Abstract. Logical rules allow us to declaratively encode expert knowl- edge, express patterns and infer new knowledge from existing one in Knowledge Graphs. However, the construction of rules is often a costly manual process. In this work, we address the problem of learning rules from the already exiting knowledge. Most of the existing rule learning algorithms for Knowledge Graphs are based on inefficient full search plus greedy pruning strategies that cover limited search space by considering rules of a predetermined shape. We propose a learning algorithm where a set of rules is learned via the process inspired by biological evolution and that is geared towards optimizing a fitness function. Such evolution- ary algorithms typically cover larger search space efficiently and provide multiple near-optimal solutions. We evaluate the proposed algorithm on a number of public Knowledge Graphs and compare it to other rule learning algorithms. Keywords: Rule Learning · Knowledge Graphs · Evolutionary Algo- rithms · Genetic Programming 1 Introduction Knowledge Graphs (KGs) are the hyped technology of recent years and have gained huge interest in multiple AI-based applications. Logical rules are used as an intrinsic part of the KGs encoding knowledge, or for Knowledge Graph construction and completion as well as error detection and noise identification, entity resolution, query and answering, among other tasks. The main charac- teristic added to such downstream tasks originates from explainability power of rule-based languages as they are human and machine-friendly. This makes rea- soning as deriving new knowledge from the existing ones also fully explainable. However, the process of gaining logical rules is costly as in traditional process for obtaining high-quality rules in a KG (often addressed as knowledge engineer- ing) significant human effort is required. Rule learning is the process of auto- matically learning rules from underlying KGs. The two manual and automated Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Wu et al. processes are often complementary, allowing engineered background knowledge to be used at the same time as rules learned from data, jointly obtaining a higher-quality result compared to each process used in isolation. Among the existing methods, the classical Inductive Logic Programming (ILP) methods for rule learning are based on the Closed World Assumption (CWA) and consider relatively small scale datasets. However, modern large scale KGs cause huge challenges for the efficiency of such rule learning methods with straightforward use of ILP methods [4]. There are recent tools which are specifically designed for Knowledge Graph rule extraction, considering the Open World Assumptions (OWA). For example, Ontological Pathfinding (OP) [2] and AMIE [4] are among the recent methods which are proposed based on exhaustive search with pruning strategies. An- other tool named as RuDiK[8] uses a greedy algorithm and A∗ graph traversal algorithm over the entire possible rule space. These tools have proven feasible performance over modern large scale KGs. However, greedy algorithms or search with pruning may suffer from being trapped in local optima. Also, the aforemen- tioned tools often have high running times over large scale KGs with a limited length of rule where the rule types are also tied to the algorithm in these systems. Approach and contribution. In this paper we propose an evolutionary rule learning algorithm that provides multiple near-optimal solutions efficiently, and, moreover, has fewer syntactic restrictions of the form of learned rules. Rule learning can be considered as heuristic search on the concept description space. Evolutionary Algorithms (EA) are a good choice for this intrinsic search mechanism and the symbolic representation. In order to extend the search space, this work focuses on studying Evolutionary Algorithms, specifically in the area of Genetic Programming (GP) for the rule learning problem over large scale Knowl- edge Graph datasets. EAs are a family of biology-inspired search algorithms that optimize for the most promising preliminary solutions, while exploring a wide search space at the same time. In particular, in our setting, one rule or a set of rules can be treated as a chromosome, from which a new population of chro- mosomes can be derived via the operations of mutation, crossover and selection operators. While performing these operations, a quality measure, called fitness function is computed to judge whether an obtained new generation of chromo- somes is fit for continuing the search. Such evolutionary algorithms typically do not need to perform exhaustive search, and are less likely to fall into local optima[3]. Furthermore, they are flex- ible in that they might not require imposed template on the shape of rules, as it is typically the case in other approaches. In addition, EA supports multiple met- rics as optimization target: Precision, Support (Recall) etc. and any combination of them. EAs are well balanced by optimizing the most promising solutions while exploring the wider search space at the same time. While greedy search may be trapped in local optima, the parallel search mechanism and mutation operations of Genetic Programming can avoid such traps. The robustness of EAs imposes fewer limitations on the rule learning system compared to other methods. It is easy to incorporate arbitrary constraints into the search space. Furthermore, An Evolutionary Algorithm for Rule Learning over Knowledge Graphs 3 EAs are adaptable in a complex search space, which makes them increasingly utilized in the different learning systems [6]. In this work, we are combining the learning power of Genetic Programming (GP) and Knowledge Representation (KR) from logic-based languages. The main contributions of this paper are: – We introduce a new evolutionary algorithm for rule learning which, to the best of our knowledge, is a first genetic programming algorithm applied in the context of learning rules over Knowledge Graphs. – We report preliminary experimental results of the new algorithm on publicly available Knowledge Graphs YAGO [10] and YAGO2 [9]. Organization. Section 2 introduces the necessary preliminaries. In Section 3 we will introduce the approach and algorithm. Section 4 presents the initial experimental evaluation. We give concluding remarks in Section 5. 2 Preliminaries In this section we present the necessary concepts that are required for better understanding of our rule learning algorithm. Let us assume a disjoint sets of entities indicated as E, literals as L , predicates as P of arity two, and variables depicted by V. Knowledge Representation is by atoms in the form of p(X, Y ), where p ∈ P, X ∈ E ∪ V and Y ∈ E ∪ V ∪ L. For example, livesIn(X, Oxford) is an atom, where livesIn is a predicate, Oxford is an entity and X is a variable, and it denotes that person X lives in Oxford. Knowledge Graphs (KGs) are sets of ground atoms indicated by K, i.e., atoms p(s, o) such that s ∈ E and o ∈ E ∪ L. For instance, K1 = {livesIn(John, Oxford), isM arriedT o(John, Mary)} is a small KG that states the fact that John lives in Oxford and that he is married to Mary. A Horn rule, or simply a rule, is an expression of the form of H ← B, where B is a set of atoms {B1 , . . . , Bn } called body atoms, and H is an atom called head atom. For instance, the expression livesIn(Y, Z) ← livesIn(X, Z), isM arriedT o(X, Y ) (1) is a rule that expresses the fact that whenever X lives in Z and X is married to Y , then Y also lives in Z. A program is a set of rules. A mapping from V to E ∪L∪V is called a substitution. For a rule r of the form H ← A1 , . . . , An and a substitution σ, by σ(r) we denote the result of applying σ to r, i.e., the rule obtained by substituting the variables in H, A1 , . . . , An according to σ. For a KG K if for every i ∈ {1, . . . , n} it holds that σ(Ai ) ∈ K, then σ(H) is called a prediction of r over K. If additionally K contains σ(H), then it is a true prediction, otherwise it is a false prediction. In our example livesIn(Mary, Oxford) is a false prediction of the rule (1) over K1 . 4 Wu et al. We define the support of a rule r in a KG K, denoted as Srule (r, K), as the number of true predictions of r over K. Additionally, for a rule r = H(X, Y ) ← B, the support of the body B, denoted as Sbody (r, K), is defined as the number of substitutions σ defined over X and Y that can be extended to a substitution σ ′ (i.e., σ ′ (X) = σ(X) and σ ′ (Y ) = σ(Y )) such that σ ′ (B) ⊆ K. In our example, for the rule (1) and KG K1 , we have Srule (r, K1 ) = 0 and Sbody (r, K1 ) = 1. The standard confidence is defined as the ratio Srule (r, K)/Sbody (r, K) and reflects the measure how well the rule r fits the KG K. However, it has been argued in [4] that the standard confidence is not particularly well suited to measure the quality of rules over KGs due to the open world assumption for KGs. Instead, we resort to the PCA confidence [4]. The Partial Completeness Assumption (PCA) asserts that if for some a and b it holds that p(a, b) ∈ K, then for every fact p(a, b′ ) that is true it holds that p(a, b′ ) ∈ K. In other words, if a KG contains a p-fact of a, then it contains all true p-facts of a. Moreover, if no p-fact of a is known, then we consider all p-facts of a unknown. For instance, under the PCA, our KG K1 contains all the true livesIn-facts about John, while the livesIn-facts about Mary are unknown. The PCA confidence Cpca (r, K) is then a refined version of the standard confidence, where the denominator does not take into account the unknown H- facts of the first argument: Srule (r, K) Cpca (r, K) = (2) #(a, b) : ∃c.B ∈ K ∧ ∃b′ .H(a, b′ ) ∈ K In our evolutionary algorithm introduced in the next section we use the PCA confidence as the fitness function. 3 Algorithm In this section we introduce an algorithm that belongs to the family of Evolu- tionary Algorithms (EA), also known as Genetic Algorithms (GA). EAs are a family of biology-inspired parallel search algorithms that optimize for the most promising preliminary solutions, while exploring a wide search space at the same time. In particular, in our setting programs can be treated as chromosomes, from which the new population of chromosomes can be derived via the operations of selection, mutation and crossover. We first introduce the basic elements of our algorithm, the operators that perform transformation of a given set of programs: selection, mutation and crossover. Selection Operators. The selection operator is used for take one or two chro- mosomes from the current population. Fitness-Proportional (Roulette-Wheel) selection method is used, the chance of each individual being selected is propor- tional to its fitness score. Each individual is assigned a probability of f (xi ) pi = PN , (3) j=1 f (xj ) An Evolutionary Algorithm for Rule Learning over Knowledge Graphs 5 where f is a given fitness function. By selector(x, f ) we denote the result of applying the selection operator with the fitness function f . Mutation Operator. We say that a rule is obtained by mutating the input rule if either a variable or a predicate that occurs in the input rule is mutated. Mutating a variable Xi means replacing it with another variable Xj , i ̸= j. For example, the rule livesIn(X1 , X2 ) ← wasBornIn(X1 , X2 ) is obtained from livesIn(X1 , X2 ) ← wasBornIn(X1 , X3 ) by mutating the variable X3 to X2 . Similarly, mutating a predicate Ri means replacing it by another predicate Rj , i ̸= j. For example, the rule livesIn(X1 , X2 ) ← worksAt(X1 , X2 ) is obtained from livesIn(X1 , X2 ) ← wasBornIn(X1 , X2 ) by mutating the body predicate wasBornIn to worksAt. We say that a program (chromosome) P1 is obtained from a program P2 by mutating, if it is obtained by mutating a rule in P2 . For a given chromosome x, by mutate(x, Pm ) we denote the result of applying a mutation to x with probability Pm . Crossover Operator. The crossover operator is applied to two given chromo- somes. The operator randomly chooses a rule from each chromosome, and subsets of atoms in the body of each of the rules and swaps them respecting the variable renaming. For instance, for the programs {r1 } and {r2 } the crossover operator can select {isCitizenOf (X1 , X2 )} from the body of r1 , and {livesIn(X1 , X2 )} from the body of r2 . r1 : isLeaderOf (X1 , X2 ) ← wasBornIn(X1 , X2 ), isCitizenOf (X1 , X2 ). (4) r2 : isLeaderOf (X1 , X2 ) ← livesIn(X1 , X2 ). The result of swapping these sets is then the programs {r1′ } and {r2′ }, where r1′ : isLeaderOf (X1 , X2 ) ← wasBornIn(X1 , X2 ), livesIn(X1 , X2 ). (5) r2′ : isLeaderOf (X1 , X2 ) ← isCitizenOf (X1 , X2 ). For two given chromosomes x1 and x2 , by crossover(x1 , x2 , Pc ) we denote the result of applying the crossover operator to x1 and x2 with the probability Pc . Initial Population consists of |P| chromosomes, each chromosome has one rule with a single body atom, each rule has a different predicate. We are ready to describe our main algorithm as shown in Algorithm 1. As input it takes the chromosome population size N , the maximum generation number M , the crossover probability Pc , and the mutation probability Pm . The initial population consists of N chromosomes. For each generation, a set of chro- mosomes (or programs) is generated, and fitness score for each of them is calcu- lated, using a rule reasoning engine. The reasoning engine stores the knowledge graph facts, evaluates candidate programs, and computes the fitness function. In our experiments, the Vadalog reasoning engine is used [1]. Once all the fitness scores are calculated, pairs of chromosomes are selected, the crossover operator produces two new offspring, and mutation operators are applied to them. The new chromosomes are placed in the new population. Duplicated chromosomes are removed, in order to preserve the diversity of the population. These steps are repeated until the new population size becomes equal to N . The algorithm iterates until maximum generation M is reached. 6 Wu et al. Algorithm 1 Rule Learning Evolutionary Algorithm 1: Input: Knowledge Graph K, Population size N , Maximum Generation M 2: Input: Fitness Function f (x) = Cpca (x, K), 3: Input: Crossover probability Pc , Mutation probability Pm 4: Initialize population set S ′ ← {r1 ..r|P| }. 5: for g = 1 to M do 6: S ← S′ 7: S ′ ← {xm }, xm = argmax(f (x)), x ∈ S 8: repeat 9: xi ← selector(S, f ); xj ← selector(S, f ) 10: x′i , x′j ← crossover(xi , xj , Pc ) 11: x′i ← mutate(xi , Pm ); x′j ← mutate(xj , Pm ) 12: S ′ ← S ′ ∪ x′i , x′j 13: until |S ′ | ≥ N 14: end for 15: Output: argmax(f (x)), x ∈ S ′ 4 Experiments In this section, we present our results from evaluation of the proposed algorithm on a set of benchmarks. Dataset As our dataset, we use the the YAGO KG which contains entities such as persons, organizations, and cities, and relations (facts) between the entities. YAGO2 nearly has 1 million facts [10] and we use a complete version of it. We also included a different version of it namely YAGO2s [9] with 4 million facts. Table 1 provides a detailed statistics about these KGs. Table 1. Knowledge Graph Dataset Characters KG Name # Triples # Predicates YAGO2 948,358 33 YAGO2s 4,122,426 37 Experimental Results As shown in Table 2, our algorithm found 18 rules for YAGO2 and 20 rules for YAGO2s, with PCA confidence threshold > 10%, and no explicit rule length limitation. The PCA confidence for the top 5 rules in YAGO2 is 97.14% and 85.26% for YAGO2s, the average PCA confidence of all the rules is 57.72% and 52.34%, respectively. The precision rate is evaluated by sampling five facts predicted by the rule, then verifying through human judgement. For reference, top 10 rules from AMIE on YAGO2 are at precision of 39%[4], and top 5 rules from RuDiK on YAGO3 are at precision 79.17% and all rules average precision at 62.86%[8]. In general, compared with RuDiK and AMIE, we find fewer rules with higher confidence and quality in terms of precision. An Evolutionary Algorithm for Rule Learning over Knowledge Graphs 7 Table 2. Knowledge Graph Rule Learning Precisions PCA Confidence Precision KG Name # Rules Top-5 Average Top-5 Average YAGO2 18 97.14% 57.72% 48% 61.11% YAGO2s 20 85.26% 52.34% 80% 64.00% Hardware Specification In the standalone setup, experiments are run on 48-core machine with 2 × Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.5GHz, 64GB RAM. Execution Time As one of the comparison criteria, we considered the execution time. As shown in Table 3, rule learning for all the 33 predicates on the YAGO2 takes 12 minutes, and takes 21 minutes for the YAGO2s dataset. The run time of AMIE, OP and RuDiK are quoted from their typical setup in the publications3 . It shows our running time is at the same level with the state-of-the-art systems. It also proved that our algorithm takes less time for execution when the KG gets larger. This is particularly important in real world scenarios of large-scale KGs. Table 3. Rule Learning Execution time comparison AMIE[4] AMIE+[5] OP[2] RuDiK[7] This Work YAGO2 3.62m 8.35m 3.59m 18m 12m YAGO2s - 59.38m 19.4m 47m 21m 5 Conclusion In this work, we highlighted the problem of rule learning in knowledge graphs and provided a formal definition for that. The undulate goal is to emphasis on the importance of explicit rule learning. A novel genetic programming algorithm is proposed for rule mining over large scale Knowledge Graphs. The initial ex- periments confirm the feasibility and efficiency of this algorithms. This work provides the foundation for the development of a comprehensive framework for rule learning into a new scope with longer rule length limitation. Our approach governs less language restrictions, so that a larger hypothesis space could be ex- plored in order to provide high quality rules. In future work of this research, we aim at addressing complex rule extraction and learning within a single frame- work. The experiments are planned to be extended and comparisons will be provided to other rule extraction and learning tools. 3 Note: Considering the differences in the problem definition, the outcome of learning specifications, the algorithm parameters and the database implementations, the run time comparison is for illustration purpose only. 8 Wu et al. Acknowledgements. The work on this paper was supported by EPSRC programme grant EP/M025268/1, the EU H2020 grant 809965, and the Vienna Science and Tech- nology (WWTF) grant VRG18-013. References 1. Bellomarini, L., Sallinger, E., Gottlob, G.: The Vadalog system: Datalog-based Reasoning for Knowledge Graphs. In: Proceedings of the VLDB Endowment. vol. 11, pp. 975–987 (2018). https://doi.org/10.14778/3213880.3213888 2. Chen, Y., Goldberg, S., Wang, D.Z., Johri, S.S.: Ontological pathfinding: Min- ing first-order knowledge from large knowledge bases. Proceedings of the ACM SIGMOD International Conference on Management of Data pp. 835–846 (2016). https://doi.org/10.1145/2882903.2882954 3. Fogel, D.B.: Evolutionary Computation: Toward a New Philosophy of Machine Intelligence: Third Edition. John Wiley & Sons (2005). https://doi.org/10.1002/0471749214 4. Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: AMIE: Association rule mining under incomplete evidence in ontological knowledge bases. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web pp. 413–422 (2013) 5. Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in on- tological knowledge bases with AMIE+. VLDB Journal 24(6), 707–730 (2015). https://doi.org/10.1007/s00778-015-0394-1 6. Giordana, A., Saitta, L., Campidoglio, M.E., Bello, G.L.: Learning relations using genetic algorithms. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 728 LNAI, pp. 218–229 (1993). https://doi.org/10.1007/3-540-57292-9 60 7. Ortona, S., Meduri, V.V., Papotti, P.: Robust discovery of positive and negative rules in knowledge bases. Proceedings - IEEE 34th Interna- tional Conference on Data Engineering, ICDE 2018 pp. 1180–1191 (2018). https://doi.org/10.1109/ICDE.2018.00108 8. Ortona, S., Meduri, V.V., Papotti, P.: RuDiK: Rule discovery in knowl- edge bases. Proceedings of the VLDB Endowment 11(12), 1946–1949 (2018). https://doi.org/10.14778/3229863.3236231 9. Suchanek, F.M., Hoffart, J., Kuzey, E., Lewis-Kelham, E.: YAGO2s: Modular high- quality information extraction with an application to flight planning. In: Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI). vol. P-214, pp. 515–518 (2013) 10. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. 16th International World Wide Web Conference, WWW2007 pp. 697–706 (2007). https://doi.org/10.1145/1242572.1242667