Learning Rules With Attributes and Relations in Knowledge Graphs Pouya Ghiasnezhad Omran1 , Zhe Wang2 and Kewen Wang2 1 The Australian National University, Canberra, ACT, Australia 2 Griffith University, Brisbane, QLD, Australia Abstract Rules capture high-level patterns in knowledge graphs and sometimes can provide virtual schemata for the data. Recent research has witnessed an increasing interest in learning rules automatically and applying the learned rules in knowledge graph inferences. While quite a number of rule learning sys- tems have been developed, the formats of learned rules are still restricted, mostly resemble paths in the knowledge graphs. Such methods focus on the graph structure of the knowledge graph, assuming all the vertices and all the edges in the graph have an equal role (while only named or labelled differently). In this paper, we propose a method HARL (Hub Augmented Rule Learning) for learning rules contain- ing attributes by treating certain binary predicates as unary predicates. A major component of HARL is an algorithm for identifying attributes for a given knowledge graph. HARL has been implemented through a scalable rule learner RLvLR. Our experimental results also show that HARL is scalable and outperforms RLvLR in most cases on major benchmark datasets. Keywords Rule Learning, Knowledge Graph, Knowledge Graph Completion 1. Introduction Knowledge graphs (KGs) have proven to be useful for building interlinks between information sources by connecting entities of diverse types like people, universities, movies, animals, etc. This is particularly useful for information synthesis over multimodal media data, such as the fusion of medical documents consisting of X-ray images and medical reports (texts and tables). In artificial intelligence, a KG is usually a set of triples like (π΅π‘Žπ‘Ÿπ‘‘π‘¦, bornInCity, π΅π‘Ÿπ‘–π‘ π‘π‘Žπ‘›π‘’), meaning that Barty was born in the city of Brisbane. Tools are available for retrieving such triples from various sources [1]. Thus, solutions for retrieving information can be obtained by technologies for KGs, such as efficient algorithms for link prediction. The objective of link prediction is to determine whether a pair of entities are connected via a relation. For instance, the link prediction task answers a query, (π΅π‘Žπ‘Ÿπ‘‘π‘¦, bornInCity, ?) that enquiries for a city name where Barty was born in. Existing methods for link prediction are mostly based on embedding based representation and graph neural networks (GNNs) [2], which adopt a black-box approach In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE 2022), Stanford University, Palo Alto, California, USA, March 21–23, 2022. email: p.g.omran@anu.edu.au (P. G. Omran); zhe.wang@griffith.edu.au (Z. Wang); k.wang@griffith.edu.au (K. Wang) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and are known to suffer from issues of explainability and flexibility. As a result, some researchers advocated providing explainable solutions to KG link prediction by learning rules. In particular, rules that contain constants are useful for both link prediction and reasoning. However, existing scalable rule learners based on embedding, e.g. [3, 4], can only learn the class of closed path rules (CP-rules) but are unable to learn rules containing constants. This assumption is vital since we face millions of entities in large KGs, which causes an explosion in the search space. On the other hand, in the KGs, specific entities in specific predicates acts as at- tributes. For example, in FB15k, the entity USA in the second argument of the predicate /film/film/country(π‘₯, 𝑦) acts as a class that distinguish all films (π‘₯) that are produced in the USA and from those who are not. Such binary attribute reflects a reality of dominance of the USA film industry globally. We refer to such entity-predicates as hubs and model them through attribute facts. We propose a method to identify a set of hubs automatically. Consider an exam- ple from YAGO2 KG, we have P : isCitizenOf_E : Netherland(π‘₯) as an attribute or hub. We can learn a number of patterns that cannot be expressed without such attributes, for instance, the rule P : isCitizenOf_E : Netherlands(π‘₯) ∧ isCitizenOf(π‘₯, 𝑦) β†’ livesIn(π‘₯, 𝑦) would have only a much lower confidence degree if without the attribute P : isCitizenOf_E : Netherlands(π‘₯) (three times less confidence degree than the rule including attribute). Thus, we will consider the more specific rules such as, P : isCitizenOf_E : Netherlands(π‘₯) ∧ isCitizenOf(π‘₯, 𝑦) β†’ livesIn(π‘₯, 𝑦), instead of the more general rule, isCitizenOf(π‘₯, 𝑦) β†’ livesIn(π‘₯, 𝑦). In this paper, we propose a method HARL (Hub Augmented Rule Learning), for learning rules containing attributes. The idea is to regard such a rule as a closed path rule so that existing scalable rule learners can be employed. This is achieved by treating a binary predicate as a unary predicate, for example, an atom /film/film/country(π‘₯, π‘ˆ 𝑆𝐴) can be considered as an attribute /film/film/country_USA(π‘₯). Thus, a major component of HARL is an algorithm for identifying attributes for a given KG. HARL can keep the search space in a manageable size while essential entities for learning rules are not missed. HARL has been implemented through a scalable rule learner RLvLR [4]. Our experimental results also show that HARL outperforms RLvLR in most cases on major benchmark datasets. 2. Preliminaries A knowledge graph consists of a set of entities β„° as its vertices and its edges are directed and labelled with a set of predicates 𝒫. An entity 𝑒 is an object such as a place, a person, etc., and a fact (or link) is a triple (𝑒, 𝑝, 𝑒′ ), which means that the entity 𝑒 is related to another entity 𝑒′ via the binary predicate 𝑝. Following the convention in knowledge representation, we denote such a fact as 𝑝(𝑒, 𝑒′ ). The format of rules we aim to learn extends closed-path rules (or CP rules), which is a language bias that is widely adopted in the rule learning literature for knowledge graphs; for instance, CP rules are the underlying formalism of Path Ranking Algorithms [5], RuleEmbedding [6], RDF2Rules [7] and ScaLeKB [3]. We extends CP rules with atoms about entity attributes. An attribute closed-path rule (or an ACP rule or simply a rule) π‘Ÿ is of the form 𝑝1 (π‘₯0 , π‘₯1 ) ∧ 𝑝2 (π‘₯1 , π‘₯2 ) ∧ Β· Β· Β· ∧ 𝑝𝑛 (π‘₯π‘›βˆ’1 , π‘₯𝑛 ) ∧ π‘Ž1 (π‘₯𝑙1 ) ∧ Β· Β· Β· ∧ π‘Žπ‘š (π‘₯π‘™π‘š ) β†’ 𝑝(π‘₯0 , π‘₯𝑛 ). (1) Here each π‘₯𝑖 (0 ≀ 𝑖 ≀ 𝑛) is a variable and 0 ≀ 𝑙𝑗 ≀ 𝑛 for each 1 ≀ 𝑗 ≀ π‘š, each 𝑝(𝑒, 𝑣) is called an atom, and 𝑒 and 𝑣 are called respectively, the subject and ob- ject argument for 𝑝, and each π‘Ž(𝑒) is an atom with π‘Ž being a unary predicate. Intu- itively, the rule π‘Ÿ reads that if 𝑝1 (π‘₯0 , π‘₯1 ), 𝑝2 (π‘₯1 , π‘₯2 ), . . . , 𝑝𝑛 (π‘₯π‘›βˆ’1 , π‘₯𝑛 ), π‘Ž1 (π‘₯𝑙1 ), . . . , π‘Žπ‘š (π‘₯π‘™π‘š ) hold, then 𝑝(π‘₯0 , π‘₯𝑛 ) holds too. The atom 𝑝(π‘₯0 , π‘₯𝑛 ) is the head of π‘Ÿ and the set of atoms {𝑝1 (π‘₯0 , π‘₯1 ), 𝑝2 (π‘₯1 , π‘₯2 ), . . . , 𝑝𝑛 (π‘₯π‘›βˆ’1 , π‘₯𝑛 ), π‘Ž1 (π‘₯𝑙1 ), . . . .π‘Žπ‘š (π‘₯π‘™π‘š )} is the body of π‘Ÿ. The rule ob- tained from π‘Ÿ by removing unary predicates in the body, that is, 𝑝1 (π‘₯0 , π‘₯1 ) ∧ 𝑝2 (π‘₯1 , π‘₯2 ) ∧ Β· Β· Β· ∧ 𝑝𝑛 (π‘₯π‘›βˆ’1 , π‘₯𝑛 ) β†’ 𝑝(π‘₯0 , π‘₯𝑛 ) is usually referred to as a closed-path rule or CP rule as the sequence of predicates in the rule body forms a path from the subject argument to the object argument of the head predicate. To assess the quality of mined rules, we recall measures that are used in some major ap- proaches to rule learning [3, 8]. Let π‘Ÿ be a rule of the form (1). A pair of entities (𝑒, 𝑒′ ) satisfies the body of π‘Ÿ, denoted π‘π‘œπ‘‘π‘¦(π‘Ÿ)(𝑒, 𝑒′ ), if there is a way of substituting variables in π‘π‘œπ‘‘π‘¦(π‘Ÿ) with entities in the KG such that (i) all atoms in π‘π‘œπ‘‘π‘¦(π‘Ÿ) (after substitution) are facts in the KG, and (ii) π‘₯0 and π‘₯𝑛 are substituted with 𝑒 and 𝑒′ respectively. And (𝑒, 𝑒′ ) satisfies the head of π‘Ÿ, denoted 𝑝(𝑒, 𝑒′ ), if 𝑝(𝑒, 𝑒′ ) is a fact in the KG. Then the support degree of π‘Ÿ is defined as 𝑠𝑒𝑝𝑝(π‘Ÿ) = #(𝑒, 𝑒′ ) : π‘π‘œπ‘‘π‘¦(π‘Ÿ)(𝑒, 𝑒′ ) ∧ 𝑝(𝑒, 𝑒′ ) That is, 𝑠𝑒𝑝𝑝(π‘Ÿ) is defined as the number of entity pairs that satisfy both the head and the body of π‘Ÿ. The degrees of standard confidence (SC) and head coverage (HC) are defined as two forms of normalisation for 𝑠𝑒𝑝𝑝(π‘Ÿ): 𝑠𝑒𝑝𝑝(π‘Ÿ) 𝑠𝑒𝑝𝑝(π‘Ÿ) 𝑆𝐢(π‘Ÿ) = 𝐻𝐢(π‘Ÿ) = #(𝑒, 𝑒′ ) : π‘π‘œπ‘‘π‘¦(π‘Ÿ)(𝑒, 𝑒′ ) #(𝑒, 𝑒′ ) : 𝑝(𝑒, 𝑒′ ) So, 𝑆𝐢(π‘Ÿ) is the normalisation of 𝑠𝑒𝑝𝑝(π‘Ÿ) through the number of entity pairs that satisfy the body, while 𝐻𝐢(π‘Ÿ) is the normalisation of 𝑠𝑒𝑝𝑝(π‘Ÿ) through the number of entity pairs that satisfy the head. 3. Learning Rules with Attributes In this section, we present our approach to learn rules with attributes through a simple and effective method. We first describe how to automatically identify facts in the KG that express entity attributes , which we refer to as attribute facts, and then explain how to transform such attribute facts to a form that can be easily processed by existing rule learning methods. 3.1. Identifying Attribute Facts As argued in [9, 10], the facts in existing KGs can be separated into (at least) two cate- gories: those representing entity relations and those for entity attributes (hubs). For example, hasFather(Joe Biden, Joseph R. Biden) expresses the relationship between two entities Joe Biden and Joseph R. Biden, whereas gender(Joe Biden, Male) is naturally considered as an attribute of the entity Joe Biden. Sometimes it is not easy to draw a clear line between the two types of facts, but in general, entity relations often involve large numbers of entities, like Joe Biden and Joseph R. Biden, but the links between an attribute and entities are relatively sparse; in another word, entity attributes often involve a small number of entities to represent attribute values, such as the gender Male, which are linked to a huge number of other entities. Existing rule learning approaches often do not distinguish entity attributes from entity relations as they can only learn the class of CP rules like hasFriend(π‘₯, 𝑦) ∧ gender(𝑦, 𝑧) β†’ gender(π‘₯, 𝑧). However, in many cases it makes more sense to consider β€˜gender’ as an attribute when rules of the following more general form are learned. hasParent(π‘₯, 𝑦) ∧ gender(𝑦, Male) β†’ hasFather(π‘₯, 𝑦). To identify attribute facts 𝑝(𝑒, 𝑒′ ), we need to identify 𝑒′ that can be used for attribute values, such as Male, and 𝑝 that is suitable for representing attributes, such as gender. That is, we need to identify the attribute values 𝑒′ and attribute predicates 𝑝. For attribute values, we observe that they are often linked to a huge number of entities and thus form kind of β€œhubs” in the network. Based on the above discussion, such hubs can be identified by their connection densities. We define the density of a hub, 𝑑𝑒𝑛(𝑒, 𝑝), as the (in and out) degrees of an entity 𝑒 w.r.t. a predicate 𝑝 as follows. #𝑒′ : 𝑝(𝑒′ , 𝑒) #𝑒′ : 𝑝(𝑒, 𝑒′ ) 𝑑𝑒𝑛𝑖𝑛 (𝑒, 𝑝) = π‘‘π‘’π‘›π‘œπ‘’π‘‘ (𝑒, 𝑝) = (2) #(𝑒′ , 𝑒′′ ) : 𝑝(𝑒′ , 𝑒′′ ) #(𝑒′ , 𝑒′′ ) : 𝑝(𝑒′ , 𝑒′′ ) In the definition of density, #𝑒′ : 𝑝(𝑒, 𝑒′ ) or 𝑝(𝑒′ , 𝑒) is the total number of occurrences of 𝑒 (as a subject or object) in a fact of the KG, and it is normalized by the total number of facts about 𝑝 in the KG. Yet a hub 𝑒 with high density is not necessarily suitable for an attribute value. Consider an extreme case where all the other entities 𝑒′ are connected with 𝑒 through a predicate 𝑝, then 𝑒 is not an attribute value that can distinguish 𝑒′ from other entities. From information theory, such an attribute does not provide any information gain as it does not express any distinguishing feature of associated entities. Thus, we need to select the hubs that both have high density and provide distinguishing features. We compute the entropy of hubs as follows. 𝑒𝑛𝑑𝑐 (𝑒, 𝑝) = βˆ’(𝑑 * π‘™π‘œπ‘”(𝑑) + (1 βˆ’ 𝑑) * π‘™π‘œπ‘”(1 βˆ’ 𝑑)), where 𝑐 ∈ {𝑖𝑛, π‘œπ‘’π‘‘}, 𝑑 = 𝑑𝑒𝑛𝑐 (𝑒, 𝑝) (3) Next, we want to identify those predicates that are suitable for attributes. This is based on the observation that such a predicate often associates an entity with its attribute values, and the number of possible values (like Male) is often far smaller than the number of entities that can be associated with the attribute (like Joe Biden). So we define the imbalance degree (imb) of a predicate 𝑝 as follows. #𝑒 : 𝑝(𝑒, 𝑒′ ) #𝑒 : 𝑝(𝑒′ , 𝑒) π‘–π‘šπ‘π‘–π‘› (𝑝) = π‘–π‘šπ‘π‘œπ‘’π‘‘ (𝑝) = (4) (#𝑒′ : 𝑝(𝑒, 𝑒′ )) * |β„°| (#𝑒′ : 𝑝(𝑒′ , 𝑒)) * |β„°| Figure 1: Enriching KG with hubs data flow. Intuitively, the degree π‘–π‘šπ‘π‘–π‘› (𝑝) being large means that 𝑝 has more potential subjects than objects, which suggests the objects of 𝑝 could potentially be attribute values (whereas the subjects are more likely to be entities). In other words, the vertices with 𝑝 being an in-coming edge are more likely to be attribute values than those with 𝑝 being an out-going edge. Finally, we combine the two measures to determine the likeliness of a fact with a hub 𝑒 and a predicate 𝑝 being an attribute fact. π‘Žπ‘‘π‘‘π‘ (𝑒, 𝑝) = 𝑒𝑛𝑑𝑐 (𝑒, 𝑝) * π‘–π‘šπ‘π‘ (𝑝), where 𝑐 ∈ {𝑖𝑛, π‘œπ‘’π‘‘}. (5) The larger π‘Žπ‘‘π‘‘π‘œπ‘’π‘‘ (𝑒, 𝑝) (or π‘Žπ‘‘π‘‘π‘–π‘› (𝑒, 𝑝)) is the more likely a fact 𝑝(𝑒, 𝑒′ ) (resp., 𝑝(𝑒′ , 𝑒)) is an attribute fact, with 𝑒′ being an entity and 𝑒 representing its attribute value. 3.2. KGs with Attributes Unlike existing embedding methods that embed attributes separately from the other relations [3, 11, 10], we propose a method to transform the attribute facts, so that existing rule learners like RLvLR can be easily adapted to learn rules with attributes from such facts. We illustrate the data flow of our data transformation in Figure 1, where we obtain an enriched KG through the identified attribute facts. In general, attributes can be seen as unary predicates, whereas other relations are bi- nary ones. For an attribute fact gender(Joe Biden, Male), it can be seen as a unary fact gender_Male(Joe Biden). Yet, such unary fact cannot be directly processed by existing rule learners as they only accept binary facts (i.e., triples) as inputs. Hence, we represent the unary fact gender_Male(Joe Biden) as a fact that is essentially a self-loop in the graph, i.e., gender_Male(Joe Biden, Joe Biden). Formally, each attribute fact in the original KG of the form 𝑝(𝑒, 𝑒′ ) (or 𝑝(𝑒′ , 𝑒)) with 𝑒′ being the attribute value can be transformed into a fact 𝑃 : 𝑝_𝐸 : 𝑒′ (𝑒, 𝑒) (resp., 𝐸 : 𝑒′ _𝑃 : 𝑝(𝑒, 𝑒)), where 𝑃 : 𝑝_𝐸 : 𝑒′ (resp., 𝐸 : 𝑒′ _𝑃 : 𝑝) is a new predicate. In Figure 2, we illustrate how attribute facts are transformed into self-loop in the KG. The intuition of such a transformation is to represent attributes as self-loops in the KG, so that path-based rule learning method can be easily adapted to handle such a KG while the attribute Figure 2: Example of enriching a KG with hubs. atoms (as self-loops) can be conveniently identified in the learned rules. For example, a path hasParent(Joe Biden, Joseph R. Biden), gender(Joseph R. Biden, Male) is transformed into a path hasParent(Joe Biden, Joseph R. Biden), P : gender_E : Male(Joseph R. Biden, Joseph R. Biden), which together with many other similar paths may induce a rule hasParent(π‘₯, 𝑦) ∧ P : gender_E : Male(𝑦, 𝑦) β†’ hasFather(π‘₯, 𝑦). The rule can be rewritten as hasParent(π‘₯, 𝑦) ∧ P : gender_E : Male(𝑦) β†’ hasFather(π‘₯, 𝑦). 3.3. Learning Rules with Attributes In Algorithm 1, we explain the complete process of identifying attribute facts, expanding the KG with transformed facts, and learning rules with attributes. Intuitively, we first select fact patterns 𝑝(𝑒, Β·) or 𝑝(Β·, 𝑒) that can be considered attribute facts. We represent such a pattern as a triple (𝑝, 𝑒, 𝑐) with 𝑐 ∈ {𝑖𝑛, π‘œπ‘’π‘‘}, which should not be confused with a triple in the KG. We select such triples where the total number of facts based on such patterns that can be added to the KG is bounded by a expansion degree 𝐸π‘₯𝑝. In what follows, we call such a triple a hub. In line 1, we compute the frequency of all potential hubs. For each candidate hub (𝑝, 𝑒, 𝑐), its frequency is the unnormalized density which is the numerator of Equation (2). To compute the frequencies, we just need to parse all facts in the KG once. We select top 𝑛 candidate hubs, where 𝑛 = 𝑇 π‘œπ‘πΉ π‘Ÿπ‘’π‘žπ‘ π‘’π‘š, to form the 𝐻𝑒𝑏𝑙𝑖𝑠𝑑. In lines 2 to 3, we compute the π‘Žπ‘‘π‘‘ measure for all potential hubs in 𝐻𝑒𝑏𝑙𝑖𝑠𝑑 as we explained in Equation (5). Then, we sort the list of candidate hubs by their π‘Žπ‘‘π‘‘ values. In lines 5 to 9, we consider the potential hubs one by one from the 𝐻𝑒𝑏𝑙𝑖𝑠𝑑 with the highest π‘Žπ‘‘π‘‘ degree and create the corresponding new attribute predicate and facts. We stop this process when we hit the maximum expansion degree. The expansion degree is computed by the number of new facts divided by the number of facts in the original KG. The maximum expansion degree is given as input parameter, 𝑀 π‘Žπ‘₯𝐸π‘₯𝑝. In line 10 we feed the new set of facts to RLvLR and obtain a set of attribute rules. In line 11 we return a set of rules with attributes. Algorithm 1 Learn Rules with Attributes Input: a KG as a set of facts β„± Parameter: a max expansion degree 𝑀 π‘Žπ‘₯𝐸π‘₯𝑝 and an integer 𝑇 π‘œπ‘πΉ π‘Ÿπ‘’π‘žπ‘ π‘’π‘š Output: a set of rules with attributes β„› 1: 𝐻𝑒𝑏𝑙𝑖𝑠𝑑 := selectTopFrequent(β„±, 𝑇 π‘œπ‘πΉ π‘Ÿπ‘’π‘žπ‘ π‘’π‘š) 2: computeAtt(𝐻𝑒𝑏𝑙𝑖𝑠𝑑) 3: 𝐻𝑒𝑏𝑙𝑖𝑠𝑑 := sortAtt(𝐻𝑒𝑏𝑙𝑖𝑠𝑑) 4: β„± β€² = β„± 5: Let 𝑁 𝑒𝑀𝐹 π‘Žπ‘π‘‘π‘  consist of attribute facts related to top hubs from 𝐻𝑒𝑏𝑙𝑖𝑠𝑑 6: while (|β„± β€² | + |𝑁 𝑒𝑀𝐹 π‘Žπ‘π‘‘π‘ | βˆ’ |β„± |)/|β„±| ≀ 𝑀 π‘Žπ‘₯𝐸π‘₯𝑝 do 7: β„± β€² := β„± β€² βˆͺ 𝑁 𝑒𝑀𝐹 π‘Žπ‘π‘‘π‘  8: Let 𝑁 𝑒𝑀𝐹 π‘Žπ‘π‘‘π‘  consist of attribute facts related to top hubs from 𝐻𝑒𝑏𝑙𝑖𝑠𝑑 9: end while 10: β„› := RLvLR(β„± β€² ) 11: return β„› 4. Context Aware Inference There is a body of works for applying the learned rules to the known facts for inferring new facts in KGs. Most of these works focus on how the rules can be used in a restricted iterative way [12]. There is another stream of works that consider applying the uncertain rules once in the knowledge graph community [4, 13, 8]. In the latter stream of works, each rule is augmented with a degree of confidence and the inference module takes to the account these degrees to infer a new fact (which is also augmented with a degree of confidence). To control the quality and number of new inferred facts, these systems put a maximum threshold on the number of new facts and/or put a minimum threshold on the confidence of new facts. The problem with this method is they do not consider the demand of KG regarding each new fact; they merely focus on the process which results in the new facts. For example, consider the predicate isParentOf. For each human person, usually, we have at most two entities as parents. In the case that we want to infer new facts regarding this predicate for a specific person, Ann, we consider all rules that result in the following partially grounded fact (isParentOf(π‘₯, 𝐴𝑛𝑛)). On the other hand, we need to consider how many instantiations of this fact are needed. If we have already two entities as parents of Ann, we should reject any more facts even if those facts are inferred via quite confident rules. For another entity, Pit, that we have no facts that state the parent of him, we should accept the new inferred fact even if the confidences of the corresponding rules are not high. Thus, in the context-aware inference method, we try to consider the confidence of the new fact production process and the demand of the new fact in the KG. In the context-aware inference method, we will infer the new facts by considering the already related existing facts in the KG. We could obtain an average number of facts regarding each predicate and entity in a specific predicate argument. Later, when we want to infer new facts, if there are enough facts already in the KG, inferring new facts are likely to be overpredicting. To avoid overpredicting, we consider just the predictions which securely fill the gap (between the current number of facts and the average number). We give higher priority to the facts with higher SC if there are alternatives. So we sort all predictions regarding their SCs, and for each one (𝑝1 (𝑒1 , 𝑒2 )) we consider the gap between the average number of facts regarding specific entity and predicate and the concerning new fact. We obtain the Average Number of Facts (ANF) for each entity, predicate, and argument. Although this averaging method ignores the variations of the number of facts for a specific entity in the specific predicate, we propose it inferring a new fact in an extra cautious fashion. We do not claim the rejected new facts are not valid but based on this mechanism; we do not have strong support that the new fact is needed. #(𝑒, 𝑒′ ) : 𝑝1 (𝑒, 𝑒′ ) 𝐴𝑁 𝐹𝑝𝑆𝑒𝑏 = (6) 1 #𝑒 : βˆƒπ‘¦ 𝑝1 (𝑒, 𝑦) #(𝑒, 𝑒′ ) : 𝑝1 (𝑒, 𝑒′ ) 𝐴𝑁 𝐹𝑝𝑂𝑏𝑗 = (7) 1 #𝑒′ : βˆƒπ‘₯ 𝑝1 (π‘₯, 𝑒′ ) ((#𝑒′ : 𝑝1 (𝑒1 , 𝑒′ )) < 𝐴𝑁 𝐹𝑝𝑆𝑒𝑏 1 ) ∧ ((#𝑒 : 𝑝1 (𝑒, 𝑒2 )) < 𝐴𝑁 𝐹𝑝𝑂𝑏𝑗 1 ) (8) We accept a new fact if the condition (8) is satisfied. This condition demands the new fact populate the KG whence both subject and object have less than average relations regarding the predicate. By this extra cautious method, we can avoid overpopulating KG by predicting too many facts w. r. t. any specific pair of subject-predicate or object-predicate with the cost of excluding many facts that might be valid. For example, in the case of nationality, the number of entities with the same nationality varies from a hundred thousand to a billion based on real-world data. By considering the average, we might exclude many new valid facts for a more populated country. Hence, the excluded facts by this mechanism are not wrong, but if we have a high priority to infer fewer facts, we exclude them. 5. Experiments Based on the methods presented in previous sections, we have implemented Hub Augmented Rule Learning (HARL) and conducted three sets of experiments to evaluate the new system. The first set of experiments aim to evaluate the effect of new hub predicates on rule learning. In the second set of experiments, we use the learned rules to infer new facts via the original inference module of RLvLR and the new context aware inference module. The results show the usefulness of hub-augmented rules to infer more quality facts. The third set of experiments aim to evaluate the improvements of RLvLR on link prediction task regarding using hubs as new predicates. In Table 1, we show the statistics of three widely used Knowledge graphs. We randomly selected 50 target predicates for target predicates, and for YAGO2 core, we used all 32 predicates as target predicates. All results shown in the following experiments are average results for the corresponding target predicates. Table 1 Benchmark KG specifications KG # Facts # Entities # Predicates FB15K 542K 15K 1345 FB15KSE 272K 15K 237 YAGO2 core 948K 470K 32 To show the advantage of identifying the hubs and populating the KG with hubs for rule learning. We introduce a new kind of rule redundancy elimination, the Concise Rule Set. 5.1. Eliminating Redundant Rules: Concise Rule Set Rule learners can be evaluated by the number of distinct quality rules that they can learn. AMIE+ [8] eliminates duplicate rules. In this process, two rules are considered duplicated only if they have the same head predicate, the same number of atoms (body length), and the same head coverage (or support). While such elimination merely considers rules with same syntax as the same rule (e.g. 𝑝3 (π‘₯, 𝑦) ∧ 𝑝2 (π‘₯, 𝑦) β†’ 𝑝1 (π‘₯, 𝑦). and 𝑝2 (π‘₯, 𝑦) ∧ 𝑝3 (π‘₯, 𝑦) β†’ 𝑝1 (π‘₯, 𝑦).), it does not consider the semantic similarity like, livesIn(π‘₯, 𝑦) β†’ isCitizenOf(π‘₯, 𝑦). and livesIn(π‘₯, 𝑦) ∧ Person(π‘₯) β†’ isCitizenOf(π‘₯, 𝑦). Intuitively the last second atom of second rule is obvious and does not add any information to the rule while it makes a new (syntactically) rule. These kind of semantically redundancy can discredit the evaluation of rule learning systems based on the number of quality rules. To overcome this challenge we introduce a novel method to obtain a set of rule with minimum redundancy. We deployed a greedy search to find a set of rules with minimum redundancy w.r.t. fact inferences. Each rule π‘Ÿ is associated with a set ⋃︀ of facts that can be inferred from the rule, denoted 𝑓 π‘Žπ‘π‘‘(π‘Ÿ). For a set of rules β„›, 𝑓 π‘Žπ‘π‘‘(β„›) = π‘Ÿβˆˆβ„› 𝑓 π‘Žπ‘π‘‘(π‘Ÿ). To obtain a concise set of rules, we start with an empty set β„› and add rules into β„› one by one from the set of all learned rules, following the descending order of the confidences of the rules (i.e., their SCs). Before adding a rule π‘Ÿ to β„›, we first check if the facts inferred by π‘Ÿ can already be inferred by those in β„›. We define the novelty degree for each rule π‘Ÿ as follows: |𝑓 π‘Žπ‘π‘‘(π‘Ÿ)| βˆ’ |𝑓 π‘Žπ‘π‘‘(π‘Ÿ) ∩ 𝑓 π‘Žπ‘π‘‘(β„›))| noveltyDegree(π‘Ÿ) = |𝑓 π‘Žπ‘π‘‘(β„›))| We add the rule π‘Ÿ to β„› only if the novelty degree is higher than a threshold. 5.2. Learning Rules with Attributes Table 2 shows the average numbers of rules (#𝑅) with the minimum quality SCβ‰₯ 0.1 and HCβ‰₯ 0.01 as these values are used in RLvLR [4] and AMIE+ [8]. #𝑄𝑅 indicates the average number of rules with the following minimum qualities, SCβ‰₯ 0.8 and HCβ‰₯ 0.01. #𝐢𝑅 indicates the average size of the concise rule sets (obtained from all #𝑅 learned rules). #𝑄𝐢𝑅 indicates the average size of the quality concise rule sets (obtained by from #𝑄𝑅 high quality rules). #𝐢𝑅 βˆ’ 𝐻 indicates the average number of concise rules that contains at least one hub predicate (obtained from the concise rule set of size #𝐢𝑅). Table 2 Performance of HARLand RLvLR on Rule Learning task Benchmark HARL RLvLR #R #QR #CR #QCR #CR-H #R #QR #CR #QCR #CR-H FB15K 267 123 190 48 66 153 15 117 10 0 FB15KSE 252 80 205 34 56 171 21 165 17 0 YAGO2core 70 32 52 14 13 33 4 32 4 0 Table 3 Performance of HARL and RLvLR on Facts Inference task Benchmark HARL RLvLR #UNF #NF #QNF #UNF #NF #QNF FB15K 66152 578 314 66215 455 276 FB15KSE 56245 815 293 55952 627 216 YAGO2core 182676 2053 279 175379 1983 265 Table 4 Performance of HARL and RLvLR running characteristics Benchmark HARL RLvLR Time #Hub EXP Minπ‘Žπ‘‘π‘‘ Time FB15K 389 73 0.09767 0.00138 310 FB15KSE 299 55 0.09118 0.0014 276 YAGO2core 403 255 0.09889 0.0000003 292 Compared to RLvLR, HARL showed significantly better performance in terms of both the rules with ordinary filtering and concise filtering. The superiority of HARL is more obvious in learning high quality rules, that is, among all the rules learned, HARL has a consistently higher number of high quality rules. 5.3. Inferring New Facts via Learnt Rules To estimate the predictive power of the corpus of learned rules, we calculated the number of facts can be predicated by applying learned rules on the given facts via RLvLR inference module and context-aware inference module. To aggregate the different rules with same target predicate we use Noisy-OR similar to RLvLR [13] and AMIE+ [8]. Table 3 shows the numbers of RLvLR inference predicted facts (#UNF), those predictions via context-aware inference facts (#NF), and those context-aware inferred facts with CDβ‰₯ 0.8 (#QNF). The running of two systems have the following characteristics regarding time and other features 4. #Hub indicates the number of hubs that we consider before the number of new facts hits the max expansion degree (𝑀 π‘Žπ‘₯𝐸π‘₯𝑝 in algorithm 1), which we assigned 0.1 for max expansion degree in these experiments. EXP is the actual expansion degree. Minπ‘Žπ‘‘π‘‘ is the π‘Žπ‘‘π‘‘ value of the last hub that we consider. Table 5 Performance of HARL and RLvLR on Link Prediction task Benchmark HARL RLvLR MRR MR H1 H10 MRR MR H1 H10 FB15K 0.36 169 0.23 0.58 0.3 207 0.2 0.48 FB15KSE 0.21 1171 0.14 0.34 0.21 1180 0.15 0.34 YAGO2core 0.06 332820 0.04 0.1 0.06 333778 0.04 0.09 5.4. Link Prediction via Learnt Rules We additionally evaluated the quality of rules through link prediction. We consider two queries, 𝑝(𝑒, ?) and 𝑝(?, 𝑒′ ), for each fact, 𝑝(𝑒, 𝑒′ ), in the testing data in each benchmark. For FB15k and FB15KSE the separation of test and train data has been done, while for YAGO2core, there is no previously prepared data, so we randomly separate 90% for train and 10% for the test. We applied learnt rules on the training data to predict the missing entities, and the testing data verified the correctness of such predictions. The predictions were measured using Mean Rank (MR) in which the less value indicates better performance and Mean Reciprocal Rank (MRR) in which the higher value indicates better performance. We also deployed Hits@1 and Hits@10. These four quality measurements are commonly used in the Knowledge Graph Completion literature [2]. In particular, we first ranked each inferred fact based on the number of rules inferring it. MR is the average number of ranks. MRR is the average number of the reciprocal ranks of correctly predicated entities. Hits@10 and Hits@10 are the proportion of correctly predicted entities that are ranked among the top ten and one, respectively. Table 5 summaries the results, where H10 and H1 are the Hits@10 and Hits@1 scores. HARL outperforms RLvLR in FB15K in all quality measures significantly while in other two benchmark it performs in par. Since the FB15kSE and YAGO2core are more difficult tasks regarding link prediction, it might be the case that proposing the hub could not help the link prediction. To see how the hubs can effect the mined rules consider the following two rules from YAGO2core. 𝑆𝐢 : 0.03, 𝐻𝐢 : 0.07, wasBornIn(π‘₯, 𝑧) ∧ isLocatedIn(𝑧, 𝑦) β†’ isCitizenOf(π‘₯, 𝑦). 𝑆𝐢 : 0.61, 𝐻𝐢 : 0.01, P :< graduatedFrom > _E :< University_College_Dublin >(π‘₯) ∧wasBornIn(π‘₯, 𝑧) ∧ isLocatedIn(𝑧, 𝑦) β†’ isCitizenOf(π‘₯, 𝑦). The standard confidence of the second rule improved by about 20 times. Since we add a condition at the body of the rule, we make it more specified, so it fires less number of facts, but those shots are more accurate and improve the SC significantly. On the other hand, since the number of resulting facts in the second rule decreased, the head coverage also reduced, making the rule less general. While, the first rule is filtered due to its low SC, the specified version of it has higher SC and qualified to be in the output set of rules. From FB15k KG, the following hubs are created as they have top π‘Žπ‘‘π‘‘. 1. P :< /people/person/gender > _E :< male >(π‘₯) 2. P :< /people/person/spouse_s./people/marriage/type_of_union > _E :< matrimony >(π‘₯) 3. E :< matrimony > _P :< /people/marriage _union_type/unions_of_this_type./people/marriage/spouse >(π‘₯) 4. P :< /people/person/nationality > _E :< USA >(π‘₯) 5. P :< /people/person/profession > _E :< actors >(π‘₯) 6. E :< actors > _P :< /people/profession/people_with_this_profession >(π‘₯) 7. P :< /film/film/language > _E :< eng >(π‘₯) 8. P :< /film/film/country > _E :< USA >(π‘₯) 9. P :< /film/film/release_date_s./film/film_regional_release_date /film_release_distribution_medium > _E :< DVD >(π‘₯) 10. P :< /film/film/release_date_s./film/film_regional_release_date /film_release_region > _E :< USA >(π‘₯) From this example, we can observe the identified hubs in FB15k are intuitive and shows which binary attributes can separate the related entities into two sets. In the real case scenario, regarding the film industry, we have some dominant categories that can distinguish the entities, like the movies produced in the USA in contrast with those that were produced in any country other than the US. With our proposed modelling, we can reflect on this disparity between different values of one attribute. 6. Related Works Several rule learning systems, including AMIE+ [8], RuDiK [14] and anyBURL [15], allow a set of entities to be considered in the partially grounded rules. Gu et al [16] introduce a language bias for their rule learning method, in which the head atom and last atom of the rule’s body are partially grounded but no entities are allowed in other atoms. However, they treat all entities equally in the partially grounded rules that are leaned. These method do not have a mechanism for selecting a set of entity-predicate pairs for attribute values. Consequently, the search space of possible rules grows vastly and these systems struggle to scale to the real-world massive KGs. The other related stream of works is attribute knowledge graph modelling. In general, our work is different from this line of works for at least two reasons: (1) they assume an input KG contain attributes while we create attributes out of the given KG; (2) the nature of attributes they consider is only numeric or categorical, while ours is a particular case of categorical ones. The attribute KGs differ in terms of the nature of attributes. [11] and [10] can model KGs that contain numeric feature vectors as attributes of nodes (entities). In [11], authors consider a text related to an entity like the title of a paper as a feature of paper node. They use methods in Natural Language Processing to construct a feature vector for representing the title feature in a numeric representation rather a string of characters. In [9], authors distinguish the relations from attributes and consider only categorical attributes. The attribute KG that they consider, FB24K, is extracted from Freebase. The number of values of attributes range from 1 to 877 categorical attributes and the number of attributes is 314. While what an attribute in their approach is similar to the hubs in our method, our attributes are binary. Furthermore, we do not remove the original facts, and an entity can act as both an attribute value and an entity. For example, the USA, in regards to predicate nationality, acts as an attribute value. At the same time, it acts as an entity in the predicate neighbour. However, [9] does not consider the attribute values as entities, so ignore the dual role of a thing as an entity and attribute/value. 7. Conclusion In this paper, we proposed an approach to learn attributed rules over KGs. In contrast to existing attribute KG modelling approaches that assume attributes as input, we can mine the attributed rules from KG without an explicit set of attributes. We achieved this by proposing a binary attribute that can easily be embedded into the closed path rules and an identifying mechanism that automatically mines such attributes from KG. We enhance the original KG with new predicates (hubs) and corresponding facts. Then we feed the improved KG to a state-of-the-art rule learner, RLvLR, to learn attributed rules. We also proposed a novel inferring mechanism that considers the demands of a new inferred fact and the quality of the rules that infer it. Finally, we evaluated our system HARL on widely used benchmarks in rule learning and link prediction. Our experimental results demonstrate that HARL improved RLvLR in terms of both the number of mined quality rules, inferred facts, and accuracy in link prediction. References [1] T. Pellissier-Tanon, G. Weikum, F. Suchanek, YAGO 4: A Reason-able Knowledge Base, in: ESWC, 2020. [2] Z. Chen, Y. Wang, B. Zhao, J. Cheng, X. Zhao, Z. Duan, Knowledge graph completion: A review, IEEE Access 8 (2020) 192435–192456. [3] Y. Chen, D. Z. Wang, S. Goldberg, ScaLeKB: scalable learning and inference over large knowledge bases, The International Journal on Very Large Data Bases 25 (2016) 893–918. [4] P. G. Omran, K. Wang, Z. Wang, Scalable rule learning via learning representation, in: IJCAI, 2018. [5] M. Gardner, T. Mitchell, Efficient and Expressive Knowledge Base Completion Using Subgraph Feature Extraction, in: Conference on Empirical Methods on Natural Language Processing, September, 2015, pp. 1488–1498. [6] B. Yang, W.-t. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases, in: ICLR, 2015, p. 12. [7] Z. Wang, J.-Z. Li, RDF2Rules: Learning Rules from RDF Knowledge Bases by Mining Frequent Predicate Cycles., arXiv preprint (2015). [8] L. GalΓ‘rraga, C. Teflioudi, K. Hose, F. M. Suchanek, Fast rule mining in ontological knowledge bases with AMIE+, The International Journal on Very Large Data Bases 24 (2015) 707–730. [9] Y. Lin, Z. Liu, M. Sun, Knowledge representation learning with entities, attributes and relations, in: IJCAI, 2016. [10] W. Liu, H. Cai, X. Cheng, S. Xie, Y. Yu, H. Zhang, Learning High-order Structural and Attribute information by Knowledge Graph Attention Networks for Enhancing Knowledge Graph Embedding, arXiv preprint (2019). [11] Z. Hu, Y. Dong, K. Wang, K. W. Chang, Y. Sun, GPT-GNN: Generative Pre-Training of Graph Neural Networks, in: SIGKDD, ACM, 2020, pp. 1857–1867. [12] M. Svato, S. Schockaert, J. Davis, STRiKE : Rule-Driven Relational Learning Using Stratified k-Entailment, in: ECAI, 2020. [13] P. G. Omran, K. Wang, Z. Wang, An Embedding-based Approach to Rule Learning in Knowledge Graphs, TKDE 33 (2021) 1348β€”-1359. [14] S. Ortona, V. V. Meduri, P. Papotti, Robust discovery of positive and negative rules in knowledge bases, in: International Conference on Data Engineering (ICDE), IEEE, 2018, pp. 1180–1191. [15] C. Meilicke, M. W. Chekol, D. Ruffinelli, H. Stuckenschmidt, Anytime bottom-up rule learning for knowledge graph completion, in: IJCAI, 2019. [16] Y. Gu, Y. Guan, P. Missier, Towards Learning Instantiated Logical Rules from Knowledge Graphs, arXiv preprint (2020).