Learning Rules With Attributes and Relations in
Knowledge Graphs
Pouya Ghiasnezhad Omran1 , Zhe Wang2 and Kewen Wang2
1
    The Australian National University, Canberra, ACT, Australia
2
    Griffith University, Brisbane, QLD, Australia


                                         Abstract
                                         Rules capture high-level patterns in knowledge graphs and sometimes can provide virtual schemata
                                         for the data. Recent research has witnessed an increasing interest in learning rules automatically and
                                         applying the learned rules in knowledge graph inferences. While quite a number of rule learning sys-
                                         tems have been developed, the formats of learned rules are still restricted, mostly resemble paths in the
                                         knowledge graphs. Such methods focus on the graph structure of the knowledge graph, assuming all
                                         the vertices and all the edges in the graph have an equal role (while only named or labelled differently).
                                         In this paper, we propose a method HARL (Hub Augmented Rule Learning) for learning rules contain-
                                         ing attributes by treating certain binary predicates as unary predicates. A major component of HARL
                                         is an algorithm for identifying attributes for a given knowledge graph. HARL has been implemented
                                         through a scalable rule learner RLvLR. Our experimental results also show that HARL is scalable and
                                         outperforms RLvLR in most cases on major benchmark datasets.

                                         Keywords
                                         Rule Learning, Knowledge Graph, Knowledge Graph Completion


1. Introduction
Knowledge graphs (KGs) have proven to be useful for building interlinks between information
sources by connecting entities of diverse types like people, universities, movies, animals, etc.
This is particularly useful for information synthesis over multimodal media data, such as the
fusion of medical documents consisting of X-ray images and medical reports (texts and tables).
In artificial intelligence, a KG is usually a set of triples like (𝐵𝑎𝑟𝑡𝑦, bornInCity, 𝐵𝑟𝑖𝑠𝑏𝑎𝑛𝑒),
meaning that Barty was born in the city of Brisbane. Tools are available for retrieving such
triples from various sources [1]. Thus, solutions for retrieving information can be obtained
by technologies for KGs, such as efficient algorithms for link prediction. The objective of link
prediction is to determine whether a pair of entities are connected via a relation. For instance,
the link prediction task answers a query, (𝐵𝑎𝑟𝑡𝑦, bornInCity, ?) that enquiries for a city name
where Barty was born in. Existing methods for link prediction are mostly based on embedding
based representation and graph neural networks (GNNs) [2], which adopt a black-box approach

In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE
2022), Stanford University, Palo Alto, California, USA, March 21–23, 2022.
email: p.g.omran@anu.edu.au (P. G. Omran); zhe.wang@griffith.edu.au (Z. Wang); k.wang@griffith.edu.au
(K. Wang)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
and are known to suffer from issues of explainability and flexibility. As a result, some researchers
advocated providing explainable solutions to KG link prediction by learning rules. In particular,
rules that contain constants are useful for both link prediction and reasoning. However, existing
scalable rule learners based on embedding, e.g. [3, 4], can only learn the class of closed path
rules (CP-rules) but are unable to learn rules containing constants. This assumption is vital since
we face millions of entities in large KGs, which causes an explosion in the search space.
   On the other hand, in the KGs, specific entities in specific predicates acts as at-
tributes. For example, in FB15k, the entity USA in the second argument of the predicate
/film/film/country(𝑥, 𝑦) acts as a class that distinguish all films (𝑥) that are produced in the
USA and from those who are not. Such binary attribute reflects a reality of dominance of the
USA film industry globally. We refer to such entity-predicates as hubs and model them through
attribute facts. We propose a method to identify a set of hubs automatically. Consider an exam-
ple from YAGO2 KG, we have P : isCitizenOf_E : Netherland(𝑥) as an attribute or hub. We can
learn a number of patterns that cannot be expressed without such attributes, for instance, the
rule P : isCitizenOf_E : Netherlands(𝑥) ∧ isCitizenOf(𝑥, 𝑦) → livesIn(𝑥, 𝑦) would have only a
much lower confidence degree if without the attribute P : isCitizenOf_E : Netherlands(𝑥) (three
times less confidence degree than the rule including attribute). Thus, we will consider the more
specific rules such as, P : isCitizenOf_E : Netherlands(𝑥) ∧ isCitizenOf(𝑥, 𝑦) → livesIn(𝑥, 𝑦),
instead of the more general rule, isCitizenOf(𝑥, 𝑦) → livesIn(𝑥, 𝑦).
   In this paper, we propose a method HARL (Hub Augmented Rule Learning), for learning rules
containing attributes. The idea is to regard such a rule as a closed path rule so that existing
scalable rule learners can be employed. This is achieved by treating a binary predicate as a
unary predicate, for example, an atom /film/film/country(𝑥, 𝑈 𝑆𝐴) can be considered as an
attribute /film/film/country_USA(𝑥). Thus, a major component of HARL is an algorithm for
identifying attributes for a given KG. HARL can keep the search space in a manageable size
while essential entities for learning rules are not missed. HARL has been implemented through
a scalable rule learner RLvLR [4]. Our experimental results also show that HARL outperforms
RLvLR in most cases on major benchmark datasets.


2. Preliminaries
A knowledge graph consists of a set of entities ℰ as its vertices and its edges are directed and
labelled with a set of predicates 𝒫. An entity 𝑒 is an object such as a place, a person, etc., and a
fact (or link) is a triple (𝑒, 𝑝, 𝑒′ ), which means that the entity 𝑒 is related to another entity 𝑒′ via
the binary predicate 𝑝. Following the convention in knowledge representation, we denote such
a fact as 𝑝(𝑒, 𝑒′ ).
   The format of rules we aim to learn extends closed-path rules (or CP rules), which is a language
bias that is widely adopted in the rule learning literature for knowledge graphs; for instance,
CP rules are the underlying formalism of Path Ranking Algorithms [5], RuleEmbedding [6],
RDF2Rules [7] and ScaLeKB [3]. We extends CP rules with atoms about entity attributes.
   An attribute closed-path rule (or an ACP rule or simply a rule) 𝑟 is of the form

   𝑝1 (𝑥0 , 𝑥1 ) ∧ 𝑝2 (𝑥1 , 𝑥2 ) ∧ · · · ∧ 𝑝𝑛 (𝑥𝑛−1 , 𝑥𝑛 ) ∧ 𝑎1 (𝑥𝑙1 ) ∧ · · · ∧ 𝑎𝑚 (𝑥𝑙𝑚 ) → 𝑝(𝑥0 , 𝑥𝑛 ).   (1)
Here each 𝑥𝑖 (0 ≤ 𝑖 ≤ 𝑛) is a variable and 0 ≤ 𝑙𝑗 ≤ 𝑛 for each 1 ≤ 𝑗 ≤ 𝑚,
each 𝑝(𝑢, 𝑣) is called an atom, and 𝑢 and 𝑣 are called respectively, the subject and ob-
ject argument for 𝑝, and each 𝑎(𝑢) is an atom with 𝑎 being a unary predicate. Intu-
itively, the rule 𝑟 reads that if 𝑝1 (𝑥0 , 𝑥1 ), 𝑝2 (𝑥1 , 𝑥2 ), . . . , 𝑝𝑛 (𝑥𝑛−1 , 𝑥𝑛 ), 𝑎1 (𝑥𝑙1 ), . . . , 𝑎𝑚 (𝑥𝑙𝑚 )
hold, then 𝑝(𝑥0 , 𝑥𝑛 ) holds too. The atom 𝑝(𝑥0 , 𝑥𝑛 ) is the head of 𝑟 and the set of atoms
{𝑝1 (𝑥0 , 𝑥1 ), 𝑝2 (𝑥1 , 𝑥2 ), . . . , 𝑝𝑛 (𝑥𝑛−1 , 𝑥𝑛 ), 𝑎1 (𝑥𝑙1 ), . . . .𝑎𝑚 (𝑥𝑙𝑚 )} is the body of 𝑟. The rule ob-
tained from 𝑟 by removing unary predicates in the body, that is, 𝑝1 (𝑥0 , 𝑥1 ) ∧ 𝑝2 (𝑥1 , 𝑥2 ) ∧ · · · ∧
𝑝𝑛 (𝑥𝑛−1 , 𝑥𝑛 ) → 𝑝(𝑥0 , 𝑥𝑛 ) is usually referred to as a closed-path rule or CP rule as the sequence
of predicates in the rule body forms a path from the subject argument to the object argument of
the head predicate.
   To assess the quality of mined rules, we recall measures that are used in some major ap-
proaches to rule learning [3, 8]. Let 𝑟 be a rule of the form (1). A pair of entities (𝑒, 𝑒′ ) satisfies
the body of 𝑟, denoted 𝑏𝑜𝑑𝑦(𝑟)(𝑒, 𝑒′ ), if there is a way of substituting variables in 𝑏𝑜𝑑𝑦(𝑟) with
entities in the KG such that (i) all atoms in 𝑏𝑜𝑑𝑦(𝑟) (after substitution) are facts in the KG,
and (ii) 𝑥0 and 𝑥𝑛 are substituted with 𝑒 and 𝑒′ respectively. And (𝑒, 𝑒′ ) satisfies the head of 𝑟,
denoted 𝑝(𝑒, 𝑒′ ), if 𝑝(𝑒, 𝑒′ ) is a fact in the KG. Then the support degree of 𝑟 is defined as

                              𝑠𝑢𝑝𝑝(𝑟) = #(𝑒, 𝑒′ ) : 𝑏𝑜𝑑𝑦(𝑟)(𝑒, 𝑒′ ) ∧ 𝑝(𝑒, 𝑒′ )
That is, 𝑠𝑢𝑝𝑝(𝑟) is defined as the number of entity pairs that satisfy both the head and the body
of 𝑟.
   The degrees of standard confidence (SC) and head coverage (HC) are defined as two forms of
normalisation for 𝑠𝑢𝑝𝑝(𝑟):

                                     𝑠𝑢𝑝𝑝(𝑟)                                       𝑠𝑢𝑝𝑝(𝑟)
               𝑆𝐶(𝑟) =                                           𝐻𝐶(𝑟) =
                            #(𝑒, 𝑒′ ) : 𝑏𝑜𝑑𝑦(𝑟)(𝑒, 𝑒′ )                        #(𝑒, 𝑒′ ) : 𝑝(𝑒, 𝑒′ )

So, 𝑆𝐶(𝑟) is the normalisation of 𝑠𝑢𝑝𝑝(𝑟) through the number of entity pairs that satisfy the
body, while 𝐻𝐶(𝑟) is the normalisation of 𝑠𝑢𝑝𝑝(𝑟) through the number of entity pairs that
satisfy the head.


3. Learning Rules with Attributes
In this section, we present our approach to learn rules with attributes through a simple and
effective method. We first describe how to automatically identify facts in the KG that express
entity attributes , which we refer to as attribute facts, and then explain how to transform such
attribute facts to a form that can be easily processed by existing rule learning methods.

3.1. Identifying Attribute Facts
As argued in [9, 10], the facts in existing KGs can be separated into (at least) two cate-
gories: those representing entity relations and those for entity attributes (hubs). For example,
hasFather(Joe Biden, Joseph R. Biden) expresses the relationship between two entities Joe Biden
and Joseph R. Biden, whereas gender(Joe Biden, Male) is naturally considered as an attribute of
the entity Joe Biden. Sometimes it is not easy to draw a clear line between the two types of
facts, but in general, entity relations often involve large numbers of entities, like Joe Biden and
Joseph R. Biden, but the links between an attribute and entities are relatively sparse; in another
word, entity attributes often involve a small number of entities to represent attribute values,
such as the gender Male, which are linked to a huge number of other entities.
   Existing rule learning approaches often do not distinguish entity attributes from entity
relations as they can only learn the class of CP rules like

                          hasFriend(𝑥, 𝑦) ∧ gender(𝑦, 𝑧) → gender(𝑥, 𝑧).

However, in many cases it makes more sense to consider ‘gender’ as an attribute when rules of
the following more general form are learned.

                     hasParent(𝑥, 𝑦) ∧ gender(𝑦, Male) → hasFather(𝑥, 𝑦).

   To identify attribute facts 𝑝(𝑒, 𝑒′ ), we need to identify 𝑒′ that can be used for attribute values,
such as Male, and 𝑝 that is suitable for representing attributes, such as gender. That is, we need
to identify the attribute values 𝑒′ and attribute predicates 𝑝.
   For attribute values, we observe that they are often linked to a huge number of entities and
thus form kind of “hubs” in the network. Based on the above discussion, such hubs can be
identified by their connection densities.
   We define the density of a hub, 𝑑𝑒𝑛(𝑒, 𝑝), as the (in and out) degrees of an entity 𝑒 w.r.t. a
predicate 𝑝 as follows.

                            #𝑒′ : 𝑝(𝑒′ , 𝑒)                                 #𝑒′ : 𝑝(𝑒, 𝑒′ )
         𝑑𝑒𝑛𝑖𝑛 (𝑒, 𝑝) =                                 𝑑𝑒𝑛𝑜𝑢𝑡 (𝑒, 𝑝) =                                 (2)
                          #(𝑒′ , 𝑒′′ ) : 𝑝(𝑒′ , 𝑒′′ )                     #(𝑒′ , 𝑒′′ ) : 𝑝(𝑒′ , 𝑒′′ )

In the definition of density, #𝑒′ : 𝑝(𝑒, 𝑒′ ) or 𝑝(𝑒′ , 𝑒) is the total number of occurrences of 𝑒 (as
a subject or object) in a fact of the KG, and it is normalized by the total number of facts about 𝑝
in the KG.
   Yet a hub 𝑒 with high density is not necessarily suitable for an attribute value. Consider an
extreme case where all the other entities 𝑒′ are connected with 𝑒 through a predicate 𝑝, then 𝑒 is
not an attribute value that can distinguish 𝑒′ from other entities. From information theory, such
an attribute does not provide any information gain as it does not express any distinguishing
feature of associated entities. Thus, we need to select the hubs that both have high density and
provide distinguishing features. We compute the entropy of hubs as follows.

 𝑒𝑛𝑡𝑐 (𝑒, 𝑝) = −(𝑑 * 𝑙𝑜𝑔(𝑑) + (1 − 𝑑) * 𝑙𝑜𝑔(1 − 𝑑)), where 𝑐 ∈ {𝑖𝑛, 𝑜𝑢𝑡}, 𝑑 = 𝑑𝑒𝑛𝑐 (𝑒, 𝑝) (3)

   Next, we want to identify those predicates that are suitable for attributes. This is based on
the observation that such a predicate often associates an entity with its attribute values, and
the number of possible values (like Male) is often far smaller than the number of entities that
can be associated with the attribute (like Joe Biden). So we define the imbalance degree (imb) of
a predicate 𝑝 as follows.

                             #𝑒 : 𝑝(𝑒, 𝑒′ )                               #𝑒 : 𝑝(𝑒′ , 𝑒)
           𝑖𝑚𝑏𝑖𝑛 (𝑝) =                                  𝑖𝑚𝑏𝑜𝑢𝑡 (𝑝) =                                    (4)
                          (#𝑒′ : 𝑝(𝑒, 𝑒′ )) * |ℰ|                      (#𝑒′ : 𝑝(𝑒′ , 𝑒)) * |ℰ|
Figure 1: Enriching KG with hubs data flow.


  Intuitively, the degree 𝑖𝑚𝑏𝑖𝑛 (𝑝) being large means that 𝑝 has more potential subjects than
objects, which suggests the objects of 𝑝 could potentially be attribute values (whereas the
subjects are more likely to be entities). In other words, the vertices with 𝑝 being an in-coming
edge are more likely to be attribute values than those with 𝑝 being an out-going edge.
  Finally, we combine the two measures to determine the likeliness of a fact with a hub 𝑒 and a
predicate 𝑝 being an attribute fact.
                     𝑎𝑡𝑡𝑐 (𝑒, 𝑝) = 𝑒𝑛𝑡𝑐 (𝑒, 𝑝) * 𝑖𝑚𝑏𝑐 (𝑝), where 𝑐 ∈ {𝑖𝑛, 𝑜𝑢𝑡}.                     (5)
The larger 𝑎𝑡𝑡𝑜𝑢𝑡 (𝑒, 𝑝) (or 𝑎𝑡𝑡𝑖𝑛 (𝑒, 𝑝)) is the more likely a fact 𝑝(𝑒, 𝑒′ ) (resp., 𝑝(𝑒′ , 𝑒)) is an
attribute fact, with 𝑒′ being an entity and 𝑒 representing its attribute value.

3.2. KGs with Attributes
Unlike existing embedding methods that embed attributes separately from the other relations [3,
11, 10], we propose a method to transform the attribute facts, so that existing rule learners like
RLvLR can be easily adapted to learn rules with attributes from such facts. We illustrate the
data flow of our data transformation in Figure 1, where we obtain an enriched KG through the
identified attribute facts.
   In general, attributes can be seen as unary predicates, whereas other relations are bi-
nary ones. For an attribute fact gender(Joe Biden, Male), it can be seen as a unary fact
gender_Male(Joe Biden). Yet, such unary fact cannot be directly processed by existing rule
learners as they only accept binary facts (i.e., triples) as inputs. Hence, we represent the
unary fact gender_Male(Joe Biden) as a fact that is essentially a self-loop in the graph, i.e.,
gender_Male(Joe Biden, Joe Biden).
   Formally, each attribute fact in the original KG of the form 𝑝(𝑒, 𝑒′ ) (or 𝑝(𝑒′ , 𝑒)) with 𝑒′ being
the attribute value can be transformed into a fact 𝑃 : 𝑝_𝐸 : 𝑒′ (𝑒, 𝑒) (resp., 𝐸 : 𝑒′ _𝑃 : 𝑝(𝑒, 𝑒)),
where 𝑃 : 𝑝_𝐸 : 𝑒′ (resp., 𝐸 : 𝑒′ _𝑃 : 𝑝) is a new predicate. In Figure 2, we illustrate how
attribute facts are transformed into self-loop in the KG.
   The intuition of such a transformation is to represent attributes as self-loops in the KG, so that
path-based rule learning method can be easily adapted to handle such a KG while the attribute
Figure 2: Example of enriching a KG with hubs.


atoms (as self-loops) can be conveniently identified in the learned rules. For example, a path
hasParent(Joe Biden, Joseph R. Biden), gender(Joseph R. Biden, Male) is transformed into a
path hasParent(Joe Biden, Joseph R. Biden), P : gender_E : Male(Joseph R. Biden, Joseph R. Biden),
which together with many other similar paths may induce a rule

               hasParent(𝑥, 𝑦) ∧ P : gender_E : Male(𝑦, 𝑦) → hasFather(𝑥, 𝑦).

The rule can be rewritten as

                hasParent(𝑥, 𝑦) ∧ P : gender_E : Male(𝑦) → hasFather(𝑥, 𝑦).

3.3. Learning Rules with Attributes
In Algorithm 1, we explain the complete process of identifying attribute facts, expanding the
KG with transformed facts, and learning rules with attributes. Intuitively, we first select fact
patterns 𝑝(𝑒, ·) or 𝑝(·, 𝑒) that can be considered attribute facts. We represent such a pattern as
a triple (𝑝, 𝑒, 𝑐) with 𝑐 ∈ {𝑖𝑛, 𝑜𝑢𝑡}, which should not be confused with a triple in the KG. We
select such triples where the total number of facts based on such patterns that can be added to
the KG is bounded by a expansion degree 𝐸𝑥𝑝. In what follows, we call such a triple a hub.
   In line 1, we compute the frequency of all potential hubs. For each candidate hub (𝑝, 𝑒, 𝑐),
its frequency is the unnormalized density which is the numerator of Equation (2). To compute
the frequencies, we just need to parse all facts in the KG once. We select top 𝑛 candidate hubs,
where 𝑛 = 𝑇 𝑜𝑝𝐹 𝑟𝑒𝑞𝑁 𝑢𝑚, to form the 𝐻𝑢𝑏𝑙𝑖𝑠𝑡.
   In lines 2 to 3, we compute the 𝑎𝑡𝑡 measure for all potential hubs in 𝐻𝑢𝑏𝑙𝑖𝑠𝑡 as we explained
in Equation (5). Then, we sort the list of candidate hubs by their 𝑎𝑡𝑡 values.
   In lines 5 to 9, we consider the potential hubs one by one from the 𝐻𝑢𝑏𝑙𝑖𝑠𝑡 with the highest
𝑎𝑡𝑡 degree and create the corresponding new attribute predicate and facts. We stop this process
when we hit the maximum expansion degree. The expansion degree is computed by the number
of new facts divided by the number of facts in the original KG. The maximum expansion degree
is given as input parameter, 𝑀 𝑎𝑥𝐸𝑥𝑝. In line 10 we feed the new set of facts to RLvLR and
obtain a set of attribute rules. In line 11 we return a set of rules with attributes.
Algorithm 1 Learn Rules with Attributes
Input: a KG as a set of facts ℱ
Parameter: a max expansion degree 𝑀 𝑎𝑥𝐸𝑥𝑝 and an integer 𝑇 𝑜𝑝𝐹 𝑟𝑒𝑞𝑁 𝑢𝑚
Output: a set of rules with attributes ℛ

 1: 𝐻𝑢𝑏𝑙𝑖𝑠𝑡 := selectTopFrequent(ℱ, 𝑇 𝑜𝑝𝐹 𝑟𝑒𝑞𝑁 𝑢𝑚)
 2: computeAtt(𝐻𝑢𝑏𝑙𝑖𝑠𝑡)
 3: 𝐻𝑢𝑏𝑙𝑖𝑠𝑡 := sortAtt(𝐻𝑢𝑏𝑙𝑖𝑠𝑡)
 4: ℱ ′ = ℱ
 5: Let 𝑁 𝑒𝑤𝐹 𝑎𝑐𝑡𝑠 consist of attribute facts related to top hubs from 𝐻𝑢𝑏𝑙𝑖𝑠𝑡
 6: while (|ℱ ′ | + |𝑁 𝑒𝑤𝐹 𝑎𝑐𝑡𝑠| − |ℱ |)/|ℱ| ≤ 𝑀 𝑎𝑥𝐸𝑥𝑝 do
 7:   ℱ ′ := ℱ ′ ∪ 𝑁 𝑒𝑤𝐹 𝑎𝑐𝑡𝑠
 8:   Let 𝑁 𝑒𝑤𝐹 𝑎𝑐𝑡𝑠 consist of attribute facts related to top hubs from 𝐻𝑢𝑏𝑙𝑖𝑠𝑡
 9: end while
10: ℛ := RLvLR(ℱ ′ )
11: return ℛ


4. Context Aware Inference
There is a body of works for applying the learned rules to the known facts for inferring new
facts in KGs. Most of these works focus on how the rules can be used in a restricted iterative
way [12]. There is another stream of works that consider applying the uncertain rules once in
the knowledge graph community [4, 13, 8].
   In the latter stream of works, each rule is augmented with a degree of confidence and the
inference module takes to the account these degrees to infer a new fact (which is also augmented
with a degree of confidence). To control the quality and number of new inferred facts, these
systems put a maximum threshold on the number of new facts and/or put a minimum threshold
on the confidence of new facts.
   The problem with this method is they do not consider the demand of KG regarding each new
fact; they merely focus on the process which results in the new facts. For example, consider the
predicate isParentOf. For each human person, usually, we have at most two entities as parents.
In the case that we want to infer new facts regarding this predicate for a specific person, Ann,
we consider all rules that result in the following partially grounded fact (isParentOf(𝑥, 𝐴𝑛𝑛)).
On the other hand, we need to consider how many instantiations of this fact are needed. If
we have already two entities as parents of Ann, we should reject any more facts even if those
facts are inferred via quite confident rules. For another entity, Pit, that we have no facts that
state the parent of him, we should accept the new inferred fact even if the confidences of the
corresponding rules are not high. Thus, in the context-aware inference method, we try to
consider the confidence of the new fact production process and the demand of the new fact in
the KG.
   In the context-aware inference method, we will infer the new facts by considering the already
related existing facts in the KG. We could obtain an average number of facts regarding each
predicate and entity in a specific predicate argument. Later, when we want to infer new facts, if
there are enough facts already in the KG, inferring new facts are likely to be overpredicting.
To avoid overpredicting, we consider just the predictions which securely fill the gap (between
the current number of facts and the average number). We give higher priority to the facts with
higher SC if there are alternatives. So we sort all predictions regarding their SCs, and for each
one (𝑝1 (𝑒1 , 𝑒2 )) we consider the gap between the average number of facts regarding specific
entity and predicate and the concerning new fact. We obtain the Average Number of Facts
(ANF) for each entity, predicate, and argument. Although this averaging method ignores the
variations of the number of facts for a specific entity in the specific predicate, we propose it
inferring a new fact in an extra cautious fashion. We do not claim the rejected new facts are not
valid but based on this mechanism; we do not have strong support that the new fact is needed.

                                              #(𝑒, 𝑒′ ) : 𝑝1 (𝑒, 𝑒′ )
                                 𝐴𝑁 𝐹𝑝𝑆𝑢𝑏 =                                                    (6)
                                      1
                                               #𝑒 : ∃𝑦 𝑝1 (𝑒, 𝑦)

                                              #(𝑒, 𝑒′ ) : 𝑝1 (𝑒, 𝑒′ )
                                 𝐴𝑁 𝐹𝑝𝑂𝑏𝑗 =                                                    (7)
                                      1
                                              #𝑒′ : ∃𝑥 𝑝1 (𝑥, 𝑒′ )

               ((#𝑒′ : 𝑝1 (𝑒1 , 𝑒′ )) < 𝐴𝑁 𝐹𝑝𝑆𝑢𝑏
                                             1
                                                 ) ∧ ((#𝑒 : 𝑝1 (𝑒, 𝑒2 )) < 𝐴𝑁 𝐹𝑝𝑂𝑏𝑗
                                                                                1
                                                                                    )          (8)
   We accept a new fact if the condition (8) is satisfied. This condition demands the new fact
populate the KG whence both subject and object have less than average relations regarding the
predicate.
   By this extra cautious method, we can avoid overpopulating KG by predicting too many facts
w. r. t. any specific pair of subject-predicate or object-predicate with the cost of excluding many
facts that might be valid. For example, in the case of nationality, the number of entities with
the same nationality varies from a hundred thousand to a billion based on real-world data. By
considering the average, we might exclude many new valid facts for a more populated country.
Hence, the excluded facts by this mechanism are not wrong, but if we have a high priority to
infer fewer facts, we exclude them.


5. Experiments
Based on the methods presented in previous sections, we have implemented Hub Augmented
Rule Learning (HARL) and conducted three sets of experiments to evaluate the new system.
   The first set of experiments aim to evaluate the effect of new hub predicates on rule learning.
In the second set of experiments, we use the learned rules to infer new facts via the original
inference module of RLvLR and the new context aware inference module. The results show the
usefulness of hub-augmented rules to infer more quality facts. The third set of experiments aim
to evaluate the improvements of RLvLR on link prediction task regarding using hubs as new
predicates.
   In Table 1, we show the statistics of three widely used Knowledge graphs. We randomly
selected 50 target predicates for target predicates, and for YAGO2 core, we used all 32 predicates
as target predicates. All results shown in the following experiments are average results for the
corresponding target predicates.
Table 1
Benchmark KG specifications
                          KG             # Facts   # Entities   # Predicates
                          FB15K            542K          15K            1345
                          FB15KSE          272K          15K             237
                          YAGO2 core       948K        470K               32


   To show the advantage of identifying the hubs and populating the KG with hubs for rule
learning. We introduce a new kind of rule redundancy elimination, the Concise Rule Set.

5.1. Eliminating Redundant Rules: Concise Rule Set
Rule learners can be evaluated by the number of distinct quality rules that they can learn.
AMIE+ [8] eliminates duplicate rules. In this process, two rules are considered duplicated
only if they have the same head predicate, the same number of atoms (body length), and
the same head coverage (or support). While such elimination merely considers rules with
same syntax as the same rule (e.g. 𝑝3 (𝑥, 𝑦) ∧ 𝑝2 (𝑥, 𝑦) → 𝑝1 (𝑥, 𝑦). and 𝑝2 (𝑥, 𝑦) ∧ 𝑝3 (𝑥, 𝑦) →
𝑝1 (𝑥, 𝑦).), it does not consider the semantic similarity like, livesIn(𝑥, 𝑦) → isCitizenOf(𝑥, 𝑦).
and livesIn(𝑥, 𝑦) ∧ Person(𝑥) → isCitizenOf(𝑥, 𝑦). Intuitively the last second atom of second
rule is obvious and does not add any information to the rule while it makes a new (syntactically)
rule. These kind of semantically redundancy can discredit the evaluation of rule learning
systems based on the number of quality rules. To overcome this challenge we introduce a novel
method to obtain a set of rule with minimum redundancy.
   We deployed a greedy search to find a set of rules with minimum redundancy w.r.t. fact
inferences. Each rule 𝑟 is associated with a set
                                              ⋃︀ of facts that can be inferred from the rule, denoted
𝑓 𝑎𝑐𝑡(𝑟). For a set of rules ℛ, 𝑓 𝑎𝑐𝑡(ℛ) = 𝑟∈ℛ 𝑓 𝑎𝑐𝑡(𝑟). To obtain a concise set of rules, we
start with an empty set ℛ and add rules into ℛ one by one from the set of all learned rules,
following the descending order of the confidences of the rules (i.e., their SCs). Before adding a
rule 𝑟 to ℛ, we first check if the facts inferred by 𝑟 can already be inferred by those in ℛ. We
define the novelty degree for each rule 𝑟 as follows:
                                           |𝑓 𝑎𝑐𝑡(𝑟)| − |𝑓 𝑎𝑐𝑡(𝑟) ∩ 𝑓 𝑎𝑐𝑡(ℛ))|
                    noveltyDegree(𝑟) =
                                                       |𝑓 𝑎𝑐𝑡(ℛ))|
We add the rule 𝑟 to ℛ only if the novelty degree is higher than a threshold.

5.2. Learning Rules with Attributes
Table 2 shows the average numbers of rules (#𝑅) with the minimum quality SC≥ 0.1 and
HC≥ 0.01 as these values are used in RLvLR [4] and AMIE+ [8]. #𝑄𝑅 indicates the average
number of rules with the following minimum qualities, SC≥ 0.8 and HC≥ 0.01. #𝐶𝑅 indicates
the average size of the concise rule sets (obtained from all #𝑅 learned rules). #𝑄𝐶𝑅 indicates
the average size of the quality concise rule sets (obtained by from #𝑄𝑅 high quality rules).
#𝐶𝑅 − 𝐻 indicates the average number of concise rules that contains at least one hub predicate
(obtained from the concise rule set of size #𝐶𝑅).
Table 2
Performance of HARLand RLvLR on Rule Learning task
         Benchmark                  HARL                                      RLvLR
                       #R   #QR    #CR #QCR         #CR-H       #R   #QR     #CR #QCR    #CR-H
         FB15K        267    123   190   48            66      153     15     117   10       0
         FB15KSE      252     80   205   34            56      171     21     165   17       0
         YAGO2core     70     32    52   14            13       33      4      32    4       0


Table 3
Performance of HARL and RLvLR on Facts Inference task
                     Benchmark              HARL                        RLvLR
                                    #UNF      #NF    #QNF       #UNF     #NF #QNF
                     FB15K          66152     578      314     66215      455  276
                     FB15KSE        56245     815      293      55952     627  216
                     YAGO2core     182676    2053      279     175379    1983  265


Table 4
Performance of HARL and RLvLR running characteristics
                      Benchmark                     HARL                     RLvLR
                                    Time    #Hub        EXP        Min𝑎𝑡𝑡     Time
                      FB15K          389       73    0.09767       0.00138     310
                      FB15KSE        299       55    0.09118        0.0014     276
                      YAGO2core      403      255    0.09889    0.0000003      292


   Compared to RLvLR, HARL showed significantly better performance in terms of both the
rules with ordinary filtering and concise filtering. The superiority of HARL is more obvious in
learning high quality rules, that is, among all the rules learned, HARL has a consistently higher
number of high quality rules.

5.3. Inferring New Facts via Learnt Rules
To estimate the predictive power of the corpus of learned rules, we calculated the number of
facts can be predicated by applying learned rules on the given facts via RLvLR inference module
and context-aware inference module. To aggregate the different rules with same target predicate
we use Noisy-OR similar to RLvLR [13] and AMIE+ [8]. Table 3 shows the numbers of RLvLR
inference predicted facts (#UNF), those predictions via context-aware inference facts (#NF), and
those context-aware inferred facts with CD≥ 0.8 (#QNF).
   The running of two systems have the following characteristics regarding time and other
features 4. #Hub indicates the number of hubs that we consider before the number of new
facts hits the max expansion degree (𝑀 𝑎𝑥𝐸𝑥𝑝 in algorithm 1), which we assigned 0.1 for max
expansion degree in these experiments. EXP is the actual expansion degree. Min𝑎𝑡𝑡 is the 𝑎𝑡𝑡
value of the last hub that we consider.
Table 5
Performance of HARL and RLvLR on Link Prediction task
                Benchmark               HARL                           RLvLR
                             MRR        MR   H1      H10    MRR        MR    H1    H10
                FB15K        0.36       169 0.23     0.58     0.3      207   0.2   0.48
                FB15KSE       0.21     1171 0.14     0.34    0.21     1180 0.15    0.34
                YAGO2core     0.06   332820 0.04      0.1    0.06   333778 0.04    0.09


5.4. Link Prediction via Learnt Rules
We additionally evaluated the quality of rules through link prediction. We consider two queries,
𝑝(𝑒, ?) and 𝑝(?, 𝑒′ ), for each fact, 𝑝(𝑒, 𝑒′ ), in the testing data in each benchmark. For FB15k and
FB15KSE the separation of test and train data has been done, while for YAGO2core, there is
no previously prepared data, so we randomly separate 90% for train and 10% for the test. We
applied learnt rules on the training data to predict the missing entities, and the testing data
verified the correctness of such predictions. The predictions were measured using Mean Rank
(MR) in which the less value indicates better performance and Mean Reciprocal Rank (MRR) in
which the higher value indicates better performance. We also deployed Hits@1 and Hits@10.
These four quality measurements are commonly used in the Knowledge Graph Completion
literature [2]. In particular, we first ranked each inferred fact based on the number of rules
inferring it. MR is the average number of ranks. MRR is the average number of the reciprocal
ranks of correctly predicated entities. Hits@10 and Hits@10 are the proportion of correctly
predicted entities that are ranked among the top ten and one, respectively. Table 5 summaries
the results, where H10 and H1 are the Hits@10 and Hits@1 scores.
   HARL outperforms RLvLR in FB15K in all quality measures significantly while in other two
benchmark it performs in par. Since the FB15kSE and YAGO2core are more difficult tasks
regarding link prediction, it might be the case that proposing the hub could not help the link
prediction.
   To see how the hubs can effect the mined rules consider the following two rules from
YAGO2core.

          𝑆𝐶 : 0.03, 𝐻𝐶 : 0.07, wasBornIn(𝑥, 𝑧) ∧ isLocatedIn(𝑧, 𝑦) → isCitizenOf(𝑥, 𝑦).
    𝑆𝐶 : 0.61, 𝐻𝐶 : 0.01, P :< graduatedFrom > _E :< University_College_Dublin >(𝑥)
                                 ∧wasBornIn(𝑥, 𝑧) ∧ isLocatedIn(𝑧, 𝑦) → isCitizenOf(𝑥, 𝑦).

  The standard confidence of the second rule improved by about 20 times. Since we add a
condition at the body of the rule, we make it more specified, so it fires less number of facts, but
those shots are more accurate and improve the SC significantly. On the other hand, since the
number of resulting facts in the second rule decreased, the head coverage also reduced, making
the rule less general. While, the first rule is filtered due to its low SC, the specified version of it
has higher SC and qualified to be in the output set of rules.
  From FB15k KG, the following hubs are created as they have top 𝑎𝑡𝑡.
   1. P :< /people/person/gender > _E :< male >(𝑥)
   2. P :< /people/person/spouse_s./people/marriage/type_of_union > _E :< matrimony >(𝑥)
   3. E :< matrimony > _P :< /people/marriage
      _union_type/unions_of_this_type./people/marriage/spouse >(𝑥)
   4. P :< /people/person/nationality > _E :< USA >(𝑥)
   5. P :< /people/person/profession > _E :< actors >(𝑥)
   6. E :< actors > _P :< /people/profession/people_with_this_profession >(𝑥)
   7. P :< /film/film/language > _E :< eng >(𝑥)
   8. P :< /film/film/country > _E :< USA >(𝑥)
   9. P :< /film/film/release_date_s./film/film_regional_release_date
      /film_release_distribution_medium > _E :< DVD >(𝑥)
  10. P :< /film/film/release_date_s./film/film_regional_release_date
      /film_release_region > _E :< USA >(𝑥)

   From this example, we can observe the identified hubs in FB15k are intuitive and shows
which binary attributes can separate the related entities into two sets. In the real case scenario,
regarding the film industry, we have some dominant categories that can distinguish the entities,
like the movies produced in the USA in contrast with those that were produced in any country
other than the US. With our proposed modelling, we can reflect on this disparity between
different values of one attribute.


6. Related Works
Several rule learning systems, including AMIE+ [8], RuDiK [14] and anyBURL [15], allow a set
of entities to be considered in the partially grounded rules. Gu et al [16] introduce a language
bias for their rule learning method, in which the head atom and last atom of the rule’s body are
partially grounded but no entities are allowed in other atoms. However, they treat all entities
equally in the partially grounded rules that are leaned. These method do not have a mechanism
for selecting a set of entity-predicate pairs for attribute values. Consequently, the search space
of possible rules grows vastly and these systems struggle to scale to the real-world massive KGs.
   The other related stream of works is attribute knowledge graph modelling. In general, our
work is different from this line of works for at least two reasons: (1) they assume an input KG
contain attributes while we create attributes out of the given KG; (2) the nature of attributes
they consider is only numeric or categorical, while ours is a particular case of categorical ones.
   The attribute KGs differ in terms of the nature of attributes. [11] and [10] can model KGs
that contain numeric feature vectors as attributes of nodes (entities). In [11], authors consider a
text related to an entity like the title of a paper as a feature of paper node. They use methods in
Natural Language Processing to construct a feature vector for representing the title feature in a
numeric representation rather a string of characters.
   In [9], authors distinguish the relations from attributes and consider only categorical attributes.
The attribute KG that they consider, FB24K, is extracted from Freebase. The number of values
of attributes range from 1 to 877 categorical attributes and the number of attributes is 314.
While what an attribute in their approach is similar to the hubs in our method, our attributes
are binary. Furthermore, we do not remove the original facts, and an entity can act as both an
attribute value and an entity. For example, the USA, in regards to predicate nationality, acts as
an attribute value. At the same time, it acts as an entity in the predicate neighbour. However,
[9] does not consider the attribute values as entities, so ignore the dual role of a thing as an
entity and attribute/value.


7. Conclusion
In this paper, we proposed an approach to learn attributed rules over KGs. In contrast to
existing attribute KG modelling approaches that assume attributes as input, we can mine the
attributed rules from KG without an explicit set of attributes. We achieved this by proposing
a binary attribute that can easily be embedded into the closed path rules and an identifying
mechanism that automatically mines such attributes from KG. We enhance the original KG
with new predicates (hubs) and corresponding facts. Then we feed the improved KG to a
state-of-the-art rule learner, RLvLR, to learn attributed rules. We also proposed a novel inferring
mechanism that considers the demands of a new inferred fact and the quality of the rules that
infer it. Finally, we evaluated our system HARL on widely used benchmarks in rule learning
and link prediction. Our experimental results demonstrate that HARL improved RLvLR in terms
of both the number of mined quality rules, inferred facts, and accuracy in link prediction.


References
 [1] T. Pellissier-Tanon, G. Weikum, F. Suchanek, YAGO 4: A Reason-able Knowledge Base, in:
     ESWC, 2020.
 [2] Z. Chen, Y. Wang, B. Zhao, J. Cheng, X. Zhao, Z. Duan, Knowledge graph completion: A
     review, IEEE Access 8 (2020) 192435–192456.
 [3] Y. Chen, D. Z. Wang, S. Goldberg, ScaLeKB: scalable learning and inference over large
     knowledge bases, The International Journal on Very Large Data Bases 25 (2016) 893–918.
 [4] P. G. Omran, K. Wang, Z. Wang, Scalable rule learning via learning representation, in:
     IJCAI, 2018.
 [5] M. Gardner, T. Mitchell, Efficient and Expressive Knowledge Base Completion Using
     Subgraph Feature Extraction, in: Conference on Empirical Methods on Natural Language
     Processing, September, 2015, pp. 1488–1498.
 [6] B. Yang, W.-t. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning
     and inference in knowledge bases, in: ICLR, 2015, p. 12.
 [7] Z. Wang, J.-Z. Li, RDF2Rules: Learning Rules from RDF Knowledge Bases by Mining
     Frequent Predicate Cycles., arXiv preprint (2015).
 [8] L. Galárraga, C. Teflioudi, K. Hose, F. M. Suchanek, Fast rule mining in ontological
     knowledge bases with AMIE+, The International Journal on Very Large Data Bases 24
     (2015) 707–730.
 [9] Y. Lin, Z. Liu, M. Sun, Knowledge representation learning with entities, attributes and
     relations, in: IJCAI, 2016.
[10] W. Liu, H. Cai, X. Cheng, S. Xie, Y. Yu, H. Zhang, Learning High-order Structural and
     Attribute information by Knowledge Graph Attention Networks for Enhancing Knowledge
     Graph Embedding, arXiv preprint (2019).
[11] Z. Hu, Y. Dong, K. Wang, K. W. Chang, Y. Sun, GPT-GNN: Generative Pre-Training of
     Graph Neural Networks, in: SIGKDD, ACM, 2020, pp. 1857–1867.
[12] M. Svato, S. Schockaert, J. Davis, STRiKE : Rule-Driven Relational Learning Using Stratified
     k-Entailment, in: ECAI, 2020.
[13] P. G. Omran, K. Wang, Z. Wang, An Embedding-based Approach to Rule Learning in
     Knowledge Graphs, TKDE 33 (2021) 1348—-1359.
[14] S. Ortona, V. V. Meduri, P. Papotti, Robust discovery of positive and negative rules in
     knowledge bases, in: International Conference on Data Engineering (ICDE), IEEE, 2018,
     pp. 1180–1191.
[15] C. Meilicke, M. W. Chekol, D. Ruffinelli, H. Stuckenschmidt, Anytime bottom-up rule
     learning for knowledge graph completion, in: IJCAI, 2019.
[16] Y. Gu, Y. Guan, P. Missier, Towards Learning Instantiated Logical Rules from Knowledge
     Graphs, arXiv preprint (2020).