=Paper= {{Paper |id=Vol-3733/paper4 |storemode=property |title=Logic Programming for Knowledge Graph Completion |pdfUrl=https://ceur-ws.org/Vol-3733/paper4.pdf |volume=Vol-3733 |authors=Damiano Azzolini,Matteo Bonato,Elisabetta Gentili,Fabrizio Riguzzi |dblpUrl=https://dblp.org/rec/conf/cilc/AzzoliniBGR24 }} ==Logic Programming for Knowledge Graph Completion== https://ceur-ws.org/Vol-3733/paper4.pdf
                         Logic Programming for Knowledge Graph Completion
                         Damiano Azzolini1 , Matteo Bonato2 , Elisabetta Gentili2,* and Fabrizio Riguzzi3
                         1
                           Department of Environmental and Prevention Sciences – University of Ferrara, Ferrara, Italy
                         2
                           Department of Engineering – University of Ferrara, Ferrara, Italy
                         3
                           Department of Mathematics and Computer Science – University of Ferrara, Ferrara, Italy


                                      Abstract
                                      A knowledge graph (KG) represents a domain of interest with a graph where some of the involved entities are
                                      linked with an edge. Knowledge Graph Completion (KGC) is a well-known task for KGs which requires finding
                                      missing connections. KGC has been studied for many years with multiple solutions available based on both
                                      symbolic and sub-symbolic techniques. In this paper, we would like to answer the question: can parameter
                                      learning for Probabilistic Logic Programming be a competitive algorithm to solve the KGC task? An empirical
                                      evaluation on the most common KGC datasets allows us to provide a negative answer to such a question.

                                      Keywords
                                      Knowledge Graph Completion, Parameter Learning, Statistical Relational Artificial Intelligence, Logic Program-
                                      ming




                         1. Introduction
                         Large Knowledge Graphs (KGs) pose interesting challenges, especially in extracting information from
                         them. One of them goes under the name of Knowledge Graph Completion (KGC): given a domain
                         represented with a graph, the goal is to find missing edges among the involved entities. There are
                         many existing solutions to solve this task, mainly based on deep learning models [1]. However, despite
                         having huge success in solving KGC, they are considered black boxes, since the predictions cannot be
                         explained to the user.
                            Probabilistic logic languages [2], and in particular Probabilistic Logic Programming (PLP), are interest-
                         ing formalisms to express uncertainty with a logic program extended with probabilistic atoms and rules.
                         These are inherently explainable, since they represent information with easily interpretable logical
                         rules. Recently, the authors of [3] proposed to learn logical rules from a KG and associate a numerical
                         parameter to them. Motivated by their success, in this paper, we cast the KGC problem as a parameter
                         learning problem in PLP: given a set of probabilistic logical rules, the goal is to associate probabilities
                         to them such that the probabilities of the provided examples are maximized. For KGC, the examples
                         are the edges existing in the KG. This can be solved, for example, with Expectation Maximization. We
                         tested multiple datasets with different configurations and different ways to generate rules, to check
                         whether the proposed approach is competitive with state-of-the-art solutions. Unfortunately, we found
                         that this is not the case. However, this provides an interesting direction for future improvement of
                         existing algorithms for parameter learning in PLP.
                            The paper is structured as follows: Section 2 introduces the background notions; in Section 3 we
                         describe the approaches used for generating the rules; Section 4 shows the results of the experiments;
                         in Section 5 we discuss related work; lastly, in Section 6 we draw the conclusions and propose future
                         work directions.




                          CILC 2024: 39th Italian Conference on Computational Logic, June 26-28, 2024, Rome, Italy
                         *
                           Corresponding author.
                          $ damiano.azzolini@unife.it (D. Azzolini); matteo01.bonato@edu.unife.it (M. Bonato); elisabetta.gentili1@unife.it
                          (E. Gentili); fabrizio.riguzzi@unife.it (F. Riguzzi)
                           0000-0002-7133-2673 (D. Azzolini); 0009-0006-6901-0540 (E. Gentili); 0000-0003-1654-9703 (F. Riguzzi)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Background
KGs are graph-based representations of knowledge in terms of relationships between entities. A KG
can be represented as a set of triples (ℎ, 𝑟, 𝑡) where r is the relation, and h and t are the head (start) and
tail (end) entities of the relation, respectively.
   Real-world KGs are usually incomplete and sparse, thus inference of missing information (entities
or relationships) is often required. This task is referred to as KGC. Depending on the information to
be inferred, KGC can be divided into specific tasks [4], such as link prediction, entity prediction, or
relation prediction. More formally, link prediction is the task of predicting either the tail 𝑡 of a triple
(ℎ, 𝑟, ?), or the head ℎ of a triple (?, 𝑟, 𝑡).
   Given a completion task (ℎ, 𝑟, ?) on a graph G, the goal is to find an entity t such that (ℎ, 𝑟, 𝑡) ∈  /G
is true. Candidate triples are associated with a score and they are sorted in descending order. Given an
answer 𝑡, its rank 𝑟𝑎𝑛𝑘(𝑡) is the position in the list of answers. Special care has to be taken in case
there is a group with several answers with the same score. In this case, there are various approaches for
computing rank(t). For example, minimum and maximum ranking take respectively the lowest and the
highest rank in the group, while average ranking assigns instead the average rank of the group. In our
implementation, we always consider the average ranking, even if other tested tools may adopt different
strategies.
   The graph G is often split into three datasets, training set 𝑇𝑡𝑟𝑎𝑖𝑛 , test set 𝑇𝑡𝑒𝑠𝑡 , and validation set
𝑇𝑣𝑎𝑙𝑖𝑑 , used to train the model and evaluate its performance. Given a set of test triples 𝑇𝑡𝑒𝑠𝑡 , KGC
algorithms are usually evaluated on link prediction tasks in terms of the following metrics:

    • Mean Rank (MR), which is the average rank of the test triples:
                                                          1        ∑︁
                                             𝑀𝑅 =                           𝑟𝑎𝑛𝑘(𝑡)
                                                       |𝑇𝑡𝑒𝑠𝑡 |
                                                                  𝑡∈𝑇𝑡𝑒𝑠𝑡

    • Mean Reciprocal Rank (MRR), which is the average reciprocal individual rank of the test triples:
                                                           1        ∑︁         1
                                            𝑀 𝑅𝑅 =
                                                       |𝑇𝑡𝑒𝑠𝑡 |             𝑟𝑎𝑛𝑘(𝑡)
                                                                  𝑡∈𝑇𝑡𝑒𝑠𝑡

    • Hits@K, which represents the proportion of the test triples ranked in the top 𝐾 positions:

                                                     |{𝑡 ∈ 𝑇𝑡𝑒𝑠𝑡 | 𝑟𝑎𝑛𝑘(𝑡) ≤ 𝐾}|
                                      𝐻𝑖𝑡𝑠@𝐾 =
                                                                 |𝑇𝑡𝑒𝑠𝑡 |

   Often, the ranking for a test triple (ℎ, 𝑟, 𝑡) includes a candidate 𝑡′ such that (ℎ, 𝑟, 𝑡′ ) ∈
                                                                                                / 𝑇𝑡𝑒𝑠𝑡 . Even if
 ′                                                  ′
𝑡 is not the correct prediction, a triple (ℎ, 𝑟, 𝑡 ) may appear in the training or validation set. To not
penalize the ranking of the correct candidate 𝑡, predictions that appear in triples of the training or
validation set can be filtered out. By doing so, a filtered version of the metrics will be computed [5].

2.1. AnyBURL
AnyBURL [3] is an anytime algorithm used to learn rules from KGs by following the bottom-up paradigm.
AnyBURL iteratively samples from a KG G random paths of length n, where n is the number of edges
and starts from 𝑛 = 2. These paths are then assembled into ground logical rules where the first edge is
considered as the head and the remaining as the body.
   Given a set of triples {(𝑒0 , ℎ0 , 𝑒1 ), (𝑒1 , 𝑏1 , 𝑒2 ), . . . , (𝑒𝑛 , 𝑏𝑛 , 𝑒𝑛+1 )}, we can construct a ground path
rule 𝑅 of the form ℎ(𝑒0 , 𝑒1 ) ← 𝑏1 (𝑒1 , 𝑒2 ), . . . , 𝑏𝑛 (𝑒𝑛 , 𝑒𝑛+1 ). AnyBURL extracts three types of rules
(listed in Table 1): i) rules that generalize acyclic ground path rules, i.e., rules where 𝑒0 ̸= 𝑒𝑛+1 (called
AC2); ii) rules that generalize cyclic ground path rules, i.e., rules where 𝑒0 = 𝑒𝑛+1 (called C); iii) rules
that generalize both acyclic and cyclic ground path rules (called AC1). This process continues until
a certain saturation parameter (which depends on the contribution to the overall performance of the
Table 1
Rules extracted by AnyBURL from paths on KGs. 𝑋 and 𝑌 are variables that appear in the head, while 𝐴𝑖 can
appear only in the body. Lowercase letters indicate constants.
                            Name     Structure
                            C        ℎ(𝑌, 𝑋) ← 𝑏1 (𝑋, 𝐴2 ), . . . , 𝑏𝑛 (𝐴𝑛 , 𝑌 )
                            AC1      ℎ(𝑒0 , 𝑋) ← 𝑏1 (𝑋, 𝐴2 ), . . . , 𝑏𝑛 (𝐴𝑛 , 𝑒𝑛+1 )
                            AC2      ℎ(𝑒0 , 𝑋) ← 𝑏1 (𝑋, 𝐴2 ), . . . , 𝑏𝑛 (𝐴𝑛 , 𝐴𝑛+1 )


considered rules) is reached. At each iteration, the length of the considered path is increased by 1. The
score of a rule 𝑅 is the confidence, computed as follows [6]:

                           support(𝑅) = |{𝜃𝑋𝑌 | ∃𝜃𝑍 𝑅𝑏 𝜃𝑋𝑌 𝑍 ∧ 𝑅ℎ 𝜃𝑋𝑌 }|

                                                      support(𝑅)
                                   conf (𝑅) =
                                                 |{𝜃𝑋𝑌 | ∃𝜃𝑍 𝑅𝑏 𝜃𝑋𝑌 𝑍 }|
where 𝜃𝑋𝑌 is a grounding for variables 𝑋 and 𝑌 appearing in the head 𝑅ℎ of 𝑅, 𝜃𝑍 is a grounding for
the variables appearing in the body 𝑅𝑏 of 𝑅 other than 𝑋 and 𝑌 , and 𝜃𝑋𝑌 𝑍 is the union of 𝜃𝑋𝑌 and
𝜃𝑍 .
  To score entities, AnyBURL adopts the maximum aggregation, in which candidate entities 𝑒𝑖 are
ordered according to the maximum confidence of all the rules 𝑅𝑖 that generated them, i.e.,

                             𝑠𝑐𝑜𝑟𝑒(𝑒𝑖 ) = 𝑚𝑎𝑥{conf (𝑅1 ), . . . , conf (𝑅𝑛 )}

where 𝑒𝑖 is the entity and 𝑅𝑖 are the rules that predicted it.
  When the number of rules is high, computing KGC metrics is very expensive. To overcome this,
AnyBURL exploits multiple threads and limits the search for candidates if during the grounding of a
rule a branch with more than 104 children is found [7]. For example, consider the rule

              ℎ𝑎𝑠𝐺𝑒𝑛𝑑𝑒𝑟(𝑋, 𝑌 ) ← 𝑏𝑜𝑟𝑛𝐼𝑛(𝑋, 𝐴), 𝑏𝑜𝑟𝑛𝐼𝑛(𝐵, 𝐴), ℎ𝑎𝑠𝐺𝑒𝑛𝑑𝑒𝑟(𝐵, 𝑌 )

and the query ℎ𝑎𝑠𝐺𝑒𝑛𝑑𝑒𝑟(𝑠𝑢𝑠𝑖, 𝑇 ) where 𝑠𝑢𝑠𝑖 is born in the USA. The second body atom unifies
𝐵 with each person known to be born in the USA, so there can be more than 104 instantiations. In
this case, AbyBURL only retrieves the first 104 . There is also a parameter TOP _K _OUTPUT that
indicates the maximum number of answers to consider for KGC queries. In fact, for example, when
computing Hits@10, it is not necessary to compute all answers. Usually, this parameter is set to 100.
However, this only provides an approximation of the metrics because the answers with a score below
that of the correct answer do not contribute to the rank of it. So even if we first find the correct answer
and 99 answers with a lower score, we cannot be sure that the rank of the correct answer is 1 as there
can be more answers not yet computed with a larger score. However, it provides a good trade-off
between speed and precision. There are also highly optimized libraries for metrics computations such
as PyClause [8], which also offers an interface for AnyBURL.

2.2. SAFRAN
One drawback of the maximum aggregation is that it considers only a single rule instead of considering
a weighted combination of all the rules that predict a certain entity. Noisy-Or aggregation tries to
address this by assuming that the confidence of the rule is a probability and by considering the Noisy-Or
of the predictions:
                                                    𝑘
                                                   ∏︁
                                  𝑠𝑐𝑜𝑟𝑒(𝑒) = 1 − (1 − conf (𝑅𝑖 ))
                                                      𝑖=1
   However, Noisy-Or aggregation does not account for rules redundancy, which often occurs in real-
world KGs, resulting in worse performance than Maximum aggregation due to an overestimation
of the confidence because of double counting [9]. The authors of [9] tried to overcome this issue
and proposed SAFRAN, a framework that adopts a novel aggregation approach called non-redundant
Noisy-Or. This approach clusters redundant rules based on their redundancy degree; then, it applies
maximum aggregations to the rules in the same clusters; lastly, it aggregates predictions of different
clusters with Noisy-Or. Two rules 𝑅𝑖 , 𝑅𝑗 are assigned to the same cluster if their redundancy degree is
higher than a certain threshold, that is 𝑠𝑖𝑚(𝑅𝑖 , 𝑅𝑗 ) > 𝑡ℎ, where 𝑠𝑖𝑚(𝑅𝑖 , 𝑅𝑗 ) is the Jaccard Index of
the sets of inferred triples and 𝑡ℎ is a threshold. To find the optimal value for 𝑡ℎ, which depends on the
relation and rule type, SAFRAN adopts grid search or random search.

2.3. LIFTCOVER+
PLP combines logic-based languages and uncertainty [2]. Thanks to its expressiveness, PLP under the
distribution semantics [10] has been adopted in many domains where uncertainty is relevant [11, 12, 13].
Logic Programs with Annotated Disjunctions (LPADs) [14] are a PLP language under the distribution
semantics. In LPADs, heads of clauses are disjunctions in which each atom is annotated with a probability.
Liftable Probabilistic Logic Programs [15] are a restriction of probabilistic logic programs in which
inference can be performed in a lifted way [16] by considering populations of individuals instead of
each individual separately. Liftable PLP programs contain clauses with a single annotated atom in the
head and the predicate of this atom is the same for all clauses:

                                             ℎ𝑖 : Π𝑖 : − 𝑏𝑖1 , . . . , 𝑏𝑖𝑢𝑖

where the single atom in the head is built over predicate target/a, which is the target of learning, and
where 𝑎 is the arity. Bodies of the clauses may contain other predicates than target/a, called input
predicates, that are certain, meaning that their facts and rules have a single atom in the head with
probability 1. To compute the probability of a query q, it is necessary to find only the number of
ground instantiations of clauses so that the body is true and the head is equal to q. If clause 𝐶𝑖 has 𝑚𝑖
instantiations, {𝜃𝑖1 , . . . , 𝜃𝑖𝑚𝑖 }, every 𝜃𝑖𝑗 corresponds to a random variable 𝑋𝑖𝑗 and 𝑃 (𝑋𝑖𝑗 = 1) = Π𝑖 ,
while 𝑃 (𝑋𝑖𝑗 = 0) = 1 − Π𝑖 . A query q is true if at least one random variable for a rule is true, and is
false only if none
                ∏︀ of the random variables is true. All the random variables are mutually independent,
so 𝑃 (𝑞) = 1 − 𝑛𝑖=1 (1 − Π𝑖 )𝑚𝑖 .
   Given a set 𝐸 + = {𝑒1 , . . . , 𝑒𝑄 } of positive examples (i.e., a set of atoms), a set 𝐸 − = {𝑒𝑄+1 , . . . , 𝑒𝑅 }
of negative examples (i.e., a set of atoms), and a background knowledge 𝐵 defining the input predicates,
LIFTCOVER learns the parameters of the clauses using either Expectation-Maximization (EM) or
Limited-memory BFGS (LBFGS). The likelihood 𝐿 of the examples can be unfolded as
                                         𝑛                𝑄         𝑛
                                                            (︃                      )︃
                                        ∏︁               ∏︁       ∏︁
                                                     𝑚𝑙−                        𝑚𝑙𝑞
                                 𝐿=        (1 − Π𝑙 )           1−     (1 − Π𝑙 )
                                    𝑙=1                 𝑞=1          𝑙=1

where 𝑚𝑖𝑞 (𝑚𝑖𝑟 ) is the number of instantiations of 𝐶𝑖 whose head is 𝑒𝑞 (𝑒𝑟 ) and whose body is true,
and 𝑚𝑙− = 𝑅
           ∑︀
             𝑟=𝑄+1 𝑚𝑙𝑟 . The gradient of the likelihood is computed as:
                                           ⎛                            ⎞
                                             𝑄     (︂            )︂
                              𝜕𝐿      𝐿 ⎝∑︁             1
                                  =            𝑚𝑖𝑞            − 1 − 𝑚𝑖− ⎠
                              𝜕Π𝑖   1 − Π𝑖            𝑃 (𝑒𝑞 )
                                                  𝑞=1

              𝜕𝐿
The equation 𝜕Π 𝑖
                  = 0 does not admit a closed-form solution, thus optimization is needed to find the
maximum of 𝐿. Finally, the clauses with a probability below a user-defined threshold are discarded.
  The EM algorithm [17] is used to find the maximum likelihood estimates of parameters. First, in
the Expectation step, the distribution of the unseen variables in each instance is computed given the
observed data and the current value of the parameters. Then, during the Maximization step, the new
parameters are computed so that the likelihood is maximized. The algorithm repeats these two steps
until there are no more improvements in the likelihood. To use the EM algorithm, the distribution of
the hidden variables given the observed ones, 𝑃 (𝑋𝑖𝑗 = 1|𝑒) and 𝑃 (𝑋𝑖𝑗 = 1|¬𝑒), has to be computed.
Given that 𝑃 (𝑋𝑖𝑗 = 1, 𝑒) = 𝑃 (𝑒|𝑋𝑖𝑗 = 1) · 𝑃 (𝑋𝑖𝑗 = 1) = 𝑃 (𝑋𝑖𝑗 = 1) = Π𝑖 since 𝑃 (𝑒|𝑋𝑖𝑗 = 1) = 1,

                                             𝑃 (𝑋𝑖𝑗 = 1, 𝑒)           Π𝑖
                          𝑃 (𝑋𝑖𝑗 = 1|𝑒) =                   =
                                                              1 − 𝑛𝑖=1 (1 − Π𝑖 )𝑚𝑖
                                                                 ∏︀
                                                 𝑃 (𝑒)

                                                                         Π𝑖
                                   𝑃 (𝑋𝑖𝑗 = 0|𝑒) = 1 −             ∏︀𝑛              𝑚𝑖
                                                              1−      𝑖=1 (1 − Π𝑖 )
Since 𝑃 (𝑋𝑖𝑗 = 1, ¬𝑒) = 𝑃 (¬𝑒|𝑋𝑖𝑗 = 1) · 𝑃 (𝑋𝑖𝑗 = 1) = 0 and 𝑃 (¬𝑒|𝑋𝑖𝑗 = 1) = 0,

                                              𝑃 (𝑋𝑖𝑗 = 1|¬𝑒) = 0

                                              𝑃 (𝑋𝑖𝑗 = 0|¬𝑒) = 1
   LIFTCOVER+ [18, 19] is a modified version of LIFTCOVER that learns the parameters and the
structure of liftable probabilistic logic programs using regularization to prevent overfitting. Moreover,
it replaces LBFGS with gradient descent to optimize the likelihood.
   Regularization is a widely used strategy to inhibit overfitting, by introducing a penalty term in the
loss function to penalize large weights. Since clauses with small weights are removed because they
have little influence on the probability of the query, fewer clauses with large weights, i.e., a smaller
theory, should be obtained. Bayesian, L1, or L2 regularization can be used in the Maximization step of
the EM algorithm. The main difference in L1 and L2 regularization lies in the penalty introduced in
the loss function. The L1 penalty term is the sum of the absolute values of the parameters, while the
L2 penalty term is the sum of their squares. Furthermore, L1 tries to bring the parameters close to 0
to create a sparse model, i.e., a model with few non-zero parameters. However, L1 is computationally
inefficient. On the other hand, L2 is efficient but produces non-sparse solutions. In both cases, the
impact of regularization is controlled by a regularization coefficient: the higher it is, the stronger the
regularization will be.
   The L1 objective function [20] is:

                                    𝐽1 (𝜃) = 𝑁1 · 𝑙𝑜𝑔𝜃 + 𝑁0 · 𝑙𝑜𝑔(1 − 𝜃) − 𝛾𝜃

while the L2 objective function [20] is:
                                                                                        𝛾 2
                                   𝐽2 (𝜃) = 𝑁1 · 𝑙𝑜𝑔𝜃 + 𝑁0 · 𝑙𝑜𝑔(1 − 𝜃) −                 𝜃
                                                                                        2
where 𝜃 = 𝜋𝑖 , 𝑁0 and 𝑁1 are the expected occurrences of 𝑋𝑖𝑗 = 0 and 𝑋𝑖𝑗 = 1 respectively computed
during the Expectation step, and 𝛾 is the regularization coefficient.
  The value of 𝜃 that maximizes the objective function is computed in the Maximization step by solving
the equation 𝜕𝐽(𝜃)
               𝜕𝜃 = 0 [20]. 𝐽1 (𝜃) is maximum at:

                                                               4𝑁1
                     𝜃1 =                         √︀
                               2(𝛾 + 𝑁0 + 𝑁1 +         (𝑁0 + 𝑁1 )2 + 𝛾 2 + 2𝛾(𝑁0 − 𝑁1 ))

while 𝐽2 (𝜃) is maximum at:
                                                       ⎛ √︂                 (︂                )︂ ⎞
                                                                                 9𝑁0
                                            ⎛                      𝛾
                                                                                                         ⎞
                                                              3𝑁0 +3𝑁1 +𝛾         2 −9𝑁1 +𝛾
                         √︁                 ⎜ arccos ⎝               3𝑁0 +3𝑁1 +𝛾
                                                                                               ⎠
                                                                                                          ⎟
                              3𝑁0 +3𝑁1 +𝛾
                     2                    cos ⎜                                                      − 2𝜋
                                              ⎜                                                           ⎟
                                   𝛾          ⎝                       3                                 3 ⎟
                                                                                                          ⎠
                                                                                                                  1
              𝜃2 =                                                                                            +
                                                              3                                                   3
   In Bayesian regularization, parameters are updated assuming a prior distribution in the form of a
Dirichlet probability density with parameters [𝑎, 𝑏], where a and b are the number of extra occurrences
observed of 𝑋𝑖𝑗 = 1 and 𝑋𝑖𝑗 = 0 respectively. Parameters are shrunk when b is much larger than a.
   To apply EM, we need to store the values 𝑚𝑙− for each rule 𝐶𝑙 and the values 𝑚𝑙𝑞 for each clause 𝐶𝑙
and each positive example 𝑒𝑞 . We can store these values with vector 𝑀− and matrix 𝑀+ respectively,
of size 𝑛 the first and 𝑛 × 𝑄 the latter. We have implemented the operations to be performed in the
Expectation and Maximization phases as vector/matrix operations on 𝑀− and 𝑀+ . LIFTCOVER+ now
offers implementations of the algorithm in Prolog and in Python using the Janus interface [21, 22]. The
Python version can exploit either the CPU with package numpy or the GPU with package cupy.


3. Approaches for Rule Generation
In this section, we describe the solutions we considered to generate rules whose probabilities are learnt
via LIFTCOVER+.

3.1. Generating Cyclic Rules by Computing Paths in the KG
Here we generate only cyclic rules by generating tuples of relations 𝑟0 , 𝑟1 , . . . , and then converting
them to rules. The tuples of relations have 𝑛 + 1 elements where the first element is the relation that
goes into the head of the clause and the other 𝑛 are those that go in the body, thus obtaining a body
with 𝑛 atoms. For this phase, each triple (ℎ, 𝑟𝑒𝑙, 𝑡) in the dataset is translated into an atom 𝑡(ℎ, 𝑟𝑒𝑙, 𝑡).
For example, the tuple (r0,r1,r2,r3) is translated into the rule

r(A,r0,B):-r(A,r1,C),r(C,r2,D),r(D,r3,B).

where 𝑟/3 is defined as

r(S,R,T):-
  t(S,R,T).

r(S,i(R),T):-
  t(T,R,S).

so that both the relation (R) and their inverse (i(R)) are considered. The tuples of relations are obtained
by starting from triples in the training set. For example, if we want to extract 4 relations we look for
values of R1,R2,R3,R4 such that the query

r(h,R1,C),r(C,R2,D),r(D,R3,E),r(E,R4,t).

succeeds at least once. Then duplicate tuples are removed.
   Removing duplicates can be very expensive, so we looked for different ways to do it. In the following,
we consider SWI Prolog [23] as inference engine. We tried with tabling [24], which has the property
of storing each answer only once, but it was faster to directly use tries1 . Moreover, we exploited
parallelism in order to speed up the computation. In particular, we used the SWI library predicate
concurrent_maplist(Goal,List) that applies Goal to each element of List in parallel using
threads. We also wrote a helper predicate chunks(List,N,SubLists) that splits the elements of
List into N lists of approximately equal length and returns them in SubLists. For example, to generate
cyclic rules with 3 atoms in the body that contain the training tuples, we used the code below:

main:-
  tell('tuples3.pl'),
  rels(Rels),
  chunks(Rels,32,Chunks),
1
    https://www.swi-prolog.org/pldoc/man?section=trie
  trie_new(Tr),
  concurrent_maplist(genpaths3(Tr),Chunks),
  trie_gen(Tr,q(R,R1,R2,R3)),
  write('(tt(A,'), write_canonical(R), write(',B):0.5 :- r(A,'),
  write_canonical(R1), write(',C), r(C,'),
  write_canonical(R2), write(',D), r(D,'), write_canonical(R3),
  writeln(',B)),'),
  fail.

main:-
   told.

genpaths3(Tr,Rels):-
  member(R,Rels),
  path3(R,R1,R2,R3),
  trie_insert(Tr,q(R,R1,R2,R3)),
  fail.

genpaths3(_,_).

path3(R,R1,R2,R3):-
    r(S,R,T), r(S,R1,I), T\=I,
    r(I,R2,J), r(J,R3,T), J\=S.

where 𝑟𝑒𝑙𝑠/1 is a predicate that computes the list of all possible relations. These are then split into
32 chunks and each chunk is processed in parallel by a separate thread. Since operations on tries are
thread-safe, parallelization does not pose a problem. The reason for the tests T\=I and J\=S is to avoid
generating rules such as

r(S,r,T):-r(S,r,T),r(T,r1,J),r(J,i(r1),T).
r(S,r,T):-r(S,r1,I),r(T,i(r1),S),r(S,r,T).

This approach allows a quick sampling of a fraction 𝑓 of the tuples by generating a random number 𝑋
in [0,1] and checking whether 𝑋 < 𝑓 , as in, for example,

trie_gen(Tr,q(R,R1,R2,R3)), random(X), X < 0.05,

where we sample 5% of the tuples.

3.2. Generating Rules as in AnyBURL
With this approach, for each relation in the KG, we find the lists of start (head) and end (tail) nodes.
Then, we find a user-defined number of paths starting from each start (end) node up to a maximum
length also specified by the user. These are used to generate AC1, AC2, and C rules. Here as well
duplicates are removed and each rule is associated with an initial probability value of 10−4 . This set of
rules is passed as input theory to LIFTCOVER+ to tune the probabilities. The rankings of the candidates
for an entity are computed via Noisy-OR aggregation.

3.3. Generating Rules by Weighted Sampling
We tried another approach to generate rules: for each dataset, we extract all the relations together
with their number of occurrences. Call the list with all the relations 𝐿𝑟 . We associate each relation
with a probability proportional to its occurrences. To generate a rule, we proceed as follows: we first
sample a relation from 𝐿𝑟 to consider as head relation. Then, we select a random number between two
Table 2
Description of the datasets used for the experiments. For each dataset, we report the number of triples in the
training, validation, test sets, and total, and the number of relations and entities.

                   Dataset       Train     Valid      Test     Total   Entities   Relations
                   Nations        1592       199       201     1992         28          55
                   WN18RR        86835      3034      3134    93003      71453          11
                   FB15k-237    272115     17535    20466    310116      14505         237
                   Nell         228426    129810   144307    502543      63361         400


and four and sample again the same number of relations from 𝐿𝑟 , that will be considered for the body.
Both sampling phases take into account the probability associated with each relation. Starting from the
selected relations, we generate rules by creating an atom with two variables for each relation and then
connect the atoms to create a chain rule. To clarify, suppose we have 𝑟0 for the head and 𝑟1 and 𝑟2 for
the body. We obtain the rule r(A,r0,B) :- r(B,r1,C), r(C,r2,D). We repeat this process for a
fixed number of rules. Once we have all the rules, we compute the cumulative number of instantiations
for all the bodies, call it 𝑛𝑖 , and associate each rule with a score given by the ratio between its number
of body instantiations and 𝑛𝑖 . This method is implemented in Prolog as well.


4. Experimental Evaluation
In this section, we report the experimental evaluation of our approach on well-known KGC datasets.

4.1. Datasets
In our experiments, we consider the following datasets: FB15K-237 [25] which contains entity pairs
extracted from FreeBase, WN18RR [26] which contains WordNet entities and explicit relations, NELL [27]
which was built with an agent (called Never-Ending Language Learner) that reads the web and learns
over time, and Nations [28] which contains data about economic, diplomatic, military, and social dyadic
interactions among a reduced set of 182 nation-dyads in the years between 1950 and 1965. The number
of tuples, relations, and entities for each dataset are listed in Table 2.

4.2. Setup Parameter Learning with LIFTCOVER+
We learn the parameters of the rules using the EM algorithm of LIFTCOVER+ because it gives solutions
with better quality than gradient descent and LBFGS, and run it on GPUs to speed up computations.
However, the need to compute and store the matrix 𝑀+ for the positive examples (see Section 2.3)
requires limiting the number of positive examples and rules. In fact, with 𝑛 rules, 𝑄 positive examples,
if we assume that an integer requires 32 bits (4 bytes) of storage, storing the matrix 𝑀+ requires 𝑛 · 𝑄 · 4
bytes. For example, with 105 rules and 105 positive examples, we need 40GB of memory, close to the
limit of the available hardware that is available to us (64GB). Thus we need to subsample examples for
the FB15k-237 and NELL datasets, since they have an excessive number of examples. In particular, we
sampled 81,196 positive examples for FB15k-237 and 4,670 for NELL. Moreover, we randomly generated
105 negative examples for FB15k-237, 2 · 105 for NELL, 2 · 105 for WN18RR, and 5 · 103 for Nations
by repeatedly randomly sampling a relation 𝑟, a head ℎ, and a tail 𝑡 and checking whether the triple
(ℎ, 𝑟, 𝑡) is not present in the KG until the required number of examples is obtained.

4.3. Metrics Computation
We implemented in Prolog an algorithm for computing the Hits@1, Hits@3, Hits@5, Hits@10, and
MRR metrics. Algorithm 1 shows the pseudocode of function Metrics that, given the test set and the
Algorithm 1 Algorithm for computing metrics.
 1: function Metrics(TestSet, TrainingSet, ValidationSet, Theory)
 2:    𝑅𝑎𝑛𝑘𝑠 = ∅
 3:    for all 𝑟(ℎ, 𝑟, 𝑡) ∈ TestSet do
 4:        SortedAns ← ComputeAns(𝑟(ℎ, 𝑟, 𝑡), Theory, TrainingSet, ValidationSet)
 5:        rank ← ComputeRank(𝑡, SortedAns)
 6:        Ranks ← Ranks ∪ {rank }
 7:    end for
 8:    return ComputeHitsMRR(Ranks)
 9: end function
10: function ComputeAns(𝑟(ℎ, 𝑟, 𝑡), Theory)
11:    ProbCorrAns ← ComputeProb(𝑟(ℎ, 𝑟, 𝑡), Theory)
12:    Inst ← FindInstantiations(𝑟(ℎ, 𝑟, 𝑇 ), Theory)
13:    FilteredInst ← Filter(Inst, TrainingSet, ValidationSet)
14:    Answers ← {(ProbCorrAns, 𝑡)}
15:    𝑁 =1
16:    for all 𝑟(ℎ, 𝑟, 𝑡′ ) ∈ FilteredInst do
17:        Prob ← ComputeProb(𝑟(ℎ, 𝑟, 𝑡′ ), Theory)
18:        if Prob ≥ ProbCorrAns then
19:            Answers ← Answers ∪ {(Prob, 𝑡′ )}
20:            𝑁 ←𝑁 +1
21:        end if
22:        if 𝑁 ≥ 20 then
23:            return Sort(Answers)
24:        end if
25:    end for
26: end function


set of rules, returns the metrics. The function, for each triple 𝑡(ℎ, 𝑟, 𝑡) in the test set, calls function Com-
puteAns(𝑟(ℎ, 𝑟, 𝑡), 𝑇 ℎ𝑒𝑜𝑟𝑦) that returns a set of pairs (𝑃, 𝑡′ ) for answers 𝑡′ to 𝑟(ℎ, 𝑟, 𝑇 ). ComputeAns
first computes the probability of the correct answer 𝑡. Then, it finds all possible instantiations of the
query 𝑟(ℎ, 𝑟, 𝑇 ) using the Prolog code
setof(r(h,r,T),Prog^find_inst(Prog,r(h,r,T)),Atoms)

where find_inst/2 is defined as
find_inst(Prog,Ex):-
  member(((Head : _P) :- Body),Prog),
  Head = Ex, Body.

After this, ComputeAns removes the instantiations present in the training or validation set and cycles
over the remaining ones, computing their probability and keeping a counter 𝑁 of the number of
answers with a probability greater or equal to 𝑃 𝑟𝑜𝑏𝐶𝑜𝑟𝑟𝐴𝑛𝑠, the probability of the correct answer.
If the probability 𝑃 𝑟𝑜𝑏 of an instantiation (ℎ, 𝑟, 𝑡′ ) is greater or equal to 𝑃 𝑟𝑜𝑏𝐶𝑜𝑟𝑟𝐴𝑛𝑠, then the
pair (𝑃 𝑟𝑜𝑏, 𝑡′ ) is added to the current set of scored answers, and the counter 𝑁 is updated. In fact, if
𝑃 𝑟𝑜𝑏 < 𝑃 𝑟𝑜𝑏𝐶𝑜𝑟𝑟𝐴𝑛𝑠, 𝑡′ need not be stored, as it does not influence the ranking of 𝑡. Then we test 𝑁 :
if it is 20 or more, we can stop examining instantiations, as the rank of 𝑡 would be for sure greater than
10. In fact, the worst case is that there are 20 answers all with the same probability: in that case, the
rank of each answer would be 20+1  2 = 10.5 according to the average approach to ranking computation.
    Function Metrics computes exact values for Hits@1, Hits@3, Hits@5, and Hits@10, while a pes-
simistic approximation to MRR, as the actual rank of a correct answer could be lower, while we stop
as soon as we have found 19 other answers with a probability greater or equal to that of the correct
Table 3
Maximum length of the body of rules and number of rules for each dataset and each rule generation method.
Paths denotes the approach described in Section 3.1.

                      Dataset      Rule Generation Method     Max Length      #Rules
                                            Paths                   3        305,365
                      Nations          AC1 + AC2 + C                3          2,243
                                      Weighted Sampling             4            105
                                            Paths                   3          3,076
                      WN18RR
                                      Weighted Sampling             4            105
                                            Paths                   3         33,696
                      FB15K-237
                                      Weighted Sampling             4            105
                                            Paths                   3        101,242
                      Nell
                                      Weighted Sampling             4            105


Table 4
For each dataset and algorithm, we report the metrics Hits@1, Hits@3, Hits@5, Hits@10 and MRR together
with the time taken by parameter learning. For Paths+Conf we used the confidence as the parameters of the
rules. We do not report the configurations that were not able to solve the task.

     Dataset          Approach          H@1      H@3       H@5     H@10      MRR       Learning Time (s)
                      Paths+EM         0.4577   0.7861    0.8607   0.9950   0.6389               467.41
                     Paths+Conf        0.4926   0.7910    0.8557   0.9701   0.6586                     -
     Nations
                   AC1+AC2+C+EM        0.2189   0.5721    0.7612   0.9104   0.4404                  6.69
                  Weighted Sampling    0.5075   0.7811    0.8458   0.9652   0.6472                     -
                      Paths+EM         0.3405   0.4537    0.4936   0.5378   0.4193               189.74
     WN18RR          Paths+Conf        0.4371   0.5057    0.5370   0.5731   0.4894                     -
                  Weighted Sampling    0.0306   0.0826    0.1107   0.1512   0.0741                     -
                      Paths+EM         0.2143   0.3093    0.3530   0.4263   0.2844               936.36
     FB15K-237
                     Paths+Conf        0.2448   0.3528    0.4034   0.4801   0.3232                     -
                      Paths+EM         0.0004   0.0018    0.0028   0.0053   0.0019             19,713.04
     Nell
                     Paths+Conf        0.0010   0.0025    0.0038   0.0067   0.0033                     -


answer. Since the values of the hits metrics are exact, this function differs from the one of AnyBURL
(also implemented in PyClause). Note that function Metrics can also be parallelized by executing
concurrently the loop over the triples of the test set.

4.4. Results
Parameter learning was performed on the Leonardo HPC system of Cineca, using machines with one
Intel Xeon Platinum 8358, 2.60GHz, 32 cores, 481GB of RAM, and Nvidia A100 GPUs with 64GB of
memory. The Weighted Sampling algorithm was run on a machine running at 2.40 GHz with 16 GB of
RAM with a time limit of 8 hours.
  Table 3 shows, for each dataset, the maximum length of the body of the rules and the number of rules
generated with the rule generation methods described in Section 3.1, Section 3.2, and Section 3.3, that
here we call respectively Paths, AC1+AC2+C and Weighted Sampling respectively.
   Table 4 shows the results of the experiments: for each dataset and algorithm, we report the metrics
Hits@1, Hits@3, Hits@5, Hits@10, and MRR together with the time taken by parameter learning.
The algorithm Paths+Conf means that we used the confidence as the parameters of the rules without
performing EM. Note that the Weighted Sampling algorithm does not have an EM phase. That is, the
reported values are the ones obtained only by sampling rules and by weighting them as described in
Section 3.3. The confidence is computed using PyClause.
   From Table 4 we can see that, for the rules found using the method described in Section 3.1 (called
Paths), performing parameter learning using EM is not beneficial. This seems surprising, as tuning
the parameters to improve the log-likelihood should produce a better classifier. It may be due to the
fact that we had to subsample positive examples and to artificially generate negative examples, for
which the test of absence from the training and validation set may not ensure absence from the test
set or falsity in general. Also Weighted Sampling has limited performance, since it can only handle
two of the four datasets. This is possibly due to the high number of instantiations that need to be
found to associate weights with rules. Lastly, Table 4 shows that the approach described in Section 3.2,
in which an input theory composed of AC1, AC2, and C rules whose parameters are then learned
with LIFTCOVER+, could not be used. Specifically, we proved that although the approach works for
small datasets, such as Nations, with very large datasets such as the other three considered, parameter
learning with LIFTCOVER+ requires too much memory.
   In order to compute the metrics for Paths, we used PyClause with the Noisy-Or aggregation strategy
because it was significantly faster than our implementation in Prolog of Algorithm 1. This result is
surprising as well, as Prolog systems such as SWI are supported by decades of software engineering
expertise in improving the speed of answering this kind of queries, with advanced techniques such as
argument indexing. This may be due to the fact that the queries are particularly simple and involve
terms that are constants, for which the unification algorithm probably requires an excessive overhead
because of its generality. Moreover, PyClause implements the optimizations discussed in Section 4.3
that, while leading to approximate results, could be the reason for the speed. Finally, another possible
reason is that metrics computation in PyClause is implemented in C++.


5. Related Work
The KGC task has been studied for many years and a plethora of techniques has been developed to
address them. A comprehensive survey can be found in [1]. Most of them are based on deep learning
techniques. However, some notable exceptions exist, such as AnyBURL and SAFRAN, discussed in
previous sections. Another approach can be found in [29] where the authors propose the adoption of
so-called “soft” rules that are integrated within a probabilistic model. They address the KGC task using
EM where the E step is solved via belief propagation while they develop an ad-hoc solution for the M
step. With this methodology, they obtain competitive results w.r.t. other approaches such as AMIE [30]
and TransH [31].
   There is a large body of research also on parameter learning in StarAI literature and in particular
on probabilistic logic languages. Most of them are based on EM, such as [10, 32, 33, 34], even if some
exceptions exist [35, 36]. Here, we consider LIFTCOVER+ because it adopts the class of liftable PLP,
where the inference process can be performed in a much faster way than in traditional PLP, due to
the restricted types of clauses allowed. This, on the one hand, benefits the parameter learning process.
However, the restriction on the types of rules may limit the expressiveness of the language and not
capture complex relations. Further exploration of this aspect, possibly considering multiple layers of
rules, even from a theoretical perspective, is an interesting future work.


6. Conclusions
In this paper, we leveraged Expectation Maximization algorithms for parameter learning in Probabilistic
Logic Programming to solve the Knowledge Graph Completion task. We presented three different
approaches to generate rules and learn their probabilities with LIFTCOVER+. Empirical results show
that EM applied to liftable probabilistic logic programs to solve the KGC task seems to be not beneficial.
Furthermore, the computation of the full rank is often unfeasible for datasets except for smaller ones.
Future work includes approaching the KGC task with other PLP languages as well as studying alternative
ways to generate rules.


Acknowledgments
This work has been partially supported by the Spoke 1 “FutureHPC & BigData” of the Italian Research
Center on High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR
Missione 4 - Next Generation EU (NGEU), and by TAILOR, a project funded by EU Horizon 2020
research and innovation programme under GA No. 952215. Damiano Azzolini, Elisabetta Gentili, and
Fabrizio Riguzzi are members of the Gruppo Nazionale Calcolo Scientifico – Istituto Nazionale di Alta
Matematica (GNCS-INdAM). We acknowledge the CINECA award under the ISCRA initiative, for the
availability of high-performance computing resources and support. Elisabetta Gentili contributed to
this paper while attending the PhD programme in Engineering Science at the University of Ferrara,
Cycle XXXVIII, with the support of a scholarship financed by the Ministerial Decree no. 351 of 9th April
2022, based on the NRRP - funded by the European Union - NextGenerationEU - Mission 4 “Education
and Research”, Component 1 “Enhancement of the offer of educational services: from nurseries to
universities” - Investment 4.1 “Extension of the number of research doctorates and innovative doctorates
for public administration and cultural heritage.


References
 [1] T. Shen, F. Zhang, J. Cheng, A comprehensive overview of knowledge graph completion,
     Knowledge-Based Systems 255 (2022) 109597. doi:10.1016/j.knosys.2022.109597.
 [2] F. Riguzzi, Foundations of Probabilistic Logic Programming Languages, Semantics, Inference and
     Learning, Second Edition, River Publishers, Gistrup, Denmark, 2022.
 [3] C. Meilicke, M. W. Chekol, D. Ruffinelli, H. Stuckenschmidt, Anytime bottom-up rule learning for
     knowledge graph completion., in: IJCAI, 2019, pp. 3137–3143.
 [4] Z. Chen, Y. Wang, B. Zhao, J. Cheng, X. Zhao, Z. Duan, Knowledge graph completion: A review,
     IEEE Access 8 (2020) 192435–192456.
 [5] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for
     modeling multi-relational data, Advances in neural information processing systems 26 (2013).
 [6] C. Meilicke, M. W. Chekol, P. Betz, M. Fink, H. Stuckeschmidt, Anytime bottom-up rule learning
     for large-scale knowledge graph completion, The VLDB Journal 33 (2024) 131–161. doi:10.1007/
     s00778-023-00800-5.
 [7] C. Meilicke, Personal communication, 2024.
 [8] P. Betz, C. Meilicke, S. Ott, L. Galárraga, PyClause, 2024. URL: https://github.com/symbolic-kg/
     PyClause/.
 [9] S. Ott, C. Meilicke, M. Samwald, Safran: An interpretable, rule-based link prediction method
     outperforming embedding models, arXiv preprint arXiv:2109.08002 (2021).
[10] T. Sato, A statistical learning method for logic programs with distribution semantics, in: L. Sterling
     (Ed.), Logic Programming, Proceedings of the Twelfth International Conference on Logic Program-
     ming, Tokyo, Japan, June 13-16, 1995, MIT Press, 1995, pp. 715–729. doi:10.7551/mitpress/
     4298.003.0069.
[11] L. De Raedt, A. Kimmig, Probabilistic (logic) programming concepts, Machine Learning 100 (2015)
     5–47. doi:10.1007/s10994-015-5494-z.
[12] F. Riguzzi, E. Lamma, M. Alberti, E. Bellodi, R. Zese, G. Cota, Probabilistic logic programming
     for natural language processing, in: F. Chesani, P. Mello, M. Milano (Eds.), Workshop on Deep
     Understanding and Reasoning, URANIA 2016, volume 1802 of CEUR Workshop Proceedings, Sun
     SITE Central Europe, 2017, pp. 30–37.
[13] A. Nguembang Fadja, F. Riguzzi, Probabilistic logic programming in action, in: A. Holzinger,
     R. Goebel, M. Ferri, V. Palade (Eds.), Towards Integrative Machine Learning and Knowledge
     Extraction, volume 10344 of Lecture Notes in Computer Science, Springer, 2017, pp. 89–116. doi:10.
     1007/978-3-319-69775-8_5.
[14] J. Vennekens, S. Verbaeten, M. Bruynooghe, Logic Programs With Annotated Disjunctions, in:
     20th International Conference on Logic Programming (ICLP 2004), volume 3132 of Lecture Notes
     in Computer Science, Springer, 2004, pp. 431–445.
[15] A. Nguembang Fadja, F. Riguzzi, Lifted discriminative learning of probabilistic logic programs,
     Machine Learning 108 (2019) 1111–1135.
[16] D. Poole, First-order probabilistic inference, in: G. Gottlob, T. Walsh (Eds.), IJCAI-03, Proceedings
     of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico,
     August 9-15, 2003, Morgan Kaufmann Publishers, 2003, pp. 985–991.
[17] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM
     algorithm, Journal of the Royal Statistical Society: Series B 39 (1977) 1–38.
[18] D. Azzolini, E. Gentili, F. Riguzzi, Link Prediction in Knowledge Graphs with Probabilistic Logic
     Programming: Work in Progress, in: J. Arias, S. Batsakis, W. Faber, G. Gupta, F. Pacenza, E. Pa-
     padakis, L. Robaldo, K. Ruckschloss, E. Salazar, Z. G. Saribatur, I. Tachmazidis, F. Weitkamper,
     A. Wyner (Eds.), Proceedings of the International Conference on Logic Programming 2023 Work-
     shops co-located with the 39th International Conference on Logic Programming (ICLP 2023),
     volume 3437 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 1–4.
[19] E. Gentili, Knowledge Graph Completion with Probabilistic Logic Programming, in: V. Poggioni,
     S. Rossi (Eds.), Proceedings of the AIxIA Doctoral Consortium 2023 co-located with the 22nd
     International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023), volume
     3670 of CEUR Workshop Proceedings, CEUR-WS.org, 2023.
[20] A. Nguembang Fadja, F. Riguzzi, E. Lamma, Learning hierarchical probabilistic logic programs,
     Machine Learning 110 (2021) 1637–1693. doi:10.1007/s10994-021-06016-4.
[21] C. Andersen, T. Swift, The Janus System: A Bridge to New Prolog Applications, in: D. S. Warren,
     V. Dahl, T. Eiter, M. V. Hermenegildo, R. Kowalski, F. Rossi (Eds.), Prolog: The Next 50 Years,
     Springer Nature Switzerland, Cham, 2023, pp. 93–104. doi:10.1007/978-3-031-35254-6_8.
[22] T. Swift, C. Andersen, The Janus System: Multi-paradigm Programming in Prolog and Python,
     Electronic Proceedings in Theoretical Computer Science 385 (2023) 241–255. doi:10.4204/eptcs.
     385.24.
[23] J. Wielemaker, T. Schrijvers, M. Triska, T. Lager, SWI-Prolog, Theory and Practice of Logic
     Programming 12 (2012) 67–96. doi:10.1017/S1471068411000494.
[24] T. Swift, D. S. Warren, Tabling with answer subsumption: Implementation, applications and per-
     formance, in: T. Janhunen, I. Niemelä (Eds.), Logics in Artificial Intelligence - 12th European Con-
     ference, JELIA 2010, Helsinki, Finland, September 13-15, 2010. Proceedings, volume 6341 of Lecture
     Notes in Computer Science, Springer, 2010, pp. 300–312. doi:10.1007/978-3-642-15675-5_26.
[25] K. Toutanova, D. Chen, Observed versus latent features for knowledge base and text inference, in:
     A. Allauzen, E. Grefenstette, K. M. Hermann, H. Larochelle, S. W.-t. Yih (Eds.), Proceedings of the
     3rd Workshop on Continuous Vector Space Models and their Compositionality, Association for
     Computational Linguistics, Beijing, China, 2015, pp. 57–66. doi:10.18653/v1/W15-4007.
[26] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for
     modeling multi-relational data, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger
     (Eds.), Advances in Neural Information Processing Systems, volume 26, Curran Associates, Inc.,
     2013.
[27] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. Hruschka, T. Mitchell, Toward an architecture for
     never-ending language learning, in: Proceedings of the AAAI conference on artificial intelligence,
     volume 24, 2010, pp. 1306–1313.
[28] R. J. Rummel, Dimensionality of Nations Project: Attributes of Nations and Behavior of Nation
     Dyads, 1950-1965, 1992. doi:10.3886/ICPSR05409.v1.
[29] R. Zhang, Y. Mao, W. Zhao, Knowledge graphs completion via probabilistic reasoning, Information
     Sciences 521 (2020) 144–159. doi:10.1016/j.ins.2020.02.016.
[30] L. A. Galárraga, C. Teflioudi, K. Hose, F. Suchanek, AMIE: association rule mining under incomplete
     evidence in ontological knowledge bases, in: Proceedings of the 22nd international conference on
     World Wide Web, 2013, pp. 413–422.
[31] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by translating on hyperplanes,
     in: Proceedings of the AAAI conference on artificial intelligence, volume 28, 2014.
[32] E. Bellodi, F. Riguzzi, Expectation maximization over binary decision diagrams for probabilistic
     logic programs, Intelligent Data Analysis 17 (2013) 343–363. doi:10.3233/IDA-130582.
[33] D. Fierens, G. Van den Broeck, J. Renkens, D. S. Shterionov, B. Gutmann, I. Thon, G. Janssens, L. De
     Raedt, Inference and learning in probabilistic logic programs using weighted Boolean formulas,
     Theory and Practice of Logic Programming 15 (2015) 358–401.
[34] D. Azzolini, E. Bellodi, F. Riguzzi, Learning the parameters of probabilistic answer set programs,
     in: S. H. Muggleton, A. Tamaddoni-Nezhad (Eds.), Inductive Logic Programming, Springer Nature
     Switzerland, Cham, 2024, pp. 1–14. doi:10.1007/978-3-031-55630-2_1.
[35] B. Gutmann, A. Kimmig, K. Kersting, L. D. Raedt, Parameter learning in probabilistic databases: A
     least squares approach, in: ECMLPKDD-2008, volume 5211 of Lecture Notes in Computer Science,
     Springer, 2008, pp. 473–488. doi:10.1007/978-3-540-87479-9_49.
[36] D. Azzolini, F. Riguzzi, Optimizing probabilities in probabilistic logic programs, Theory and
     Practice of Logic Programming 21 (2021) 543–556. doi:10.1017/S1471068421000260.