=Paper=
{{Paper
|id=None
|storemode=property
|title=Parameter and Structure Learning Algorithms for Statistical Relational Learning
|pdfUrl=https://ceur-ws.org/Vol-926/paper1.pdf
|volume=Vol-926
|dblpUrl=https://dblp.org/rec/conf/aiia/BellodiR12
}}
==Parameter and Structure Learning Algorithms for Statistical Relational Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-926/paper1.pdf</pdf>
<pre>
                                                                                  5

    Parameter and Structure Learning Algorithms
         for Statistical Relational Learning

                                  Elena Bellodi
                            Supervisor: Fabrizio Riguzzi

                          ENDIF, Università di Ferrara
                       Via Saragat, 1 – 44122 Ferrara, Italy
                            elena.bellodi@unife.it


1    Introduction

My research activity focuses on the field of Machine Learning. Two key chal-
lenges in most machine learning applications are uncertainty and complexity.
The standard framework for handling uncertainty is probability, for complexity
is first-order logic. Thus we would like to be able to learn and perform inference
in representation languages that combine the two. This is the focus of the field
of Statistical Relational Learning.
     My research is based on the use of the vast plethora of techniques developed
in the field of Logic Programming, in which the distribution semantics [16] is one
of the most prominent approaches. This semantics underlies, e.g., Probabilistic
Logic Programs,Probabilistic Horn Abduction,PRISM [16], Independent Choice
Logic,pD,Logic Programs with Annotated Disjunctions (LPADs) [17], ProbLog
[5] and CP-logic. These languages have the same expressive power: there are
linear transformations from one to the others. LPADs offer the most general
syntax, so my research and experimentations has been focused on this formalism.
An LPAD consists of a finite set of disjunctive clauses, where each of the disjuncts
in the head of a clause is annotated with the probability of the disjunct to hold,
if the body of the clause holds. LPADs are particularly suitable when reasoning
about actions and effects where we have causal independence among the possible
different outcomes for a given action.
     Various works have appeared for solving three types of problems for languages
under the distribution semantics:

 – inference: computing the probability of a query given the model and some
   evidence: most algorithms find explanations for queries and compute their
   probability by building a Binary Decision Diagram (BDD) [5,15,10];
 – learning the parameters: for instance, LeProbLog [6] uses gradient descent
   while LFI-ProbLog [7] uses an Expectation Maximization approach where
   the expectations are computed directly using BDDs;
 – learning the structure: in [13] a theory compression algorithm for ProbLog
   is presented, in [12] ground LPADs are learned using Bayesian network tech-
   niques.
6

    A different approach from distribution semantics is represented by Markov
Logic: a Markov Logic Network (MLN) is a first-order knowledge base with a
weight attached to each clause, reflecting “how strong” it is. For this language
inference, parameter and structure learning algorithms are available as well, see
[14] and [11].
    By means of the development of systems for solving the above problems, one
would like to handle classical machine learning tasks, such as Text Classification,
Entity Resolution, Link Prediction, Information Extraction, etc., in real world
domains. An example of Text Classification problem is given by the WebKB
dataset, containing the text of web pages from 4 universities; each page belongs
to one of four classes: course, faculty, research project or student. Given words
on the pages, one wish to infer the class. Entity Resolution concerns the problem
of information integration from multiple sources, where the same entities can be
differently described. An example is represented by the Cora dataset, containing
citations of computer science publications, since citations of the same paper often
appear differently.
    I have considered the problem of learning the parameters and both the param-
eters and the structure of LPADs by developing three learning systems: section
2 presents a parameter learning algorithm based on Expectation Maximization,
section 3 presents two algorithms for structure learning that exploit the first.


2    Learning LPADs Parameters with the EM Algorithm

An LPAD is composed of annotated disjunctive clauses Ci of the form
hi1 : Πi1 ; . . . ; hini : Πini : −bi1 , . . . , bimi . where hi1 , . . . hini are logical atoms,
bi1 , . . . , bimi are logical literals and {Πi1 , . . . , Πini } are real numbers in the in-
terval [0, 1] representing probabilities and sum up to 1.
     An LPAD rule containing variables represents a number of “simple exper-
iments”, one for each ground instantiation of the rule (obtained by replacing
variables with constants of the domain), with the disjuncts as the possible out-
comes. For each ground instantiation of a rule only one pair (h : Π) is chosen;
in this way a normal non-disjunctive logic program is obtained, called possible
world. A probability distribution is defined over the space of possible worlds by
assuming independence among the selections made for each rule. The probability
of a possible world is the product of probabilities of the individual heads chosen
in each rule. The probability of a query Q according to an LPAD is given by the
sum of the probabilities of the possible worlds where the query is true.

Example 1. The following LPAD T encodes the result of tossing a coin depend-
ing on the fact that it is biased or not:

    C1 = heads(Coin) : 0.5; tails(Coin) : 0.5 : −toss(Coin), ¬biased(Coin).
    C2 = heads(Coin) : 0.6; tails(Coin) : 0.4 : −toss(Coin), biased(Coin).
    C3 = f air(coin) : 0.9; biased(coin) : 0.1.
    C4 = toss(coin) : 1.
                                                                                 7

This program models the fact that a fair coin lands on heads or on tails with
probability 0.5, while a biased coin with probabilities 0.6 and 0.4 respectively.
The third clause says that a certain coin coin has a probability of 0.9 of being
fair and of 0.1 of being biased, the fourth one that coin is certainly tossed.
    Each selection of a disjunct in a ground clause of an LPAD can be represented
by the equation Xij = k, where k ∈ {1, . . . , ni } indicates the head chosen, Xij
is a multivalued random variable where i and j indicate the clause and the
grounding. A function f (X), built on a set of multivalued variables and taking
Boolean values, can be represented by a Multivalued Decision Diagram (MDD),
a rooted graph that has one level for each variable. Each node has one child for
each possible value of the associated variable. The leaves store either 0 or 1, the
possible values of f (X). Given values for all the variables X, a MDD can be
used for computing the value of f (X) by traversing the graph starting from the
root and returning the value associated to the leaf that is reached. An example
of such function is: f (X) = {X11 = 1 ∨ X21 = 2 ∧ X31 = 1 ∨ X22 = 3 ∧ X31 = 1}.
A MDD can be used to represent the set of selections over rules, and will have
a path to a 1-leaf for each possible world where a query Q is true. It is often
unfeasible to find all the worlds where the query is true so inference algorithms
find instead explanations for the query, i.e. set of selections such that the query
is true in all the worlds whose selection is a superset of them. Since MDDs split
paths on the basis of the values of a variable, the branches are mutually disjoint
so that a dynamic programming algorithm can be applied for computing the
probability of a query by a summation. Usually one works on MDDs with a
Binary Decision Diagram package, so one has to represent multivalued variables
by means of Boolean variables.
    The problem I faced is how to efficiently perform “parameter learning”, i.e.,
using training data for learning correct probabilities Πik . The technique applied
exploits the EM (Expectation Maximization) algorithm over BDDs proposed in
[9,8] and has been implemented in the system EMBLEM, for “EM over BDDs
for probabilistic Logic programs Efficient Mining” [1,2,3].
    EMBLEM takes as input a set of interpretations (sets of ground facts), each
describing a portion of the domain of interest, and a theory (LPAD). The user
has to indicate which, among all predicates, are target predicates: the facts for
these predicates will form the queries for which a SLD-proof is computed; from
these proofs a BDD is built encoding the Boolean formula consisting of the
disjunction of the explanations for the query.
    Then EMBLEM performs an EM cycle, in which the steps of Expectation and
Maximization are repeated until the log-likelihood of the examples reaches a local
maximum. Expectations are computed directly over BDDs. EM is necessary to
determine the parameters Πik since the number of times that head hik is chosen
is required. The information about which selection was used is unknown, so the
“choice” variables are latent and the number of times is a sufficient statistic.
    Decision Diagrams are suitable to efficiently evaluate the expectations since
the set of selections used for the derivation of the examples can be represented
as the set of paths from the root to the 1-leaf.
8

3   Learning LPADs Structure

The first system developed for LPADs’ structure learning is SLIPCASE, for
“Structure LearnIng of ProbabilistiC logic progrAmS with Em over bdds” [4].
It learns a LPAD by starting from an initial theory and by performing a beam
search in the space of refinements of the theory. The initial theory is inserted
in the beam and, at each step, the theory with the highest log likelihood is
removed from the beam and the set of its refinements, allowed by the language
bias, is built. The possible refinements are: the addition/removal of a literal from
a clause, the addition of a clause with an empty body and the removal of it. For
each refinement an estimate of the log likelihood of the data is computed by
running a limited number of iterations of EMBLEM. The best theory found so
far is updated and each refinement is inserted in order of log likelihood into the
beam.
    I am now working on SLIPCOVER, an evolution of SLIPCASE which first
searches the space of clauses and then the space of theories. SLIPCOVER per-
forms a cycle for each predicate that can appear in the head of clauses, where a
beam search in the space of clauses is performed: each clause (built according to
a language bias) is tested on the examples for the predicate, its head parameters
are learned with EMBLEM and the log likelihood of the data is used as its score.
Then the clause is inserted into one of two lists of promising clauses: a list of
target clauses, those for predicates we want to predict, and a list of background
clauses, those for the other predicates. Then a greedy search in the space of
theories is performed, in which each target clause is added to the current theory
and the score is computed. If the score is greater than the current best the clause
is kept in the theory, otherwise it is discarded. Finally parameter learning with
EMBLEM is run on the target theory plus the clauses for background predicates.
    The two systems can learn general LPADs including non-ground programs.


4   Experiments

We experimented EMBLEM on the real datasets IMDB, Cora, UW-CSE, We-
bKB, MovieLens and Mutagenesis and evaluated its performances by means of
the Area under the PR curve and under the ROC curve, in comparison with five
logic-probabilistic learning systems. It achieves higher areas in all cases except
two and uses less memory, allowing it to solve larger problems often in less time.
    We tested SLIPCASE on the real datasets HIV, UW-CSE and WebKB, and
evaluated its performances - in comparison with [11] and [12] - through the same
metrics, obtaining highest area values under both.
    We have tested the second structure learning algorithm on HIV, UW-CSE,
WebKB, Movielens, Mutagenesis and Hepatitis and evaluated it - in comparison
with SLIPCASE, [11] and [12] - through the same metrics. It has overcome them
in all cases. In the future we plan to test the systems on other datasets and to
experiment with other search strategies.
                                                                                     9

References
 1. Bellodi, E., Riguzzi, F.: EM over binary decision diagrams for prob-
    abilistic logic programs. In: Proceedings of the 26th Italian Con-
    ference on Computational Logic (CILC2011), Pescara, Italy, 31 Au-
    gust 31-2 September, 2011. pp. 229–243. No. 810 in CEUR Work-
    shop Proceedings, Sun SITE Central Europe, Aachen, Germany (2011),
    http://www.ing.unife.it/docenti/FabrizioRiguzzi/Papers/BelRig-CILC11.pdf
 2. Bellodi, E., Riguzzi, F.: Expectation Maximization over binary decision diagrams
    for probabilistic logic programs. Intel. Data Anal. 16(6) (2012), to appear
 3. Bellodi, E., Riguzzi, F.: Experimentation of an expectation maximization algorithm
    for probabilistic logic programs. Intelligenza Artificiale (2012), to appear
 4. Bellodi, E., Riguzzi, F.: Learning the structure of probabilistic logic programs. In:
    Inductive Logic Programming 21st International Conference, ILP 2011, London,
    UK, July 31 - August 3, 2011. Revised Papers. LNCS, Springer (2012), to appear
 5. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: A probabilistic prolog and
    its application in link discovery. In: International Joint Conference on Artificial
    Intelligence. pp. 2462–2467. AAAI Press (2007)
 6. Gutmann, B., Kimmig, A., Kersting, K., Raedt, L.D.: Parameter learning in prob-
    abilistic databases: A least squares approach. In: European Conference on Machine
    Learning and Knowledge Discovery in Databases. LNCS, vol. 5211, pp. 473–488.
    Springer (2008)
 7. Gutmann, B., Thon, I., Raedt, L.D.: Learning the parameters of probabilistic logic
    programs from interpretations. In: European Conference on Machine Learning and
    Knowledge Discovery in Databases. LNCS, vol. 6911, pp. 581–596. Springer (2011)
 8. Inoue, K., Sato, T., Ishihata, M., Kameya, Y., Nabeshima, H.: Evaluating abduc-
    tive hypotheses using an em algorithm on bdds. In: Boutilier, C. (ed.) Proceedings
    of the 21st International Joint Conference on Artificial Intelligence (IJCAI). pp.
    810–815. Morgan Kaufmann Publishers Inc. (2009)
 9. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the em algorithm
    by bdds. In: Late Breaking Papers of the International Conference on Inductive
    Logic Programming. pp. 44–49 (2008)
10. Kimmig, A., Demoen, B., De Raedt, L., Santos Costa, V., Rocha, R.: On the
    implementation of the probabilistic logic programming language problog. Theory
    and Practice of Logic Programming 11(2-3), 235–262 (2011)
11. Kok, S., Domingos, P.: Learning markov logic networks using structural motifs. In:
    International Conference on Machine Learning. pp. 551–558. Omnipress (2010)
12. Meert, W., Struyf, J., Blockeel, H.: Learning ground CP-Logic theories by leverag-
    ing Bayesian network learning techniques. Fundam. Inform. 89(1), 131–160 (2008)
13. Raedt, L.D., Kersting, K., Kimmig, A., Revoredo, K., Toivonen, H.: Compressing
    probabilistic prolog programs. Mach. Learn. 70(2-3), 151–168 (2008)
14. Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1-2),
    107–136 (2006)
15. Riguzzi, F.: Extended semantics and inference for the Independent Choice Logic.
    Logic Journal of the IGPL 17(6), 589–629 (2009)
16. Sato, T.: A statistical learning method for logic programs with distribution seman-
    tics. In: International Conference on Logic Programming. pp. 715–729. MIT Press
    (1995)
17. Vennekens, J., Verbaeten, S., Bruynooghe, M.: Logic programs with annotated
    disjunctions. In: International Conference on Logic Programming. LNCS, vol. 3131,
    pp. 195–209. Springer (2004)

</pre>