=Paper=
{{Paper
|id=None
|storemode=property
|title=Assertion Role in a Hybrid Link Prediction Approach through Probabilistic Ontology
|pdfUrl=https://ceur-ws.org/Vol-1041/ontobras-2013_paper33.pdf
|volume=Vol-1041
|dblpUrl=https://dblp.org/rec/conf/ontobras/ArmadaRLC13
}}
==Assertion Role in a Hybrid Link Prediction Approach through Probabilistic Ontology==
<pdf width="1500px">https://ceur-ws.org/Vol-1041/ontobras-2013_paper33.pdf</pdf>
<pre>
Assertion Role in a Hybrid Link Prediction Approach through
                    Probabilistic Ontology
         Marcius Armada1 , Kate Revoredo1 , José Eduardo Ochoa Luna2 ,
                           Fabio Gagliardi Cozman 3
                     1
                         Departamento de Informática Aplicada, Unirio
                          Av. Pasteur, 458, Rio de Janeiro, RJ, Brazil
                               2
                           Universidad Católica San Pablo
             Quinta Vivanco s/n , Urb. Campiña Paisajista, Arequipa, Perú
                     3
                     Escola Politécnica, Universidade de São Paulo,
                  Av. Prof. Mello Morais 2231, São Paulo - SP, Brazil
      {marcius.oliveira, katerevoredo}@uniriotec.br, eduardo.ol@gmail.com, fgcozman@usp.br

    Abstract. Link prediction in a network is mostly based on information about the
    neighborhood topology of the nodes. Recently, the interest for hybrid link pre-
    diction approaches that combine topology information with information about
    the network individuals, has grown. However, considering the whole set of in-
    dividuals may not be necessary and sometimes not even suitable. Therefore,
    mechanisms to automatically discover the relevant set of individuals are de-
    manding. In this paper, we encompass this problem by proposing an algorithm
    that combines structure and semantic metrics to find the set of relevant indi-
    viduals. We empirically evaluate this proposal analyzing the assertion role of
    these individuals when predicting a link through a probabilistic ontology.

1. Introduction
Many social, biological, and information systems can be well described by networks,
where nodes represent objects (individuals), and links denote the relations or interactions
between nodes. These networks have a dynamic behavior, thus nodes and links can appear
and disappear rapidly. In this scenario, predicting a possible link in a network, this is
predicting a future occurrence of a not yet existing relationship, is an interesting issue
that has received significant attention. For instance, one may be interested on finding
potential friendship between two persons in a social network, or a potential collaboration
between two researchers. In short, link prediction aims at predicting whether two nodes
should be connected given previous information about their relationships or interests.
        Hasan and Zaki [Al Hasan and Zaki 2011] survey representative link prediction
methods, classifying them in three groups. In the first group, feature-based meth-
ods construct pairwise features to use in classification. The majority of the features
are extracted from the graph topology by computing similarity based on the neighbor-
hood of the pair of nodes, or based on ensembles of paths between the pair of nodes
[Liben-Nowell and Kleinberg 2007]. Semantic information has also been used as fea-
tures [Sachan and Ichise 2011, Wohlfarth and Ichise 2008]. The second group includes
probabilistic approaches that model the joint probability for entities in a network by
Bayesian graphical models [Wang et al. 2007]. The third group employs linear algebraic
approaches that compute the similarity between nodes in a network by rank-reduced sim-
ilarity matrices [Kunegis and Lommatzsch 2009].
         In [Ochoa-Luna et al. 2013], an approach for link prediction that combines
Bayesian graphical models and semantic-based features was proposed. To repre-
sent semantic-based features, a probabilistic ontology represented with the proba-
bilistic description logic called Credal ALC (CRALC) [Cozman and Polastro 2009]
was used.      This probabilistic description logic extends the popular logic ALC
[Schmidt-Schauß and Smolka 1991] with probabilistic inclusions. These are sentences,
such as P (Professor|Researcher) = 0.4, specifying the probability that an element of
the domain is a Professor given that it is a Researcher. Exact and approximate inference
algorithms for CRALC have been proposed [Cozman and Polastro 2009], using ideas in-
herited from the theory of Relational Bayesian Networks [Jaeger 2002].
        When using semantic features, information about the individuals of the domain
are considered. However, information about all individuals may not be necessary and
sometimes not even suitable. Therefore, mechanisms that automatically select the relevant
individuals are important. In [Ochoa-Luna et al. 2013], a first discussion about this matter
was done, where structure features were considered to select the most relevant individuals.
In this paper, we extend this idea and evaluate alternative methods for selecting the set of
relevant individuals. We empirically evaluate our proposal using a probabilistic ontology,
represented in CRALC, for modeling the domain.
         The paper is organized as follows. Section 2 reviews basic concepts of probabi-
listic description logics and link prediction. Our proposal for selecting the most relevant
individuals related to the two being analyzed for link prediction is presented in Section 3.
Section 4 describes experiments, and Section 5 concludes the paper and discusses some
future work.

2. Background
This section briefly review probabilistic description logics and link prediction methods,
with a focus on concepts and techniques that are later used.

2.1. Probabilistic Description Logics and CRALC
Description logics (DLs) form a family of representation languages that are typically de-
cidable fragments of first order logic (FOL) [Baader and Nutt 2002]. Knowledge is ex-
pressed in terms of individuals, concepts, and roles. The semantics of a description is
given by a domain D (a set) and an interpretation ·I (a functor). Individuals represent ob-
jects through names from a set NI = {a, b, . . .}. Each concept in the set NC = {C, D, . . .}
is interpreted as a subset of a domain D. Each role in the set NR = {r, s, . . .} is inter-
preted as a binary relation on the domain. An assertion states that an individual belongs
to a concept of that a pair of individuals satisfies a role. An ABox is a set of assertions.
       A popular description logic is ALC [Schmidt-Schauß and Smolka 1991]; given
its importance to our proposal, we briefly review it here. Constructors in ALC are con-
junction (C u D), disjunction (C t D), negation (¬C), existential restriction (∃r.C), and
value restriction (∀r.C). Concept inclusions and definitions are denoted respectively by
C v D and C ≡ D, where C and D are concepts. Concept C t ¬C is denoted by >, and
concept C u ¬C is denoted by ⊥. The semantics of these constructs is given by a domain
D and an interpretation I as follows: each individual a is mapped into an element aI ;
each concept C is mapped into a subset C I of the domain; each role r is mapped into a
binary relation rI in the domain; moreover,
     • (C u D)I = C I ∩ DI ;
     • (C t D)I = C I ∪ DI ;
     • (¬C)I = D\C I ;
     • (∃r.C)I = {x ∈ D|∃y : (x, y) ∈ rI ∧ y ∈ C I };
     • (∀r.C)I = {x ∈ D|∀y : (x, y) ∈ rI → y ∈ C I }.
Finally, C v D is interpreted as C I ⊆ DI and C ≡ D is interpreted as C I = DI .
       An example may be useful. Consider the following concept definition:

                   Researcher ≡ Person u ∃hasPublication.BibItem                          (1)

specifying that researchers are individuals who are persons and who have published a
bibliographic item.
        Several probabilistic description logics have appeared in the literature
[Lukasiewicz and Straccia 2008, Klinov 2008]. An example is the probabilistic descrip-
tion logic CRALC , which is a probabilistic extension of the description logic ALC. It
keeps all constructors of ALC, but only allows concept names on the left hand side of in-
clusions/definitions. Additionally, in CRALC one can have probabilistic inclusions such
as P (C|D) = α or P (r) = β for concepts C and D, and for role r (in this paper we only
consider equality in probabilistic inclusions/definitions). If the interpretation of D is the
whole domain, then we simply write P (C) = α. The semantics of these inclusions is
roughly (a formal definition can be found in Ref. [Cozman and Polastro 2009]) given by:

                              ∀x ∈ D : P (C(x)|D(x)) = α,

                            ∀x ∈ D, y ∈ D : P (r(x, y)) = β.
We assume that every terminology is acyclic: no concept uses itself (where “use” is the
transitive closure of “directly use”; we say that C directly uses D if D appears in the
right hand side of an inclusion/definition, or in the conditioning side of a probabilistic
inclusion). This assumption allows one to represent any terminology T through a directed
acyclic graph. Such a graph, denoted by G(T ), has each concept name and role name as
a node, and if a concept C directly uses concept D, that is if C and D appear respectively
in the left and right hand sides of an inclusion/definition, then D is a parent of C in G(T ).
Each existential restriction ∃r.C and each value restriction ∀r.C is added to the graph
G(T ) as a node, with an edge from r and C to each restriction directly using it. Each
restriction node is a deterministic node in that its value is completely determined by its
parents.
        Consider, as an example, a terminology TR containing the sentence in Expression
(1), plus P (Person) = 0.2, P (BibItem) = 0.6, P (hasPublication) = 0.1; its graph is
depicted in the left of Figure 1.
        The semantics of CRALC is based on probability measures over the space of in-
terpretations, for a fixed domain. To make sure a terminology specifies a single proba-
bility measure, a number of additional assumptions are adopted: the domain is assumed
                        
                      hasPublication
                        
       Person 
              BibItem                                                           
       
                                
                                              hP(b, b)     hP(b, p)     hP(p, p)     hP(p, b)
                                                                    
                     XX    
           @           Xz
                        X 
                            ∃                  @                @
           @R
             @          
                                        P(b) BI(b)@XX      P(p) BI(p)@
           Researcher 9                         XX@ XX           XX@
                                            @         
                                                       R ? XX
                                                       X
                                                       @
                                                       z
                                                       X        @       
                                                                          X
                                                                          R
                                                                          @
                                                                          z?
                                                                          X
                                                                X
                                                       ∃(b)  @         - ∃(p)
                                                                  XX
                                                                
                                                            )
                                                                  X
                                                   9               9
                                               @
                                               R
                                               @                  R
                                                                  @
                                                  R(b)                     R(p)
                                                                     
                                                                         

     Figure 1. Graph G(TR ) (left Figure) and Bayesian network over indicator functions
     of assertions, produced by grounding the terminology TR (right figure)


finite, fixed, and known; the unique-name assumption and the rigidity assumption for
individuals (as usual in first-order probabilistic logic [Fagin et al. 1990]) are assumed; a
single concept name appears in the left hand side of any inclusion or definition and in the
conditioned side of any probabilistic inclusion; and finally a Markov condition imposes
independence of any grounding of concept/role conditional on the groundings of its cor-
responding parents in the graph G(T ) [Cozman and Polastro 2009]. Given these assump-
tions, a set of sentences T in CRALC defines a relational Bayesian network [Jaeger 2002]
whose underlying graph is exactly G(T ).
        Considering the domain D = {bob, paper} and the set of assertions A = {
Person(bob), Researcher(bob), BibItem(paper), hasPublication(bob, paper) }, inferences
such as P (Ao (a0 )|A) can be computed by grounding the terminology, where grounding
means that all existing variables must be replaced by constants. In our case they are re-
placed by the individuals in the domain and the grounding process generates a “slice” for
each individual. The right Bayesian network in Figure 1 shows a grounding for termi-
nology T where two slices, one for individual bob and another for individual paper, are
built (for the sake of space, names are abbreviated). At first sight the resulting Bayesian
network may seem odd, with nodes like Bibitem(bob) or Person(paper), but since we
are not based on the “closed world” assumption then anything we not currently known
can be either true or false. For large domains, exact probabilistic inference is in gen-
eral quite hard due to the complexity of the resulting grounded Bayesian network but
variational algorithms that approximate such probabilities are available in the literature
[Cozman and Polastro 2009] in an attempt to deal whith this problem.

2.2. Link Prediction
The      task    we     are   interested    in     can     be    defined     as   follows
[Liben-Nowell and Kleinberg 2007]. One is given a network (a graph) G consist-
ing of a set of nodes V (represented by letters a, b, etc) and a set of edges E, where an
edge represents an interaction between nodes. Interactions may be tagged with times,
and the link prediction problem may be one of predicting the existence of edges in a time
interval, given the edges observed in another time interval. Here we are interested in a
static problem where we are given nodes and edges, except for the edge between two
nodes a and b, and we must then predict whether there is an edge between a and b.
        Many different tools are used for link prediction, some of which, like matrix fac-
torization, are related to the massive size of datasets; other tools are directly related to the
existence of links between nodes. One can use classifiers that, based on network features
and measures, classify each tentative link as existing or not [Al Hasan and Zaki 2011];
one may also resort to collective classification over the whole set of possible links
[Getoor and Diehl 2005]. Several such techniques are based on computing measures
of proximity/similarity between nodes in a network [Liben-Nowell and Kleinberg 2007,
Lü and Zhou 2011].
        Other approaches consider semantic features. The degree of semantic similarity
among entities can be useful to predict links that might be missed by simple topological or
frequency-based features [Wang et al. 2007]. One way of capturing semantic similarity is
by considering documents related to nodes in the network. A simple example of semantic
similarity is the keyword match count between two authors [Hasan et al. 2006]. A more
sophisticated method makes use of the well-known techniques such as TFIDF feature
vector representation and the cosine measure to compute similarity [Wang et al. 2007].
The latter measure, for documents d1 and d2 , is obtained by creating vector representations
→
−            →
             −
V (d1 ) and V (d2 ) that countain word counts weighted by their TFIDF (Term Frequency
- Inverse Document Frequency) measures. The similarity measure is then
                                                →
                                                −          →
                                                           −
                                                V (d1 ) · V (d2 )
                            cosine(d1 , d2 ) = →−         →
                                                          −         ,
                                               | V (d1 )|| V (d2 )|
where the dot product is used in the numerator and the Euclidean distance is used in
the denominator. To recall, the TFIDF weighting scheme assigns to term t a weight in
document d given by TFIDFt,d = TFt,d × IDFt , where TFt,d is the term frequency in d,
                                                                      N , for N the total
and IDFt is the inverse document frequency of t, given by IDFt = log DF
                                                                        t
number of documents and DFt the number of documents containing the term.
        Approaches to link prediction can be understood not only by considering the
kinds of tools employed, but also by examining the model that is used to represent
the network as a whole. Typically one assumes some sort of probabilistic mech-
anism that at least partially explains the existence of edges, perhaps together with
domain-specific knowledge (for instance, domain theories about human relationships)
[Goldenberg et al. 2010, Newman 2003]. Thus the simplest network model is the Erdös-
Rènyi random graph: each pair of nodes can be connected with identical probability.
More sophisticated models resort to hierarchical specification of link probabilities, or to
grouping of nodes within blocks of varying probability.
         One way to capture the probabilistic structure of a network is through graph-based
models such as Markov random fields or Bayesian networks [Pearl 1988]. However, these
languages are well suited to express independence relations between a fixed set of random
variables; when nodes and links are to be dealt within graphs, it is best to consider mod-
eling languages that can specify Markov random fields and Bayesian networks over rela-
tional structures. Indeed many proposals for link prediction resort to such languages, from
seminal work by Getoor et al [Getoor et al. 2002] and Taskar et al [Taskar et al. 2003].
The presence of relational structure lets one to represent properties of individuals nodes,
of links, of communities; one can then compute the probability of specific links, and
estimate such probabilities from data.
        In [Ochoa-Luna et al. 2013], this modeling strategy was followed using the pro-
babilistic description logics CRALC. The interest in models based on description log-
ics is justified given recent results on the importance of ontologies in organizing in-
formation that can be used in link prediction [Aljandal et al. 2009, Caragea et al. 2009,
Thor et al. 2011]. While other link prediction implementation usually focus in one kind
of feature, the one using CRALC showed to be able to mix different features such as se-
mantic, numeric and topological. Being a versatile solution doesn’t make it easier to be
modeled than other solutions, but as a novel approach there is still room for evolution and
further experimentation.

3. Assertion Role in Link Prediction through a Probabilistic Ontology
Given a network (a graph) G consisting of a set of nodes V and a set of edges E, where an
edge represents an interaction between nodes. For a link prediction task considering se-
mantic features, we follow the approach proposed in [Ochoa-Luna et al. 2013] and model
the domain using a probabilistic ontology (O) represented in CRALC. Nodes in G are
individuals of a concept C in O and edges are instances of a role R in O. Thus, the net-
work G is built encompassing assertions about concept C and role R. For instance, in
a co-authorship network, assertions for concept Researcher are represented by nodes and
assertions for role sharePublication are represented by relationships between two nodes.
Figure 2 depicts a network for the assertions shown in Figure 3.


                                             Researcher(john). Researcher(ann). Researcher(carl).
                                             Researcher(emily). sharePublication(john, ann).
                                             sharePublication(john, carl). sharePublication(carl, emily).
    Figure 2. Network encom-
    passing assertions of the
    ABox in Figure 3.                            Figure 3. Example of an ontology ABox.


       The probabilistic ontology O can model the domain widely, thus having other
concepts and roles beyond the ones encompassing the network. For instance, an ontology
describing the co-authorship domain is shown in Figure 4.

     TBox:
     P (Publication)              = 0.3
     P (sharePublication)         = 0.22
     P (hasSameInstitution)       = 0.14
     Researcher                   ≡ Person u ∃hasPublication.BibItem
     P (PublicationCollaborator   | Researcher u ∃sharePublication.Researcher) = 0.91

     ABox:                           Researcher(john). Researcher(ann). Researcher(carl).
                                     Researcher(emily). sharePublication(john, ann).
                                     sharePublication(john, carl). sharePublication(carl, emily).
                                     Publication(p1). Publication(p2)

             Figure 4. A probabilistic ontology for the co-authorship domain.

      Predicting a link between two nodes a and b in a network G concerns evaluating
whether an edge between a and b should be included. In the semantic link prediction task,
where the domain is modeled through CRALC, the problem can be rewritten as evaluating
if the considered role between individuals a and b may exist in a given ontology. Thus, the
semantic link prediction task considered in this paper can be described as: compute the
probability of an assertion concerning the role that provides the semantic of relationships
in the network G given an ABox of asserted concepts and roles of the domain.
        Because domain knowledge is expressed with CRALC, questions about probabi-
lity of assertions can be answered by inference in CRALC. For instance, the question
“what is the probability of Emily and Ann share a publication given some information
about the domain?” can be translated into P (sharePublication(emily, ann)|A), where A
represents the ABox with assertions about the domain. If this probability is higher than a
suitable threshold then the assertion may be considered true and a link introduced in G.
        Intuitively, the inference quality of any assertion’s probability rests in the used
assertions contained in A. While one can suppose that more assertions leads to more
accurate calculated probabilities, this is not always true. Some individuals may not be
related to the ones being analyzed and therefore their assertions may not impact the eval-
uation. Thus it is unnecessary to consider evidence (assertion) about them. Moreover, in
some case may even be impractical to reason about all individuals of the domain due to
limits in computational resources or long response times. Hence it is important to filter
out assertions and to focus on the most relevant ones.
        We are interested in predicting a relationship between two specific nodes, a and b.
Therefore, we argue that assertions directly related to these two individuals, and to other
individuals strongly related to them in the network, are more relevant for link prediction
than assertions on other individuals in the network. The link prediction algorithm (see
Algorithm 1) will not only be scalable but will be more accurate if we only consider
assertions about a, b and the individuals strongly related to them in our inferences. To do
so, we must specify the set A(a, b) of elements of the domain that are deemed strongly
related to a and b.

   Algorithm 1: Algorithm for link prediction (adapted from
   [Ochoa-Luna et al. 2013]).
      Require: a network G, an ontology O, a role r̂ representing links in the network, a
          concept Ĉ specifying the nodes in the network and a threshold γ.
      Ensure: a set of predicted links L
       1: initialize L = ∅;
       2: for all pair of instances (a, b) of nodes in G do
       3:    if there is no link between nodes a and b in G then
       4:       find A(a, b);
       5:       E=assertions about A(a, b);
       6:       infer probability P (r(a, b)|E) using the relational Bayesian network created
                from the ontology O;
       7:       if P (r(a, b)|E) > γ then
       8:          add link between a and b to L.
       9:       end if
      10:    end if
      11: end for
        In [Ochoa-Luna et al. 2013] the strategy adopted to define A(a, b) was to consider
nodes along paths between a and b. In this paper, we argue that not only structural metrics
can define the best set A(a, b) and we evaluate the performance of structural and semantic
approaches for selecting the most relevant individuals for a link prediction task. The
following approaches were considered:
     i) A(a, b) = Aadj (a, b), where Aadj (a, b) = adjacent(a) ∪ adjacent(b). Defines
        A(a, b) as the set of nodes adjacent to a union the set of nodes adjacent to b.
    ii) A(a, b) = AP adj (a, b), where AP adj (a, b) = A0 (a, b) ∪i∈A0 (a,b) adjacent(i) and
        A0 (a, b) = {a} ∪ {b} ∪ path(a, b). Defines A(a, b) as the set of all nodes in the
        path between a and b union their adjacent nodes and the adjacents of a and b.
   iii) A(a, b) = fsemantic (Aadj (a, b)). Defines A(a, b) as the set of nodes contained in
        Aadj (a, b) that are most semantically related to a and b considering a semantic
        function fsemantic .
   iv) A(a, b) = fsemantic (AP adj (a, b)). Defines A(a, b) as the set of nodes contained in
        AP adj (a, b) that are most semantically related to a and b considering a semantic
        function fsemantic .
        An experimental evaluation was conducted and will be described in the next sec-
tion to evaluate the benefits of these metrics. Moreover, a discussion around the role of
the assertions about individuals for the semantic link prediction task is also presented.

4. Experiments
Experiments have been conducted to evaluate the benefits of considering structural and
semantic metrics for selecting the most relevant individuals for the semantic link predic-
tion task. A real world data repository, the Lattes curriculum platform, was used. This
section reports the steps involved in this process and the results found.

4.1. Scenario Description
The Lattes platform is the public repository of brazilian scientific curricula that consists of
approximately a million registered documents. Information is encoded in HTML format,
ranging from personal information such as name and professional address to publication
lists, administrative tasks, research areas, research projects and advising/advisor informa-
tion. There is implicit relational information in these web pages, for instance collaboration
networks are built by advising/adviser links, shared publications, and so on.
        To perform experiments we have randomly selected eight thousand researchers
and their relationships from the Lattes platform. Assertions were extracted concerning
these researchers. For instance, if a parser finds that a researcher John has two publica-
tions (p1 , p2 ) and a researcher Ann has two (p2 , p3 ), where p2 was done in collaboration
with John, then assertions, as the following, are extracted:
         Researcher(john), Researcher(ann),
         Publication(p1 ), Publication(p2 ), Publication(p3 ),
         hasPublication(john, p1 ), hasPublication(john, p2 ),
         hasPublication(ann, p2 ), hasPublication(ann, p3 )
         sharePublication(john, ann).
        A probabilistic ontology was then learned using algorithms in the literature
[Ochoa-Luna et al. 2011, Revoredo et al. 2010]. This ontology is comprised by 24 pro-
babilistic inclusions and 17 concept definitions.
        The concept Researcher indicates whether an element of the domain is a node
in the network (hence for each assertion of concept Researcher a node exists in the net-
work) and the role sharePublication indicates whether a pair of elements of the domain
are linked in the network (hence for each assertion of role sharePublication a link exists
in the network). Using this data, link probabilities were computed through inference in
the CRALC ontology.

4.2. Methodology
In this section, we describe our main design choices to run experiments. Given the 8000
selected researchers, there exist 31996000 possible link relationships. To perform link
prediction we have considered collaborations based on co-authorship on publications
(there are 2837206 publications). After analysing these publications we identified 95011
true positive links among researchers based on co-authorship. From the available data
we randomly selected links so that the used dataset in the experiments was comprised by
1000 positive links and 1000 negative links (balanced datasets).
        Although we can use probabilistic inference to decide whether there is a link be-
tween two nodes, to perform comparisons among the structural and semantic metrics des-
cribed in Section 3 we resort to a classification algorithm approach through the Logistic
regression algorithm.
       Beyond the 4 metrics described in Section 3 we also considered:
    v) the metric proposed in [Ochoa-Luna et al. 2013]: A(a, b) = Apath (a, b), where
       Apath (a, b) defines the set of nodes in the paths between a and b .
   vi) A(a, b) = random selection of 10 nodes in the network.
       The metric v will permit us compare our proposal with the previous one presented
in [Ochoa-Luna et al. 2013]. For this metric, since computing all paths (∞) is expensive,
we follow Ochoa et al. and only considered paths of length at most four (i ≤ 4).
       The semantic feature we considered was keyword match. For each researcher
a document with the words appearing in the title of his publications (removing stop
words) is considered. Thus, a researcher is represented as a set of words, which allows
us to compute a semantic feature: the keyword match count between two researchers
[Hasan et al. 2006]. Using this feature we were able to select the top 10 researchers with
the most words in common with a an b.
       Finally, the probability P (r(x, y)|E), given by our probabilistic description logic
model, is used as a numerical feature in the classification model, in order to investigate
whether it can improve the classification approach for link prediction.

4.3. Results
In order to evaluate suitability of our approach in predicting co-authorships in the Lattes
dataset, several experiments were conducted. Each metric, through the probabilistic logic
scores found, has been considered as isolated features in our clasification algorithm. After
     Table 1. Classification results for dataset Lattes on accuracy (%) for baseline fea-
     tures used for selecting individuals used for generating assertions for inference
     in CRALC : metric i, metric ii, metric iii, metric iv, metric v, metric vi.

           Feature                 Lattes (acc.) Avg(#) of selected individuals
           CR ALC + metric i         99.93%         501
           CR ALC + metric ii        99.86%         545
           CR ALC + metric iii       99.88%         10
           CR ALC + metric iv        99.65%         10
           CR ALC + metric v         92.41%         26
           CR ALC + metric vi        71.14%         10

a ten-fold cross validation process, the classification algorithm yielded results on accuracy
for the dataset which are depicted in Table 1.
        The results shows us that randomly selecting individuals for assertion generation
(metric vi) obtained the worse accuracy in comparison to the other metrics with only 71%
while all the other obtained accuracies greater than 90%. Thus, it is important to use the
best possible assertions in the inference.
        All other results show little differences in accuracy between each other but those
metrics which don’t use the semantic feature (metric i and ii) needed about 50 times more
individuals to obtain near the same results. This demonstrates that the quality of the
selected individuals, using the semantic feature, and the assertions generated from them
were able to keep the CRALC link prediction algorithm scalable and the quality of the
predictions high.

5. Conclusion
In this paper, we have evaluated the role of assertions about individuals for the semantic
link prediction task. We follow the approach introduced in [Ochoa-Luna et al. 2013] and
considered a probabilistic ontology, represented with the probabilistic description logic
CR ALC, for modeling the domain. Thus, given a collaborative network, interests and
graph features are encoded through the probabilistic ontology.
        To predict links, probabilistic inference is used. Structural and semantic metrics
are combined in order to select the most relevant individuals for the prediction link task.
Therefore, only the necessary individuals are used and results have shown the importance
of selecting the best individuals from the avaiable ones. Moreover, this approach makes
the proposal scalable. Our proposal was evaluated on an academic domain, where links
among researchers were predicted and was able to attain accuracies greater than 90% as
shown in Table 1.
        Compared to previous work, our approach employs a rich ontology (as opposed
to simple is-a terminologies) that can encode substantial information about the domain.
Hierarchical structure can be encoded together with knowledge about specific nodes in
a network — we plan to explore richer ontologies in the future. Our proposal attains
better scalability than previous proposals that have tried to explore probabilistic relational
models for similar purposes but we plan to experiment with other new and state-of-the-
art selection algorithms in the search for the best set of assertions to be used in the link
prediction task.

6. Acknowledgment
This work is being accomplished in the context of the “Infrastructure for the Manage-
ment of Scientific Experiments in Computational Modeling” project, granted by CNPq,
No. 559998/2010-4.

References
Al Hasan, M. and Zaki, M. J. (2011). A survey of link prediction in social networks. In
   Social network data analytics, pages 243–275. Springer.
Aljandal, W., Bahirwani, V., Caragea, D., and Hsu, H. (2009). Ontology-aware classifi-
   cation and association rule mining for interest and link prediction in social networks.
   In AAAI 2009 Spring Symposium on Social Semantic Web: Where Web 2.0 Meets Web
   3.0, Standford,CA.
Baader, F. and Nutt, W. (2002). Basic description logics. In Description Logic Handbook,
  pages 47–100. Cambridge University Press.
Caragea, D., Bahirwani, V., Aljandal, W., and Hsu, W. H. (2009). Ontology-based link
  prediction in the livejournal social network. In SARA’09, pages 1–1.
Cozman, F. G. and Polastro, R. B. (2009). Complexity analysis and variational inference
  for interpretation-based probabilistic description logics. In Conference on Uncertainty
  in Artificial Intelligence.
Fagin, R., Halpern, J. Y., and Megiddo, N. (1990). A logic for reasoning about probabili-
  ties. Information and Computation, 87:78–128.
Getoor, L. and Diehl, C. P. (2005). Link mining: a survey. SIGKDD Explor. Newsl.,
  7(2):3–12.
Getoor, L., Friedman, N., Koller, D., and Taskar, B. (2002). Learning probabilistic models
  of link structure. Journal of Machine Learning Research, 3:679–707.
Goldenberg, A., Fienberg, S. E., Zheng, A. X., and Airoldi, E. M. (2010). A survey of
  statistical network models.
Hasan, M. A., Chaoji, V., Salem, S., and Zaki, M. (2006). Link prediction using super-
  vised learning. In In Proc. of SDM 06 workshop on Link Analysis, Counterterrorism
  and Security.
Jaeger, M. (2002). Relational Bayesian networks: a survey. Linkoping Electronic Articles
   in Computer and Information Science, 6.
Klinov, P. (2008). Pronto: A non-monotonic probabilistic description logic reasoner. In
   The Semantic Web: Research and Applications, pages 822–826. Springer.
Kunegis, J. and Lommatzsch, A. (2009). Learning spectral graph transformations for link
  prediction. In Proceedings of the 26th Annual International Conference on Machine
  Learning, pages 561–568. ACM.
Liben-Nowell, D. and Kleinberg, J. (2007). The link-prediction problem for social
   networks. Journal of the American society for information science and technology,
   58(7):1019–1031.
Lü, L. and Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A:
   Statistical Mechanics and its Applications, 390(6):1150–1170.
Lukasiewicz, T. and Straccia, U. (2008). Managing uncertainty and vagueness in descrip-
  tion logics for the semantic web. Web Semantics: Science, Services and Agents on the
  World Wide Web, 6(4):291–308.
Newman, M. E. (2003). The structure and function of complex networks. SIAM review,
  45(2):167–256.
Ochoa-Luna, J. E., Revoredo, K., and Cozman, F. G. (2011). Learning probabilistic
  description logics: A framework and algorithms. In Batyrshin, I. and Sidorov, G.,
  editors, Advances in Artificial Intelligence, volume 7094 of Lecture Notes in Computer
  Science, pages 28–39. Springer.
Ochoa-Luna, J. E., Revoredo, K., and Cozman, F. G. (2013). Link prediction using a
  probabilistic description logic. Journal of the Brazilian Computer Society, pages 1–
  13.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible
  inference. Morgan Kaufman.
Revoredo, K., Ochoa-Luna, J., and Cozman, F. (2010). Learning terminologies in pro-
  babilistic description logics. In da Rocha Costa, A., Vicari, R., and Tonidandel, F.,
  editors, Advances in Artificial Intelligence SBIA 2010, volume 6404 of Lecture Notes
  in Computer Science, pages 41–50. Springer / Heidelberg, Berlin.
Sachan, M. and Ichise, R. (2011). Using semantic information to improve link prediction
  results in network datasets. International Journal of COmputer Theory and Engeneer-
  ing, 3:71–76.
Schmidt-Schauß, M. and Smolka, G. (1991). Attributive concept descriptions with com-
  plements. Artificial intelligence, 48(1):1–26.
Taskar, B., Wong, M.-F., Abbeel, P., and Koller, D. (2003). Link prediction in relational
  data. In Proceedings of the 17th Neural Information Processing Systems (NIPS).
Thor, A., Anderson, P., Raschid, L., Navlakha, S., Saha, B., Khuller, S., and Zhang, X.-
  N. (2011). Link prediction for annotation graphs using graph summarization. In The
  Semantic Web–ISWC 2011, pages 714–729. Springer.
Wang, C., Satuluri, V., and Parthasarathy, S. (2007). Local probabilistic models for link
  prediction. In Proceedings of the 2007 Seventh IEEE International Conference on Data
  Mining, ICDM ’07, pages 322–331, Washington, DC, USA. IEEE Computer Society.
Wohlfarth, T. and Ichise, R. (2008). Semantic and event-based approach for link pre-
  diction. In Proceedings of the 7th International Conference on Practical Aspects of
  Knowledge Management.

</pre>