=Paper=
{{Paper
|id=None
|storemode=property
|title=An Experimental Evaluation of a Scalable Probabilistic Description Logic Approach for Semantic Link Prediction
|pdfUrl=https://ceur-ws.org/Vol-900/paper6.pdf
|volume=Vol-900
|dblpUrl=https://dblp.org/rec/conf/semweb/LunaRC12
}}
==An Experimental Evaluation of a Scalable Probabilistic Description Logic Approach for Semantic Link Prediction==
An Experimental Evaluation of a Scalable Probabilistic Description Logic Approach for Semantic Link Prediction José Eduardo Ochoa Luna1 , Kate Revoredo2 , and Fabio Gagliardi Cozman1 1 Escola Politécnica, Universidade de São Paulo, Av. Prof. Mello Morais 2231, São Paulo - SP, Brazil 2 Departamento de Informática Aplicada, Unirio Av. Pasteur, 458, Rio de Janeiro, RJ, Brazil eduardo.ol@gmail.com,katerevoredo@uniriotec.br,fgcozman@usp.br Abstract. In previous work, we presented an approach for link predic- tion using a probabilistic description logic, named crALC. Inference in crALC, considering all the social network individuals, was used for sug- gesting or not a link. Despite the preliminary experiments have shown the potential of the approach, it seems unsuitable for real world scenar- ios, since in the presence of a social network with many individuals and evidences about them, the inference was unfeasible. Therefore, we ex- tended our approach through the consideration of graph-based features to reduce the space of individuals used in inference. In this paper, we evaluate empirically this modification comparing it with standard pro- posals. It was possible to verify that this strategy does not decrease the quality of the results and makes the approach scalable. 1 Introduction Many social, biological, and information systems can be well described by net- works, where nodes represent objects (individuals), and links denote the relations or interactions between nodes. Predicting a possible link in a network is an in- teresting issue that has gained attention, due to the growing interest in social networks. For instance, one may be interested on finding potential friendship between two persons in a social network, or a potential collaboration between two researchers. Thus link prediction [13] aims at predicting whether two nodes (i.e. people) should be connected given that we know previous information about their relationships or interests. In [13] a survey with some representative link prediction methods, catego- rized in three groups, was presented. In the first group, feature-based methods construct pair-wise features to use in a classification task. The majority of the features are extracted from the graph topology computing the similarity based on the neighborhood of the pair of nodes or based on ensembles of paths between the pair of nodes [10]. Recently, semantic informations have also being used as features [21, 17]. The second group includes probabilistic approaches which model the joint-probability among the entities in a network by Bayesian graphi- cal models [20]. And, finally the third group concerns linear algebraic approaches which computes the similarity between the nodes in a network by rank-reduced similarity matrices [9]. In [15] we presented an approach that uses a Bayesian graphical model to- gether with semantic-based features for semantic link prediction. Therefore, our proposal lies on the first two categories described previously. To model the do- main and thus consider semantic-based features, the proposal adopted a proba- bilistic description logic called Credal ALC (crALC) [5], that extends the pop- ular logic ALC [3] with probabilistic inclusions. These are sentences, such as P (Professor|Researcher) = 0.4, indicating the probability that an element of the domain is a Professor given that it is a Researcher. Exact and approximate in- ference algorithms have been proposed [5], using ideas inherited from the theory of Relational Bayesian Networks (RBN) [8]. In [14], we extended our proposal to also consider graph-based approaches in order to scale for large social net- work. In this paper we conduct some experimental analisis in order to verify the benefits of our proposal. The paper is organized as follows. Section 2.1 reviews basic concepts of prob- abilistic description logics, crALC and our proposal for a scalable semantic link prediction approach. Section 3 describes the experiments we conducted bringing some discussions. Section 4 concludes the paper. 2 Background In this section, probabilistic description logic crALC and our former proposal for semantic link prediction are reviewed. 2.1 Probabilistic Description Logics and crALC Description logics (DLs) form a family of representation languages that are typ- ically decidable fragments of first order logic (FOL) [3]. Knowledge is expressed in terms of individuals, concepts, and roles. The semantics of a description is given by a domain D (a set) and an interpretation ·I (a functor). Individuals represent objects through names from a set NI = {a, b, . . .}. Each concept in the set NC = {C, D, . . .} is interpreted as a subset of a domain D. Each role in the set NR = {r, s, . . .} is interpreted as a binary relation on the domain. Several probabilistic descriptions logics have appeared in the literature [11]. Heinsohn [7] and Sebastiani [18] consider probabilistic inclusion axioms such as PD (Professor) = α, meaning that a randomly selected object is a Professor with probability α. This characterizes a domain-based semantics: probabilities are assigned to subsets of the domain D. Sebastiani also allows inclusions such as P (Professor(John)) = α, specifying probabilities over the interpretations them- selves. For example, one interprets P (Professor(John)) = 0.001 as assigning 0.001 to be the probability of the set of interpretations where John is a Professor. This characterizes an interpretation-based semantic. The probabilistic description logic crALC is a probabilistic extension of the DL ALC that adopts an interpretation-based semantics. It keeps all construc- tors of ALC, but only allows concept names on the left hand side of inclu- sions/definitions. Additionally, in crALC one can have probabilistic inclusions such as P (C|D) = α or P (r) = β for concepts C and D, and for role r. If the interpretation of D is the whole domain, then we simply write P (C) = α. The semantics of these inclusions is roughly (a formal definition can be found in [5]) given by: ∀x ∈ D : P (C(x)|D(x)) = α, ∀x ∈ D, y ∈ D : P (r(x, y)) = β. We assume that every terminology is acyclic; no concept uses itself. This as- sumption allows one to represent any terminology T through a directed acyclic graph. Such a graph, denoted by G(T ), has each concept name and role name as a node, and if a concept C directly uses concept D, that is if C and D appear respectively in the left and right hand sides of an inclusion/definition, then D is a parent of C in G(T ). Each existential restriction ∃r.C and value restriction ∀r.C is added to the graph G(T ) as nodes, with an edge from r and C to each restriction directly using it. Each restriction node is a deterministic node in that its value is completely determined by its parents. The graph G(T ) is a Relational Bayesian Network (RBN) [8]. Example 1. Consider a terminology T1 with concepts A, B, C, D. Suppose P (A) = 0.9, B v A, C v B t ∃r.D, P (B|A) = 0.45, P (C|B t ∃r.D) = 0.5, and P (D|∀r.A) = 0.6. The last three assessments specify beliefs about partial overlap among concepts. Suppose also P (D|¬∀r.A) = ≈ 0 (conveying the existence of exceptions to the inclusion of D in ∀r.A). Figure 1 in the left depicts G(T ), while the graph in the right illustrates the grounding of G(T ) for a domain with two individuals (D = {a, b}). Fig. 1. G(T ) for terminology T in Example 1 and its grounding for domainD = {a, b}. The semantics of crALC is based on probability measures over the space of interpretations, for a fixed domain. Inferences, such as P (Ao (a0 )|A) for an ABox A, can be computed by propositionalization, generating a grounding RBN, where one slice is built for each individual. Therefore, not always exact probabilistic in- ference is possible. In [5], a first order loopy propagation algorithm was proposed for approximate calculations. 2.2 Link Prediction with crALC Given a social network N , where nodes are entities (represented by letters a, b, c, . . .), one is interested in defining whether a link between a and b is suit- able given that there is no link between these nodes in N . In [15], interests, i.e., semantics between the nodes were modeled through a probabilistic ontology rep- resented by the probabilistic description logic crALC. In addition, in [14] graph path information was used to improve probabilistic inference. In summary, the semantic link prediction task proposed in [15] (and improved in [14]) can be described as: Given: • a network N defining relationship between objects; • an ontology O in crALC describing the domain of the objects; • the ontology concept C that defines the semantics of the network objects; • the ontology role r( , ) that defines the semantics of the relationship between network objects; Find: • a revised network Nf with new relationship between objects. The proposed algorithm for link prediction receives a network of a spe- cific domain. For instance, in a co-authorship network the nodes repre- sent researchers and the relationship can have the semantics ”has a pub- lication with” or ”is advised by”. Therefore, the ontology represented by crALC describes the domain of publications between researchers, hav- ing concepts like Researcher and Publication and roles like hasPublication, hasSameInstitution and sharePublication. This ontology can be learned automati- cally through a learning algorithm as the ones proposed in [16]. Thus, the nodes represent instances of one of the concepts described in the probabilistic descrip- tion logic crALC and the semantics of the links is described by one of the roles in the probabilistic description logic crALC. These concept and role must be informed as inputs to the proposed algorithm. The link prediction algorithm is described in Algorithm 1. The algorithm starts looking for all pairs of instances of the concept C defined as the concept that provides the semantics for the network nodes — this is a general setting, as a rule the set of possible pairs is restricted. For each pair, it checks whether a link between the corresponding nodes exist in the network. If not the probability of the link is calculated through the probability of the de- fined role conditioned on evidences (step 5). The evidences are provided by the instances of the ontology. The number of instances in an ontology has a great impact in inference. Usually one considers that more instances better inference. However, evidences for different individuals can turn out the inference process computationally expensive, since in a RBN a slice is created for each individual, Require: a network N , an ontology O, the role r( , ) representing the semantics of the network link, the concept C describing the objects of the network and a threshold. Ensure: a revised network Nf 1: define Nf as N ; 2: for all pair of instances (a, b) of concept C do 3: if does not exist a link between nodes a and b in the network N then 4: compute evidence based on a, b and nodes in their path; 5: infer probability P (r(a, b)|evidence) using the RBN created through the ontology O; 6: if P (r(a, b)|evidence) > threshold then 7: add a link between a and b in network Nf ; 8: end if 9: end if 10: end for Algorithm 1: Algorithm for link prediction through crALC. and then inference should be done for each slice. In [5], an approximate infer- ence algorithm was proposed where all slices without evidence are consolidated in a unique slice, thus making inference feasible in real domains. Therefore, less individuals with evidence faster inference is. From another perspective we are interested in predicting a relationship between two individuals, a and b. There- fore, evidences for these two individuals and other individuals strongly related to them are more relevant for link prediction than evidences from other individuals in the network. Thus, in [14] we extended our semantic link prediction approach in order to consider evidences about a, b and the individuals in their path, which makes the link prediction problem scalable for large networks. Therefore, in step 4 the nodes (individuals) belonging to the path between a and b are found. The inference is then performed through crALC lifted variational method on ontology O. If the probability inferred is greater than a threshold then the cor- responding link is added to the network. Alternatively, when the threshold to be considered is not known a priori, a rank of the inferred links based on their probability is done and the top-k, where k would be a parameter, are chosen. 3 Experiments In order to evaluate our previously proposed approach for semantic link pre- diction empirical experiments were performed. To do so, a real world dataset was used and our algorithm was combined with state-of-the-art measures on a classification model for link prediction. This section reports on steps involved in this process. 3.1 Scenario Description The Lattes Platform is the public repository of Brazilian scientific curriculum which is comprised by approximately a million of registered researchers. Infor- mation is given in HTML format, and ranges from personal information such as name and address to a list of publications, examination board participa- tions, research areas, research projects and advising/advisor information. There is implicit relational information in these web pages, for instance collaboration networks are given by advising/adviser links, shared publications and so on. We have randomly selected a set of 1100 researchers from engineering and math backgrounds and based on assertional data about these researchers a probabilis- tic ontology has been learned. To perform link prediction, this ontology has also been extended with some probabilistic roles — learning is mainly addressed to probabilistic inclusions and concepts. Part of the revised ontology is as follows. P (Publication) = 0.3 P (Board) = 0.33 P (sharePublication) = 0.22 P (wasAdvised) = 0.05 P (hasSameInstitution) = 0.14 P (sameExaminationBoard) = 0.31 ResearcherLattes ≡ Person u(∃hasPublication.Publication u∃advises.Person u ∃participate.Board) P (PublicationCollaborator | Researcher u ∃sharePublication.Researcher) = 0.91 P (SupervisionCollaborator | Researcher u ∃wasAdvised.Researcher) = 0.94 P (SameInstitution | Researcher u ∃hasSameInstitution.Researcher) = 0.92 P (SameBoard | Researcheru ∃sameExaminationBoard.Researcher) = 0.95 P (NearCollaborator | Researcher u ∃sharePublication.∃hasSameInstitution. ∃sharePublication.Researcher) = 0.95 FacultyNearCollaborator ≡ NearCollaborator u ∃sameExaminationBoard.Researcher P (NullMobilityResearcher | Researcher u ∃wasAdvised. ∃hasSameInstitution.Researcher) = 0.98 StrongRelatedResearcher ≡ Researcher u (∃sharePublication.Researcher u ∃wasAdvised.Researcher) InheritedResearcher ≡ Researcher u (∃sameExaminationBoard.Researcher u ∃wasAdvised.Researcher) In this probabilistic ontology concepts and probabilistic inclusions denote mutual research interests. For instance, a PublicationCollaborator inclusion refers to Researchers who shares a Publication, thus relates two nodes (instances of con- cept Researcher) in a collaboration graph. Therefore, the concept Researcher and the role sharePublication are inputs to the algorithm we proposed in Algorithm 1. Moreover, their instances were used to define a collaboration network, which was also provided to the algorithm. Topological graph information was computed accordingly. Figure 2 depicts a subset of collaborations among researchers. To perform inferences and therefore to obtain link predictions we resort to the lifted algorithm in crALC. If we carefully inspect this collaboration graph we could be interested, for instance, in predicting links among researchers from different groups. Since filling form is prone to errors, there is uncertainty regarding real collaborations. Thus, in Figure 2 one could further investigate whether a link between researcher R (red octagon node) and the researcher B (blue polygon node) is suitable. In order to infer this, the probability of a possible link between R and B is calculated, P (link(R, B)|E), where E denotes evidence about researchers such as Viviane Cristine Silva José Augusto Pereira da Silva Ivan Eduardo Chabu Giuseppe Pintaúde Célia Aparecida Lino dos Santos Bernardo Pinheiro de Alvarenga Newton Kiyoshi Fukumasu Cristian Camilo Viáfara Arango Yasmara Conceicao De Polli Migliano Fabio Henrique Pereira Roberto Moura Sales Daniel Rodrigues 1 Helio Goldenstein 1 1 1 1 Marcio Gustavo Di Vernieri Cuppari Wilson Luiz Guesser 1 1 1 1 1 Wanderlei Marinho da Silva Jose Jaime da Cruz 1 1 1 Julio Cesar Klein das Neves 1 Mauricio Barbosa de Camargo Salles 1 Renato Rufino Xavier 1 1 1 Sérgio Luís Lopes Verardi 1 1 1 Amilton Sinatora 1 1 Linilson Rodrigues Padovese Marcelo Nelson Páez Carreño José Roberto Cardoso 1 1 1 1 1 Pedro Pereira de Paula Renato Carlson Paulo Roberto Mei 1 1 1 1 1 Andre Paulo Tschiptschin 1 1 Roberto Martins de Souza 1 Andrea Murillo Betioli 1 Silvio Ikuyo Nabeta 1 1 1 Ricardo Fuoco Vanessa Gomes da Silva 1 Miguel Angelo de Carvalho Carlos Shiniti Muranaka Inés Pereyra 1 Antonio Carlos de Jesus Paes 1 Antonio Domingues de Figueiredo Celso Pupo Pesce Artemária Coêlho de Andrade Thiago Melanda Mendes Ivan Gilberto Sandoval Falleiros Marcos Antonio Ruggieri Franco Emilio Del Moral Hernandez Deniol Katsuki Tanaka Orestes Marraccini Goncalves Djonny Weinzierl Mario Leite Pereira Filho Henrique Kahn Vahan Agopyan Luiz Lebensztajn Cleber Marcos Ribeiro Dias Maurício Caldora Costa 1 1 1 Brunoro Leite Giordano 1 1 Sérgio Cirelli Angulo 1 Anne Komatsu Agopyan 1 1 Maria Elena Santos Taqueda 1 Racine Tadeu Araujo Prado 1 1 1 1 1 Silvio Burrattino Melhado Márcia Terra da Silva Mário Sousa Takeashi Neide 1 Matiko Nakata Sato Vanderley Moacyr John 1 Fernando Resende 1 Fred Borges da Silva 1 Carina Ulsen 1 1 Silvia Maria de Souza Selmo 1 1 1 1 Roberto Cesar de Oliveira Romano Viviane Miranda Araujo Carlos Torres Formoso Ubiraci Espinelli Lemes de Souza 1 1 1 1 1 1 1 1 1 Maristela Gomes da Silva Marcela Paula Maria Zanin Meneguetti Mercia Maria Semensato Bottura de Barros Miguel Aloysio Sattler Rafael Giuliano Pileggi 1 1 1 Adriana Gouveia Rodrigo 1 Henrique Lindenberg Neto 1 1 Antonio Carlos Vieira Coelho 1 Bruno Luís Damineli Alex Kenya Abiko Francisco Ferreira Cardoso 1 Heliana Comin Vargas Arnaldo Manoel Pereira Carneiro 1 Heloísa Cristina Fernandes Joao Paulo Sinnecker Antonio Riul Júnior Fernando Henrique Sabbatini 1 John Paul Hempel Lima 1 1 1 Rinaldo Gregorio Filho Patricia Alexandra Antunes Leni Campos Akcelrud 1 Paulo Sergio Santos 1 1 1 1 1 Mauro Zilbovicius 1 1 1 1 1 1 Marco Antônio Penido de Rezende Ely Antonio Tadeu Dirani Cláudia Nascimento de Jesus 1 Everaldo Carlos Venancio Tathyana Moratti 1 1 1 Gerson dos Santos Fernando Josepetti Fonseca 1 Antônio Otávio de Toledo Patrocínio 1 1 1 1 Marystela Ferreira Adriano Gameiro Vivancos 1 1 Leonardo Tolaine Massetto Rodrigo Fernando Bianchi 1 1 Vitor Angelo Fonseca Deichmann Eliana Kimie Taniguti Adnei Melges de Andrade Paulo Cesar de Morais Miguel Alexandre Novak Anna Helena Reali COSTA Erika Paiva Tenório de Holanda Cláudio Vicente Mitidieri Filho Maria Aparecida Godoy Soler Pajanian Rodrigo Andrade Martinez Fig. 2. Lattes collaboration network. publications, institution, examination board participations and so on. Since the role sharePublication defines the semantics of the links in the graph, it is through it that we must calculate P (link(R, B)|E). Concept PublicationCollaborator is defined by the role sharePublication and considering as evidence Researcher(R) u ∃hasSameInstitution.Researcher(B) one can infer P (link(R, B)|E) through: P (PublicationCollaborator(R) |Researcher(R) u∃hasSameInstitution.Researcher(B)) = 0.57. If we took a threshold of 0.60, the link between R and B would not be included. One could gain more evidence, such as information about nodes that in- directly connect these two groups (Figure 2), denoted by I1 , I2 . The inference would be P (PublicationCollaborator(R) |Researcher(R) u∃sharePublication(I1 ).∃sharePublication(B) u∃sharePublication(I2 ).∃sharePublication(B)) = 0.65. Because more information was provided the probability inferred was different. The same threshold now would preserve the link. In order to compare with existing baseline algorithms, topological and se- mantic features have also been defined. Further details are given as follows. 3.2 Methodology In this section we describe our main design choices to run experiments. According to cross validation principles, our dataset (1100 researchers which give rise to 1400 true co-authoring links) has been divided in training and validation sets. To avoid skeweness (due to unbalanced classes), every fold is comprised by balanced negative and positive instances, where positive instances correspond to a link between two nodes while negative instance means that there is not a link between these two nodes. In order to classify possible co-authoring links and therefore to perform com- parisons with previous approaches we resort to the Logistic regression classifica- tion algorithm. In a classification approach for link prediction, features are commonly ex- tracted from topological graph properties such as neighborhood and paths be- tween nodes. In addition, numerical features are also computed from joint prob- ability distributions and semantics. Two baseline graph-based numerical features have been used in our experi- ments. First, the Katz measure [10] is a weighted sum of the number of paths in the graph that connect two nodes, with higher weight for shorter paths. This leads to the following formula: ∞ X Katz(x, y) = β i pi i=1 where pi is the number of paths of length i connecting x and y, while β (≤ 1) parameter is used to regularize this feature. A small value of β considers only the shorter paths. Since computing all paths (∞) is expensive we only consider paths of length at most four (i ≤ 4). The second numerical feature is the Adamic-Adar measure [1] which com- putes the similarity between two nodes in a graph. Let Γ (x) be the set of all neighbors of node x. Then the similarity between two nodes x, y is given by X 1 Adamic-Adar(x, y) = log |Γ (z)| z∈Γ (x)∩Γ (y) The intuition behind the score is that instead of simply counting the number of neighbors shared by two nodes, we should weight the hub nodes less and rarer nodes more. In this way, Adamic-Adar weighs the common neighbors with smaller degree more heavily. We have also considered semantic features. The degree of semantic similarity among entities is something that can be useful to predict links that might not be captured by either topological or frequency-based features [20]. In this work, for each author a document with the words appearing in the title of his publications (removing stop words) is considered. Thus, an author is represented as a set of words, which allow us to compute two features based on semantic similarity: i The keyword match count between two authors [6]. ii The cosine between the TFIDF features vectors of two authors [20]. To compute (ii), we derive a bag of words representation for each author, weight- ing each word by its TFIDF (Term Frequency - Inverse Document Frequency) measure. The TFIDF weighting scheme assigns to term t a weight in document d given by TFIDFt,d = TFt,d × IDFt TFt,d is the term frequency in d, and IDFt denotes the inverse document fre- quency of t which is given by IDFt = log DF N , where N is the total number of t documents and DFt is the number of documents containing the term. The standard way of quantifying the similarity between two documents d1 → − and d2 is to compute the cosine similarity of their vector representations V (d1 ) → − and V (d2 ) → − → − V (d1 ) · V (d2 ) cosine(d1 , d2 ) = →− → − | V (d1 )|| V (d2 )| where the numerator represents the dot product (also known as the inner prod- → − → − uct) of the vectors V (d1 ) and V (d2 ), while the denominator is the product of their Euclidean lengths. Finally, we also use the probability, P (r(x, y)|evidence), given by our proba- bilistic description logic model, as a numerical feature in the classification model. We wish to investigate whether this probabilistic logic measure can improve the classification approach for link prediction. 3.3 Results In order to evaluate suitability of our approach in predicting co-authorships in the Lattes dataset, three experiments were run. In the first experiment two baseline scores, Katz and Adamic-Adar, have been used as features in the logistic regression algorithm. After a ten-fold cross validation process the classification algorithm yielded results on accuracy which are depicted in Table 1. Given the Lattes dataset, one can see that the Katz feature yields the best accuracy (75.49%) when the two topological features are used in isolation. Katz has been shown to be among the most effective topological measures for the link prediction task [10]. Furthermore, when we combine the Katz and the Adamic- Adar features, we improve the accuracy to 76.44%. Table 1. Classification results on accuracy (%) for baseline features: Adamic-Adar (Adamic), Katz and a combined one (Adamic+Katz) Adamic Katz Adamic+Katz Lattes dataset 72.75 ± 1.87 75.49 ± 2.07 76.44 ± 2.03 In the second experiment, we evalute two features based on semantic simi- larity and their combination with topological features. Results on accuracy for these semantic features are depicted in Table 2. The cosine similarity feature performs better than matching keyword feature and outerperforms the two for- mer topological features. This feature alone yields 82.45% on accuracy. When we combine all the four features together, there is an improvement in accuracy to 85.63%. Table 2. Classification results on accuracy(%) for semantic similarity features: match- ing keyword (match) and cosine similarity (cosine) and topological features. match cosine Adamic+Katz+match+cosine Lattes dataset 69.42 ± 2.66 82.45 ± 1.37 85.63 ± 1.23 In the third experiment, a probabilistic feature based on our probabilistic description logic approach was introduced into the model. Results on accuracy for this feature are depicted in Table 3. The probabilistic description logic feature performs better than the other features. This feature yields 87.72% on accuracy. When we combine all the five features together, there is an improvement in accuracy to 89.48%. Table 3. Classification results on accuracy(%) for probabilistic description logics and baseline features: crALC based (cralc) and Adamic-Adar, Katz, match, cosine, crALC (Adamic+Katz+match+cosine+cralc). cralc Adamic+Katz+match+cosine+cralc Lattes dataset 87.72 ± 0.52 89.48 ± 0.96 It is worth noting that the probabilistic logic feature probability outer per- forms all other features and allow us to improve the classification model for link prediction on accuracy. Nothing prevent us to define ad-hoc probabilistic networks to estimate link probabilities. However, by doing so we are expected to define a large proposition- alized network (a relational Bayesian network) [15] or estimate local probabilistic networks [20]. These approaches do not scale well since computing probabilistic inference for large networks is expensive. To overcome these performance and scalability issues, we resort to lifted inference in crALC which is based on variational methods — tunned by evidence defined according nodes’s neighborhood. Thus, for a ten thousand network, if evidence is given for 5 nodes, then there is only 6 slices which have messages interchanged. In our experiments, the average runtime for inference in crALC (1100 nodes network) was 135 milliseconds. Table 4 depicts some runtime results for larger networks which demonstrates the scalability of our approach. Table 4. Average runtime for inference. nodes runtime(milliseconds) 1100 135 10000 168 100000 175 1000000 185 On the other hand, a propositionalized relational Bayesian network fails to run inference due to out of memory issues. 4 Conclusion In [15, 14] we have presented an approach for predicting links that resorts to both graph-based and ontological information. Given a collaborative network, we encode interests and graph features through a crALC probabilistic ontol- ogy. In order to predict links we resort to probabilistic inference, where only information about two nodes being analyzed and the nodes in their path are used as evidence. Thus, making the proposal scalable. In this paper, we evalu- ated our proposal focused on an academic domain, and we aimed at predicting links among researchers. The approach was successfully compared with graph- based and semantic-based features. As future work we intend to consider other datasets. Previous combined approaches for link prediction [4, 2] have focused on ma- chine learning algorithms [12]. In such schemes, numerical graph-based features and ontology-based features are computed; then both features are input into a machine learning setting where prediction is performed. Unless from such ap- proaches, in our work we adopt a generic ontology (instead of a hierarchical ontol- ogy, expressing only is-a relationships among interests). Therefore, our approach uses more information about the domain to help the prediction. Moreover, in [19], a Probabilistic Relational Model is used for link prediction task. This is one of the approaches more closed to ours, since uses semantic features considering a probabilistic graphical model. However, inference is done in a propositionalized network that can not scale for large networks. Acknowledgements The third author is partially supported by CNPq. The work reported here has received substantial support by FAPESP grant 2008/03995-5 and FAPERJ grant E-26/111484/2010. References 1. L.A. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 25:211–230, 2001. 2. W. Aljandal, V. Bahirwani, D. Caragea, and H.W. Hsu. Ontology-aware classifica- tion and association rule mining for interest and link prediction in social networks. In AAAI 2009 Spring Symposium on Social Semantic Web: Where Web 2.0 Meets Web 3.0, Standford,CA, 2009. 3. F. Baader and W. Nutt. Basic description logics. In Description Logic Handbook, pages 47–100. Cambridge University Press, 2007. 4. D. Caragea, V. Bahirwani, W. Aljandal, and W.H. Hsu. Ontology-based link prediction in the livejournal social network. In SARA’09, pages 1–1, 2009. 5. F. G. Cozman and R. B. Polastro. Complexity analysis and variational inference for interpretation-based probabilistic description logics. In Conference on Uncertainty in Artificial Intelligence, 2009. 6. Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed Zaki. Link prediction using supervised learning. In In Proc. of SDM 06 workshop on Link Analysis, Counterterrorism and Security, 2006. 7. J. Heinsohn. Probabilistic description logics. In International Conf. on Uncertainty in Artificial Intelligence, pages 311–318, 1994. 8. M. Jaeger. Relational Bayesian networks: a survey. Linkoping Electronic Articles in Computer and Information Science, 6, 2002. 9. J. Kunegis and A. Lommatzsch. Learning spectral graph transformations for link prediction. In Proceedings of the ICML, pages 561–568, 2009. 10. D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. American Society for Information Science and Technology, 7(58):1019–1031, 2007. 11. T. Lukasiewicz and U. Straccia. Managing uncertainty and vagueness in description logics for the semantic web. Semantic Web Journal, 6(4):291–308, 2008. 12. T. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. 13. A.H. Mohammad and J.Z. Mohammed. A survey of link prediction in social net- works. In Social Network Data Analytics, pages 243–275. 2011. 14. J. Ochoa-Luna, K. Revoredo, and F.G. Cozman. A scalable semantic link prediction approach through probabilistic description logics. In In proceeding of 9th Artificial Intelligence National Meeting (ENIA) - to appear, 2012. 15. K. Revoredo, J. Ochoa-Luna, and F.G. Cozman. Semantic link prediction through probabilistic description logics. In Bobillo, F., et al. (eds.) Proceedings of the 7th International Workshop on URSW, volume 778, pages 87–97. 16. K. Revoredo, J. Ochoa-Luna, and F.G. Cozman. Learning probabilistic description logics: A framework and algorithms. In In proceedings of the MICAI, volume 7094 of LNCS, pages 28–39. Springer, 2011. 17. M. Sachan and R. Ichise. Using semantic information to improve link prediction results in network datasets. International Journal of COmputer Theory and En- geneering, 3:71–76, 2011. 18. F. Sebastiani. A probabilistic terminological logic for modelling information re- trieval. In ACM Conf. on Research and Development in Information Retrieval (SIGIR), pages 122–130, 1994. 19. B. Taskar, M. faiWong, P. Abbeel, and D. Koller. Link prediction in relational data. In Proceedings of the 17th NIPS, 2003. 20. C. Wang, V. Satuluri, and S. Parthasarathy. Local probabilistic models for link prediction. In Proceedings of the 2007 Seventh IEEE ICDM, pages 322–331, Wash- ington, DC, USA, 2007. IEEE Computer Society. 21. T. Wohlfarth and R. Ichise. Semantic and event-based approach for link predic- tion. In Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management, 2008.