-

GMap: Results for OAEI 2015

Weizhuo Li

liweizhuo@amss.ac.cn 0

Qilin Sun

sunqilin@amss.ac.cn 0 0 Institute of Mathematics,Academy of Mathematics and Systems Science, Chinese Academy of Sciences , Beijing , P. R. China

GMap is an alternative probabilistic scheme for ontology matching, which combines the sum-product network and the noisy-or model. More precisely, we employ the sum-product network to encode the similarities based on individuals and disjointness axioms. The noisy-or model is utilized to encode the probabilistic matching rules, which describe the influences among entity pairs across ontologies. In this paper, we briefly introduce GMap and its results of four tracks (i.e.,Benchmark, Conference, Anatomy and Ontology Alignment for Query Answering) on OAEI 2015.

1.1

Presentation of the system State, purpose, general statement

The state of the art approaches have utilized probabilistic graphical models [ 5 ] for ontology matching such as OMEN [ 7 ], iMatch [ 1 ] and CODI [ 8 ]. However, few of them can keep inference tractable and ensure no loss in inference accuracy. In this paper, we propose an alternative probabilistic scheme, called GMap, combining the sum-product network (SPN) and the noisy-or model [ 6 ]. Except for the tractable inference, these two graphical models have some inherent advantages for ontology matching. For SPN, even if the knowledge such as individuals or disjointness axioms is missing, SPN can also calculate their contributions by the maximum a posterior (MAP) inference. For the noisy-or model, it is a reasonable approximation for incorporating probabilistic matching rules to describe the influences among entity pairs.

Figure 1 shows the sketch of GMap. Given two ontologies O1 and O2, we calculate the lexical similarity based on edit-distance, external lexicons and TFIDF [ 3 ] with the max strategy. Then, we employ SPN to encode the similarities based on individuals and disjointness axioms and calculate the contribution through MAP inference. After that, we utilize the noisy-or model to encode the probabilistic matching rules and the value calculated by SPN. With one-to-one constraint and crisscross strategy in the refine module, GMap obtains initial matches. The whole matching procedure is iterative. If there is no additional matches identified, the matching is terminated. 1.2

Specific techniques used The similarities based on individuals and disjointness axioms In open world as

sumption, individuals or disjointness axioms are missing at times. Therefore, we define O2

Computing lexical similarity

Using SPN to encode individuals and disjointness axioms

Using Noisy-Or model to encode probabilistic matching rules

Refining matches Yes

O2 a special assignment—”U nknown” of the similarities based on these individuals and disjointness axioms.

For individuals, we employ the string equivalent to judge the equality of them. When we calculate the similarity of concepts based on individuals across ontologies, we regard individuals of each concept as a set and use Ochiai coefficient1 to measure the value. We use a boundary t to divide the value into three assignments (i.e., 1, 0 and U nknown). Assignment 1 (or 0) means that the pair matches (or mismatches). If the value ranges between 0 and t or the individuals of one concept are missing, the assignment is U nknown.

For disjointness axioms, we utilize these axioms and subsumption relations within ontologies and define some rules to determine assignments of similarity. For example, x1, y1 and x2 are concepts that come from O1 and O2. If x1 matches x2 and x1 is disjoint with y1, then y1 is disjoint with x2 as well as their descendants. The similarity also have three assignments. Assignment 1 (or 0) means the pair mismatches (or overlaps). If all the rules are not satisfied, the assignment is U nknown.

Using SPN to encode the simialrities based on individuals and disjointness axioms

Sum-Product Network is a directed acyclic graph with weighted edges, where variables are leaves and internal nodes are sums and products [ 9 ]. As shown in Figure 2, we designed a sum-product network S to encode above similarities and calculate the contributions. All the leaves, called indicators, are binary-value. M represents the contribution of individuals and disjointness axioms and indicators M1, M2, M3 comprise the assignments of it. M1 = 1 (or M2 = 1) means that the contribution is positive (or negative). If M3 = 1, the contribution is U nknown. Similarly, Indicators D0; D1; I1; I2; I3 correspond to assignments of the similarities based on individuals and disjointness axioms. The concrete assignment metrics are listed in Table 1–2 and the assignment metric of M is similar to the metric of similarity D. 1 https://en.wikipedia.org/wiki/Cosine similarity · D0 × + · · I1 I2 + × + · I3 + × + · M1

SPN |= (I ? M |D1) ×

+ + · M2 · D1 · M3

With the MAP inference in SPN [ 9 ], we can obtain the indicators’ value of contribution M . The MAP inference has three steps. Firstly, replace sum nodes with max nodes. Secondly, with the bottom-up method, each max node can get a maximum weighted value. Finally, the downward pass starts from the root node and recursively selects the highest-value child of each max node, then the indicators’ value of M are obtained. Moreover, even if individuals or disjointness axioms are missing at times, We can also calculate the contribution M by MAP inference. Assumed I = 1, D = U nknown for one pair, then we can obtain I1 = 1; I2 = 0; I3 = 0; D0 = 1; D1 = 1 with defined similarities and assignment metrics of SPN. As contribution M is not given, so we need to set M1 = 1; M2 = 1; M3 = 1. After MAP inference, we observe M1 = 1 which means that the contribution is positive. Moreover, it is able to infer D0 = 1, which means the pair overlaps.

As the network S is complete and decomposable, the inference in S can be computed in time linear in the number of edges [ 4 ]. So MAP inference is tractable.

Combining the lexical similarity and the contribution calculated by SPN Consider

ing the range of lexical similarity, we define a scaling factor to limit the contribution of lexical similarity. It can help us to analyze the sources from different contributions. The SPN-based similarity (S0) is defined in Eqs 1, which is calculated according to the indicators’ value of M and D.

S0(x1; x2) = 80 > > > < > > > : lexSim(x1; x2) + lexSim(x1; x2) lexSim(x1; x2)

M2 = 1; D1 = 1 M1 = 1; D0 = 1 M2 = 1; D0 = 1 M3 = 1; D0 = 1 (1) where is a contribution factor that represents the contribution based on disjointness axioms and individuals. If contribution is positive (negative) and pair overlaps, the SPNbased similarity is equal to the scaled lexical similarity adding (subtracting) . If the contribution is U nknown and pair overlaps, the SPN-based similarity is equal to the scaled lexical similarity. If the pair mismatches, then the inferred contribution is negative and the SPN-based similarity is equal to 0.

Using Noisy-Or model to encode probabilistic matching rules As listed in Table

3, we utilize probabilistic matching rules to describe the influences among the related pairs across ontologies.

ID R1 R2 R3 R4 R5 R6

Probabilistic matching rules two classes probably match if their fathers match two classes probably match if their children match two classes probably match if their siblings match two classes about domain probably match if related objectproperties match and range of these property match two classes about range probably match if related objectproperties match and domain of these properties match two classes about domain probably match if related dataproperties match and value of these properties match

Considering the matching probability of one pair, we observe that the condition of each rule has two value (i.e., T or F) and all the matching rules are independent of each other approximately. Moreover, all of them benefit to improving the matching probability of this pair. Therefore, we utilize the noisy-or model [ 5 ] to encode them.

R1 S1

R2 S2

OR There are two kinds of parameters that need be set. one mainly comes from networks and it is set manually based on some considerations [ 2 ]. The others are adapted by I3CON data set2 such as scaling factor ( ), contribution factor ( ) in Eqs 1 and threshold ( ). Nevertheless, we do not make any specific adaptation for OAEI 2015 evaluation campaign and all parameters are the same for different tracks. In this section, we present the results of GMap achieved on OAEI 2015. Our system mainly focuses on Benchmark, Anatomy, Conference. Adding to that, we also present the results of the test Ontology Alignment for Query Answering which not follow the classical ontology alignment evaluation on the SEALS platform. The goal of Benchmark is to evaluate the matching systems in scenarios where the input ontologies lack important information. Table 4 summarizes the average results3 of it.

GMap had a good performance in biblio, ranking third in F-measure, because it makes use of the string resource such as identifiers, labels and comments. Specially in ontologies 201–210 of biblio, as the mapping concepts have the same group of individuals but different names, SPN can play a role in improving the alignment quality of GMap. 2 http://www.atl.external.lmco.com/projects/ontology/i3con.html 3 The new test set about energy exists some troubles. 2.2

Anatomy

The Anatomy track consists of finding an alignment between the Adult Mouse Anatomy (2744 classes) and a part of the NCI Thesaurus (3304 classes) describing the human anatomy. The results are shown in Table 5.

GMap ranked fifth in Anatomy track. We analyze that GMap does not concentrate on language techniques such as the abbreviations and emphasizes one-to-one constraint. Both of them may cause a low recall. In addition, these top-ranked systems employ alignment debugging techniques, which is helpful to improve alignment quality. However, we do not employ these techniques in the current version. 2.3

Conference

Conference track contains sixteen ontologies from the conference organization domain. There are two versions of reference alignment. The original reference alignment is labeled as RA1, and the new reference alignment, generated as a transitive closure computed on the original reference alignment, is labeled as RA2. Table 6 shows the results of our system in this track.

For Conference track, GMap ranked sixth of the 14 participants, which outperforms others in recall except AML but its precision is lower than them. There are mainly two reasons. One is the lexical similarity which combines the similarities based on editdistance, external lexicons and TFIDF with the max strategy. The other is the noisy-or model which is hard to describe the negative effect on pairs matching [ 5 ]. Both of them would retain some false positive matches after matching finished. Specially in property pairs, even though their domains and ranges mismatch, GMap can not describe this negative impact. Therefore, employing alignment debugging techniques are comparatively ideal method solutions to deal with this problem. The aims of OA4QA are investigating the effects of logical violations affecting computed alignments and evaluating the effectiveness of repair strategies employed by the matchers. In the OAEI 2015 the ontologies and reference alignment (RA1) are based on the conference track. RAR1 is a repaired version of RA1 different from RA2 in the conference track. The table 7 presents the results for the whole set of queries.

Since GMap did not consider mapping repair techniques, it was only able to answer half of queries, which influenced the obtained precision and recall at last. 3

General comments

3.1

Comments on the results

GMap achieved qualified results in its first participation in OAEI, which is competitive with other systems in some tracks such as Benchmark, Conference, Anatomy. Both of the employed graphical models are able to improve the quality of alignment in terms of the defined lexical similarity [ 6 ]. Most improvements are attributed to the noisy-or model because it makes use of rich relations specified in ontologies such as in Anatomy track. If there are some individuals and disjointness axioms declared in ontologies, SPN will work such as biblio (201–210) in Benchmark track. More importantly, Combining SPN and the noisy-or model is able to increase precision and recall further.

However, some weaknesses still remain. For example, the alignment incoherence of GMap is unsolved, which influences the performance of GMap. In addition, it is important for us to consider the efficiency of GMap such as running time and memory usage for large-scale mapping problems. 3.2

Discussions on the way to improve the proposed system

GMap still has a lot of room for improvement. Employing alignment debugging techniques are able to solve the alignment incoherent and reduce some false positive matches in alignment such as the pair fConference: has members, edas: hasMemberg in Conference track. In addition, seeking available data sets to learn parameters of the sumproduct network and the noisy-or model is also one direction of our future works.

Conclusion

In this paper, we have presented GMap and its results of four tracks (i.e.,Benchmark, Conference, Anatomy and Ontology Alignment for Query Answering) on OAEI 2015. The results show that GMap is competitive with the top-ranked systems in some tracks by means of combining some special graphical models (i.e.,SPN, Noisy-or model). On the other hand, for those disadvantages exposed, we discuss the possible solutions. In the future, we would like to participate in more tracks and hope to efficiently solve the instance matching and large biomedical ontologies matching challenges. Acknowledgments. This research was partly supported by the Natural Science Foundation of China (No. 61232015), the National Key Research and Development Program of China (Grant No. 2002CB312004), the Knowledge Innovation Program of the Chinese Academy of Sciences, Key Lab of Management, Decision and Information Systems of CAS, Institute of Computing Technology of CAS, and the Key Laboratory of Multimedia and Intelligent Software at Beijing University of Technology.

1. Albagli , S. , Ben- Eliyahu-Zohary, R. , Shimony , S.E. : Markov network based ontology matching . Journal of Computer and System Sciences 78 ( 1 ) ( 2012 ) 105 - 118

2. Ding , L. , Finin , T. : Characterizing the semantic web on the web . In: The Semantic Web-ISWC 2006 . Springer ( 2006 ) 242 - 257

3. Euzenat , J. , Shvaiko , P. : Ontology Matching. Springer Science & Business Media ( 2013 )

4. Gens , R. , Pedro , D. : Learning the structure of sum-product networks . In: Proceedings of The 30th International Conference on Machine Learning . ( 2013 ) 873 - 880

5. Koller , D. , Friedman , N.: Probabilistic graphical models: principles and techniques . MIT press ( 2009 )

6. Li , W.:

Combining sum-product network and noisy-or model for ontology matching

7. Mitra , P. , Noy , N.F. , Jaiswal , A.R. : Omen: A probabilistic ontology mapping tool . In: The Semantic Web-ISWC 2005 . Springer ( 2005 ) 537 - 547

8. Niepert , M. , Meilicke , C. , Stuckenschmidt , H.: A probabilistic-logical framework for ontology matching . In: AAAI, Citeseer ( 2010 )

9. Poon , H. , Domingos , P. : Sum-product networks: A new deep architecture . In: Computer Vision Workshops (ICCV Workshops) , 2011 IEEE International Conference on, IEEE ( 2011 ) 689 - 690