=Paper=
{{Paper
|id=Vol-1766/oaei16_paper13
|storemode=property
|title=RiMOM results for OAEI 2016
|pdfUrl=https://ceur-ws.org/Vol-1766/oaei16_paper13.pdf
|volume=Vol-1766
|authors=Yan Zhang,Hailong Jin,Liangming Pan,Juanzi Li
|dblpUrl=https://dblp.org/rec/conf/semweb/ZhangJPL16
}}
==RiMOM results for OAEI 2016==
RiMOM Results for OAEI 2016 Yan Zhang, Hailong Jin, Liangming Pan, Juanzi Li Tsinghua University, Beijing, China. {z-y14, jinhl, panlm14}@mails.tsinghua.edu.cn ljz@keg.tsinghua.edu.cn Abstract. This paper presents the results of RiMOM in the Ontology Alignment Evaluation Initiative (OAEI) 2016. RiMOM participated in all three tracks of Instance Matching this year. In this paper, we first describe the overall framework of our system (RiMOM). Then we detail the techniques used in the framework for instance matching. Last, we give a thorough analysis on our results and discuss some future work on RiMOM. 1 Presentation of the system With the rapid development of the Semantic Web, knowledge base has become a domi- nant mechanism to represent the data semantics on the Web. In practice, data is always distributed on heterogeneous data sources. For example, there are a large number of ontological knowledge bases nowadays, such as DBpedia[1]. , YAGO [2, 3], Xlore [4], etc. It is inevitable that the knowledge about the same real-world entity may be stored in different knowledge bases. Therefore, data integration process requires the detection of such heterogeneous instances to ensure the integrity and consistency. Most recently, it should be noticed that there are many knowledge bases described in different languages. For example, Wikipedia, a well-known public encyclopedia, con- tains 281 language versions. It is going to be norm that the same real-world entities are described by different language. Thus, there is a growing need to align instances in a cross-lingual environment so that we can share knowledge from all over the world. In consideration of this circumstance, based on previous version of RiMOM[5], we pro- pose an extended version, which provides support for cross-lingual instance matching in a supervised or an unsupervised way. There are three major techniques in our system, blocking, multi-strategy, machine learning: 1. Blocking: We index the instances based on their objects in two knowledge bases respectively, and then select the instances which contain the same keys as candidate instance pairs. We limit the number of pairs to be compared by this step, which significantly improve the efficiency of the system. 2. Multi-strategy: We implement several matchers in our instance matching system, we can execute these matchers in parallel and then aggregate the result according to the characteristics of the source ontologies. 3. Machine learning:In general, there are some existing alignments. For exmaple, there are a number of cross-lingual links between two different language versions of Wikipedia. To make full use of these data, we formalize the instance matching as a binary classification problem, and use the reference mappings to train a classifier, which will determine whether an instance pair is equivalent or not. Faced with challenges in large-scale instance matching, we propose an novel data integration framework RiMOM-2016 (the latest version of RiMOM), which is based on our former ontology and instance matching system RiMOM [5, 6]. The RiMOM- 2016 framework is designed for large-scale and cross-lingual instance matching task specially. It presents a novel multi-strategy method to be fit for different kinds of ontol- ogy and employs a learning-based approach to get instance alignments in multilingual environments. 1.1 State, purpose, general statement This section describes the overall framework of RiMOM2016. The overview of the instance matching system is shown in Fig. 1. The system includes seven modules, i.e., Preprocess, Predicate Alignment, Mathcher Choosing, Candidate Pair Generation, Matching Score Calculation, Instance Alignment and Validation. The sequences of the process are shown in the Fig. 1. We illustrate the process as follows. Fig. 1. Framework of RiMOM 2016 1. Preprocess: The system begins with Preprocess, which loads the ontologies and parameters into system. In the meantime, preprocessor can get some meta data about the two ontologies, which will be used in the later processes, Predicate align- ment and Matcher choosing 2. Predicate Alignment: In this process, we will get the alignments of the predicates between the two ontologies. 3. Matcher choosing: The system will choose the most suitable one or more match- ers according to the meta data of the ontologies. 4. Candidate Pairs Generation: In this step, we get candidate pairs when the in- stances have the same literal objects on some discriminatory predicates. 5. Matching Score Calculation & Instance Alignment: This procedure is the most striking difference with the last version of RiMOM. In RiMOM-2016, we get align- ments in a supervised or an unsupervised way which depends on whether there exist reference alignments or not. In case of unsupervised method, we calculate similarities between two instances on each property, and then we aggregate these similarities according to the degree of identifying obtained in step 1. On the con- trary, we conduct a supervised method when there exist reference alignments. For each instance pairs, we also calculate the similarities as unsupervised way. Then we construct a similarity vector for each pairs and train a logistic regression model [7]. For each candidate instance pair, we use this model to determine whether it is equivalent or not. 6. Validation: We will evaluate the alignment result on Precision, Recall and F1- Measure if there is validation data set. 1.2 Specific techniques used This year we participate in all of three subtasks in the Instance Matching track. We will describe specific techniques in this section. Data Preprocessing: First, we remove some stop words like ”a, of, the”, etc. Af- terwards, we calculate the TF-IDF values of words in each knowledge base. We also calculate some information of each predicate, in order to obtain the degree of identify- ing of predicates which will be used in similarity aggregation. Predicate Alignment: The predicates can express rich semantics, and there exist one-to-one, one-to-many, or many-to-many relationships among these predicates. It is apparent that we should get the alignments of the predicates before we calculate the similarity of instances. In RiMOM-2016, we use an object-based method to align pred- icates, which is similar with RiMOM-2015 [5]. Blocking: This step aims to pick a relatively small set of candidate pairs from all pairs. Due to the large scale of knowledge bases, it is impossible to calculate match- ing scores of all instance pairs. In our method, we firstly generate the inverted index on the objects. instance pairs are selected into the candidate set when they have com- mon objects. This method may reduce the recall slightly, but it also reduce the scale of computation significantly. Multi-Strategy: We implement several matchers in our system, e.g. label-based approach and structure-based approach. In the preprocess step, we will compare the schema of the two ontologies. If the range of predicates is similar, the label-based ap- proach will play a key role in the matching process. Otherwise, the literal properties are not similar (e.g. the two ontologies are defined in different languages or the intersection of values is really small), label-based approach will not be effective. In this case, we will get some supplementary information (e.g. machine translation, WordNet), or use structure-based appraoch (or use the structure similarity as a feature). In addition, we will use a learning-based method if we have data for training. Similarity Calculation & Instance Alignment: In OAEI 2016 instance matching track, some of subtasks are defined in the same language, while others use multilingual data sets (e.g. SABINE Task). Unsupervised method: we use a object-based method to get alignments, it is de- fined as follows: 0 p fpn (i1 , i2 ) = Sim(Oip1n , Oi2n ) (1) where i1 and i2 are instances from two data sets respectively. Oip1n represent the 0 p object value of instance i1 on property pn . Sim(Oip1n , Oi2n ) represent the similarity of object values between these two instances on property pn and its corresponding proper- 0 ty pn . The computing method of this similarity depends on the data type. For example, we use Levenshtein distance for type:text and indicator function for type:int. Sim(i1 , i2 ) = ω1 × fp1 (i1 , i2 ) + ω2 × fp2 (i1 , i2 ) + ... + ωn × fpn (i1 , i2 ) (2) For each property pj , we calculate the similarity according to equation 1 and aggre- gate them by weights ωj which indicate the importance of properties. Supervised method: In equation 2, the weight wi is determined by meta-data of ontology or manual. Intuitively, it could be improved by a learning-based method if we have some existing alignments. So, basically, we formulate this instance matching problem as a binary classification problem. For a pair of instance i1 and i2 , the feature vector f = {fpi }ni=1 . Thus, we can use a sigmoid function to compute the probability that instances i1 is equivalent with i2 . 1 P (i1 ≡ i2 ) = (3) 1 + ew·f (i1 ,i2 ) If i1 ≡ i2 , P (i1 ≡ i2 ) > 0.5; otherwise P (i1 ≡ i2 ) < 0.5 In this case, the weights w can be determined by the maximum likelihood estimation technique for logistic re- gression. The assumption in this model is that we can use the machine learning method to determine which property is more important for instance matching problem. 1.3 Link to the system and parameters file The RiMOM system and configuration files (2016 version) can be found at https:// drive.google.com/file/d/0BzqVVt4Q8YUuaHpseWJOZkI4MnM/view? usp=sharing. 2 Results The Instance Matching track contains three tracks and seven subtasks. RiMOM-2016 participate in all of these tracks, and we will present the results and related analysis in this section. 2.1 SABINE Track There are two subtasks in this track: Inter-linguistic mapping and Data linking. Table 1 is the result for Inter-linguistic mapping task and Table 2 is for Data linking task. Inter-linguistic mapping is a cross-lingual task between English and Italian. As shown in the result, RiMOM preform well in this task. Data linking task requires participants to link the entity to DBpedia, and RiMOM get high Recall but low Precision in this task. Tool Precision Recall F-measure Tool Precision Recall F-measure LogMapIm 0.012 0.016 0.014 LogMapIm NaN 0.000 NaN AML 0.919 0.916 0.917 AML 0.926 0.855 0.889 LogMapLite 0.358 0.153 0.214 LogMapLite NaN 0.000 NaN RiMOM 0.955 0.932 0.943 RiMOM 0.424 0.917 0.580 Table 1. The result for Inter-linguistic mapping Table 2. The result for Data linking 2.2 SYNTHETIC Track There are two subtasks in this track: UOBM and SPIMBENCH. Each subtask contains two data set in different size: sandbox is small data set while mainbox is a large one. Table 3 4 5 6 show the final results in this track. We think RiMOM produce satisfactory results in all of the subtasks. Tool Precision Recall F-measure Tool Precision Recall F-measure LogMapIm 0.701 0.207 0.320 LogMapIm 0.625 0.023 0.044 AML 0.785 0.577 0.665 AML 0.509 0.515 0.512 RiMOM 0.771 0.877 0.821 RiMOM 0.443 0.516 0.477 Table 3. The result for UOBM sandbox Table 4. The result for UOBM mainbox 2.3 DOREMUS Track This track contains three subtasks: 9-heterogeneities, 4-heterogeneities, and False Pos- itive Trap. Table. 7 shows the final result in this track. Tool Precision Recall F-measure Tool Precision Recall F-measure LogMapIm 0.958 0.766 0.851 LogMapIm 0.981 0.695 0.814 AML 0.907 0.749 0.820 AML 0.900 0.747 0.816 RiMOM 0.984 1.000 0.992 RiMOM 0.991 1.000 0.995 Table 5. The result for SPIMBENCH sandbox Table 6. The result for SPIMBENCH mainbox Sub-task Precision Recall F-measure 9-heterogeneities 0.813 0.813 0.813 4-heterogeneities 0.746 0.746 0.746 False Positive Trap 0.707 0.707 0.707 Table 7. The result for DOREMUS Track 2.4 Discussions on the way to improve the proposed system Our system can only align two ontologies at a time, and we think it will be a significant improvement if we can develop a system which is able to align several ontologies simul- taneously. In addition, in cross-lingual environment, our system still rely on the machine translation. In this case, we hope to develop a method which is language-independent. 3 Conclusion and future work In this paper, we present the system of RiMOM in OAEI 2016 Campaign. We partici- pate all of the three tracks in instance matching track this year. We described specific techniques we used in the task. In our project, we design a new framework to align instances in different languages. The results turn out that our method is effective. In the future, we will make great efforts to improve our system continuously. 4 Acknowledgement The work is supported by 973 Program (No.2014CB340504), NSFC-ANR (No.61261130588),and NSFC key project(No.61533018), Tsinghua University Initiative Scientific Research Program (No.20131089256) and THU-NUS NExT Co-Lab. References 1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: Db- pedia - A crystallization point for the web of data. J. Web Sem. 7(3) (2009) 154–165 2. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194 (2013) 28–61 3. Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: A knowledge base from multilingual wikipedias. In: 7th Biennial Conference on Innovative Data Systems Research, CIDR Con- ference (2014) 4. Wang, Z., Li, J., Wang, Z., Li, S., Li, M., Zhang, D., Shi, Y., Liu, Y., Zhang, P., Tang, J.: Xlore: A large-scale english-chinese bilingual knowledge graph. In: Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia, October 23, 2013. (2013) 121–124 5. Zhang, Y., Li, J.: Rimom results for oaei 2015. Ontology Matching (2015) 185 6. Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: A dynamic multistrategy ontology alignment frame- work. IEEE Trans. Knowl. Data Eng. 21(8) (2009) 1218–1232 7. Hosmer, D.W., Lemeshow, S.: Introduction to the logistic regression model. Applied Logistic Regression, Second Edition (2000) 1–30