RiMOM Results for OAEI 2015 Yan Zhang, Juanzi Li Tsinghua University, Beijing, China. z-y14@mails.tsinghua.edu.cn ljz@ keg.tsinghua.edu.cn Abstract. This paper presents the results of RiMOM in the Ontology Align- ment Evaluation Initiative (OAEI) 2015. We only participated in Instance Match- ing@OAEI2015. We first describe the overall framework of our matching Sys- tem (RiMOM); then we detail the techniques used in the framework for instance matching. Last, we give a thorough analysis on our results and discuss some fu- ture work on RiMOM. 1 Presentation of the system As the infrastructure of the Semantic Web, knowledge base has become a dominant mechanism to represent the data semantics on the Web. In this circumstance, a large number of ontological knowledge bases have been built and published, such as DB- pedia[1]. , YAGO [2], Xlore [3], etc. In real environment of the Semantic Web, data is always distributed on heterogeneous data sources (ontology). It is inevitable that the knowledge about the same real-world entity may be stored in different knowledge bases. Therefore, there is a growing need to align different knowledge bases so that we can easily get complete information that we are interested in. Some good results have been achieved in the field of ontology matching [4]. Previ- ous researches always focus on aligning the schema elements (i.e. concepts and proper- ties) in knowledge bases. Most recently, with the rapid development of semantic web, there have been many large-scale ontologies which contain millions of entities. It is obviously that the number of instances is much larger than other elements (e.g. con- cepts and properties) in these ontologies. For example, the DBpedia contains 882,000 instances of 6 main concepts. Thus, the large-scale instance matching has become the key point in the ontology matching system. Different from the schema matching, the instance matching always has the follow- ing characteristics: 1. The number of instances may be enormous. 2. The schema is straightforward. 3. In practice, the knowledge base is always updated dynamically. In consideration of these differences, we proposed a large-scale instance matching system, RiMOM. There are two major techniques in our system, inverted index and multi-strategy: 1. We index the instances based on their objects in two knowledge bases respectively, and then select the instances which contain the same keys as candidate instance pairs. We limit the number of pairs to be compared by this step, which significantly improve the efficiency of the system. 2. We implement several matchers in our instance matching system, we can execute these matchers in parallel and then aggregate the result according to the character- istics of the source ontologies. In order to solve the challenges in large-scale instance matching, we propose an instance matching framework RiMOM-2015 (RiMOM-Instance Matching), which is based on our former ontology matching system RiMOM [5]. The RiMOM-2015 frame- work is designed for large-scale instance matching task specially. It presents a novel multi-strategy method to be fit for different kind of ontology and employs inverted in- dex to imporve the efficiency. 1.1 State, purpose, general statement This section describes the overall framework of RiMOM. The overview of the instance matching system is shown in Fig. 1. The system includes seven modules, i.e., Prepro- cess, Predicate Alignment, Mathcher Choosing, Candidate Pair Generation, Matching Score Calculation, Instance Alignment and Validation. The sequences of the process are shown in the Fig. 1. We illustrate the process as follows. Fig. 1. Framework of RiMOM 2015 1. Preprocess: The system begins with Preprocess, which loads the ontologies and parameters into system. In the meantime, preprocessor can get some meta data about the two ontologies, which will be used in the later processes, Predicate align- ment and Matcher choosing 2. Predicate Alignment: In this process, we will get the alignments of the predi- cates between the two ontologies. Currently, in our system, this process is semi- automatic. 3. Matcher choosing: The system will choose the most suitable one or more match- ers according to the meta data of the ontologies. 4. Candidate Pairs Generation: In this step, we get the candidate pair when the instances have the same literal objects on some discriminatory predicate. 5. Matching Score Calculation: After the candidate set generation, we calculate more accurate similarity using the algorithm chosen by step 3. In this task, the vector distance similarity was calculated between each candidate pair. 6. Instance Alignment: According to the similarity calculated in step 5, we get the final instance alignment. 7. Validation: We will evaluate the alignment result on Precision, Recall and F1- Measure if there is validation data set. 1.2 Specific techniques used This year we only participate in the Instance Matching track. We will describe spe- cific techniques used in this track. Data Preprocessing: First, we remove some stop words like ”a, of, the”, etc. Af- terwards, we calculate the TF-IDF values of words in each knowledge base. We also calculate some information of each predicate, in order to find the important predicates. Predicate Alignment: It is apparent that we should get the alignment of the pred- icates before we calculate the similarity of instances. The predicates can express rich semantics, and there exists one-to-one, one-to-many, or many-to-many relationships among these predicates. We can find some of one-to-one relationships through calcu- lating the Jaccard Similarity of the two predicates. i.e. |Opi ∩ Opj | sim(pi , pj ) = |Opi ∪ Opj | where pi and pj are predicates in two ontologies respectively. Opj is the range of the predicate pj . There are also some one-to-many relationships. We get the alignments of them by manual regulations, e.g. Xn object(pi ) = object(pj ) j=1 object(pi ) = max object(pj ) j=1..n object(pi ) = min object(pj ) j=1..n Candidate Pairs Generation: This step aims to pick a relatively small set of can- didate pairs from all pairs. Due to the large scale of knowledge bases, it is impossible to calculate matching scores of all instance pairs. In our method, we firstly generate the inverted index on the objects. instance pairs are selected into the candidate set when they have common objects. This method may reduce the recall, but it also reduce the scale of computation significantly. Multi-Strategy: We implement several matchers in our system, e.g. label-based approach and structure-based approach. In the preprocess step, we will compare the schema of the two ontologies. If the range of predicates is similar, the label-based ap- proach will play a key role in the matching process. Otherwise, the literal properties are not similar (e.g. the two ontologies are defined in different languages), label-based ap- proach will not be effective. In this case, we will get some supplementary information (e.g. machine translation, WordNet), or use structure-based appraoch. Similarity Calculation: In OAEI 2015 instance matching track, the ontologies are all defined in the same language, English. In the tasks which we took part in, author − dis and author − rec, the schema of the ontologies tend to be similar. So label-based vector distance matcher is chosen to calculate the similarity of the instances, it is defined as follows: La = Objects(Ia ) where Ia is an instance, La is a list which contains all of the objects of the instance Ia . 1 X Sim(Ia , Ib ) = Sim(La , Lb ) = max(Sim(Oa , Ob )|Ob ∈ Lb ) |La | Oa ∈La where Oa is one of the objects in the list La . We define the similarity of the two instances equals to the similarity of their objects list. For each Oa in La , we find a most similar object Ob in Lb . The algorithm varies with the data type of the object. For example, for date, we use the indicator function. The indicator function will be 1 when the dates are the same, otherwise, 0. For some literal properties, such as ”title”, we compute cosine similarity based on the tf-idf vectors. Instance Alignment After we get the accurate similarity, for each instance in source ontology, we choose the instance which has the best score in target ontology. Then we filter the result on a certain threshold and get the final Instance Alignment. 1.3 Link to the system and parameters file The RiMOM system (2015 version) can be found at https://www.dropbox. com/s/6bx4pb46ytvddvy/RiMOM.zip?oref=e. 2 Results The Instance Matching track contains five subtasks. we present the results and relat- ed analysis for the two subtasks (author-disambiguation and author-recognition) in the following subsections. 2.1 Author Disambiguation sub-task The goal of the author-dis task is to link OWL instances referring to the same person (i.e., author) based on their publications. We can use the Sandbox (small scale data set) to tune our parameters. The class ’author’ have only one literal properties, ’name’. So we must get alignments on the class ’publication’. Finally, we get 854 pairs for Sandbox task, and 8428 pairs for Mainbox task. Expected mappings Retrieved mappings Precision Recall F-measure EXONA 854 854 0.941 0.941 0.941 InsMT+ 854 722 0.834 0.705 0.764 Lily 854 854 0.981 0.981 0.981 LogMap 854 779 0.994 0.906 0.948 RiMOM 854 854 0.929 0.929 0.929 Table 1. The result for Author-dis sandbox Expected mappings Retrieved mappings Precision Recall F-measure EXONA 8428 144827 0 0 NaN InsMT+ 8428 7372 0.76 0.665 0.709 Lily 8428 8428 0.964 0.964 0.964 LogMap 8428 7030 0.996 0.831 0.906 RiMOM 8428 8428 0.911 0.911 0.911 Table 2. The result for Author-dis mainbox The reference alignments of sandbox are provided by sponsor, so we only pay at- tention to mainbox. As shown in table 2, the results for the author-dis mainbox task are: Precision 0.911, Recall 0.911, F-measure 0.911, which is slightly lower than sandbox. Afterwards, we find that the property ’title’ plays a key role in publication. So we think that we can get a better result if we do some deeper work on it. 2.2 Author Recognition sub-task The goal of the Author-rec task is to associate a person (i.e., author) with the corre- sponding publication report containing aggregated information about the publication activity of the person, such as number of publications, h-index, years of activity, num- ber of citations. The final goal is similar with the Author-dis task, but there are some changes on schema of the ontology. The most remarkable is that there exists one-to- many relationships between the properties. So we add some manual regulation to solve the problem. As show in table 4, RiMOM get a excellent result on author-rec task. The results for the author-dis mainbox task are: Precision 0.999, Recall 0.999, Fmeasure 0.999, which expresses that the algorithm we implement is very suitable for this task. Expected mappings Retrieved mappings Precision Recall F-measure EXONA 854 854 0.518 0.518 0.518 InsMT+ 854 90 0.556 0.059 0.106 Lily 854 854 1.0 1.0 1.0 LogMap 854 854 1.0 1.0 1.0 RiMOM 854 854 1.0 1.0 1.0 Table 3. The result for Author-rec sandbox Expected mappings Retrieved mappings Precision Recall F-measure EXONA 8428 8428 0.409 0.409 0.409 InsMT+ 8428 961 0.246 0.028 0.05 Lily 8428 8424 0.999 0.998 0.999 LogMap 8428 8436 0.999 1.0 0.999 RiMOM 8428 8428 0.999 0.999 0.999 Table 4. The result for Author-rec mainbox 2.3 Discussions on the way to improve the proposed system Our system need align the predicates before instance matching, and in this process, the system is required to scan all of the instances in the ontology, which may cause a waste of time. In addition, the process of P redicateAlignment is semi-automatic, we have to add some manual regulations to deal with the one-to-many relationships. In conclusion, we hope to develop our system through inventing an algorithm to align the predicates automatically and iteratively. Firstly we can use the values of pred- icates to align the instances, and in turn, the aligned instances will help us to update the similarity for predicates. In this way, we will gradually get the final alignment result. 2.4 Comments on the OAEI 2015 measures These two tasks are instance matching task on publication data set. We use the reference of the sandbox to tune the parameters,and it turns out that our approach is effective. We also find that the inverted index not only improve efficiency, but reduce the mistake and increase the Precision. There are also some aspects we are not satisfied with. For time’s sake, we don’t take part in other three tasks. Finally, we are looking forward to making some progress in the next OAEI campaign. 3 Conclusion and future work In this paper, we present the system of RiMOM in OAEI 2015 Campaign. We partic- ipate in intance matching track this year. We described specific techniques we used in the task. In our project, we design a new framework to deal with the instance matching task. The result turns out that our method is effective and efficient. In the future, we will develop an iterative algorithm to align the predicates automat- ically. References 1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: Db- pedia - A crystallization point for the web of data. J. Web Sem. 7(3) (2009) 154–165 2. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194 (2013) 28–61 3. Wang, Z., Li, J., Wang, Z., Li, S., Li, M., Zhang, D., Shi, Y., Liu, Y., Zhang, P., Tang, J.: Xlore: A large-scale english-chinese bilingual knowledge graph. In: Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia, October 23, 2013. (2013) 121–124 4. Shvaiko, P., Euzenat, J.: Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25(1) (2013) 158–176 5. Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: A dynamic multistrategy ontology alignment frame- work. IEEE Trans. Knowl. Data Eng. 21(8) (2009) 1218–1232