Wiktionary Matcher Results for OAEI 2021 Jan Portisch1,2[0000−0001−5420−0663] and Heiko Paulheim1[0000−0003−4386−8195] 1 Data and Web Science Group, University of Mannheim, Germany {jan, heiko}@informatik.uni-mannheim.de 2 SAP SE Business Technology Platform - One Domain Model, Walldorf, Germany jan.portisch@sap.com Abstract. This paper presents the results of the Wiktionary Matcher in the Ontology Alignment Evaluation Initiative (OAEI) 2021. Wiktionary Matcher is an ontology matching tool that exploits Wiktionary as exter- nal background knowledge source. Wiktionary is a large lexical knowl- edge resource that is collaboratively built online. Multiple current lan- guage versions of Wiktionary are merged and used for monolingual on- tology matching by exploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. This is the third OAEI participation of the matching system.3 Keywords: Ontology Matching · Ontology Alignment · External Re- sources · Background Knowledge · Wiktionary 1 Presentation of the System 1.1 State, Purpose, General Statement The Wiktionary Matcher is an element-level, label-based matcher which uses an online lexical resource, namely Wiktionary. The latter is ”[a] collaborative project run by the Wikimedia Foundation to produce a free and complete dic- tionary in every language”4 . The dictionary is organized similarly to Wikipedia: Everybody can contribute to the project and the content is reviewed in a com- munity process. Compared to WordNet [3], Wiktionary is significantly larger and also available in other languages than English. This matcher uses DBnary [13], an RDF version of Wiktionary that is publicly available5 . The DBnary dataset makes use of an extended LEMON model [6] to describe the data. For this matcher, recent DBnary datasets for 8 Wiktionary languages6 have been down- loaded and merged into one RDF graph. Triples not required for the matching 3 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 4 see https://web.archive.org/web/20190806080601/https://en.wiktionary. org/wiki/Wiktionary 5 see http://kaiko.getalp.org/about-dbnary/download/ 6 Namely: Dutch, English, French, Italian, German, Portugese, Russian, and Spanish. 2 J. Portisch et al. Fig. 1. High-level overview of the Wiktionary Matcher. KG1 and KG2 represent the input ontologies and optionally instances. The final alignment is referred to as A. algorithm, such as glosses, were removed in order to increase the performance of the matcher and to lower its memory requirements. As Wiktionary contains translations, this matcher can work on monolingual and multilingual matching tasks. This is the third OAEI participation of this matching system, Wiktionary Matcher participated in the OAEI in 2019 [9] and in the OAEI 2020 [11]. The matcher has been implemented and packaged using the Matching EvaLuation Toolkit (MELT) 7 , a Java framework for matcher development, tuning, evalua- tion, and packaging [5,8]. 1.2 Specific Techniques Used This matching system system was initially introduced at the OAEI 2019 [9]. An overview of the matching system is provided in Figure 1. The main techniques used for matching are summarized below. Monolingual Matching For monolingual ontologies, the matching system first applies multiple string matching techniques. Afterwards, the synonym matcher 7 see https://github.com/dwslab/melt Wiktionary Matcher 3 module links labels to concepts in Wiktionary and checks then whether the con- cepts are synonymous in the external dataset. This approach is conceptually similar to an upper ontology matching approach. Concerning the usage of a col- laboratively built knowledge source, the approach is similar to WikiMatch [4] which exploits the Wikipedia search engine. Wiktionary Matcher adds a corre- spondence to the final alignment purely based on the synonymy relation inde- pendently of the actual word sense. This is done in order to avoid word sense dis- ambiguation on the ontology side but also on Wiktionary side: Versions for some countries do not annotate synonyms and translations for senses but rather on the level of the lemma. Hence, many synonyms are given independently of the word sense. In such cases, word-sense-disambiguation would have to be performed also on Wiktionary [7]. Linking labels of entities to Wiktionary is carried out as fol- lows: The full label is looked up in the knowledge source. If the label cannot be found, labels consisting of multiple word tokens are truncated from the right and the process is repeated to check for sub-concepts. This allows to detect long sub-concepts even if the full string cannot be found. Label conference banquet of concept http://ekaw#Conference Banquet from the Conference track, for exam- ple, cannot be linked to the background dataset using the full label. However, by applying right-to-left truncation, the label can be linked to two concepts, namely conference and banquet, and in the following also be matched to the correct con- cept http://edas#ConferenceDinner which is linked in the same fashion. For multi-linked concepts (such as conference dinner ), a match is only annotated if every linked component of the label is synonymous to a component in the other label. Therefore, lens (http://mouse.owl#MA 0000275) is not mapped to crystalline lens (http://human.owl#NCI C12743) due to a missing synonymous partner for crystalline whereas urinary bladder neck (http://mouse.owl#MA 0002491) is matched to bladder neck (http://human.owl#NCI C12336) because urinary bladder is synonymous to bladder. Multilingual Matching For every matching task, the system first determines the language distributions in the ontologies. If the ontologies appear to be in different languages, the system automatically enables the multilingual matching module: Here, Wiktionary translations are exploited: A match is created, if one label can be translated to the other one according to at least one Wiktionary language version – such as the Spanish label ciudad and the French label ville (both meaning city). This process is depicted in Figure 2: The Spanish label is linked to the entry in the Spanish Wiktionary and from the entry the translation is derived. If there is no Wiktionary version for the languages to be matched or the approach described above yields very few results, it is checked whether the 4 J. Portisch et al. two labels appear as a translation for the same word. The Chinese label 决定  (juédı̀ng), for instance, is matched to the Arabic label P@Q ¯ (qrār) because both appear as a translation of the English word decision on Wiktionary. This (less precise) approach is particularly important for language pairs for which no Wik- tionary dataset is available to the matcher (such as Chinese and Arabic). The process is depicted in Figure 3: The Arabic and Chinese labels cannot be linked to Wiktionary entries but, instead, appear as translation for the same concept. Fig. 2. Translation via the Wiktionary headword (using the DBnary RDF graph). Here: One (of more) French translations for the Spanish word ciudad in the Spanish Wiktionary. Instance Matching The matcher presented in this paper can be also used for com- bined schema and instance matching tasks. If instances are available in the given datasets, the matcher applies a two step strategy: After aligning the schemas, in- stances are matched using a string index. As there are typically many instances, Wiktionary is not used for the instance matching task in order to increase the matching runtime performance. Moreover, the coverage of schema level concepts in Wiktionary is much higher than for instance level concepts: For example, there is a sophisticated representation of the concept movie 8 , but hardly any individ- ual movies in Wiktionary. For correspondences where the instances belong to classes that were matched before, a higher confidence is assigned. If one instance matches multiple other instances, the correspondence is preferred where both their classes were matched before. Explainability Unlike many other ontology matchers, this matcher uses the ex- tension capabilities of the alignment format [1] in order to provide a human 8 see https://en.wiktionary.org/wiki/movie Wiktionary Matcher 5 Fig. 3. Translation via the written forms of Wiktionary entries (using the DBnary RDF graph). Here: An Arabic and a Chinese label appear as translation for the same Wiktionary entry (decision in the English Wiktionary). readable explanation of why a correspondence was added to the final alignment. Such explanations can help to interpret and to trust a matching system’s deci- sion. Similarly, explanations also allow to comprehend why a correspondence was falsely added to the final alignment: The explanation for the false positive match (http://confOf#Contribution, http://iasted#Tax), for instance, is given as fol- lows: ”The first concept was mapped to dictionary entry [contribution] and the second concept was mapped to dictionary entry [tax]. According to Wiktionary, those two concepts are synonymous.” Here, it can be seen that the matcher was successful in linking the labels to Wiktionary but failed due to the missing word sense disambiguation. In order to explain a correspondence, the description property9 of the Dublin Core Metadata Initiative is used. 1.3 Extensions to the Matching System for the 2021 Campaign For the 2021 campaign, the background knowledge has been updated: The sys- tem uses DBnary dumps as of late July 2021. The Wiktionary knowledge source grew significantly compared to the version used in the OAEI 2020: In total, 7 million new triples were added; an increase of roughly 9%. Besides upgrading the background knowledge, the underlying architecture was also improved: Rather than using a custom Wiktionary component, the 2021 version of the matching system was adapted to use the background knowledge modules that were made available with the release of MELT 3.0 [10]. With these changes, the code base is cleaner and better modularized. Improvements to the Wiktionary module will now benefit all MELT users. It is important to emphasize that these architectural improvements do not change the matching algorithm 9 see http://purl.org/dc/terms/description 6 J. Portisch et al. compared to the 2020 version. The system was, furthermore, adapted to be packaged as MELT Web Docker10 container. The implementation is publicly available on GitHub.11 2 Results 2.1 Anatomy Track On the anatomy track, recall and F1 could be slightly improved compared to the 2020 version of the matcher. The system performs at the median of all 2021 systems with an F1 score of 0.843 (precision = 0.956, recall = 0.753). 2.2 Conference Track The matching system achieves almost the same results as in 2021 on the confer- ence track with a slightly improved recall. With an F1 score of 0.59 on rar2-M3, the system performs above the median in terms of F1 . 2.3 Multifarm Track The largest overall improvements compared to last year could be observed on the Multifarm track: Here, the F1 score could be improved through a higher recall (the precision fell slightly). Like in the 2020 campaign, Wiktionary Matcher was the system with the overall highest precision and scored third place behind AML and LogMap. 2.4 LargeBio Track As of writing this article, the results of the LargeBio track were not yet published. 2.5 Knowledge Graph Track As in 2020, Wiktionary Matcher is the best matching system on the knowledge graph track.12 The performance numbers did not change compared to the 2020 version of the matcher. 2.6 Common Knowledge Graph Track This year, a new track was added to the OAEI: The Common Knowledge Graph Track [2]. Although not optimized for this track, Wiktionary Matcher achieved the second best result in terms of F1 with a score of 0.89. 10 see https://dwslab.github.io/melt/matcher-packaging/web 11 see https://github.com/janothan/WiktionaryMatcher 12 2021 [12] achieves the same F1 score – however, as the performance of the latter matcher on classes and properties is slightly worse, Wiktionary Matcher comes in first. Wiktionary Matcher 7 3 General Comments It is important to note that the matching system currently exploits only a small share of semantic relations available on Wiktionary. The system is restricted by the available relations extracted by the DBnary project. The additional ex- ploitation of the relations alternative forms or derived terms, for instance, would likely improve the system. However, those are not yet extracted and are conse- quently not used for the matching task as of today. The improvements observed on Anatomy and Conference are completely due to the updated Wiktionary version since the core matching code was left unchanged. 4 Conclusion In this paper, we presented the Wiktionary Matcher, a matcher utilizing a col- laboratively built lexical resource, as well as the results of the system in the 2021 OAEI campaign. Given Wiktionary’s continuous growth, it can be expected that the matching results will continue to improve over time – for example when ad- ditional synonyms and translations are added. In addition, improvements to the DBnary dataset, such as the addition of alternative word forms, may also im- prove the overall matcher performance in the future. References 1. David, J., Euzenat, J., Scharffe, F., dos Santos, C.T.: The alignment API 4.0. Semantic Web 2(1), 3–10 (2011). https://doi.org/10.3233/SW-2011-0028, https: //doi.org/10.3233/SW-2011-0028 2. Fallatah, O., Zhang, Z., Hopfgartner, F.: A gold standard dataset for large knowl- edge graphs matching. In: Shvaiko, P., Euzenat, J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.) Proceedings of the 15th International Workshop on Ontology Matching co-located with the 19th International Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 2, 2020. CEUR Workshop Proceedings, vol. 2788, pp. 24–35. CEUR-WS.org (2020), http://ceur-ws.org/Vol-2788/om2020_LTpaper3.pdf 3. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. Language, Speech, and Communication, MIT Press, Cambridge, Massachusetts (1998) 4. Hertling, S., Paulheim, H.: WikiMatch - Using Wikipedia for Ontology Match- ing. In: Shvaiko, P., Euzenat, J., Kementsietsidis, A., Mao, M., Noy, N., Stuck- enschmidt, H. (eds.) OM-2012: Proceedings of the ISWC Workshop. vol. 946, pp. 37–48 (2012) 5. Hertling, S., Portisch, J., Paulheim, H.: MELT - matching evaluation toolkit. In: Acosta, M., Cudré-Mauroux, P., Maleshkova, M., Pellegrini, T., Sack, H., Sure- Vetter, Y. (eds.) Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, SEMANTiCS 2019, Karlsruhe, Germany, Septem- ber 9-12, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11702, pp. 231–245. Springer (2019). https://doi.org/10.1007/978-3-030-33220-4 17, https: //doi.org/10.1007/978-3-030-33220-4_17 8 J. Portisch et al. 6. McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez- Pérez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., Wunner, T.: Interchanging Lexical Resources on the Semantic Web. Language Resources and Evaluation 46(4), 701–719 (Dec 2012). https://doi.org/10.1007/s10579-012-9182- 3, http://link.springer.com/10.1007/s10579-012-9182-3 7. Meyer, C.M., Gurevych, I.: Worth its weight in gold or yet another resource - A comparative study of wiktionary, openthesaurus and germanet. In: Gelbukh, A.F. (ed.) Computational Linguistics and Intelligent Text Processing, 11th In- ternational Conference, CICLing 2010, Iasi, Romania, March 21-27, 2010. Pro- ceedings. Lecture Notes in Computer Science, vol. 6008, pp. 38–49. Springer (2010). https://doi.org/10.1007/978-3-642-12116-6 4, https://doi.org/10.1007/ 978-3-642-12116-6_4 8. Portisch, J., Hertling, S., Paulheim, H.: Visual analysis of ontology matching results with the MELT dashboard. In: The Semantic Web: ESWC 2020 Satellite Events (2020) 9. Portisch, J., Hladik, M., Paulheim, H.: Wiktionary matcher. In: Shvaiko, P., Eu- zenat, J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.) Proceedings of the 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference (ISWC 2019), Auckland, New Zealand, October 26, 2019. CEUR Workshop Proceedings, vol. 2536, pp. 181–188. CEUR- WS.org (2019), http://ceur-ws.org/Vol-2536/oaei19_paper15.pdf 10. Portisch, J., Hladik, M., Paulheim, H.: Background knowledge in schema matching: Strategy vs. data. In: Proceedings of the International Semantic Web Conference, ISWC 2021 (2021), to appear 11. Portisch, J., Paulheim, H.: Wiktionary matcher results for OAEI 2020. In: Shvaiko, P., Euzenat, J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.) Proceedings of the 15th International Workshop on Ontology Matching co-located with the 19th International Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 2, 2020. CEUR Workshop Proceedings, vol. 2788, pp. 225–232. CEUR-WS.org (2020), http://ceur-ws.org/ Vol-2788/oaei20_paper14.pdf 12. Portisch, J., Paulheim, H.: ALOD2vec Matcher results for OAEI 2021. In: OM@ISWC 2021 (2021), to appear 13. Sérasset, G.: Dbnary: Wiktionary as a lemon-based multilingual lexical resource in RDF. Semantic Web 6(4), 355–361 (2015). https://doi.org/10.3233/SW-140147, https://doi.org/10.3233/SW-140147