-

ATBox Results for OAEI 2021

0 Data and Web Science Group, University of Mannheim , Germany

ATBox matcher is a system for matching instances (Abox) as well as schema (Tbox) of two given KGs. The focus of this matcher is on scalability such that it can easily perform huge tasks like Knowledge Graph and Large Bio track. ATBox participates in the OAEI for the second time. The basic system as well as the improvements are described in this paper. For matching, two pipelines (schema and instance) are used for generating candidates. The schema matches are used to further improve the instance alignments.

Ontology Matching Knowledge Graph

1.1

State, purpose, general statement

The overall matching strategy of ATBox is shown in gure 1. The Tbox and Abox have di erent processing pipelines but the correspondences are combined 0 Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

TBox1 TBox 2 ABox1 ABox 2

Stopword Extraction final alignment String Matching

Synonym

Extension Cardinality Filter Similar Neighbors

Filter Cosine Similarity

Filter

String Matching Bounded Path

Matching Instance Filter

Type Filter

Common Properties Filter in the end to get the nal alignment. One of the main di erences in comparison to the system submitted last year is the additional bounded path matching for classes.

First have a look at the Tbox matching. It is applied for all classes and properties (owl:ObjectProperty, owl:DatatypeProperty, and rdf:Property). They are retrieved by the jena1 methods OntModel.listClasses() and OntModel.listAllOntProperties().

The rst step is to extract KG speci c stopwords because in some cases the labels and/or fragments contains tokens which appears very often like class, infobox etc. If these tokens appears in more than 20 % of all classes/properties, then they are assumed to be stop words.

The synonyms are extracted from the English Wiktionary via DBnary [ 11 ]. The extraction process is detailed in the previous results paper[ 3 ] similarly to the string matching component. After these components the new bound path matching is executed. This component will match classes which are in between two already matched classes in a hierarchy. Thus it is a structural approach which requires already matched resources. Figure 2 shows an example. The class book is matched to class books and novel to novel. With this information, the class in between is a candidate for another correspondence. Thus it will be added with the average con dence of the other two correspondences.

The instance matching (Abox - shown in the lower part of the gure 1) is kept the same in comparison to the last submission. As a last step, all correspondences are combined and a nal cardinality lter ensures a one to one alignment by comparing the con dence scores.

1 https://jena.apache.org

one:Book

two:Books rdfs:subClassOf one:novel crime one:Fiction

Book rdfs:subClassOf rdfs:subClassOf one:novel two:Enterta inment two:Novel rdfs:subClassOf rdfs:subClassOf ATBox matcher is also available as a docker based matcher which runs a HTTP endpoint. The matcher is packaged with the MELT framework[ 5 ]. It will generate a docker image which also contains the code for running a small server. ATBox matcher can be downloaded from https://www.dropbox.com/s/l344aawh0mw6rjm/atmatcher-1.0-web-latest. tar.gz?dl=0. 2

Results

This section discusses the results of ATBox for each track of OAEI 2021 where the matcher is able to produce results. The following tracks are included: anatomy, conference, largebio, phenotype, biodiv, commonKG and knowledge graph track. The results for were not reported this year.

Speci c matching strategies and interfaces for the interactive and complex track are still not implemented and thus not described. Due to the fact that ATBox has no multi language support, the track multifarm is also excluded from the results discussion. 2.1

Anatomy

In comparison to last years participation, the F-Measure slightly decreased from 0.799 to 0.794 but still beats the baseline by a small margin. The matcher is rather precision oriented and achieves the third highest value after the string baseline, LSMatch, and ALIN. Recall should be optimized further than just using synonyms and an alignment repair step can be introduced to make a coherent alignment (which is not yet the case). 2.2

Conference

In the conference track, ATBox matcher increased the F-Measure from 0.57 to 0.59 using the rar2-M3 evaluation setup [ 12 ] (which is a violation free version of the entailed reference alignment for classes and properties). This is the third highest value after AML, LogMap, and GMap. Again the recall (with 0.51) is lower than precision (with 0.69). 2.3

Largebio

ATBox matcher is able to run on three out of six tasks in largebio. In the rst task (FMA-NCI), the presented system returned 2,332 correspondences and scored 0.867 in terms of F-measure.

The third task (FMA-SNOMED) could be solve in 30 seconds which is the third best time in this test case. In this short time, the matcher returned 6,226 correspondences. Only the LogMap matcher family and AML have better results but also need more time.

The task FMA-SNOMED is the only one where also the whole ontologies could be matched. This results in a higher runtime of 77 seconds. Unfortunately the recall (0.206) was too low to return many correct mappings.

Overall the system needs to be tuned to nd more correspondences (also in larger ontologies). 2.4

Phenotype

In the phenotype track, the presented matcher is able to run on HP-MP task but not on DOID-ORDO. We will investigate which components prevent a successful run of the latter task.

For task HP-MP the matcher was again quite fast and only AML and LogMap are better but the di erences in terms of F-measure are quite large (0.454 in comparison to AML with 0.804 and LogMap with 0.818). 2.5

Biodiv

In the Biodiv track ATBox scored di erently for the given fours tasks. For the envo-sweet task only a score of 0.671 could be achieved but for anaeethes-gemet task ATBox is the second best matcher with 0.748. Furthermore it is also by far one of the fastest matchers together with LogMapLt (which has a much slower F-Measure for the second task).

For the agrovoc-nalt and ncbitaxon-taxre d tasks, our matcher could not produce any result. We will further investigate it, such that the system is able to match these tasks in the upcoming campaign. 2.6

Common Knowledge Graphs

This is a new track which was introduced in OAEI 2021. The task is to align classes between NELL and DBpedia. NELL has 134 classes and 1,184,377 instances whereas DBpedia has 138 classes and 631,461 instances.

ATMatcher is the second best matcher together with ALOD2Vec and Wiktionary with a F-Measure of 0.89. Only KGMatcher (0.94) could nd more correct correspondences. For this track it would help to nd classes based on the instances matches as already done by DOME matcher. The currently version of ATMatch only uses the classes to improve the instance correspondences. In the next version we plan to also add this component to increase the capabilities of this matcher. 2.7

Knowledge Graph

The results of ATBox are similar to previous years because the class hierarchy in this track is not deep. One possibility would be to use the categories (connected with property dcterms:subject2) as an additional type of class information.

The F-Measure is 0.85 which is only slightly higher than the baseline using label and alternative label (0.84). Only ALOD2Vec and Wiktionary can improve on these results (both 0.87).

Regarding the runtime, ATMatcher is the fastest one with only 20 minutes for all test cases. Only the baselines are faster which need usually 11 minutes.

The con dences of the overall KG track alignment are visualized in gure 3 (generated with MELT dashboard[ 9 ]). The di erent hard coded con dence values can be seen very well and show that 0.4 and 0.5 has many false positives similar to 0.8.

Discussions on the way to improve the proposed system

We would like to extend the matching pipeline with further components such as transformer[ 1,6 ] based comparison between a textual representation of resources.

2 http://purl.org/dc/terms/subject

3 3.1

General comments

This only works if already created correspondences needs a precise con dence based on text but does not retrieve any new correspondences because of the complexity to compare all resources in a cross product manner. One way to mitigate this problem is to use sentence transformers[ 10 ]. They embed the text in a high dimensional space and thus allows to retrieve the top-k neighbors of a given resource.

Due to the fact that most of the returned alignments are not consistent with the ontology, we also plan to include some alignment repair steps [ 7 ] like the ALCOMO component[ 8 ].

In case the resources have attached images, it would be also interesting to compare those as well e.g. in the KG track are instances with an image displaying the concept. With a visual comparison (like same persons etc) the con dence of a correspondence can be further increased.

Furthermore the schema matches could be improved with the help of instance correspondences as already shown in the DOME matcher [ 2 ]. 4

Conclusions

In this paper, we have analyzed the results of ATBox matcher in OAEI 2021. It shows that the system is very scalable and can generate class, property and instance alignments.

Most of the used matching components are furthermore included in the MELT framework[ 5 ] to allow other system developers to reuse them.

1. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 )

2. Hertling , S. , Paulheim , H.: Dome results for oaei 2019 . OM@ ISWC 2536 , 123 { 130 ( 2019 )

3. Hertling , S. , Paulheim , H.: Atbox results for oaei 2020 . OM@ ISWC 2788 , 168 { 175 ( 2020 )

4. Hertling , S. , Paulheim , H.: The knowledge graph track at oaei - gold standards, baselines, and the golden hammer bias . In: The Semantic Web: ESWC 2020 . pp. 343 { 359 ( 2020 )

5. Hertling , S. , Portisch , J. , Paulheim , H.: Melt - matching evaluation toolkit . In: SEMANTICS. Karlsruhe . ( 2019 )

6. Hertling , S. , Portisch , J. , Paulheim , H.: Matching with transformers in melt . In: OM@ ISWC ( 2021 )

7. Jimenez-Ruiz , E. , Meilicke , C. , Grau , B.C. , Horrocks , I. : Evaluating mapping repair systems with large biomedical ontologies . Description Logics 13 , 246 { 257 ( 2013 )

8. Meilicke , C. : Alignment incoherence in ontology matching ( 2011 )

9. Portisch , J. , Hertling , S. , Paulheim , H.: Visual analysis of ontology matching results with the melt dashboard . In: European Semantic Web Conference . pp. 186 { 190 . Springer ( 2020 )

10. Reimers , N. , Gurevych , I. : Sentence-bert: Sentence embeddings using siamese bertnetworks . In: EMNLP ( 2019 )

11. Serasset , G.: Dbnary: Wiktionary as a lemon-based multilingual lexical resource in rdf . Semantic Web 6 ( 4 ), 355 { 361 ( 2015 )

12. Zamazal , O. , Svatek , V. : The ten-year ontofarm and its fertilization within the onto-sphere . Journal of Web Semantics 43 , 46 { 53 ( 2017 )