ATBox Results for OAEI 2020 Sven Hertling[0000−0003−0333−5888] and Heiko Paulheim[0000−0003−4386−8195] Data and Web Science Group, University of Mannheim, Germany {sven,heiko}@informatik.uni-mannheim.de Abstract. ATBox matcher is a scalable system for instance (Abox) and schema (Tbox) matching. It uses two pipelines for generating candidates for the schema and instance matching, and utilizes the schema matches to further improve the instance correspondences. Using a string blocking method, ATBox is able to align large ontologies and can run on OAEI tracks like largebio and knowledge graph. The results look promising, but further features for better finding correct instance matches can be developed. Keywords: Ontology Matching · Knowledge Graph 1 Presentation of the system Nearly all systems submitted to the Ontology alignment Evaluation Initiative (OAEI) are able to align ontologies, schemas, or Tboxes, as they are called in de- scription logics (DL). On the other hand, there are more and more instance tracks like spimbench, link discovery, geolink cruise, and knowledge graph, matching instances, or Aboxes, becomes equally important. The matcher presented in this paper, called ATBox, focuses on both the Abox and Tbox. Especially the knowledge graph track needs scalable systems which can deal with hundred of thousands of instances [4]. Thus, the basis of this matcher is a blocking approach, which focuses on high recall. Its result is succesively fine tuned to increase the precision. Given this design, ATBox is also able to match large knowledge graphs like DBpedia [1] or YAGO [6]. 1.1 State, purpose, general statement The overall matching strategy of ATBox is shown in figure 1. The Tbox and Abox have different processing pipelines but the correspondences are combined in the end to get the final alignment. Tbox matching is applied for all classes and properties (owl:ObjectProp- erty, owl:DatatypeProperty, and rdf:Property). They are retrieved by the jena1 methods OntModel.listClasses() and OntModel.listAllOntProperties(). 0 Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 1 https://jena.apache.org if #entities < 10,000 TBox1 Stopword Synonym String Matching Extraction Extension TBox 2 final alignment Cardinality Filter Instance Filter ABox1 Similar Neighbors String Matching Type Filter Filter ABox 2 Cosine Similarity Common Filter Properties Filter Fig. 1. Overview of the ATBox matcher strategy. The Tbox matching (classes and properties) starts with the stopword ex- traction. In some cases the labels and/or fragments (which we define as the part after the last hashtag symbol # or slash /) contains tokens which appears very often like class, infobox etc. If such tokens appears in more than 20 % of all classes/properties (considered separately), then it is extracted as a corpus spe- cific stop word. In case there are many such stop words, they are restricted to the five most occurring ones. The synonyms (used during string matching) are extracted from the English Wiktionary to cover many different domains. The extraction is done with DB- nary [8], a dataset containing Wiktionary as RDF. The extraction process starts with all resources of type dbnary:Page 2 within the English domain 3 . Then we follow the describes relation and extract all resources connected with property synonym. Furthermore we follow the relation sense to also find all the given senses and their synonyms. The lemmas are extracted directly from the URI. Table 1. String processing steps in ATBox matcher for schema matches. Processsing Confidence Levenshtein equality 1.0 no normalize 0.9 no normalizeParentheses 0.8 no defaultStopwords 0.7 no corpusStopwords 0.6 yes synonyms 0.5 no 2 http://kaiko.getalp.org/dbnary#Page 3 http://kaiko.getalp.org/dbnary/eng/ The string matching contains multiple different steps which are shown in ta- ble 1. All processing applies to rdfs:label and in case it is missing to the URI fragment. If the extracted text is exactly the same, the generated correspon- dence has a confidence of 1.0. During the normalization process, a word written in camel case4 is separated with whitespace (e.g. hasAge to has Age) and after- wards lowercased. In case some UTF-8 characters are not normalized, we apply a normalization step for them (e.g. an accented character can be encoded in multiple different ways in UTF-8). All possible punctuations are furthermore removed and multiple whitespaces are combined into one. In case the normal- ized text matches, a confidence of 0.9 is assigned. In the normalizeParentheses step, all text within parentheses is removed. If the remaining normalized text (same as in normalize step) is equal, it assigns a confidence of 0.8. The reason behind is that many articles in KGs define concepts with same names to have the discriminating term in parentheses e.g. “Harry Potter (character)” and “Harry Potter (film series)”. DefaultStopwords removes a given set of stopwords while keeping all other processing steps as before (confidence is 0.7). In the last pro- cessing step, the corpus specific stopwords, extracted before, are also removed and additionally allow a levenshtein distance[7] of 1 (but only in case the text is longer than 6 characters). In case it matches a correspondence with confidence of 0.6 is generated. If the amount of concepts are less than 10,000 for source and target, then a synonym step is added with a confidence of 0.5. In this step, the extracted synonyms are used to replace (possibly multiple) tokens with all available synonyms. All string processing steps are executed in order starting with the highest con- fidence. If a match is found the remaining steps are also executed to find possible other candidates. As an example, a correspondence like is already found, then the processing continues and also add to the resulting alignment. The instance matching (Abox - shown in the lower part of the figure 1) starts directly with the string matching component. It reuses the processing steps described in the previous section without the corpus dependent stopword removal and synonym replacement. The applied steps are shown in table 2. The first four steps applies to the rdfs:label and if it is missing to the fragment of the URI. The confidence is decreasing with a step size of 0.1 starting with 1.0. In the second part, the additional properties skos:prefLabel and skos:altLabel are taken into account. If they match, the confidence is set to maximally 0.6 depending in which preprocessing step the match occurs. Once again, we allow matches which a lower confidence, even when a correspondence with a higher confidence is found. This increases the recall because it might be the case that the matched entity with a high confidence is not the best available match. The string processing step generated an alignment with a high recall. All following steps try to increase the precision by generating additional confi- dences for each correspondence. This helps at the end of the processing pipeline to enforce a one to one alignment and selecting the right correspondence in 4 https://en.wikipedia.org/wiki/Camel_case Table 2. Processing steps for generating instance matches. Processsing Confidence Property equality 1.0 rdfs:label (or fragment) normalize 0.9 rdfs:label (or fragment) normalizeParentheses 0.8 rdfs:label (or fragment) defaultStopwords 0.7 rdfs:label (or fragment) equality 0.6 + skos:preflabel, skos:altLabel normalize 0.5 + skos:preflabel, skos:altLabel normalizeParentheses 0.4 + skos:preflabel, skos:altLabel defaultStopwords 0.3 + skos:preflabel, skos:altLabel case there are multiple target entities for one source entity (or the other way around). Thus the following filters only add additional confidences (with the addAdditionalConfidence function of YAAA [5]) and do not yet remove any correspondences: – Similar Neighbors Filter – Cosine Similarity Filter – Common Properties Filter – Type Filter All these filters are explained in the following. The similar neighbors filter uses the instance alignment (generated by the previous string processing step) to count for each instance correspondence how many resources or literals are shared between the two instances. Figure 2 shows an example where two neighbors are detected for correspondence because the literal “blue” and the resource “Gryffindor” is shared. Note that the properties are not taken into account (which is done later by the common properties filter). Thus we do not need a mapping of property “eyeColor” to “eye”. We further exclude the properties rdfs:label and skos:altLabel and all properties which have the same literal as those. This will not count the literals which just repeats the name of the resource with a different (maybe not matched) property like “name”. Two literals are the same when their lowercased lexical value is equal. The additional confidence is the absolute amount of neighbors. The cosine similarity filter compares text which is extracted from instances. It is generated by iterating over all literals and checking if the datatype of it is xsd:string,rdf:langString or if the literal has a language tag. All lexical representations of such literals are concatenated to generate a textual represen- tation. These representations are then compared with a cosine similarity which is added to the correspondence. The common properties filter checks for each instance correspondence the number of shared properties. This heavily relies on already matched schema because all properties with the same URI are excluded beforehand. Thus we only check if the instances share some matched properties regardless of their objects. The number of overlap is then added to the correspondence. KG 1 KG 2 two:Gryffindor one:Gryffindor (House) “blue“ one:house two:house one:Harry_ two:Harry_ Potter Potter one:bloodStatus two:father “blue“ one:Half two:James_ -blood Potter Fig. 2. The similar neighbors filter would assign two neighbors for the correspon- dence because of literal “blue” and the already matched entites one:Gryffindor and two:Gryffindor(House). The type filter is similar to the neighbors filter but only checks if the types (retrieved by rdf:type) actually overlap. This again requires already matched classes. The absolute overlap is added as an additional confidence. The final step during instance matching is to actually filter these correspon- dences and create a one to one alignment. This instance filter sorts the correspon- dences by confidence (which is initially set by the string matching) and iterating over it. If a source or target resource is already matched, then it continues with the next correspondence. In all other cases it checks if there is a correspondence in the whole instance alignment which should be used instead. The criteria for being better is fixed to have greater values in two additional confidences. As a last step, all correspondences are combined and a final cardinality filter ensures a one to one alignment by comparing the confidence scores. 1.2 Specific techniques used We used the following matching components of MELT [5]: – ScalableStringProcessingMatcher – StopwordExtraction – SimilarNeighborsFilter – CommonPropertiesFilter – CosineSimilarityConfidenceMatcher – SimilarTypeFilter – NaiveDescendingExtractor 1.3 Adaptations made for the evaluation ATBox matcher is also available as a SEALS package. Due to clashes of depen- decies of SEALS and ATBox, we decided to use the external SEALS packaging mechanism of the MELT framework[5]. It generates an intermediate matcher which executes an external process which runs in its own java virtual machine (JVM). Thus different versions of dependencies are not a problem. 1.4 Link to the system and parameters file ATBox matcher can be downloaded from https://www.dropbox.com/s/q57rzoec9zeumi2/ATBox.zip?dl=0. 2 Results This section discusses the results of ATBox for each track of OAEI 2020 where the matcher is able to produce results. The following tracks are included: anatomy, conference, largebio, phenotype, and knowledge graph track. Specific matching strategies and interfaces for the interactive and complex track are currently not implemented and are thus not described. Due to no multi language support, the multifarm track is also excluded. 2.1 Anatomy ATBox could achive a slightly higher F-measure than the baseline (0.799 vs 0.766). Even though a synonym step is included in the matcher, the recall is only at 0.671 but therefor a high precision of 0.987 could be achieved (third best value). Some examples were the matcher could find some non-trivial matches are: – The first three have a confidence of 0.5 and thus the matches are mainly generated by synonym replacements. The last two contain different spellings like “grey” and “gray”. They are matched because the levenshtein distance is one between the two strings. Some examples where the synonym step yields wrong results are: – This shows that not only true positives are generated and it is also the reason why the correspondence has a low confidence. 2.2 Conference In the conference track ATBox matcher (0.56) is a bit better in terms of F- Measure than the baselines edna (0.54) and StringEquiv (0.52) when using the ra2-M3 evaluation. It covers the class and property alignments (M3) and uses the ra2 reference alignment which is a transitive closure of the original refer- ence alignment ra1[9]. Analyzing the precision/recall triangular graph which is based on the same evaluation dataset is can easily be seen that ATBox matcher has the best tradeoff between recall and precision. The reason is mainly the higher recall and the lower precision which is not easily avoidable. The schema matching capabilities of ATBox are rather limited and thus only the synonym expansion helps a lot. The ontology specific stopwords do not help here be- cause they do not exist in the given dataset. Some examples where the synonym step help: , , , and . The levenshtein distance helps finding and . Furthermore ATBox is one of the seven matching systems which returns a wide variation of confidence values. 2.3 Largebio ATBox matcher is one of six systems which are able to run on all six test cases and return meaningful alignments. It was consequently the second fastest system after LogMapLt. The results are very good in terms of precision but the recall is to low to compete with the other participants. Only in the FMA-SNOMED small fragments test cases the presented matcher could perform better than Wiktionary and LogMapLt. 2.4 Phenotype In this track the presented matcher only returns 759 correspondences for the first task HP-MP and 1,318 correspondences for the second task DOID-ORDO. The evaluation result thus contains a low recall of 0.298 respectively 0.333. Together with a high precision, a F-measure of 0.457 and 0.498 can be achieved. This is probably due to the missing background knowledge because LogMapBio uses BioPortal, LogMap uses spelling variants of SPECIALIST lexicon, and AML uses three sources (Uberon, DOID, and MeSH). All these systems achieve a higher recall than ATBox. Nevertheless in task HP-MP we could rank higher than ALOD2Vec and Wiktionary. 2.5 Biodiv In the Biodiv track ATBox could only return results in FLOPO-PTO test case. Once again the F-measure of 0.714 is much better than those of Wiktionary and ALOD2Vec but less than all LogMap variants and AML. 2.6 Knowledge Graph ATBox could score int the overall evaluation(which contains classes, proper- ties, and instances) the second highest F-measure score of 0.85 together with AML. Only ALOD2Vec and Wiktionary scores 0.01 better. When matching only classes, the presented matcher is the second best system after AML and for properties it is the best matcher. The instance matching pipeline is helpful for finding the correct correspondences but with 0.84 it is a bit below AML (0.85), ALOD2Vec (0.87), and Wiktionary (0.87). 3 General comments 3.1 Discussions on the way to improve the proposed system We would like to increase the number of feature generators. For example, all texts connected to an instance could be compared not only with cosine simi- larity but also with a BERT classifier[2]. Another feature would be to compare images associated with the instances to further distinguish true positive from false positive correspondences. Furthermore the schema matches could be improved with the help of all instance correspondences as already shown in DOME matcher [3]. 4 Conclusions In this paper, we have analyzed the results of ATBox matcher in OAEI 2020. It shows that the system is very scalable and can generate class, property and instance alignments. It usually has a high precision but on some tracks like Largebio, Phenotype, and Biodiv the recall can be increased by utilizing external knowledge despite the already used synonym lexicon from Wiktionary. Most of the used matching components are furthermore included in the MELT framework[5] to allow other system developers to reuse them. References 1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: The semantic web, pp. 722–735. Springer (2007) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 3. Hertling, S., Paulheim, H.: Dome results for oaei 2019. OM@ ISWC 2536, 123–130 (2019) 4. Hertling, S., Paulheim, H.: The knowledge graph track at oaei - gold standards, baselines, and the golden hammer bias. In: The Semantic Web: ESWC 2020. pp. 343–359 (2020) 5. Hertling, S., Portisch, J., Paulheim, H.: Melt - matching evaluation toolkit. In: SEMANTICS. Karlsruhe. (2019) 6. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013) 7. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and re- versals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966) 8. Sérasset, G.: Dbnary: Wiktionary as a lemon-based multilingual lexical resource in rdf. Semantic Web 6(4), 355–361 (2015) 9. Zamazal, O., Svátek, V.: The ten-year ontofarm and its fertilization within the onto-sphere. Journal of Web Semantics 43, 46–53 (2017)