=Paper=
{{Paper
|id=Vol-3324/om2022_LTpaper6
|storemode=property
|title=A simple standard for ontological mappings 2022: updates of data model and outlook
|pdfUrl=https://ceur-ws.org/Vol-3324/om2022_LTpaper6.pdf
|volume=Vol-3324
|authors=Nicolas Matentzoglu,Joe Flack,John Graybeal,Nomi L. Harris,Harshad B. Hegde,Charles T. Hoyt,Hyeongsik Kim,Sabrina Toro,Nicole Vasilevsky,Christopher J. Mungall
|dblpUrl=https://dblp.org/rec/conf/semweb/MatentzogluFGHH22
}}
==A simple standard for ontological mappings 2022: updates of data model and outlook==
A Simple Standard for Ontological Mappings 2022: Updates of data model and outlook Nicolas Matentzoglu 1, Joe Flack 2, John Graybeal 3, Nomi L. Harris 4, Harshad B. Hegde 4, Charles T. Hoyt 5, Hyeongsik Kim 6, Sabrina Toro 7, Nicole Vasilevsky 7 and Christopher J. Mungall 4 1 Semanticly, Athens, Greece 2 Johns Hopkins University, Baltimore, MD 21218, USA 3 Stanford University, Stanford, CA, 94305, USA 4 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA 5 Harvard Medical School, Boston, MA 02115, USA 6 Robert Bosch LLC 7 University of Colorado Anschutz Medical Campus, Aurora, CO 80217, USA Abstract The Simple Standard for Ontological Mappings (SSSOM) was first published in December 2021 (v. 0.9). After a number of revisions prompted by community feedback, we have published version 0.10.1 in August 2022. One of the key new features is the use of a controlled vocabulary for mapping-related processes, such as preprocessing steps and matching approaches. In this paper, we give an update on the development of SSSOM since v. 0.9, introduce the Semantic Mapping Vocabulary (SEMAPV) and outline some of our thoughts on the establishment of mapping commons in the future. Keywords 1 standards, mappings, ontologies, ontology mapping, FAIR data. 1. Introduction The problem of mapping between entities in databases and ontologies is ubiquitous - from automatically establishing mapping using ontology matching or entity resolution techniques to applying them in the context of data transformation, ontology merging or knowledge graph integration. We define a mapping as the correspondence of one entity (a record in a database, a class in an ontology), i.e., the subject, to another entity, i.e., the object. A semantic mapping is a mapping that further specifies a predicate describing how the subject maps to the object e.g., is it an “exact match”, or merely “close”? Are the two entities “logically equivalent” in the sense of the Web Ontology Language (OWL)? Despite their importance, semantic mappings are rarely shared widely outside the tool-developing communities. Standards like the EDOAL [1] and the Alignment API [2] have had a huge impact on the field of ontology alignment, being the reference format for the Ontology Alignment Evaluation Initiative (OAEI) community, but are conceptually focused on ontologies, and have only very limited ability to express detailed metadata on mapping sets (alignments), such as provenance, licensing information and attribution. The Simple Standard for Ontological Mappings (SSSOM) has been proposed as a community standard for sharing semantic mappings between information entities [3]. A mapping in this context is a statementthat establishes a correspondence of a subject entity (s) to an object entity (o) via a Proceedings of the 17th International Workshop on Ontology Matching EMAIL: cjmungall@lbl.gov (C. J. Mungall) ORCID: 0000-0002-7356-1779 (A. 1); 0000-0002-2906-7319 (A. 2); 0000-0001-6875-5360 (A. 3); 0000-0001-6315-3707 (A. 4); 0000-0002- 2411-565X (A. 5); 0000-0003-4423-4370 (A. 6); 0000-0002-3002-9838 (A. 7); 0000-0002-4142-7153 (A. 8); 0000-0001-5208-3432 (A. 9); 0000-0002-6601-2165 (A. 10) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) mapping predicate (p). Individual mappings can be grouped into mapping sets. A special kind of mapping set is an “alignment” comprising all mappings between two data spaces (ontologies/databases). SSSOM specifies a schema and data model for mapping sets and mappings, and a rich set of metadata elements to describe them. Publishing mappings in a standard format with appropriate provenance, whether they are automatically generated by ontology matchers, manually curated or both, drastically reduces the burden of re-curating mappings between the same ontologies over and over again, and provides a “point of convergence”, standard mapping sets curated and maintained much the same way as ontologies and vocabularies are curated: as FAIR semantic artefacts [4]. In this paper, we will describe the most significant changes to the SSSOM standard since its initial publication in December 2021, including: 1. The move from modelling mapping metadata as “type” information, e.g. this mapping corresponds to a “lexical match”, to modelling them as “mapping justifications”, e.g. this mapping is supported by “lexical matching”. 2. The establishment of a bespoke semantic mapping vocabulary that defines certain key mapping activities, such as types of matching approaches, formally. 3. A first concrete proposal to define mapping registries, collections of mapping sets, as a backbone to defining what we envision as mapping commons - community-driven spaces that curate and reconcile mappings and enrich them with metadata. 2. Updates to the SSSOM standard since December 2021 The following changes have been made since version 0.9 of the SSSOM standard. At the time this paper was written, the version is 0.10.1. 2.1. Key changes to the SSSOM data model Requiring sources to be recorded as entity references. We now require entity references instead of strings to denote source information for the subjects and objects in the mapping (subject_source, object_source). For example, we previously permitted the string “UBERON” to be used to declare that the subject is part of the Uberon ontology [5], but now require (ideally resolvable) entity references instead, e.g. obo:uberon to denote the Uberon ontology or wikidata:Q465 to denote DBpedia. While this practice by itself does not guarantee that sources are documented uniformly across mapping sets (both obo:uberon and wikidata:Q7876491 refer to the Uberon ontology), it does ensure that users can at least figure out which was the intended source without having to resort to manual web searches. We are currently discussing how to ensure that “sources” can be referred to by a unique, consistent identifier scheme. Requiring entity references for documenting preprocessing techniques. Previously we allowed preprocessing techniques, for example, applying to lexical matching activities (Stemming, Lemmatisation, etc), to be recorded using natural language strings. To facilitate the standardisation of these references, we now require preprocessing techniques to be recorded using controlled vocabularies. One such vocabulary is the Semantic Mapping Vocabulary (SEMAPV), which we will introduce later in this report. Splitting match term type into separate type fields for the subject and the object of the mapping. To facilitate mappings between different ontological types (classes and individuals, for example), we split the previous match_term_type property (which used a controlled vocabulary with terms like ConceptMatch, ClassMatch, IndividualMatch) into separate metadata elements for the subject (subject_term_type) and object (object_term_type), using the standard Semantic Web types as values, such as owl:Class, rdfs:Resource or skos:Concept. This ensures, for example, that we can map between elements that are not strictly the same type, such as “skos:Concept”, for example “Alzheimers” in a SKOS taxonomy, and “owl:Class”, for example “Alzeimhers” in an OWL Ontology. As another positive side effect, we can use this kind of information to make assumptions about the intended semantic framework for the mapping: if, for example, a mapping property is used that is formally an owl:ObjectProperty, and both the subject and the object are owl:Class, then we can export the mapping as an owl:SomeValuesFrom restriction when the user uses the SSSOM toolkit to transform their mapping into OWL. Modelling mapping metadata as “mapping justifications”, rather than “match type” information. In SSSOM v. 0.9, the “match_type” property was used to express that the mapping between the subject entity (subject_id) and the object entity (object_id), using a particular mapping predicate (predicate_id), is of type “lexical match”. During the first phase of the design, this felt more natural, as we would often use phrases like “this is a lexical match” when talking about a specific mapping. But it turned out that there were a few problems with this metadata element. Firstly, we realised that there was a confusingly inconsistent use of the “matching” vs “mapping” terminology across the SSSOM specification. For example, the name “SSSOM” suggests that we are talking about a standard for mappings - but one of its central metadata elements (“match_type”) uses the term “match” instead. The core team has settled on the following conventions for using the terms “mapping” and “matching” across the specification: a term mapping is a statementcomprising a subject entity (s), an object entity (o) and a mapping predicate (p); this is synonymous to what the ontology matching literature refers to as correspondence. We refer to the process that results in a mapping between a subject and an object entity as “matching”. Obviously, these are merely conventions to ensure consistent communication when talking about SSSOM, and not in any way normative - there are many different valid uses of the terms “mapping”, “match” and “matching” across the literature. Secondly, a single mapping can be lexical, logical and expert-curated at the same time. Stating that the mapping is of “multiple types” is awkward and confusing. Instead, aligning SSSOM a bit more closely with the provenance model of PROV (another request by the community) appeared more natural: describing how the mapping came into being as “activities that generate or confirm a mapping” (“activity” is a term from the PROV data model denoting a process “that occurs over a period of time and acts upon or with entities”). We decided to refer to these processes as “mapping justifications”. 2.2. Tooling related updates The SSSOM toolkit offers a number of utility methods such as “merge” (to merge two mapping sets), “parse” (to convert a different format, such as EDOAL, into SSSOM) and “validate” (to check that a mapping set is legal SSSOM). While it was always part of the design philosophy of SSSOM to not require any special tooling for reading and writing SSSOM files (i.e. it should be possible to use the normal data science toolbox, such as pandas), it can be convenient to have a special toolkit that covers some of the more frequently used operations of mapping sets on the command line, and provides a convenient API for data pipelines in Python. Since SSSOM 0.9, a number of new features have been added. The “annotate” function allows adding mapping set level metadata, such as license or a version, directly using the command line. This supports, for example, use cases like automated mapping extraction or matching pipelines which automatically assign versions. The “validate” command has been significantly improved and now covers JSON schema validation. Lastly, the “filter” command allows to filter a mapping set based on any of its metadata elements: for example, mapping sets can easily be filtered by predicate id, subject id prefix or mapping provider. Some infrastructure developers have also started making pull requests. For example, a converter of SSSOM to OntoPortal mappings has been provided by the AgroPortal developers; a converter for the FHIR ConceptMap is provided by a FHIR developer (under construction). The Ontology Access Kit (OAK) implements functionality to do basic lexical matching based on term synonyms, including a system for specifying mapping rules such as: if a label of the subject matches an exact synonym, then we declare the presence of an “skos:exactMatch”. Term matches are exported as SSSOM mapping sets, including detailed justifications such as “subject match field” and “match string”. OAK furthermore allows retrieving ontology mappings from various endpoints such as OxO or BioPortal and exporting it as SSSOM mapping sets. The next version of the Ontology Development Kit (ODK) will implement direct support for managing SSSOM mappings alongside ontologies by providing functionality for automatically exporting mappings curated as part of the ontology into SSSOM for easier consumption, copying mappings relevant to the ontology sourced from elsewhere and curating mappings (manually or with matching tools). 2.3. Key changes related to outreach and governance Contribution guidelines and Code of Conduct. Since April 2022, we have defined contribution guidelines [6], a Code of Conduct [7] and proposed a few guidelines for general governance, such as for joining the core team, voting on changes to the data standard and resolving conflicts between team members. New SSSOM tutorial. We also developed our first comprehensive tutorial for the curation of mappings [8]. The tutorial is the first in a series for getting familiar with SSSOM as a standard to capture mappings and improve practices for mapping curation in general. We believe that it is key to sensitize curators to the idea of mapping precision (exact, narrow, close, broad) and capture this precision as part of the metadata, in particular their choice of a concrete mapping predicate such as skos:exactMatch. Much of our training efforts focus around teaching curators how to choose appropriate mapping predicates for each use case [9]. SSSOM logo design. After a few rounds of proposals and feedback, we settled on a circular Sankey diagram style solution developed by Julie McMurry that illustrates cross-mappings between resources, similar to what the Ontology Xref Services uses [10]. 3. The Semantic Mapping Vocabulary (SEMAPV) In the first public version of SSSOM, most metadata elements were either defined as enums (hardcoded values that are part of the SSSOM standard itself, see above discussion on mapping preprocessing and justifications) or simple open-ended strings (for example the “source” element). Now, to make certain metadata more easily customisable and extensible (while still being semantically meaningful, with rich metadata), we decided to develop a new controlled vocabulary independently of SSSOM to capture the required terms. The Semantic Mapping Vocabulary (SEMAPV) is in a very early stage of development, covering at the moment primarily the terms required for the use cases of the SSSOM user community, in particular to document mapping justifications (e.g. mapping justifications, widely used preprocessing methods). We are, however, discussing the possibility to include additional mapping relationships such as cross-species mappings for connecting biological databases covering diverse species from Drosophila and zebrafish to Homo Sapiens. These are not covered well by the SKOS mapping vocabulary (which, alongside OWL is the preferred vocabulary for semantic mappings in SSSOM), unless we give up a lot of precision and model these all as skos:closeMatch or skos:relatedMatch. An early version of SEMAPV can be inspected. SEMAPV is open for community feedback and new term requests. 4. Mapping registries and mapping commons: how to manage collections of mapping sets Traditionally, a lot of focus in and around modelling mappings has been about the individual term mappings and mappings sets (or alignments). Very little work is done to deal with sets of mappings, either in terms of collection, indexing and retrieval (mapping registries), or their reconciliation. Matching tools are typically designed to generate a one-off alignment, which is difficult to reconcile with partial alignments from other sources, such as human curated subsets. The role of human review in general is typically seen in service to the automated approaches either by providing a training set for machine learning based approaches, or a validation set for other more traditional approaches (“my matcher is 70% correct compared to a human reviewed subset”). In reality mapping sets are co- developed by curators, automated matchers and, importantly, “crosswalks” or “mapping chains”. The latter exploits the fact that if a subject A is an exactMatch to two objects (B and C) then we can reasonably assume that B and C are also exact matches to each other. Manual curation may often not be feasible due to the scale of the data, but can often be captured by user feedback (“this does not make sense!”). It is our firm belief that all these mapping approaches should be applied in concert, which means that a good model for dealing with collections of mapping sets must be developed. This model answers questions like how to reconcile conflicting mappings (picking the one with higher confidence?) or mappings whose application leads to nonsensical knowledge graphs and ontology structures (equivalence hairballs, inconsistencies). Many of the latter issues are covered by the ontology merging literature [11–15], but their outputs have yet to be standardised to be useful to a wider audience. With the growing number of projects developing terminology servers, we have recently started looking into formalising collections of mappings sets as mapping registries. An early version of this is included in the current SSSOM release. Mapping registries allow capturing metadata about mappings sets from the perspective of the registry maintainers. For example, while a specific matching tool will output a confidence value for a mapping, the registry maintainer may trust the tool more or less (curate their own confidence in the mapping set). These confidence values can then be used by reconciliation tools to create “harmonised mappings” (a concept that is not clearly defined that the authors usually intend to mean a “mapping set that does not induce equivalence hairballs” or “lead to logical inconsistencies). An example of a mapping registry instance can be found on GitHub (https://github.com/mapping- commons/mh_mapping_initiative/blob/master/mappings.yml). Ultimately, we hope that mapping registries could form the backbone of mapping commons, social endeavours that seek to collect and harmonise all mappings covering a specific domain. For example, the mouse-human mapping commons (https://github.com/mapping-commons/mh_mapping_initiative) seeks to collect and maintain mappings relevant to the integration of mouse model organism research data with human clinical data. 5. Discussion and Conclusions The adoption of SSSOM is still in the early stages. It is increasingly clear that precise and well documented terminological mappings are required for many use cases, such as bridging the chasm between the world of clinical terminology, which dominates the domain of clinical data, with the world of biological research (such as genomics) which is dominated by open biomedical ontologies (such as OBO Foundry ontologies), and simply relying on automated tooling won’t work for many use cases. Furthermore, it is likely that millions of taxpayer dollars are wasted on recreating the exact same mappings over and over again due to the lack of a FAIR and open mapping culture along the lines of what we already have for biomedical ontologies. What is needed is a holistic approach that integrates (partial) mappings created by different groups for specific use cases with the results from automated matchers (which enable coverage and scalability) and human curation (user feedback, biocuration). To achieve this our community (Ontology Matching) should evolve from a primarily tool-centric view (with a focus on algorithmic precision and fully automated matching) to a more data-centric view (integrated development processes, continuous updates of mapping sets, reuse and sharing, hybrid automated and manual mapping curation). Not only will the publication and curation of mappings in mapping commons improve the user experience; for the Ontology Matching community, manually curated mapping sets can ultimately evolve from silver to gold standard corpora for evaluation as well. Tools should be retrofitted to export “mapping justifications” alongside their results, documenting preprocessing steps, mapping decisions and more in a way that allows downstream users to accept or reject a mapping based on the justification alone. 6. References [1] EDOAL: Expressive and declarative ontology alignment language. [cited 12 Aug 2022]. Available: https://moex.gitlabpages.inria.fr/alignapi/edoal.html [2] David J, Euzenat J, Scharffe F, dos Santos CT. The Alignment API 4.0. Semantic Web. 2011. pp. 3–10. doi:10.3233/sw-2011-0028 [3] Matentzoglu N, Balhoff JP, Bello SM, Bizon C, Brush M, Callahan TJ, et al. A Simple Standard for Sharing Ontological Mappings (SSSOM). Database . 2022;2022: baac035. [4] Franc YL, Coen G, Essen JP, Bonino L, Lehväslaiho H, Staiger C. D2.2 FAIR semantics: First recommendations. 2020 [cited 12 Aug 2022]. Available: https://www.narcis.nl/publication/RecordID/oai:pure.knaw.nl:publications%2F8e193436-dd29- 40e5-8e60-1b6a3cf43e8f [5] Haendel MA, Balhoff JP, Bastian FB, Blackburn DC, Blake JA, Bradford Y, et al. Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon. J Biomed Semantics. 2014;5: 21. [6] CONTRIBUTING.md at master · mapping-commons/sssom. Github; Available: https://github.com/mapping-commons/sssom [7] CODE_OF_CONDUCT.md at master · mapping-commons/sssom. Github; Available: https://github.com/mapping-commons/sssom [8] Basic tutorial - A simple standard for sharing ontology mappings (SSSOM). [cited 12 Aug 2022]. Available: https://mapping-commons.github.io/sssom/tutorial/ [9] How to use mapping predicates - A Simple Standard for Sharing Ontology Mappings (SSSOM). [cited 12 Aug 2022]. Available: https://mapping-commons.github.io/sssom/mapping-predicates/ [10] Ontology Xref Service. [No title]. [cited 12 Aug 2022]. Available: https://www.ebi.ac.uk/spot/oxo/ [11] Lambrix P, Edberg A. Evaluation of ontology merging tools in bioinformatics. Pac Symp Biocomput. 2003; 589–600. [12] Stumme, Maedche. Ontology merging for federated ontologies on the semantic web. OIS@ IJCAI. Available: https://openreview.net/pdf?id=ryWUgQz_-B [13] Dou D. Ontology Translation by Ontology Merging and Automated Reasoning. Yale University; 2004. [14] Raunich S, Rahm E. Towards a Benchmark for Ontology Merging. On the Move to Meaningful Internet Systems: OTM 2012 Workshops. Springer Berlin Heidelberg; 2012. pp. 124–133. [15] Noy, Musen. Algorithm and tool for automated ontology merging and alignment. Proceedings of the 17th National Conference on. Available: https://www.aaai.org/Papers/AAAI/2000/AAAI00- 069.pdf?ref=https://githubhelp.com