Comparison of ontology mapping techniques to map plant trait ontologies Marie-Angélique Laporte, Léo Valette, Elizabeth Arnaud Bioversity International Montpellier, France m.a.laporte@cgiar.org Laurel Cooper, Austin Meier, Pankaj Jaiswal Department of Botany and Plant Pathology Oregon State University Corvallis, USA Christopher J. Mungall Berkeley Bioinformatics Open-Source Projects Lawrence Berkeley National Laboratory Berkeley, USA Abstract—Crop specific ontologies for phenotype annotations automated concept mapping techniques to be able to apply in breeding have proliferated over the last 10 years. Across-crop semantics channels for data integration and discovery. data interoperability involves linking those ontologies together. For this purpose, the Planteome project is mapping the Crop The Planteome project (www.planteome.org) aims to Ontology traits (www.cropontology.org) to the reference ontology support comparative plant biology, and provides integrated for plant traits, Trait Ontology (TO). Manual mapping is time- access to annotated datasets generated by inter and intra- consuming and not sustainable in the long-run as ontologies keep specific comparative analysis of transcriptomes, proteomics, on evolving and multiplicating. We are thus working on phenomics and genome annotation. To address this objective, developing reliable automated mapping techniques to assist Planteome is currently developing, and promoting the use of a curators in performing semantic integration. Our study shows set of reference ontologies for plants, proposing species-neutral the benefit of the ontology matching technique based on formal concepts, as well as common data annotation standards. definitions and shared ontology design patterns, compared to Harmonization between the species-specific ontologies and the standard automatic ontology matching algorithm, such as AML Planteome reference ontologies is currently done by mapping (AgreementMakerLight). Crop Ontology to the Plant Trait Ontology (TO) [2], which is the reference species-neutral ontology for plant traits aiming at Keywords—ontology mapping; ontology design patterns; integrating the many crop-specific trait ontologies. reference ontologies The purpose our study is to generate mappings in an I. INTRODUCTION efficient way in order to ease the work of the ontology curators in creating manual mappings. In this objective, we have The development of improved crop varieties relies on both compared two automatic ontology mapping techniques. The traditional breeding methods and next-generation methods such first technique is widely used to align ontologies and consists as high-throughput sequencing, molecular breeding and in applying a standard automatic matching algorithm. Indeed, automated scoring of traits. In that context, a number of AML (AgreementMakerLight) performs mappings based on ontologies have been developed to face the data both the string similarities of the ontology terms and the interoperability issues. They fulfill the needs of specific ontology structure. Considering the number of ontologies to be communities, but are species or clade-specific ontologies [1] mapped and the inherent nature of ontologies to evolve over and therefore block data harmonization across disciplines and time, it can be hard to maintain automatically the mappings communities. created using such a technique. Therefore, the second The crop breeding community, in particular widely uses the technique relies on formal definitions and shared ontology Crop Ontology (CO; www.cropontology.org), which is design patterns. The ontology design patterns are created using composed of species-specific ontologies for fieldbook edition Ontology Web Language (OWL) axioms based on Entity- and data annotation [1]. Because these ontologies grow in size Quality (EQ) statements, leading to a post-composition of and number, it is essential to develop efficient and reliable terms, similar to what has been proposed by the Ontology of Biological Attributes (OBA) [3]. The Entity (E) and Quality (Q) are sourced from the reference ontologies promoted by Planteome. The Q comes from the Phenotype and Trait TABLE I. MAPPING RESULTS Ontology (PATO) whereas the E comes from Plant Ontology (PO) when it is related to plant structures, Gene Ontology (GO) Rice Wheat Lentil Cassava for subcellular components, Chemical Entities of Biological # trait 157 238 66 175 Interest (ChEBI) for chemical entities or Environment classes Ontology (EO) for the environment conditions. Automated AML 84 (54%) 73 (30%) 28 (42%) 59 (34%) reasoning engines are then used to generate the mappings Design 121 (77%) 199 (84%) 47 (71%) 118 (67%) between the species-specific ontologies and the reference Patterns ontologies, while guarantying the validity of the unified merged ontology (i.e. TO plus the species-specific CO). As a result, TO is being enriched with well defined crop-specific III. CONCLUSION terms of Crop Ontology and Planteome can integrate additional data annotated in a unified way by the breeding and the genetic In an era of ontology proliferation, it is of vital importance communities. to have reference ontologies and powerful tools that reduce the effort of ontology alignment. Standard mapping techniques do not fit the need of ontology evolution over time as their results II. RESULTS AND DISCUSSION are difficult to maintain automatically. Developing the The AML algorithm and the design patterns approach mapping process based on ontology design patterns and logical have been applied to four crop Trait Dictionaries of the Crop axioms ensures validity confidence accuracy of the resulting Ontology so far: cereals rice and wheat, legume lentil and root ontology mappings. Scientists from the breeding community tuber crop cassava. Those ontologies are very different in can continue to use the standards preferred by them to terms of plant anatomy and morphology, but also in terms of annotate/record their data, reducing the effort they need to count and complexity of phenotypic traits. Table 2 provide. Planteome, through the TO, provides unified access to summarizes the results of the mappings process on trait terms. the breeding and the genetic data, opening up the possibility to Mapping using formal definitions resulted in two-fold increase perform large scale analysis such as comparative genomics by successful mappings. On average, AML was able to propose promoting a species neutral approach. mappings for ~40% of the CO classes in each ontology compared to ~75% mapped terms using the formal definition ACKNOWLEDGMENT approach. This can be explained by the fact that crop specific This work was supported by IOS:1340112 from the NSF. ontologies use very specific terminologies, especially for the Additionally, CJM acknowledges the support of the Director, Entity part of the EQ statement. Although the specific plant Office of Science, Office of Basic Energy Sciences, of the U.S. entities are defined in the Plant Ontology (PO) as synonyms of Department of Energy under Contract No. DE-AC02- species neutral entities, all the synonyms were not added to 05CH11231 TO and CO when the terms were pre-composed. The AML algorithm was thus not able to use this information, whereas REFERENCES the PO synonyms have been used in order to build the formal definitions of the CO terms. Furthermore, because the class [1] Shrestha, R., Arnaud, E., Mauleon, R., Senger, M., Davenport, G.F., Hancock, D., Morrison, N., Bruskiewich, R. and McLaren, G., 2010. hierarchy is quite simple in the different CO, AML was not Multifunctional crop trait ontology for breeders' data: field book, able to use the ontology structures to improve the mapping annotation, data discovery and semantic enrichment of the results: only equivalent terms were found using AML. literature. AoB plants, 2010, p.plq008. Disease resistance traits are important for breeders. A [2] Arnaud, E., Cooper, L., Shrestha, R., Menda, N., Nelson, R.T., Matteis, disease results from the combination of a host species, a L., Skofic, M., Bastow, R., Jaiswal, P., Mueller, L.A. and McLaren, G., 2012, October. Towards a Reference Plant Trait Ontology for Modeling pathogen and an environment, the disease resistance traits are Knowledge of Plant Traits and Phenotypes. In KEOD (pp. 220-225). crop-specific. Biotic stress traits include disease-related traits [3] https://github.com/obophenotype/bio-attribute-ontology, and can cover as much as 20% of the individual CO. Those DOI:10.5281/zenodo.47337 traits cannot have an exact correspondence in TO. Thus AML was not able to find mappings for those terms. Based on the formal definitions, a reasoner linked those terms to be subclasses of one the TO stress trait. Finally, all the classes in TO haven’t been formally defined. Indeed, design patterns are hard to develop for very complex traits such as yield-related traits. This is why the all the CO classes cannot be mapped to TO classes using the design pattern technique. Manual mapping is still needed in order to map those traits. The mapping coverage will be improved in the future. The mapped ontologies are available on www.planteome.org as well as on Planteom’s GitHub repository (https://github.com/Planteome).