Terminologies as a neglected part of research data: Making supplementary research data available through the GFBio Terminology Service David Fichtmüller1 , Maren Gleisberg1 , Naouel Karam2 , Claudia Müller-Birn2 , and Anton Güntsch1 1 Botanic Garden and Botanical Museum (BGBM), Freie Universität Berlin, Germany {d.fichtmueller@bgbm.de} 2 Institute of Computer Science, Freie Universität Berlin, Germany {naouel.karam@fu-berlin.de} Abstract. In many research projects, much more data are created than made publicly available. Keeping research data deliberately closed or publishing only selected subsections of the gathered data are unfortunately common practices in academia. Fortunately, such problems have been getting more and more attention in the past years. However, another issue that is still often overlooked concerns research data that are generated as part of a research project but that are generally not considered part of the primary research data. One example for such neglected research data are terminologies such as controlled vocabularies that are used to describe or classify primary research data. In this paper we will outline the process that is used by the Terminology Service of the German Federation for Biological Data (GFBio) to prepare and process terminologies so that they can be included in the GFBio Terminology Service where they are made available to researchers within and outside the original research project. We will also show how making such supplementary research data publicly available will benefit the researchers who share them as well as the scientific community as a whole. Keywords: GFBio, research data, terminology, ontology, terminology service 1 Introduction In recent years, primary research data have been getting more attention as part of the publication process. Funding agencies such as the German Research Foundation (DFG3 ) and publishers are pushing scientists to publish the underlying research data along with the corresponding papers, or at least upload them to research data repositories. The DFG-funded project GFBio4 (German Federation for Biological Data) is creating a dedicated repository for various kinds of biological research data and is developing supplementary tools for discovery and reuse of these data [2]. Various other initiatives are working on making research data publication and usage easier. One such initiative 3 www.dfg.de 4 www.gfbio.org Fig. 1: A simple example of a geographic classification as it is used in research data and as it exists on its own with its definitions and connections. is re3data.org5 that has created an extensive registry of research data repositories, so scientists can easily find the repository that best fits their data and their needs. Another example is DataCite6 who provides tools to make scientific data more citable and easier to find and reuse. Generally, the state of research data has significantly improved in recent years and will most likely continue to improve in the years to come. All of these tools and methods, however, generally only focus on the primary research data generated by the research projects. Another kind of data that is created during research projects is often overlooked: terminologies that are used to describe or classify records in the primary data. In scientific projects where several people are involved in the creation or gathering of the data, especially in large joint research cooperatives, it is vital to have a common understanding about the methods and categories used to describe these data. Ideally, this common understanding is expressed through written definitions of the terms prior to the collection of the first data. However, it is also possible that the conceptual agreement between the involved scientists was only achieved through ad-hoc discussions during data accumulation and was never formalized or documented. Even when common terms have been properly defined and documented, these documentations are often not published alongside the primary research data. This is a crucial loss of useful information, since definitions, synonyms and structural relations between terms usually cannot fully be extracted from the research data that is described using those terminologies, see Fig. 1. Terminologies that are used to describe or classify primary research data can therefore 5 www.re3data.org 6 www.datacite.org be considered as supplementary7 research data, data that is not the primary focus of a research project, but vital to the accumulation of the primary research data. 2 Context and Related Work 2.1 What are Terminologies In the context of the GFBio Terminology Service and this paper, a terminology is the overarching name for any set of fixed denotations that are used to describe something with the goal to reduce ambiguity and facilitate comparability. A terminology can range from a simple Controlled Vocabulary (a simple list of terms) to a complex Ontology (formal definitions of terms and their relations semantically expressed in a machine readable way). Terminologies can include translations and synonyms or aliases for individual terms. 2.2 What is GFBio and the GFBio Terminology Service The German Federation for Biological Data (GFBio) is a national data infrastructure to store and facilitate access to biological and environmental research data. It offers services and resources to researchers for the archiving and publication of their research data as well as an open access portal to provide access to the data stored in the various data centers. The Terminology Service8 (TS) of GFBio provides access to various terminologies for research data through one unified API [5]. Terminologies hosted at the TS can be distinguished into two groups: internal terminologies where data are locally stored and external terminologies9 where the TS provides access to terminologies hosted on remove servers, examples for the latter case would be large databases like the Catalogue of Life (CoL)10 , the World Register of Marine Species (WoRMS)11 or the GeoNames12 Database. On the GFBio data portal, search queries for taxonomic names are extended using the TS to include synonyms and names of higher taxa, resulting in more relevant results for the users. The TS is therefore a vital component of the GFBio infrastructure. The GFBio Terminology Service can handle all kinds of terminologies, independent of their complexity, though the authors of terminologies to be included are required to at least provide definitions of the terms. 7 Supplementary as in supplementary to the primary data, and not to be confused with supple- mentary data for journal publications where the supplementary refers to the primary research data being the supplement to the journal article. 8 terminologies.gfbio.org 9 In the context of this paper we will focus only on the preparation for terminologies to be imported as an internal terminology, as the process for connecting to an external terminology is completely different and beyond the scope of this paper. 10 www.catalogueoflife.org 11 www.marinespecies.org 12 www.geonames.org 2.3 Related Initiatives Different systems providing a comparable terminology service exist, the most widely used being Bioportal [7], a repository providing access to a large number of biomedical ontologies and Agroportal [4] its counterpart for agriculture and earth sciences. Finto (Finnish thesaurus and ontology service) [8] is a vocabulary service offering interfaces to ontologies from different domains, such as art, geography, science and medicine. The Ontology Lookup Service (OLS) [1] is a system integrating publicly available biomedical ontologies. And finally, Aber-OWL [3] is a framework that provides reasoning services over bio-ontologies. Specific project requirements motivated our decision of setting up our own solution, for instance regarding the range of heterogeneity of the considered terminologies or the necessity of combining ontology content with annotations to perform semantic search. More details about the requirements and a detailed comparison with existent systems can be found in [5]. 3 Terminology Preparation Steps If researchers want to have their terminology included in the GFBio Terminology Service, they need to contact the TS team, either directly or through the GFBio Submission Page13 . To make a terminology fit the requirements for import in the Terminology Service, several processing steps might be required. These steps are done in close cooperation between the TS team and the scientist(s) providing the terminologies. The steps strongly vary between the individual terminologies, their type and complexity, and the additional work already provided by the involved scientists. The simplest case is when a dedicated list of terms is available as part of the supplementary research data, ideally with definitions and connections between the terms. In cases where no dedicated list of terms or formal documentation is available, the terms are extracted from the primary research data. This can range from simply exporting individual columns or tables from the set of the primary research data to doing complex parsing operations on the data to filter out the desired terminologies. The software used to do these extractions depends on the original data, e.g. when the terminology is included in the form of geographic data files, a common GIS software is used to extract it. The goal of the extraction process is to end up with a tabular file of the individual terms and their corresponding information, like hierarchies, if they can be extracted as well. Once the extraction is done, the scientists are asked to review the information for the completeness and correctness and provide any missing information that were not part of the original research data, such as definitions, translations or hierarchical structures in cases where they could not be extracted. The next step of the terminology processing is the data refinement and cleanup, which again is done in close contact with the contributing scientist(s). The refinement is usually done using OpenRefine14 , to catch errors like spelling mistakes in the term names, resulting in two very similar but not identical terms. Different additional tools are sometimes also 13 https://www.gfbio.org/data/submit/generic; This is the same page as for the general GFBio data submission. 14 www.openrefine.org used to check for logical errors in the structure or other errors that cannot be checked using OpenRefine. Each term of the terminology will get an individual URI which makes them address- able as a resource in the Semantic Web context. To avoid creating additional URIs for the same concepts, similar terminologies are searched for and if available, their terms are compared to the terms of the current terminology. In cases where terms are identical, the already existing URI is used. If terms are comparable but not identical to terms from other terminologies, then the relation between the terms is recorded by using properties such as skos:broader or skos:related. There are two options for contributing scientists if new URIs for the terms are assigned. The terms can either get the GFBio TS prefix15 , or they can provide their own prefix. The URIs with the TS prefix are resolvable and provide both human and machine readable formats depending on who is resolving the link. Custom URI prefixes on the other hand can help the branding of research projects, but the researchers are responsible for resolving the terms if they wish to have this highly recommended feature. In the end, the metadata of the terminology itself are formalized and the terminology is exported. Depending on its complexity this is usually SKOS, OWL or another RDF-based format which can then be imported into the GFBio Terminology Service. The export is done by creating a template in which the individual terms can be imported and using the OpenRefine templating engine to generate the final RDF file. After a final check and validation, the file is then imported into the GFBio TS, where the terms are then accessible via the TS API. When several scientists from the original research project wish to collaboratively and simultaneously work on reviewing and extending the terminology during the different feedback steps mentioned above, the TS team can provide dedicated tools. 4 Advantages of accessible Research Terminologies There are several advantages that come with having research terminologies accessible. The foremost gain is that the primary research data itself becomes more understandable and reusable when the definitions and underlying hierarchies of the terms used to express it, are available as well. This is the primary use case of supplementary research data. These advantages can be further extended if the primary research data are served through a semantic aware search or portal, as this will allow for queries that also include synonyms or higher hierarchical terms, as demonstrated in [6]. Additional benefits could arise, if the primary research data not only uses the terms as a textual representation (i.e. copying its name) but as a semantic annotation, by using the concept URI to link to the term instead. Once the terms and their definitions are publicly available it strongly encourages their reuse. This could be in a subsequent project by the same researchers or even with researchers from other projects. Reusing terms not only saves time and effort for the people involved, but it makes the produced research data between the different projects more comparable, reusable and integrable. While journal publications of research papers and their subsequent number of citations are still the de facto standard to measure research impact, in recent years new approaches have come along to measure 15 The TS URIs are formatted like this: http://terminologies.gfbio.org/terms// other kinds of scientific output as well, such as data publications or continuous work on service infrastructure. All terminologies on the GFBio Terminology Service can be cited as a research product which will give credit to the researchers who invested time and effort in creating them. 5 Conclusion The GFBio Terminology Service is an important resource both for scientists who wish to share their terminologies that are used to describe and classify research data and for researchers who wish to apply existing terminologies and classifications to their own research data to improve their integrability. With reasonable additional effort the terminologies can be processed to be included in the TS and both the scientists who created the terminologies and the scientific community as a whole can benefit from this otherwise neglected research data. References 1. R. G. Côté, P. Jones, R. Apweiler, and H. Hermjakob. The ontology lookup service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics, 7(1):1– 7, 2006. 2. M. Diepenbroek, F. O. Glöckner, P. Grobe, A. Güntsch, R. Huber, B. König-Ries, I. Kostadinov, J. Nieschulze, B. Seeger, R. Tolksdorf, and D. Triebel. Towards an integrated biodiversity and ecological research data management and archiving platform: The german federation for the curation of biological data (gfbio). In 44. Jahrestagung der Gesellschaft für Informatik, Stuttgart, Germany. 3. R. Hoehndorf, L. Slater, P. N. Schofield, and G. V. Gkoutos. Aber-owl: a framework for ontology-based data access in biology. BMC Bioinformatics, 16(1):1–9, 2015. 4. C. Jonquet, A. Toulet, E. Arnaud, S. Aubin, E. Dzalé Yeumo, V. Emonet, V. Pesce, and P. Larmande. Reusing the NCBO BioPortal technology for agronomy to build AgroPortal. In ICBO : International Conference on Biomedical Ontologies, page 3 p., 2016. 5. N. Karam, C. Müller-Birn, M. Gleisberg, D. Fichtmüller, R. Tolksdorf, and A. Güntsch. A terminology service supporting semantic annotation, integration, discovery and analysis of interdisciplinary research data. Datenbank-Spektrum, 16(3):195–205, Nov 2016. 6. F. Löffler, K. Opasjumruskit, N. Karam, D. Fichtmüller, F. Klan, C. Müller-Birn, U. Schindler, and M. Diepnebroek. Honey bee versus apis mellifera: A semantic search for biological data. In E. Blomqvist, D. Maynard, A. Gangemi, R. Hoekstra, P. Hitzler, and O. Hartig, editors, The Semantic Web - 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28 - June 1, 2017, Proceedings, Part I, Lecture Notes in Computer Science, 2017. 7. N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L. Rubin, M. D. Storey, C. G. Chute, and M. A. Musen. Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37(Web-Server-Issue):170–173, 2009. 8. O. Suominen, S. Pessala, J. Tuominen, M. Lappalainen, S. Nykyri, H. Ylikotila, M. Frosterus, and E. Hyvnen. Deploying national ontology services: From onki to finto. In Proceedings of the Industry Track at the International Semantic Web Conference 2014. CEUR Workshop Proceedings, October 2014.