<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Terminologies as a neglected part of research data: Making supplementary research data available through the GFBio Terminology Service</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Fichtmu¨ ller</string-name>
          <email>d.fichtmueller@bgbm.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maren Gleisberg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naouel Karam</string-name>
          <email>naouel.karam@fu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Mu¨ ller-Birn</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anton G u¨ntsch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Botanic Garden and Botanical Museum (BGBM), Freie Universita ̈t Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Freie Universita ̈t Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In many research projects, much more data are created than made publicly available. Keeping research data deliberately closed or publishing only selected subsections of the gathered data are unfortunately common practices in academia. Fortunately, such problems have been getting more and more attention in the past years. However, another issue that is still often overlooked concerns research data that are generated as part of a research project but that are generally not considered part of the primary research data. One example for such neglected research data are terminologies such as controlled vocabularies that are used to describe or classify primary research data. In this paper we will outline the process that is used by the Terminology Service of the German Federation for Biological Data (GFBio) to prepare and process terminologies so that they can be included in the GFBio Terminology Service where they are made available to researchers within and outside the original research project. We will also show how making such supplementary research data publicly available will benefit the researchers who share them as well as the scientific community as a whole.</p>
      </abstract>
      <kwd-group>
        <kwd>GFBio</kwd>
        <kwd>research data</kwd>
        <kwd>terminology</kwd>
        <kwd>ontology</kwd>
        <kwd>terminology service</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years, primary research data have been getting more attention as part of the
publication process. Funding agencies such as the German Research Foundation (DFG3)
and publishers are pushing scientists to publish the underlying research data along with
the corresponding papers, or at least upload them to research data repositories. The
DFG-funded project GFBio4 (German Federation for Biological Data) is creating a
dedicated repository for various kinds of biological research data and is developing
supplementary tools for discovery and reuse of these data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Various other initiatives
are working on making research data publication and usage easier. One such initiative
3 www.dfg.de
4 www.gfbio.org
is re3data.org5 that has created an extensive registry of research data repositories, so
scientists can easily find the repository that best fits their data and their needs. Another
example is DataCite6 who provides tools to make scientific data more citable and easier
to find and reuse. Generally, the state of research data has significantly improved in
recent years and will most likely continue to improve in the years to come. All of these
tools and methods, however, generally only focus on the primary research data generated
by the research projects. Another kind of data that is created during research projects
is often overlooked: terminologies that are used to describe or classify records in the
primary data. In scientific projects where several people are involved in the creation
or gathering of the data, especially in large joint research cooperatives, it is vital to
have a common understanding about the methods and categories used to describe these
data. Ideally, this common understanding is expressed through written definitions of
the terms prior to the collection of the first data. However, it is also possible that the
conceptual agreement between the involved scientists was only achieved through ad-hoc
discussions during data accumulation and was never formalized or documented. Even
when common terms have been properly defined and documented, these documentations
are often not published alongside the primary research data. This is a crucial loss of
useful information, since definitions, synonyms and structural relations between terms
usually cannot fully be extracted from the research data that is described using those
terminologies, see Fig. 1.
      </p>
      <p>Terminologies that are used to describe or classify primary research data can therefore
be considered as supplementary7 research data, data that is not the primary focus of a
research project, but vital to the accumulation of the primary research data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Context and Related Work</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>What are Terminologies</title>
        <p>In the context of the GFBio Terminology Service and this paper, a terminology is the
overarching name for any set of fixed denotations that are used to describe something
with the goal to reduce ambiguity and facilitate comparability. A terminology can range
from a simple Controlled Vocabulary (a simple list of terms) to a complex Ontology
(formal definitions of terms and their relations semantically expressed in a machine
readable way). Terminologies can include translations and synonyms or aliases for
individual terms.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>What is GFBio and the GFBio Terminology Service</title>
        <p>
          The German Federation for Biological Data (GFBio) is a national data infrastructure
to store and facilitate access to biological and environmental research data. It offers
services and resources to researchers for the archiving and publication of their research
data as well as an open access portal to provide access to the data stored in the various
data centers. The Terminology Service8 (TS) of GFBio provides access to various
terminologies for research data through one unified API [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Terminologies hosted at
the TS can be distinguished into two groups: internal terminologies where data are
locally stored and external terminologies9 where the TS provides access to terminologies
hosted on remove servers, examples for the latter case would be large databases like the
Catalogue of Life (CoL)10, the World Register of Marine Species (WoRMS)11 or the
GeoNames12 Database. On the GFBio data portal, search queries for taxonomic names
are extended using the TS to include synonyms and names of higher taxa, resulting in
more relevant results for the users. The TS is therefore a vital component of the GFBio
infrastructure. The GFBio Terminology Service can handle all kinds of terminologies,
independent of their complexity, though the authors of terminologies to be included are
required to at least provide definitions of the terms.
7 Supplementary as in supplementary to the primary data, and not to be confused with
supplementary data for journal publications where the supplementary refers to the primary research
data being the supplement to the journal article.
8 terminologies.gfbio.org
9 In the context of this paper we will focus only on the preparation for terminologies to be
imported as an internal terminology, as the process for connecting to an external terminology is
completely different and beyond the scope of this paper.
10 www.catalogueoflife.org
11 www.marinespecies.org
12 www.geonames.org
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Related Initiatives</title>
        <p>
          Different systems providing a comparable terminology service exist, the most widely
used being Bioportal [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a repository providing access to a large number of biomedical
ontologies and Agroportal [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] its counterpart for agriculture and earth sciences. Finto
(Finnish thesaurus and ontology service) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is a vocabulary service offering interfaces
to ontologies from different domains, such as art, geography, science and medicine. The
Ontology Lookup Service (OLS) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a system integrating publicly available biomedical
ontologies. And finally, Aber-OWL [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is a framework that provides reasoning services
over bio-ontologies. Specific project requirements motivated our decision of setting up
our own solution, for instance regarding the range of heterogeneity of the considered
terminologies or the necessity of combining ontology content with annotations to perform
semantic search. More details about the requirements and a detailed comparison with
existent systems can be found in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Terminology Preparation Steps</title>
      <p>If researchers want to have their terminology included in the GFBio Terminology Service,
they need to contact the TS team, either directly or through the GFBio Submission Page13.
To make a terminology fit the requirements for import in the Terminology Service, several
processing steps might be required. These steps are done in close cooperation between
the TS team and the scientist(s) providing the terminologies. The steps strongly vary
between the individual terminologies, their type and complexity, and the additional work
already provided by the involved scientists. The simplest case is when a dedicated list
of terms is available as part of the supplementary research data, ideally with definitions
and connections between the terms. In cases where no dedicated list of terms or formal
documentation is available, the terms are extracted from the primary research data.
This can range from simply exporting individual columns or tables from the set of the
primary research data to doing complex parsing operations on the data to filter out
the desired terminologies. The software used to do these extractions depends on the
original data, e.g. when the terminology is included in the form of geographic data files,
a common GIS software is used to extract it. The goal of the extraction process is to end
up with a tabular file of the individual terms and their corresponding information, like
hierarchies, if they can be extracted as well. Once the extraction is done, the scientists are
asked to review the information for the completeness and correctness and provide any
missing information that were not part of the original research data, such as definitions,
translations or hierarchical structures in cases where they could not be extracted. The
next step of the terminology processing is the data refinement and cleanup, which again
is done in close contact with the contributing scientist(s). The refinement is usually done
using OpenRefine14, to catch errors like spelling mistakes in the term names, resulting
in two very similar but not identical terms. Different additional tools are sometimes also
13 https://www.gfbio.org/data/submit/generic; This is the same page as for the general GFBio data
submission.
14 www.openrefine.org
used to check for logical errors in the structure or other errors that cannot be checked
using OpenRefine.</p>
      <p>Each term of the terminology will get an individual URI which makes them
addressable as a resource in the Semantic Web context. To avoid creating additional URIs for
the same concepts, similar terminologies are searched for and if available, their terms
are compared to the terms of the current terminology. In cases where terms are identical,
the already existing URI is used. If terms are comparable but not identical to terms from
other terminologies, then the relation between the terms is recorded by using properties
such as skos:broader or skos:related. There are two options for contributing scientists if
new URIs for the terms are assigned. The terms can either get the GFBio TS prefix15,
or they can provide their own prefix. The URIs with the TS prefix are resolvable and
provide both human and machine readable formats depending on who is resolving the
link. Custom URI prefixes on the other hand can help the branding of research projects,
but the researchers are responsible for resolving the terms if they wish to have this
highly recommended feature. In the end, the metadata of the terminology itself are
formalized and the terminology is exported. Depending on its complexity this is usually
SKOS, OWL or another RDF-based format which can then be imported into the GFBio
Terminology Service. The export is done by creating a template in which the individual
terms can be imported and using the OpenRefine templating engine to generate the final
RDF file. After a final check and validation, the file is then imported into the GFBio TS,
where the terms are then accessible via the TS API. When several scientists from the
original research project wish to collaboratively and simultaneously work on reviewing
and extending the terminology during the different feedback steps mentioned above, the
TS team can provide dedicated tools.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Advantages of accessible Research Terminologies</title>
      <p>
        There are several advantages that come with having research terminologies accessible.
The foremost gain is that the primary research data itself becomes more understandable
and reusable when the definitions and underlying hierarchies of the terms used to
express it, are available as well. This is the primary use case of supplementary research
data. These advantages can be further extended if the primary research data are served
through a semantic aware search or portal, as this will allow for queries that also include
synonyms or higher hierarchical terms, as demonstrated in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Additional benefits could
arise, if the primary research data not only uses the terms as a textual representation
(i.e. copying its name) but as a semantic annotation, by using the concept URI to link to
the term instead. Once the terms and their definitions are publicly available it strongly
encourages their reuse. This could be in a subsequent project by the same researchers
or even with researchers from other projects. Reusing terms not only saves time and
effort for the people involved, but it makes the produced research data between the
different projects more comparable, reusable and integrable. While journal publications
of research papers and their subsequent number of citations are still the de facto standard
to measure research impact, in recent years new approaches have come along to measure
15 The TS URIs are formatted like this:
http://terminologies.gfbio.org/terms/&lt;terminologyname&gt;/&lt;term-name&gt;
other kinds of scientific output as well, such as data publications or continuous work on
service infrastructure. All terminologies on the GFBio Terminology Service can be cited
as a research product which will give credit to the researchers who invested time and
effort in creating them.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The GFBio Terminology Service is an important resource both for scientists who wish
to share their terminologies that are used to describe and classify research data and
for researchers who wish to apply existing terminologies and classifications to their
own research data to improve their integrability. With reasonable additional effort the
terminologies can be processed to be included in the TS and both the scientists who
created the terminologies and the scientific community as a whole can benefit from this
otherwise neglected research data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. R. G. Coˆte´,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Apweiler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Hermjakob</surname>
          </string-name>
          .
          <article-title>The ontology lookup service, a lightweight cross-platform tool for controlled vocabulary queries</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Diepenbroek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. O.</given-names>
            <surname>Glo</surname>
          </string-name>
          ¨ckner,
          <string-name>
            <given-names>P.</given-names>
            <surname>Grobe</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Gu¨ntsch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Ko¨nig-</article-title>
          <string-name>
            <surname>Ries</surname>
            ,
            <given-names>I. Kostadinov</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nieschulze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Seeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tolksdorf</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Triebel</surname>
          </string-name>
          .
          <article-title>Towards an integrated biodiversity and ecological research data management and archiving platform: The german federation for the curation of biological data (gfbio)</article-title>
          .
          <source>In 44. Jahrestagung der Gesellschaft fu¨r Informatik</source>
          , Stuttgart, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Slater</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Schofield</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Gkoutos</surname>
          </string-name>
          .
          <article-title>Aber-owl: a framework for ontology-based data access in biology</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Jonquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toulet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Arnaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Aubin</surname>
          </string-name>
          , E. Dzale´ Yeumo,
          <string-name>
            <given-names>V.</given-names>
            <surname>Emonet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pesce</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Larmande</surname>
          </string-name>
          .
          <article-title>Reusing the NCBO BioPortal technology for agronomy to build AgroPortal</article-title>
          .
          <source>In ICBO : International Conference on Biomedical Ontologies, page 3</source>
          p.,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>N.</given-names>
            <surname>Karam</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Mu¨ller-</article-title>
          <string-name>
            <surname>Birn</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gleisberg</surname>
          </string-name>
          , D. Fichtmu¨ller, R. Tolksdorf,
          <article-title>and</article-title>
          <string-name>
            <surname>A. G</surname>
          </string-name>
          <article-title>u¨ntsch. A terminology service supporting semantic annotation, integration, discovery and analysis of interdisciplinary research data</article-title>
          .
          <source>Datenbank-Spektrum</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>195</fpage>
          -
          <lpage>205</lpage>
          ,
          <year>Nov 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>F.</given-names>
            <surname>Lo</surname>
          </string-name>
          <article-title>¨ffler, K. Opasjumruskit</article-title>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Karam</surname>
          </string-name>
          , D. Fichtmu¨ller,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Mu¨ller-</article-title>
          <string-name>
            <surname>Birn</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Schindler</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Diepnebroek</surname>
          </string-name>
          .
          <article-title>Honey bee versus apis mellifera: A semantic search for biological data</article-title>
          . In E. Blomqvist,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoekstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          , and O. Hartig, editors,
          <source>The Semantic Web - 14th International Conference, ESWC</source>
          <year>2017</year>
          ,
          <article-title>Portorozˇ</article-title>
          , Slovenia, May 28 - June 1,
          <year>2017</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , Lecture Notes in Computer Science,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Whetzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Griffith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jonquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Storey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Chute</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Musen</surname>
          </string-name>
          .
          <article-title>Bioportal: ontologies and integrated data resources at the click of a mouse</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>37</volume>
          (
          <string-name>
            <surname>Web-Server-Issue</surname>
          </string-name>
          ):
          <fpage>170</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>O.</given-names>
            <surname>Suominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pessala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tuominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lappalainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nykyri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ylikotila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Frosterus</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Hyvnen</surname>
          </string-name>
          .
          <article-title>Deploying national ontology services: From onki to finto</article-title>
          .
          <source>In Proceedings of the Industry Track at the International Semantic Web Conference</source>
          <year>2014</year>
          . CEUR Workshop Proceedings,
          <year>October 2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>