[Dai:Si] - A Modular Dataset Retrieval Framework
with a Semantic Search for Biological Data
Fateme Shafiei1 , Felicitas Löffler1 , Sven Thiel1 , Kobkaew Opasjumruskit2 ,
Denis Grabiger1 , Pauline Rauh1 and Birgitta König-Ries1,3,4
1
  Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science,
Friedrich Schiller University Jena, Jena, Germany
2
  Software Systems for Digitalization, Institute of Data Science, German Aerospace Center (DLR), Jena, Germany
3
  German Center for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany
4
  Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, Germany


                                         Abstract
                                         Dataset search is receiving increasing attention in a scholar’s daily research practice. In biodiversity
                                         research, dataset retrieval in particular is a challenging and time-consuming task as most search services
                                         in current data portals only offer a simple keyword-based search. In this work we introduce [Dai:Si], a
                                         modular framework for dataset retrieval with a semantic search for biological data. [Dai:Si] is based on a
                                         former semantic search service developed within the scope of the GFBio project. It allows the expansion
                                         of query keywords with related terms using GFBio’s Terminology Service. This new version provides
                                         an enhanced user interface (UI) with explanations of related semantic terms upon demand. Due to its
                                         modular structure, [Dai:Si]’s semantic service can now be used independently of the user interface (UI).

                                         Keywords
                                         Dataset search, Dataset retrieval, Semantic search, Query expansion, Biodiversity informatics


1. Introduction
Dataset search is an increasingly important task in daily research practice. In particular, in
biodiversity research, the search for datasets and their reuse has steadily increased over the last
decade [1]. However, scholars report difficulties in finding relevant datasets [2, 3]. “Inadequate
search tools” [3] constitute one obstacle. Most data portals only offer a keyword-based search
along with a faceted search to look for scientific datasets, e.g., DataOne 1 or Zenodo 2 . In
these search systems, relevant datasets can only be found when a query keyword syntactically
matches the content of a dataset. As biological terms are often fuzzy [4], further related semantic
terms should be taken into account in dataset search. So far there are very few approaches that
S4BioDiv 2021: 3rd International Workshop on Semantics for Biodiversity, held at JOWO 2021: Episode VII The Bolzano
Summer of Knowledge, September 11–18, 2021, Bolzano, Italy
" fateme.shafiei@uni-jena.de (F. Shafiei); felicitas.loeffler@uni-jena.de (F. Löffler); sven.thiel@uni-jena.de
(S. Thiel); kobkaew.opasjumruskit@dlr.de (K. Opasjumruskit); denis.grabiger@uni-jena.de (D. Grabiger);
pauline.rauh@uni-jena.de (P. Rauh); birgitta.koenig-ries@uni-jena.de (B. König-Ries)
 0000-0001-9731-9496 (F. Shafiei); 0000-0001-6423-7427 (F. Löffler); 0000-0003-3093-5635 (S. Thiel);
0000-0002-9206-6896 (K. Opasjumruskit); 0000-0002-2382-9722 (B. König-Ries)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    DataOne, https://www.dataone.org/
                  2
                    Zenodo, https://zenodo.org/
Figure 1: [Dai:Si]’s architecture and the overall flow of the semantic search.


additionally support scholars with semantic services. In the biomedical domain, Datamed [5] 3
offers a search portal that expands query terms with related terms using the UMLS service 4 .
In biodiversity research, within the scope of the GFBio 5 project a semantic search based on
query expansion has been introduced [6]. The knowledge base behind the system is GFBio’s
own Terminology Service [7] offering tailored ontology services for biodiversity research.
   In this work, we present a new version of this semantic search for biological datasets. It is
part of a modular framework - [Dai:Si] (the name is an abbreviation of the phonetic spelling of
‘dataset search’) - that allows developers to use the semantic search independently of the user
interface. In addition, explanations of the expanded terms are now available on demand, and
the search can be expanded with narrower or broader terms. The code is publicly available in
our GitHub repository: https://github.com/fusion-jena/DaiSi.


2. Architecture
The architecture is presented in Figure 1. The framework consists of a middleware, implemented
with NodeJS 6 , and a front-end, implemented with Angular 7 . The NodeJS server communicates
through a REST API with both back-end applications, the GFBio Terminology service 8 and the
GFBio search index.
  Modularity is one of the main aims of the framework. Therefore, domain and business
specific logic are separated from functional components. This allows an easy integration of
additional search indexes. For each search index a new module is added to the middleware.
The search index only has to provide some fields that need to be mapped to the underlying
data model. More details are described on our GitHub page. The terminology service can be
replaced by other services in a configuration file in the middleware. However, as no protocols
    3
      Datamed, https://datamed.org/
    4
      UMLS, https://www.nlm.nih.gov/research/umls/
    5
      GFBio, https://www.gfbio.org
    6
      NodeJS, https://nodejs.org/en/
    7
      Angular, https://angular.io/
    8
      GFBio TS, https://terminologies.gfbio.org/
Figure 2: Screenshot of [Dai:Si]’s semantic search. A search for ’honeybee’ will extend the query to
related terms such as ’bee’ or ’Apis mellifica’.


or standards for terminology services exist yet, changing the service might require adjustments
to the middleware functions based on the API requests and responses.
   [Dai:Si]’s UI consists of four main components: filters, map, search input and search result.
For collecting documents during search, e.g. for later download, a data basket is provided.
When users look for datasets in the semantic search (1), all query terms are sent to GFBio’s
terminology service (2). Matching URIs are looked up per query term and are sent back to the
middleware (3). In the current version, only synonyms including scientific and common names
are considered. Afterwards, the expanded search terms are sent to the search index (4). The
result (5) is forwarded to the frontend and the returned datasets are displayed (6). For now all
search terms are combined with a logical OR. However, if the search index supports boosting
of results containing all or most search terms, datasets with the largest match are presented
on top. Figure 2 presents a screenshot of [Dai:Si]’s semantic search. The user can obtain more
information about the expanded terms by hovering over them. An explanation dialog displays
the URIs found, their ontologies and a description. This supports users in understanding the
relation between the originally entered keyword and the expanded terms. Users can also query
for further semantic relations such as child (narrower) or parent (broader) concepts on demand.
These related terms can be added to the search input field with a double-click.


3. Demonstration
We provide a demonstration of [Dai:Si] with GFBio’s search index: https://dev.gfbio.uni-jena.
de/daisi. Users can either search for datasets with the original search without query expansion,
or they can try out the semantic search. All middleware services, including the semantic search,
are also accessible separately: https://dev.gfbio.uni-jena.de/daisi-api/api-docs/.
4. Conclusion
In this work, we presented [Dai:Si] - a new modular dataset retrieval framework with a semantic
search for biological data. We aim to enhance the semantic search to permit the usage of AND,
OR, NOT and quotation marks in the search input field. We would also like to integrate further
semantic services to highlight important biological entities, e.g., species, environmental terms
or data parameters.


Acknowledgments
We would like to thank the following colleagues for their support in terms of the search index
and terminology services: U. Schindler, A. Behnken, F. Becker and N. Karam.


References
[1] GBIF, GBIF Science Review 2020, Technical Report, 2020. doi:10.35035/bezp-jj23.
[2] A. Culina, T. W. Crowther, J. J. C. Ramakers, P. Gienapp, M. E. Visser, How to do meta-
     analysis of open datasets, Nature Ecology & Evolution 2 (2018) 1053–1056. doi:10.1038/
     s41559-018-0579-2.
[3] K. Gregory, P. Groth, A. Scharnhorst, S. Wyatt, Lost or found? Discovering data needed for
     research, Harvard Data Science Review 2 (2020). doi:10.1162/99608f92.e38165eb.
[4] A. E. Thessen, H. Cui, D. Mozzherin, Applications of natural language processing in
     biodiversity science, Advances in Bioinformatics (2012). doi:10.1155/2012/391574.
[5] X. Chen, A. E. Gururaj, B. Ozyurt, R. Liu, E. Soysal, T. Cohen, F. Tiryaki, Y. Li, N. Zong,
     M. Jiang, D. Rogith, M. Salimi, H.-E. Kim, P. Rocca-Serra, A. Gonzalez-Beltran, C. Farcas,
    T. Johnson, R. Margolis, G. Alter, I. M. Fore, L. Ohno-Machado, J. S. Grethe, H. Xu, Datamed
    - an open source discovery index for finding biomedical datasets, Journal of the American
     Medical Informatics Association 25 (2018) 300–308. doi:10.1038/s41559-018-0579-2.
[6] F. Löffler, K. Opasjumruskit, N. Karam, D. Fichtmüller, U. Schindler, F. Klan, C. Müller-
     Birn, M. Diepenbroek, Honey bee versus Apis Mellifera: A semantic search for bio-
     logical data, Springer International Publishing, Cham, 2017, pp. 98–103. doi:10.1007/
     978-3-319-70407-4_19.
[7] N. Karam, C. Müller-Birn, M. Gleisberg, D. Fichtmüller, R. Tolksdorf, A. Güntsch, A
     terminology service supporting semantic annotation, integration, discovery and analysis
     of interdisciplinary research data, Datenbank-Spektrum 16 (2016) 195–205. doi:10.1007/
     s13222-016-0231-8.