Integrated Semantic Search on Structured and Unstructured Data in the ADOnIS System Friederike Klan, Erik Faessler,Alsayed Algergawy, Birgitta König-Ries, and Udo Hahn Friedrich-Schiller-Universität Jena, Jena, Germany firstname.lastname@uni-jena.de Abstract. We introduce ADOnIS, an information system which co- herently integrates two important, yet mostly disparate data sources, namely structured, tabular data, and unstructured data in terms of pub- lications. The integration is achieved by providing the underlying back- ground knowledge of the domains involved in terms of adequately tai- lored ontologies. Once the two basic data sources are semantically linked, entirely novel opportunities for cross-source information retrieval arise which we will highlight in this paper. 1 Introduction Two mutually separated “data cultures” have emerged over the years and still persist in the field of information systems. On the one hand, the database com- munity focuses on the structured representation of slices of the reality, typically in terms of relations and tables. On the other hand, the information retrieval community deals with, from a computational view, unstructured data, namely documents as streams of characters (and other media types, such as visual data) and tries to computationally interpret (and thus restructure) the meaning en- coded in these textual data carriers. Both worlds rest on solid mathematical foundations and stable technical implementations on the basis of which huge amounts of structured and unstructured data can be managed and searched on an industrial scale. Yet, with the exception of activities aiming at the Semantic Web (for a survey, cf. [20]) they currently lack crossover. This lack of integration hampers the usability of data at all levels. Consider, as a concrete example, an interdisciplinary research community such as the one established in the collaborative research center (CRC) AquaDiva, our research environment [16].1 AquaDiva explores the role of water (Aqua) and biodiver- sity (Diva) for shaping the structure, properties and functions of the earth’s subsurface. When a graduate student enters the CRC, she might be interested in the transport of viruses in the geological subsurface. In order to get started the student searches for an overview of the state of the art and hints what has been done on this topic in AquaDiva so far. So she searches for relevant publi- cations in portals like PubMed or Google Scholar and poses search queries 1 http://www.aquadiva.uni-jena.de/ to the BExIS 2 data portal, the central information system hub of the project to obtain data that have been collected already. Typically, the student will start with one query and then try to navigate results and find related entries. Her success will strongly depend on her familiarity with the special mix of domains, skills of interacting with search engines and data repositories (including SQL/SPARQL-style query languages), her knowledge of linguistic variants and the taxonomic structures of the relevant sublanguages. For instance, queries for “virus transport subsurface”, “virus transport soil”, and “phages transport soil” typically return only partially overlapping result sets in PubMed or standard data management systems. This is due to simplistic string matching criteria, the incapability to account for linguistic variations of the same content (inflection variants, phrasal paraphrases, or synonyms) and the general lack of conceptual background knowledge (e.g., the taxonomic or partonomic structure of the do- mains’ terminologies). In our work, we aim to account for these deficiencies in a systematic way. The solution we propose is implemented in ADOnIS, the AquaDiva Ontology-based Information System that provides integrated and seamless access to structured data and unstructured publications by making use of a variety of semantic tech- nologies such as ontologies and natural language processing (NLP) tools. With this, we hope to reduce the cognitive burden put on searchers while, at the same time, we intend to increase the coverage and quality of search results. In this paper, we briefly describe the methodologies underlying ADOnIS and the way users can interact with the system. 2 Related Work Data in general and scientific data specifically can be roughly categorized into structured and unstructured data. Unstructured data has no predefined data model and is typically text-heavy. Due to its unstructured nature, it is a challeng- ing task to extract specific and useful information [6, 10]. Retrieval algorithms for unstructured data often rely on keyword-based indexing and comparison tech- niques. They typically offer a search box query interface, where the searcher can input keywords of interest. Due to its simplicity, this kind of user interface, is very intuitive and easy to use. This comes at a cost. The semantics of the search query in terms of a set of input terms is not explicitly given and needs to be revealed by the information system. On the other hand, structured data is data that is organized according to a predefined (but not necessarily explicitly known) data model, such as a table in a relational database (known data model), a document in RDF format ((partially) known data model) or a spreadsheet (unknown implicit data model). This pre- defined data model (if known2 ) enables search based on structured queries (e.g. SQL or SPARQL queries) with a well-known semantics. Although these kind 2 In cases where the underlying data model is implicit (e.g. in spreadsheets), it needs to be provided by the data creator or has to be automatically extracted using machine- learning techniques. The latter can be particularly challenging, since in contrast to of query interfaces make it easy to effectively identify and discover a piece of information and access it in concise way, they are rather complex and thus less suited to users with a non computer science background. Recent approaches have therefore started to combine and integrate keyword-based search approaches for unstructured data and concept-based approaches for structured data [3, 6, 2, 18, 19]. K-search is one of the earliest works on hybrid search that supports the re- trieval of documents and knowledge [2]. The K-search approach aims at searching the Semantic Web as a collection of documents (unstructured data) and meta- data (structured data). To achieve this goal, a hybrid strategy is proposed, where keyword-based and metadata-based search strategies are combined. K-Search uses two separate indexes for the hybrid search and combines the results afterwards via result intersection [10]. An ontology-based retrieval system is proposed in [6]. It adapts the classical vector space representation to be suitable for large-scale information sources. An ontology-based scheme is used to semi-automatically produce document annotations that are used for a semantic search. To cope with incomplete information in the knowledge base, the semantic search is combined with a conventional keyword-based search. Gärtner et al. [10] suggest a semantic search system (HS 3 ) that aims at semantically bridging the gap between struc- tured and unstructured data. HS 3 is an automated system that augments an arbitrary knowledge base with additional information extracted from the Web. These information can then be used to build a document corpus and a combined index. This index is leveraged for a hybrid semantic search strategy that com- bines keyword-based and concept-based search. TextTile is a data visualization tool for datasets and query examination that requires a flexible analysis of struc- tured data and unstructured text [9]. The tool includes a set of operations that can be interchangeably applied to structured as well as to unstructured textual data parts to generate useful data summaries. The tool does not make use of ontologies and semantic reasoning during the search process. An semantic search architecture specifically designed for biodiversity data is suggested in [1]. The proposed system aims at improving the quality of the search results by exploiting ontologies and the contextual meaning of data. A mapping component links biodiversity data and concepts of a domain-specific on- tology, OntoBio. A web interface supports end users to access data via SPARQL endpoints. In order to achieve this, the tool transforms domain ontologies, taxo- nomic information as well as biodiversity data into a common format. This has two disadvantages: datasets are duplicated and it becomes harder to reason on such big data. The ELSEWEB framework [22] aims at facilitating the integra- tion of environmental data and providing semantic bridges between these data and species distribution models. text-based documents, e.g. data tables, often reveal only scarce information that might give a hint to its meaning. 3 Overview of ADOnIS We have implemented ADOnIS as an extension to the BExIS 2 data man- agement platform [7]3 . In the following, we describe its two basic subsystems, namely the one dealing with already structured, tabular data (Sect. 3.1), and the one dealing with unstructured textual input on the basis of the semantic document search engine SeMedico (Sect. 3.2). The two components are sup- plemented by a graphical user interface that allows users to enter search terms based on which ADOnIS retrieves relevant data stored in BExIS 2 as well as publications (Sect. 4). A comprehensive view of the whole architecture of ADO- nIS is provided in Fig. 1 which will be explained in the subsections to follow. 3.1 Handling Structured Data Scientific data stored in BExIS 2 typically refer to field observations and mea- surements and are organized in tables. Each table and its corresponding meta information is referred to as a dataset. In addition to the data table containing the data values, each dataset comprises the table schema (name, datatype and unit of measurement for each data column) and metadata such as information about the data provider. Both, the actual data values and the table schema, are stored in a relational database. To make the semantics of datasets explicit, we annotate each data table with conceptual knowledge encoded in ADOn , a domain-specific ontology expressed in Owl 2.4 The ontology is tailored to the needs of the description of obser- vational data from the life sciences domain. It only includes relevant classes and properties of these as TBox statements. Assertions about data values and data annotations, i.e. ABox statements, are not materialized in the ontology. Instead, we use the ontology-based data access system Ontop [5]. Based on a given ontology and a set of mappings that relate class and property symbols in the ontology to SQL views over the data in the database, Ontop provides a virtual RDF graph that can be queried using Sparql. This avoids duplication of instance data (that already reside in the relational database) and allows for sound and complete query answering in LOGSPACE under the OWL 2 QL en- tailment regime.5 In order to retrieve datasets relevant to a certain search query, we generate a set of proper Sparql queries from the user-provided keywords, thus removing the burden from the searcher to formulate queries using a formal query language. ADOn Ontology & Semantic Annotation. As core ontology, we use a mod- ified version of the Extensible Observation Ontology (Oboe) [17] (version 1.2) that provides classes and properties for the description of field observations and measurements. Sets of related observations are organized in oboe:Observation 3 http://bexis2.uni-jena.de/ 4 https://www.w3.org/TR/owl-syntax 5 https://www.w3.org/TR/owl-profiles/\#OWL_2_QL Fig. 1. System Architecture for ADOnIS Collections, which resemble the concept of a dataset in BExIS 2. Each data row in a BExIS 2 data table is modeled as one or more oboe:Observations. An observation refers to an oboe:Entity, e.g. a Tree, and a set of oboe:Measurements related to that entity. A measurement refers to an oboe:Characteristic, uses an oboe:Standard and results in a value. For instance, for a certain Tree entity, its Circumference (characteristic) might have been measured in meters (stan- dard) and the measured value is 0.8. Oboe allows to indicate contextual re- lationships between observations, e.g. a tree might have been observed within a certain forest and this forest is located in a certain area. Modeling observa- tions in this way enables logical inferences about entities and the relationships between them, as well as about measured characteristics of entities. In the life sciences domain, both observed entities and their characteristics are particularly important when trying to explain phenomena and thus play a key role when searching for datasets. To cover domain-specific characteristics and entities, we reuse concepts from domain ontologies such as Obi (biomedical investigations),6 Envo (environ- mental features),7 Ncit (biomedical concepts)8 and ChEBI (chemical enti- ties).9 These were selected using the Joyce tool for ontology selection and integrated into our ontology applying strict methodological criteria to guaran- tee non-redundancy, minimality, and optimal coverage [8]. These requirements were met by asserting subclass-relationships between concepts from a third-party ontology and either oboe:Characteristic or oboe:Entity. Since Ncit and ChEBI are huge in terms of the number of concepts they define, we used mod- ularization techniques [8] to reuse only needed parts of these ontologies. We also 6 http://obi-ontology.org 7 http://environmentontology.org 8 https://evs.nci.nih.gov/ 9 https://www.ebi.ac.uk/chebi defined additional properties of oboe:ObservationCollections, which directly relate datasets to observed entities, characteristics and standards (in contrast to Oboe, where these properties are related to individual observations). This enables efficient querying of these properties (instead of a potentially large set of observations (data rows) a much smaller number of datasets and their properties has to be inspected during search). Each BExIS 2 data value/data column was (manually10 ) annotated with an ontology class corresponding to the entity it refers to, an ontology class modeling the characteristic that was measured and a class referring to the measurement standard that was used. Moreover, for each dataset, we indicated contextual relationships between the observed entities. The semantic annotations are stored in a relational database. Ontop Mappings In order to enable Sparql queries over the conceptual view given by the ontology, we defined mappings that relate BExIS 2 datasets, the entities and characteristics they refer to, the measured values and the dataset annotations residing in the relational database to class and property symbols in the ontology. These mappings are fixed for a given ontology and database. The subsequent mapping for example, creates a (virtual) instance for each charac- teristic measured in some annotated BExIS 2 dataset. It indicates the type of this instance (some subclass of oboe:Characteristic) as given by the semantic annotation stored in the database table annotation (cf. mapping below), and relates it to dataset instances that refer to this characteristic (not depicted). mappingId CHARACTERISTIC-TYPE target :crct_{crct_id} a <{crct}> . source SELECT DISTINCT crct, chrct_id FROM annotation Query Generation Using this approach, we can pose Sparql queries about observational data stored in BExIS 2 on the schema level as well as on the level of individual data values. At the moment, we do not use the full poten- tial of this solution, but rather restrict ourselves to the retrieval of BExIS 2 datasets based on keyword queries. For that purpose, we translate the search terms into a set of Sparql queries. For each keyword that can be mapped to the label (via string comparison) of an ontology class C that is a subclass of oboe:Characteristic, we create the following SPARQL query (prefixes omit- ted) that returns all datasets that measure C. SELECT DISTINCT ?dset WHERE { 10 We are currently working on a data upload wizard which analyzes new datasets to (semi-)automatically identify semantically annotated data attributes (the type of measurement referred to in a dataset column, its datatype and unit of measurement) that are already known to and maintained by ADOnIS . Such a mechanism will enable semantic annotation with little user interaction. ?dset ad:refersToCharacteristic ?char. ?char a } For each keyword that can be mapped to the label of an ontology class E that is a subclass of oboe:Entity, this is done in a similar way, which also accounts for contextual relationships between entities. We create a Sparql query that asks for all datasets referring to entities of type E or to some entity that appears in the context of an entity of type E. SELECT DISTINCT ?dset WHERE { ?dset ad:refersToEntity ?ent. { ?ent a } UNION { ?ent ad:hasEntityContext ?entC. ?entC a } } If the label of a characteristic was entered directly before the label of an entity in the search box, we interpret this as a search for the given characteris- tic measured for the given entity. In case a keyword neither matches the label of an oboe:Characteristic nor the label of an oboe:Entity, we search for datasets containing data values matching the keyword. Finally, we return the union of the resulting datasets. The required information about the type of each provided keyword is delivered by an autocomplete function that provides sugges- tions while the user is typing words in the BExIS 2 search box. The suggestions are generated based on an index of entity and characteristic class labels defined in the underlying ontology. The keywords provided by the user as well as the keyword-related information are passed to the structured search module, which has been implemented as web service with a REST-API. 3.2 Handling Unstructured Data Unstructured data are handled by the SeMedico system which receives feeds from two sources, viz. more than 26 million life science abstracts from Med- line/PubMed1112 and more than 1.5 million life science full texts from PubMed Central from the open access subset. They are stored in a PostgreSQL database.13 Ontologies & Semantic Annotation. Terminological and ontological re- sources for the indexing of all documents come from various sources. Most no- table among them is the NCBI Gene database.14 SeMedico’s gene recognition and normalization engine maps gene mentions in the documents to unique NCBI 11 https://www.ncbi.nlm.nih.gov/pubmed 12 https://www.nlm.nih.gov/databases/download/pubmed_medline.html 13 https://www.postgresql.org/ 14 https://www.ncbi.nlm.nih.gov/gene Gene database entries to handle gene name synonymy and ambiguity. Addition- ally, SeMedico integrates the Gene Ontology (Go)15 and the Gene Reg- ulation Ontology (Gro)16 for the semantic description of different types of gene events. All resources are stored in a Neo4j17 graph database for direct access to their hierarchical structure. All terminologies, ontologies and databases are converted into a common JSON format. This format is then imported into Neo4j using a custom Neo4j server plugin. Natural Language Processing. Before Medline and PubMed Central documents are added to SeMedico’s index, they undergo an extensive linguistic analysis. The goal is to identify textual units referring to gene/protein mentions, ontology concepts, gene interaction events and factuality markers for them as expressed in the documents. To be able to recognize such higher-level semantic concepts, it is necessary to do basic linguistic analysis first like sentence and token segmentation, part-of-speech tagging and chunking. Semantic analysis includes species tagging by the Linnaeus tagger [11], gene mention tagging and normalization using GeNo [23], gene/protein event recog- nition with BioSem [4] and identification of event confidence ratings following the factuality rating as described by [13]. For BioSem, we use a model trained on the BioNLP Shared Task 2011 [15] training data that includes abstracts as well as full texts. MeSH, Go and Gro concepts are tagged by a dictionary component. All documents undergo linguistic processing employing the Uima18 compo- nent repository JCoRe [14]. The morpho-syntactic analysis includes the reso- lution of acronyms [21]. This step is crucial for the interactive disambiguation feature of SeMedico. We recognize textual mentions of ontology classes via pre- ferred names and their synonyms. When searching, also subclasses of query con- cepts are automatically included in the search, leveraging the ontology’s subclass hierarchy. Additionally, we employ dedicated named entity recognition tools for the detection of gene / protein mentions via GeNo [23] and species via the Lin- neaus species tagger [11]. We also look for textually expressed relations between genes / proteins in publications. We employ BioSem [4] to extract mentions of gene / protein interactions from sentences such as ”Here we show that recombinant Pnc1 stimulates Sir2 HDAC activity.” were semantic connections between genes, proteins or, in this case, enzymes are described. Such relations have a high information value for researchers who look for interaction data on specific entities of interest. Modern relation extraction engines such as BioSem are far superior to simpler approaches which identify co-occurrences of entity within formal text units (e.g., sentences). 15 http://www.geneontology.org/ 16 https://bioportal.bioontology.org/ontologies/GRO 17 https://neo4j.com/ 18 https://uima.apache.org/ However, mere interaction extraction does not take into account the confi- dence level the authors of a publication assign to these observational data. Con- sider the following sentence: ”These results may suggest that mTOR-mediated autophagy inhibition may result in mesangial cell proliferation in IgAN.” While the sentence expresses some interaction between mTOR and igAN, the authors carefully use speculative words like may and suggest. Such information should be integrated into a scientific data portal to serve as an indicator how trustworthy an information item really is. We store all these annotations together with the original, raw documents in the document database. In a last step, the analysis results required for semantic search are sent to an ElasticSearch cluster for indexing. We use a custom ElasticSearch plugin to have ElasticSearch accept a term format that allows to exactly specify index terms within the ElasticSearch index. We model the publication search module as a web service disclosing a REST- like API. The API accepts parameters for a query string, a sorting criterion and the range of result documents that should be returned. The server then returns a JSON encoded response, including document text and bibliographic information. 4 Implementation & Preliminary Results In this section, we introduce the GUI provided to the end user to facilitate the search process as well as preliminary evaluation results to demonstrate the effectiveness of the proposed method. To this end, we set up a running instance of the BExIS 2 system with the ADOnIS module that stores 55 real world datasets from the AquaDiva project19 . The datasets comprise 880 data columns and 539, 774 data rows in total. This results in 2, 420, 012 single data values. For the unstructured data search results Semedico stores more than 26M MEDLINE citations and approximately 1.5M PubMed Central full texts from the open access subset in its index. ADOnIS comes with a graphical user interface for the semantic search (Fig. 2). It is divided into three parts: the search box (top), where the user can enter keyword queries (one or more keywords), a section displaying publications (unstructured data) relevant to the query (left) and the list of retrieved BExIS 2 datasets (structured data) (right). An exemplary search using the keywords groundwater, concentration of and nitrate is shown in Fig. 2. The search delivers datasets that refer to the entity groundwater or entities that have been observed in the context of groundwater and datasets where the concentration of (characteristic) nitrate (entity) was measured. On the left-hand side, relevant publications are listed. To demonstrate the effectiveness of the search functionality of ADOnIS we compared its results to those of the original keyword-based search provided by 19 Currently, a subset of 15 datasets including 146 data attributes has been semantically annotated. Fig. 2. Search Interface BExIS 2 , which is powered by Apache Lucene20 indexing both datasets and its accompanying metadata. As a preliminary evaluation, we’ve run the sys- tem with keyword queries relevant within the AquaDiva project. We varied the query complexity by using one or more keywords. Exemplary results are re- ported in Table 1. In its current version, ADOnIS returns the union of both, the results returned by the semantic search and the results retrieved by the BExIS 2 standard search. This is to avoid an empty result set in cases where the se- mantic search does not retrieve any (exactly fitting) datasets. As a consequence, ADOnIS can just return additional datasets that have not been found by the original BExIS 2 search. For a single keyword, ADOnIS and BExIS 2 typically return the same results, since those keywords are often explicitly mentioned either in the datasets itself or in the metadata. However, if we consider more complex queries, ADOnIS delivers relevant results that BExIS 2 does not discover. As a next step, we will extend this preliminary evaluation. In particular, we plan to invite formal feedback from the AquaDiva researchers. This will cover both, an assessment of the relevance of the delivered search results21 as well as an evaluation of the user interface. In addition, we will evaluate how well the search scales with an increasing number of datasets. 5 Conclusion We introduced ADOnIS, an information system which coherently integrates two important, yet mostly disparate data sources, namely structured data from databases (or spreadsheets), on the one hand, and unstructured data in terms 20 https://lucene.apache.org/ 21 Note that, even if datasets are annotated correctly, the search might deliver results that the user did not expect, since ADOnIS interprets the user’s keywords in a certain way (cf. Sect. 3.1) that does not necessarily comply with the searcher’s query intend. Such a mismatch would be discovered by a user study with the AquaDiva researchers. Table 1. Search results Keywords # of ADOnIS results # of BExIS 2 results RNA 16 16 soil moisture 6 2 chemical upper aquifer 2 0 groundwater concentration of nitrate 6 0 of publications, on the other hand. The integration is achieved by providing the underlying background knowledge of the domains involved in terms of adequately tailored ontologies. Once the two basic data sources are semantically linked, entirely novel opportunities for cross-source information retrieval arise. 6 Acknowledgments This work has been mostly funded by the Deutsche Forschungsgemeinschaft (DFG) as part of the CRC 1076 AquaDiva. References 1. F. K. Amanqui, K. J. A. Serique, S. D. Cardoso, J. L. C. dos Santos, A. C. F. Albuquerque, and D. A. Moreira. Improving biodiversity data retrieval through semantic search and ontologies. In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Warsaw, Poland, August 11-14, 2014 - Volume II, pages 274–281, 2014. 2. R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, and D. Petrelli. Hybrid search: Effectively combining keywords and semantic searches. In 5th European Semantic Web Conference, ESWC, pages 554–568, 2008. 3. N. Bikakis, G. Giannopoulos, T. Dalamagas, and T. K. Sellis. Integrating keywords and semantics on document annotation and search. In On the Move to Meaningful Internet Systems, OTM 2010 - Confederated International Conferences: CoopIS, IS, DOA and ODBASE, Hersonissos, Crete, Greece, October 25-29, 2010, Pro- ceedings, Part II, pages 921–938, 2010. 4. Q. C. Bui, E. M. van Mulligen, D. Campos, and J. A. Kors. A fast rule-based approach for biomedical event extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop, pages 104–108, Sofia, Bulgaria, 2013. 5. D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez-Muro, and G. Xiao. Ontop: Answering SPARQL queries over relational databases. Semantic Web –Interoperability, Usability, Applicability, 8(3):471–487, 2017. 6. P. Castells, M. Fernández, and D. Vallet. An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. Knowl. Data Eng., 19(2):261–272, 2007. 7. J. Chamanara and B. König-Ries. A conceptual model for data management in the field of ecology. Ecological Informatics, 24:261–272, 2014. 8. E. Faessler, F. Klan, A. Algergawy, B. König-Ries, and U. Hahn. Selecting and tai- loring ontologies with Joyce. In Proc. of the Intl. Conf. on Knowledge Engineering and Knowledge Management. Springer, 2017. 9. C. Felix, A. V. Pandey, and E. Bertini. Texttile: An interactive visualization tool for seamless exploratory analysis of structured data and unstructured text. IEEE Trans. Vis. Comput. Graph., 23(1):161–170, 2017. 10. M. Gärtner, A. Rauber, and H. Berger. Bridging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowl. Inf. Syst., 41(3):761–792, 2014. 11. M. Gerner, G. Nenadic, and C. M. Bergman. Linnaeus: a species name identifica- tion system for biomedical literature. BMC Bioinformatics, 11:85, 2010. 12. R. V. Guha, R. McCool, and E. Miller. Semantic search. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hun- gary, May 20-24, 2003, pages 700–709, 2003. 13. U. Hahn and C. Engelmann. Grounding epistemic modality in speakers’ judgments. In D.-N. Pham and S.-B. Park, editors, Trends in Artificial Intelligence. PRICAI 2014 –Proceedings of the 13th Pacific Rim International Conference on Artificial Intelligence. Gold Coast, Australia, 1-5 Dec, 2014, number 8862 in Lecture Notes in Artificial Intelligence, pages 654–667. Springer, 2014. 14. U. Hahn, F. Matthies, E. Faessler, and J. Hellrich. UIMA-based JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. In Proc. of the Intl. Conf. on Language Resources and Evaluation, pages 2502–2509, Paris, 2016. 15. J. Kim, N. L. T. Nguyen, Y. Wang, J. Tsujii, T. Takagi, and A. Yonezawa. The genia event and protein coreference tasks of the bionlp shared task 2011. BMC Bioinformatics, 13(S-11):S1, 2012. 16. K. Küsel, K. U. Totsche, S. E. Trumbore, R. Lehmann, C. Steinhäuser, and M. Her- rmann. How deep can surface signals be traced in the critical zone? merging biodi- versity with biogeochemistry research in a central German Muschelkalk landscape. frontiers in Earth Science, 4:32, 2016. 17. J. Madin, S. Bowers, M. Schildhauer, S. Krivov, D. Pennington, and F. Villa. An ontology for describing and synthesizing ecological observation data. Ecological Informatics, 2(3):279–296, Oct. 2007. 18. P. Peng, L. Zou, and Z. Qin. Answering top-k query combined keywords and structural queries on RDF graphs. Inf. Syst., 67:19–35, 2017. 19. P. Peng, L. Zou, and D. Zhao. On the marriage of SPARQL and keywords. In Web Technologies and Applications - 17th Asia-PacificWeb Conference, APWeb 2015, Guangzhou, China, September 18-20, 2015, Proceedings, pages 3–16, 2015. 20. P. Ristoski and H. Paulheim. Semantic Web in data mining and knowledge dis- covery: A comprehensive survey. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 36:1–22, 2016. 21. A. S. Schwartz and M. A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In PSB 2003 – Proceedings of the Pacific Symposium on Biocomputing 2003. Kauai, Hawaii, USA, January 3-7, 2003, pages 451–462, 2003. 22. N. Villanueva-Rosales, N. R. D. Rio, D. Pennington, and L. G. Chavira. Semantic bridges for biodiversity sciences. In The Semantic Web - ISWC 2015 - 14th Inter- national Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part II, pages 310–317, 2015. 23. J. Wermter, K. Tomanek, and U. Hahn. High-performance gene name normaliza- tion with geno. Bioinformatics, 25(6):815–821, 2009.