Building a Trophic Knowledge Graph to Support Soil Food Web Reconstruction Nicolas Le Guillarme1 , Mickaël Hedde2 and Wilfried Thuiller1 1 Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, CNRS, LECA, Laboratoire d’Ecologie Alpine, F-38000 Grenoble, France 2 INRAE, UMR Eco & Sols, Montpellier, France Abstract While food webs are pivotal tools to understand the structure, dynamics and functioning of ecosys- tems, their reconstruction is not trivial since feeding relationships are not always known. To this end, soil ecologists often simplify the problem by either grouping morphologically similar organisms into trophic groups with known interactions or by assuming that feeding relationships are predictable from consumer diets (e.g. frugivore or bacterivore). Interestingly, the scientific community has collected a considerable amount of information on trophic interactions and feeding habits, some of it being available in structured databases, or disseminated in the scientific and grey literature. However, the large-scale ex- ploitation of these data for food web reconstruction is hampered by their dissemination in a multitude of heterogeneous datasets. The goal of our work is to propose a semantic data integration pipeline whose role will be to extract, aggregate, normalize, and integrate information about trophic interactions and diets into a trophic knowledge graph that will support the automatic reconstruction of soil food webs. Keywords soil ecology, food web, semantic data integration, knowledge graph 1. Introduction Food webs (also called trophic webs, trophic interaction networks) encode both the composition of ecological communities as well as the feeding relationships within the community. In soil ecology, food webs are often used to understand the structure and dynamics of soil assemblages, and their impact on decomposition processes and nutrient cycling [1]. Yet, reconstructing soil food webs is not a straightforward task. The nature of soil as a black-box ecosystem makes direct observations of most trophic interactions fairly impossible, and knowledge of resource preferences of many taxonomic groups of soil fauna are not well known. These preferences may be inferred from morphological similarities with species of known feeding habits or by phylogenetic proximity. The development of high-throughput species identification methods (e.g. eDNA metabarcoding), and the availability of massive amounts of data about trophic interactions and feeding habits collected by researchers over the past decades have paved the way for new knowledge-based methods to automate the reconstruction of food webs on an unprecedented scale [2]. However, despite recent efforts to facilitate access to large collections S4BioDiv 2021: 3rd International Workshop on Semantics for Biodiversity, held at JOWO 2021: Episode VII The Bolzano Summer of Knowledge, September 11–18, 2021, Bolzano, Italy " nicolas.leguillarme@univ-grenoble-alpes.fr (N. Le Guillarme); mickael.hedde@inrae.fr (M. Hedde)  0000-0003-4559-7579 (N. Le Guillarme); 0000-0002-6733-3622 (M. Hedde); 0000-0002-5388-5274 (W. Thuiller) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: The same information about the feeding habits of saprotrophic fungus Hypholoma cap- noides as it can be found in fungaltraits (a fungal trait database), GloBI (the Global Biotic Interactions database), and an article published in Soil Biology & Biochemistry. Both fungaltraits and GloBI are struc- tured databases, while textual data is by nature unstructured. Fungaltraits focuses on trophic groups (Wood Saprotroph), while GloBI is a trophic interaction database, using terms from the RO ontology (e.g. eats) to describe the nature of the interaction. GloBI provides unambiguous species identification using taxon identifiers (e.g. GBIF:2533439), whereas in the other two data sources the consumer is referred to by its scientific name only. of structured data on species interactions [3], information about trophic interactions and diets (e.g. fungivory, xylophagy...) is still largely scattered in multiple heterogeneous datasets, as well as in the scientific and grey literature. These datasets may have different levels of structuredness and do not usually share common taxonomies and trophic classifications (Fig. 1), which hampers data interoperability and reuse and impose limits on the scale at which studies can be carried out. To address this issue, we are developing a semantic data integration pipeline whose role will be to integrate existing trophic datasets with a reference taxonomy and a new ontology of soil trophic webs. The resulting knowledge graph will offer unified access services over a large collection of trophic information from heterogeneous datasets. In addition, its formal semantics will allow additional knowledge about trophic group membership to be deduced from trophic interaction data, and vice versa, thus supporting food web reconstruction. 2. Semantic Integration of Trophic Data Our semantic integration pipeline receives data from multiple sources in different formats: structured data from open-access relational/graph databases or in-house datasets stored as tabular files, semi-structured data (XML, JSON), and ultimately, unstructured data as free-form text from scientific papers or wiki pages. The architecture of the pipeline is shown in Fig. 2. This architecture comprises three main building blocks: Data extraction. The role of this module is to fetch data from their respective sources and to store the extracted data as structured datasets in a local directory. The data extraction Figure 2: Our ontology-based data integration pipeline adopts an Extract-Transform-Load approach to build a trophic knowledge graph from heterogeneous data sources. module currently provides methods to access local or remote files (e.g. download dataset dumps from the Web), and to access RDF data via SPARQL endpoints. This module will also rely on text mining techniques (e.g. trophic relation extraction) to transform unstructured textual sources (scientific publications, wiki pages...) into structured datasets. Data transformation. The data transformation module is responsible for the normalization and RDF-ization of the extracted data. In particular, all taxonomic data (taxon names or identifiers) are mapped to entities in the NCBI Taxonomy using nomer [4], an open- source taxonomic alignment software. The NCBI Taxonomy was chosen as the reference taxonomy because it is the only taxonomy available in the form of an OWL ontology, which allows reasoning on taxonomic knowledge. Information concerning trophic interactions (consumer-resource relationships) and group membership is represented using terms from the Soil Trophic Web Ontology (STWO, [5]). This OWL ontology provides a formal description of trophic knowledge, including logical definitions of trophic groups that allow new facts to be deduced using OWL reasoning. STWO leverages existing resources by importing modules from existing OBO ontologies (e.g. RO, ECOCORE, ENVO...) and adds new classes for missing groups and resources. Semantically annotated data are then transformed into RDF triples using RMLMapper [6], an open-source Java library that executes RML mapping rules to generate RDF data from various input formats (e.g. CSV, JSON, XML, RDF...). These rules are dataset-specific, which means that a new set of rules have to be written manually for each new data source. At the end of the data transformation step, all datasets have been converted into named RDF graphs, the name of the graph being used to keep track of data provenance. Knowledge graph creation. This module simply loads all the datasets in N-Quads format into a triplestore. The result is a knowledge graph where taxonomic entities (consumers) are linked to resources (other taxonomic entities, anatomical entities, environmental mate- Figure 3: The trophic knowledge graph built using our semantic data integration pipeline provides a single access point to trophic group and interaction data from a multitude of sources, thus supporting the reconstruction of food webs from community composition data. An R package allows easy access to the knowledge contained in the graph by encapsulating SPARQL queries. rial...) by trophic interactions, either explicitly (e.g. the triple ncbi:CarabusHispanicus ro:eats ncbi:Gastropoda), or implictly through trophic group information (e.g. the triple ncbi:CarabusHispanicus rdf:type stwo:malacophage). Additional facts about trophic group membership or potential trophic interactions can be deduced from existing knowledge using OWL reasoning. For instance, the fact that C. hispanicus is a malacophagous organism can be inferred from the following facts: C. hispanicus consumes gastropods, Gastropoda is a taxonomic class within the phylum Mollusca, and the class of malacophagous organisms is logically defined in STWO as the set of all organisms consuming molluscs. Similarly, knowing that C. hispanicus is malacophage, it is possible to deduce that C. hispanicus interacts trophically with members of the phylum Mollusca. This type of reasoning can be carried out by the triplestore’s built-in reasoner (if any) or delegated to an external reasoner (e.g. Pellet, HermiT...). Our implementation relies on generic Python components that are instantiated and assembled into a dedicated pipeline for each new data source by providing a set of configuration files (including RML mapping rules for this specific dataset). External tools (e.g. nomer, RMLMapper) are run using Docker. Pipelines are scheduled and monitored using Apache Airflow. The entire integration workflow is fully automated and can be configured so that the knowledge graph is rebuilt on a regular basis (weekly, monthly...) to account for source dataset updates. Finally, end-users (mainly soil ecologists) can interact with the knowledge graph via an R package that implements the most common SPARQL queries, thus providing a friendly interface to users that do not have SPARQL knowledge (Fig 3). 3. Future Work Our vision is that of a semantic data integration pipeline that will enable ecologists to make the most of available information about soil trophic ecology, regardless of the initial format and location of this information. To achieve this, we are working on three fronts: the development of a domain ontology to represent knowledge about soil trophic ecology, the implementation of a user-friendly trophic knowledge graph construction pipeline, and last but not least, the development of trophic information extraction tools to handle unstructured data sources. We implemented a proof-of-concept of our data integration pipeline that is fully functional, easily extensible to new data sources, and generic enough so that it could be used to integrate other types of knowledge (e.g. organism traits, habitats, etc.), provided that the appropriate ontological resources exist. We are working on making the pipeline even more accessible to non-expert users. For instance, we would like to provide a easier way to generate RML mapping rules for new data sources. In a near future, we plan to focus on quality assessment by improving data provenance tracking and developing error detection methods. Trophic data integration at scale still faces a number of challenges: the existence of several reference taxonomies whose mapping to the NCBI Taxonomy (currently the only taxonomy available in ontology form) is far from trivial (e.g. the SILVA taxonomy), the limited lifetime of organism-related information due to taxonomic instability, the challenge of large-scale onto- logical reasoning... These are just a few examples of the many opportunities for collaboration between ecologists and knowledge engineers that could benefit biodiversity science. Acknowledgments The research received funding from the French Agence Nationale de la Recherche (ANR) through the GlobNets (ANR-16-CE02-0009) project and through MIAI@Grenoble Alpes (ANR-19-P3IA- 0003). References [1] S. Scheu, The soil food web: structure and perspectives, European journal of soil biology 38 (2002) 11–20. [2] Z. G. Compson, W. A. Monk, C. J. Curry, D. Gravel, A. Bush, C. J. Baker, M. S. Al Manir, A. Riazanov, M. Hajibabaei, S. Shokralla, et al., Linking DNA metabarcoding and text mining to create network-based biomonitoring tools: A case study on boreal wetland macroinvertebrate communities, Advances in ecological research 59 (2018) 33–74. [3] J. H. Poelen, J. D. Simons, C. J. Mungall, Global biotic interactions: An open infrastructure to share and analyze species-interaction datasets, Ecological Informatics 24 (2014) 148–159. [4] J. Poelen, globalbioticinteractions/nomer, 2021. URL: https://doi.org/10.5281/zenodo. 4925111. [5] N. Le Guillarme, M. Hedde, W. Thuiller, STWO : an ontology for soil food web reconstruction, S4BioDiv 2021: 3rd International Workshop on Semantics for Biodiversity (2021). [6] RML.io, RMLio/rmlmapper-java, 2021. URL: https://github.com/RMLio/rmlmapper-java.