Rule-Based Data Access: A Use Case in Agroecology Elie Najm, Jean-François Baget and Marie-Laure Mugnier LIRMM, Inria, University of Montpellier, CNRS, Montpellier, France Abstract There is a crucial need for tools to help designing sustainable agrosystems. In this paper, we consider the issue of selecting plant species according to the ecosystem services they are likely to deliver. For that, we rely on the one hand on recent scientific results in agronomy linking functional traits (i.e., measurable characterics of plant species) to ecosystem services, and on the other hand on data collected by the research community in ecology. The architecture of our prototype is inspired by the ontology- based data access paradigm, which clearly distinguishes between the data level and the knowledge representation level, with mappings linking the two levels. Knowledge is represented in a rule langage that extends plain Datalog with computed functions and stratified negation. We detail the construction of a knowledge base devoted to vine grassing, i.e., installing herbaceous service plants in vineyards, and briefly report on the experimental evaluation of the system’s results on this use case. Keywords Agroecology, Vine grassing, Ontology-Based Data Access, Datalog 1. Introduction Sustainable agrosystems should not only produce goods but also ecosystem services, like, e.g., pollinisation, nitrogen production for crops, soil fertility perservation, etc. It is widely acknowledged that this requirement involves increasing biodiversity on agricultural plots [1]. As these systems become much more complex, there is a crucial need for tools to help their design [2]. In this paper, we consider the issue of helping to select service plants, i.e., plants associated with crops, according to the ecosystem services they are likely to deliver. We propose to rely on two pillars. On the one hand, recent research in agronomy makes it possible to associate some measurable characteristics of plant species (called functional traits) with some functions of the agrosystem that contribute to the production of ecosystem services. For instance, several functional traits of the root system of a plant contribute to the function of soil structural stability, which supports the service of maintenance of soil quality [3]. On the other hand, rich data on functional traits has been collected by the international research community in ecology. In particular, the TRY initiative [4, 5] has built a very large dataset providing plant functional trait values measured in a wide range of environmental conditions (www.try-db.org). TRY currently integrates more than 400 datasets and contains experimental observations on 4 millions individual plants concerning 2100 different traits and about 160k plant taxa (mostly species). We hypothesized that if we could associate functional traits with functions and services, this RuleML+RR’22: 16th International Rule Challenge and 6th Doctoral Consortium, September 26–28, 2022, Virtual " enajm@lirmm.fr (E. Najm); baget@lirmm.fr (J. Baget); mugnier@lirmm.fr (M. Mugnier) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) database, and possibly others, would allow us to identify species that support these functions and services. To study the feasibility of our approach, we implemented it on the use case of vine grassing, i.e., installing service plants in vineyards [6]. Exploiting data on functional traits in the design of agrosystems is indeed a new approach. Existing decision helping tools rely on field experiments, farmers’ know-how and workshops between agronomists from various domains. As this is time and budget demanding, these tools are typically restricted to a small set of plant species. Moreover, the decision is a “black box”, in the sense that the computation of a recommendation is hardly explainable. On the other hand, these tools, which are intended for farmers and agricultural consultants, give a very accurate recommendation adapted to a specific cultivation context. As examples, let us cite SIMSERV [7], a tool for the selection of service plants (in a predefined list) to be associated with banana and yam crops, or a tool in agroforestry [8] to select shade tree species in coffee and cocoa agrosystems, based on an inventory of local practices. In contrast, our objective is to support the design activity of researchers and technicians in agroecology, with the aim of “opening the space of possibilities”; in particular, the tool should be able to suggest species that may not have been considered yet, while being able to explain why these species are likely to provide a desired package of services. To sum up, our starting question was the following: can we exploit currently available data on plant functional traits and combine it with a suitable representation of scientific knowledge on the trait-function-service relationships, to assess the potential contribution of any plant species to some ecosystem service? In this paper, we first present our system architecture (Section 2), the formal framework (Section 3) and the methodology to acquire expert knowledge from data sources (Section 4). Then, we detail the construction of a knowledge base devoted to the vine grassing case study (Section 5). Finally, we briefly report on the experimental evaluation of the system’s results on this use case and discuss the lessons learnt. 2. System Architecture To integrate data and knowledge in a principled manner, we decided to rely on the paradigm of ontology-based data access (OBDA) [9, 10]. OBDA systems are structured in three layers: the conceptual level, organized around a domain ontology ; the data level, composed of one or several data sources ; and mappings from the data level to the conceptual level, which allow to select relevant data and translate it into facts using the ontological vocabulary. Queries to the global system are expressed at the conceptual level. The architecture of our system is outlined in Figure 1. A global working database is obtained by integrating several data sources. This integration step selects and aggregates relevant data, and translates it according to the global database schema, while keeping track of the data source provenance. The conceptual level is made of a knowledge base (KB), which comprises facts and rules. We further distinguish between two kinds of facts: data facts obtained from the working database (e.g., the fact that some functional trait for a given species has some normalized value according to a certain data source) and expert facts obtained from expert knowledge (e.g., the fact that a given ecosystem service is supported by some ecosystem functions, which Knowledge base Domain ontology Working database DB1 File 1 Generic rules Fact base DBn File 𝑛 Expert knowledge Data level Conceptual level Figure 1: Overview of the global architecture. Black arrows depict mappings and the green arrow the formalization of expert knowledge. . themselves rely in some way on some functional traits). Expert knowledge is acquired under the form of diagrams, from which facts can be automatically built. About rules, we distinguish between those defining the domain ontology, which provides the concept and relations that are meaningful to a user (an expert in agroecology who builds diagrams or an end-user who queries the system), and more complex rules used to process data facts and combine them with expert facts to estimate the contribution of species to ecosystem functions and services. Note that the latter rules are generic in the sense that they are independent from a specific use case (e.g., vine grassing). Mappings allow to select and aggregate information from a structure (here, a formatted text file, typically a csv file, or a database) and to translate the resulting information into the vocabulary of another structure (here, a database or a fact base). Importantly, they are specified in a declarative way. Query answering in OBDA usually follows a mediating (aka virtualization) approach, i.e., the fact base remains virtual, and user queries are first reformulated with the ontology, then rewritten with the mappings, to yield queries that are directly evaluated on the data [9, 11, 12, 13]. In contrast, we follow here a materialization approach: the fact base is first materialized by triggering the mappings, then saturated by rule applications; we finally store the part of the saturated fact base that is relevant to an end-user as a relational database, in order to benefit from the whole expressive power of SQL. There are several reasons for the choice of materialization: first, mediation has been mainly developed for simple queries (essentially unions of conjunctive queries), while our user queries are more complex (e.g., may involve aggregations); second, some features of our KR language (computed functions, default negation) do not allow to use off-the-shelf reformulation techniques; third, most queries of interest require to rank species (e.g., find the k-best species for some service) and materialization is more appropriate to answer such queries efficiently. Finally, the main advantage of virtualization is the independence with respect to the evolution of data sources, yet this does not seem to be an issue in the target applications. 3. Formal Foundations Regarding the KR language, we did not make any a priori choice. We started by eliciting expert knowledge to identify the language that allowed us to express domain knowledge in a convenient way, while having a restricted expressivity in order to avoid needlessly costly inferences. Rule-based formalisms were natural candidates since expert knowledge is often expressed under the form of rules. Furthermore, compared to description logics (DL) [14], rules allow to express complex relationships between entities, whereas DL are essentially restricted to tree-shaped descriptions and binary predicates. Another important feature is the ability to incorporate computed functions into the logical formalism (the term function is used here in the sense of programming, i.e., a function outputs a value given a list of parameters). Such functions allow in particular to aggregate values of traits or ecosytemic functions, and can be arbitrarily complex. Finally, default negation allows us, for instance, to process missing values or priorities. In the current state of the modeling, our rule language is an extension of plain Datalog to computed functions and stratified default negation [15], as formally defined next. Note that the rules do not involve disjunction in the head, as this feature was not required by the modelling. 3.1. The Rule Language We consider finite sets of predicates and functional symbols of any arity. Beside standard predicates, there are predefined binary predicates like =, ̸= and <. Each functional symbol is linked to a function defined in a programming language. Constants may be objects or literals. A term may be simple or complex. A simple term is a variable or a constant. A complex term is of the form 𝑓 (𝑡1 , . . . , 𝑡𝑛 ), 𝑛 > 0, where 𝑓 is a functional symbol and each 𝑡𝑖 is a term. An atom is of the form 𝑝(𝑡1 , . . . , 𝑡𝑛 ), where 𝑝 is a predicate of arity 𝑛 and each 𝑡𝑖 is a term. A filter is of the form not 𝑝(𝑡1 , . . . , 𝑡𝑛 ) (negated atom) or 𝑡1 𝑡2 , where is a predefined binary predicate and 𝑡1 , 𝑡2 are variables or literals. Given an atom or filter 𝐴, or set of these, we denote by terms(𝐴) and vars(𝐴) the terms and variables, respectively, that occur in 𝐴. A fact is an atom whose terms are constants. A query body is a conjunction of atoms on simple terms and filters, such that each variable occurring in a filter of the form 𝑡1 𝑡2 also → − → − − → → − occurs in an atom. A rule 𝑅 has the form 𝑅 = ∀ 𝑋 (𝐵[ 𝑋 ] → 𝐻[𝑋 ′ ]), where 𝐵[ 𝑋 ] (the body → − − → of 𝑅) is a query body with vars(𝐵) = 𝑋 ; and 𝐻[𝑋 ′ ] (the head of 𝑅) is an atom such that − → vars(𝐻) = 𝑋 ′ ⊆vars(𝐵). Note that 𝐻 may contain complex terms. In the examples, we omit quantifiers, ∧ is replaced by a comma, words starting with a capital letter are variables, and function symbols are prefixed by fct:. Figure 2 illustrates facts and rules. The three first facts are data facts specifying a value for some trait and some species (e.g., the first fact says that the trait “specific root length” of species “dactylis glomerata” has value 0.72). Note that the values of traits are normalized and range on the interval [0, 1]. The next two facts are expert facts, specifying that “soil exploration and competition with vines” is an ecosystem function, which is linked to traits “specific root length”, “root length density” and “relative growth rate”, with “mean” as the method of aggregation of these trait values. The rule says that when an ecosystem function EcoSystemFunction is linked to traits Trait1, Trait2 and Trait3 with Aggregation as the aggregation method of these % Facts hasTraitValue("specific root length","dactylis glomerata",0.72). hasTraitValue("root length density","dactylis glomerata", 0.38). hasTraitValue("relative growth rate","dactylis glomerata", 0.54). ecoSystemFunction("soil exploration and competition with vines"). isLinkedTo("soil exploration and competition with vines","specific root length","root length density","relative growth rate",fct:mean). % Rule isLinkedTo(EcoSystFunction,Trait1,Trait2,Trait3,Aggregation), hasTraitValue(Trait1,Species,V1), hasTraitValue(Trait2,Species,V2), hasTraitValue(Trait3,Species,V3) → hasValue(EcoSystFunction,Species,fct:aggreg3(Aggregation,V1,V2,V3)). Figure 2: Five facts and a (positive) rule trait values, and Trait1, Trait2 and Trait3 respectively have values V1, V2 and V3 for a species Species, then the score of Species for EcoSystemFunction is the aggregation of V1, V2 and V3 with method Aggregation. Here, fct:mean denotes a constant (5th fact), while fct:aggreg3 is a functional symbol associated with a computed function whose first parameter is the name of the aggregation method. Note that this is a simplified example: the actual facts and rules have additional arguments to specify the data source and the plant growing conditions for a measured trait value, the weight of the link between a trait and an ecosystem function, as well as the reliability of the aggregated result. A knowledge base (KB) 𝒦 = (𝐹, ℛ) is composed of a finite set of facts 𝐹 (the fact base) and a finite set of rules ℛ (the rule set). A homomorphism from a query body 𝐵 to a set of facts 𝐹 is a substitution ℎ of vars(𝐵) by terms(𝐹 ) such that (1) for every atom 𝑝(𝑡1 , . . . , 𝑡𝑘 ) ∈ 𝐵, ℎ(𝑝) = 𝑝(ℎ(𝑡1 ), . . . , ℎ(𝑡𝑘 )) is an atom of 𝐹 , and (2) every filter in 𝐵 is evaluated to true in the context of substitution ℎ. In particular, a filter not𝐴 is evaluated to true in the context of ℎ if there is no homomorphism from ℎ(𝐴) to 𝐹 ; note that when such a filter is evaluated some of its variables may not be instantiated by ℎ. A rule 𝑅 : 𝐵 → 𝐻 is applicable to a fact set 𝐹 if there is a homomorphism ℎ from 𝐵 to 𝐹 . The pair (𝐵, ℎ) is called a trigger for 𝑅 on 𝐹 . The application of a rule according to trigger (𝐵, ℎ) produces the atom ℎ(𝐻), obtained by substituting each variable 𝑋𝑖 ∈ 𝐻 by ℎ(𝑋𝑖 ), then evaluating the complex terms; e.g., Figure 2: the application of the rule to the set of facts produces the fact hasValue("soil exploration and competition with vines", "dactylis glomerata", 0.55). Given a KB 𝒦 = (𝐹, ℛ), a derivation (from 𝐹 ) is a sequence of fact sets (𝐹0 = 𝐹 ), 𝐹1 , . . . , 𝐹𝑛 such that, for all 0 < 𝑖 ≤ 𝑛, there is a rule 𝑅 : 𝐵 → 𝐻 ∈ ℛ and a trigger (𝐵, ℎ) for 𝑅 on 𝐹𝑖−1 , 𝐹𝑖 = 𝐹𝑖−1 ∪ {ℎ(𝐻)} and ℎ(𝐻) ̸⊆ 𝐹𝑖−1 . A derivation (𝐹0 = 𝐹 ), . . . , 𝐹𝑛 is complete (aka fair) if it cannot be extended; then 𝐹𝑛 is called the saturation of 𝐹 by ℛ. Finally, we consider stratified rule sets [15], which ensures that each KB has a well-defined semantics, based on its (unique) saturated fact base. A rule set ℛ is stratifiable if there is a surjective mapping 𝜌 from its set of intensional predicates 𝑃 (i.e., predicates occurring in the rule heads) to a set of 𝑚 integers, such that for all rule 𝑅 ∈ ℛ with head predicate 𝑝 and all 𝑞 ∈ 𝑃 that occurs in the body of 𝑅: (1) if 𝑞 occurs in an atom of 𝑅 then 𝜌(𝑞) ≤ 𝜌(𝑝), and (2) if 𝑞 occurs in a negated atom of 𝑅 then 𝜌(𝑞) < 𝜌(𝑝); then a stratification of ℛ is a partition of ℛ into subsets ℛ1 . . . ℛ𝑚 such that each rule with head predicate 𝑝 belongs to the subset ℛ𝜌(𝑝) . A derivation complies with a stratification ℛ1 . . . ℛ𝑚 if, for all 1 ≤ 𝑖 < 𝑗 ≤ 𝑚, any application of a rule from ℛ𝑖 precedes any application of a rule from ℛ𝑗 . It is well-known that, for a stratifiable rule set ℛ and a KB 𝒦 = (𝐹, ℛ), all the complete derivations that comply with a stratification of ℛ lead to the same saturation. Semantically, stratified negation can be seen as a particular case of stable negation [16]. Answers to queries are defined with respect to the saturated fact base. 3.2. Mappings Our architecture requires to manipulate data coming from different data sources, possibly stored in different systems (e.g., relational or NoSQL databases, spreadsheet tables, etc.). To be able to do that, we rely upon mappings: these objects are akin to rules and are composed of a query on some data source 𝑆1 (the body, written in the query language associated with 𝑆1 ) and an insertion query on some other data source 𝑆2 (the head, written in the query language associated with 𝑆2 ). Though the body of a query can be written in a fairly expressive language such as SQL, we also have to handle, for instance, the very simple queries on spreadsheet files that only return tuples corresponding to lines in a table. In that case, constraints can be added to the mapping body to ensure that no required value is missing, and more generally, that retrieved data satisfies some integrity conditions. Furthermore, functions in the mapping head allow transforming values from 𝑆1 into values that fulfil the syntactic requirements of 𝑆2 (e.g, type casting). Formally, a mapping from 𝑆1 to 𝑆2 has the following general form: → − − → −→ − → − → 𝑄1 ( 𝑋 ), 𝐶1 (𝑋1 ), . . . , 𝐶𝑘 (𝑋𝑘 ) → 𝑄2 (𝑓1 (𝑌1 ), . . . , 𝑓𝑝 (𝑌𝑝 )) → − where 𝑄1 is a query over 𝑆1 ; answers to 𝑄1 are substitutions of 𝑋 by values from 𝑆1 ; the −→ → − → − → − 𝐶𝑖 are constraints on 𝑋𝑖 ⊆ 𝑋 ; and 𝑄2 is an insertion query over 𝑆2 , where each 𝑌𝑖 ⊆ 𝑋 and each 𝑓𝑖 is a function symbol. Such mappings can be seen as a generalization of the GAV relational mappings classically defined in OBDA to mappings over heterogeneous data sources, hence the addition of constraints to supplement the lack of expressivity of some query languages. The application of a mapping 𝑚 is as follows: an answer to the body of 𝑚 is a substitution 𝜎 that is an answer to 𝑄1 such −→ that 𝐶𝑖 (𝜎(𝑋𝑖 )) evaluates to true for all 1 ≤ 𝑖 ≤ 𝑘; given an answer 𝜎 to the body of 𝑚, each → − 𝑓𝑖 (𝜎( 𝑌𝑖 )) (1 ≤ 𝑖 ≤ 𝑝) is evaluated, which yields a value 𝑣𝑖 , and the tuple (𝑣𝑖 )1≤𝑖≤𝑝 is inserted into 𝑆2 by the insertion query 𝑄2 . Given a set of mappings from 𝑆1 to 𝑆2 , the construction of 𝑆2 is obtained by performing all the applications of these mappings on 𝑆1 . In our current prototype, we rely upon data sources available in spreadsheet tables. Cleaning mappings select and transform data from each source to store it into a relational database. In those mappings, constraints are used to discard irrelevant, doubtful or unusable lines, and functions for normalization purposes. Then, database mappings from the obtained databases lead to a single relational database (the working database): those mappings are used to select Nitrogen supply to the vine Aggregation: mean Mineralization of Soil exploration and Symbiotic fixation of organic matter competition with vines atmospheric nitrogen Aggregation: mean Aggregation: mean Aggregation: mean C/N ratio of the Specific leaf area plant, shoot, leaf, litter TRY:3086,3115,3116,3117 TRY:146,150,409,1021 DB2:7 Specific root length Dry mass of plants, Dry matter content TRY:614,1080 shoots, leaves of leaves Root length density TRY:388,403,700 TRY:47 Nitrogen fixation TRY:1508,2025,2281 Nitrogen content of DB2:2 capacity shoots and leaves Relative growth rate TRY:8 TRY:339,408,502,1126 TRY:77 DB2:5 Figure 3: Traits-Functions-Service for Nitrogen supply to the vine. Green and red arrows indicate positive and negative impact, respectively. data relevant to a specific use case (here, vine grassing) and to aggregate values. Then, data- to-knowledge mappings normalize and aggregate values from the working database to produce the higher-level facts (so-called data facts). Rules are applied to saturate the fact base, which also contains so-called expert facts. Finally, storage mappings from the saturated fact base to a relational database allow to select the part available for querying, while benefiting from SQL expressive power. 4. Acquisition of Expert Knowledge In this section, we present the acquisition of expert knowledge by means of diagrams, as well as the construction of correspondences between traits identified by experts and database trait IDs. Expert diagrams. The experts provide diagrams structured in 3 levels: traits, functions and services, with links between elements of successive levels. Figure 3 depicts the diagram describing the service nitrogen supply to the vine.1 A green arrow denotes a positive impact of the trait (resp. function) value for the function (resp. service) rendered. A red arrow denotes a negative impact. To define services relevant to agricultural ecosystems, the experts relied on the reference study EFESE2 . Links from functional traits to ecosystem functions are based on scientific papers as well as grey literature. The experts were asked to specify methods of aggregation to pass from traits to functions (resp. from functions to services). Since no precise criteria could be derived from state-of-the-art domain knowledge, they decided to consider the mean of the normalized trait values, effectively giving the same importance to all traits (resp. functions). 1 Diagrams for two other services can be found in file Appendix.pdf at https://gitlab.inria.fr/enajm/related-doc 2 EFESE is a French national initiative to assess ecosystems and ecosystem services: https://www.inrae.fr/en/news/ assessing-services-provided-agricultural-ecosystems-improve-their-management. Correspondences with databases. The next step consists of associating each trait in a diagram with one or several trait IDs in available databases. In its current state, the prototype is mainly based on the large Plant Trait database TRY presented in the introduction. It also uses a second database built by a French study on the interests of intermediate crops3 , which gives records for 58 species according to 7 traits. Although it is very small, this database has the advantage of being devoted to herbaceous species (relevant to vine grassing) and most trait values it provides for these species are not filled in TRY. It allows us to implement the principle of a multi-database setting. Next, this database is called DB2 as it is still confidential. Both databases use the same standard names for plant species, defined in The Plant List 4 [17] in relationship with the Taxonomic Name Resolution Service5 [18]. Hence, each trait in a diagram is associated with one or several trait IDs in the databases TRY and DB2. As shown in Figure 3, some traits are associated with a single database trait ID, like “Relative growth rate” associated with ID 77 in TRY, or with a single ID in each database, like “Dry matter content of leaves” associated with ID 47 in TRY and ID 2 in DB2. However, most traits have more than one match in TRY. The reason for that is that traits can be measured according to different techniques, which are reported in TRY. Hence, the same trait for the same observed individual plant has different values according to the measurement techniques. Since we will normalize traits values (on a [0,1] interval), it will still be relevant to aggregate different values of a trait for a species, regardless of the measurement technique. Trait IDs associated with the same expert trait are called exchangeable. As there is no universally preferred measurement technique, they are very useful to mitigate the impact of missing values. Finally, our modeling includes preferences between databases: there is a global default order, which can be overwritten for specific traits (here, TRY is globally preferred to DB2). This allows to give a higher priority to a more reliable source. The value of a trait for a species is given by the highest priority data source that provides one. To summarize, our methodology for the acquisition of expert knowledge consists of the following main steps: 1. Build traits-functions-service diagrams, based on scientific sources and the grey literature. 2. Identify relevant databases and associate diagram’s traits with relevant IDs in these databases, together with the choice of an aggregation technique. 3. Define priority among databases, globally and possibly for specific traits. Back and forth between the steps 1 and 2 are necessary, as traits in the diagrams have to find counterparts in the databases. Formalization. Expert knowledge is formalized into two types of knowledge: • Generic rules handling the passage from database values to functional trait, function and service values for species. These rules are generic in the sense that they do not consider specific traits, functions or services, nor specific aggregation methods, hence they are not tied to specific diagrams, nor to specific data sources. • Expert facts that describe specific diagrams, including their links to database trait IDs. 3 https://methode-merci.fr 4 http://www.theplantlist.org. 5 https://tnrs.biendata.org/. Importantly, rules are independent from a use case, hence the evolution of diagrams, including the introduction of new diagrams, only implies to generate again expert facts. Whereas rule- based expert systems often rely on carefully crafted rules that are tailored for a specific use case, our rules could in principle be applicable to any use case that follows the trait-function-service approach. 5. The Knowledge Base The domain ontology provides the vocabulary meaningful to a user and is described by simple rules expressing concept subsumption and relation signatures. Additional predicates are used at intermediate steps in the computation of trait, function and service values. 5.1. From Expert Diagrams to (Expert) Facts Each entity of a diagram is typed by the according concept (a unary predicate) and relationships between traits and functions (resp. functions and services) are captured by predicates that may have a high arity. E.g., the upper part of the diagram from Figure 3 (from the functions to the service) is described as follows: biophysicalRegulationService("nitrogen supply to the vine"). function("mineralization of organic matter"). function("soil exploration and competition with vines"). function("symbiotic fixation of atmospheric nitrogen"). isLinkedTo3Functions("nitrogen supply to the vine", "mineralization of organic matter",1, "soil exploration and competition with vines",-1, "symbiotic fixation of atmospheric nitrogen",1, fct:mean). The last fact, with predicate isLinkedTo3Functions, links the service to three functions, each followed by a number weighting their participation to the service (here, 1 for a positive participation, and -1 for a negative one), and specifies the aggregation method (here, mean). Other facts translate correspondences between traits and data IDs. E.g., the following facts express that: the trait “nitrogen content of shoots and leaves” has 4 matches in TRY (ID: 339, 408, 502 and 1126) and one match in DB2 (ID 5); furthermore, the aggregation of these exchangeable IDs is made by taking their max value (from the highest priority data source). numberOfMatches("nitrogen content of shoots and leaves",4,try,max). numberOfMatches("nitrogen content of shoots and leaves",1,db2,max). hasTraitID("nitrogen content of shoots and leaves","339",try). hasTraitID("nitrogen content of shoots and leaves","408",try). hasTraitID("nitrogen content of shoots and leaves","502",try). hasTraitID("nitrogen content of shoots and leaves","1126",try). hasTraitID("nitrogen content of shoots and leaves","5",db2). 5.2. From Data to (Data) Facts Using cleaning mappings. TRY data is available on request for specific traits (we asked for 52 traits, which yielded about 151k species) and comes as a formatted text file (similar to csv). This file contains so-called observations (about 1.6M); each observation (identified by an ID) corresponds to measurements of a trait (identified by an ID) on an individual plant (associated with a species or genus ID). An observation is described by several lines, which briefly provides measurements, as well as contextual information about the observation, original dataset provenance and estimation of the (un)reliability of the values (called error risk). Actually, the content of the cells is very heterogeneous in terms of units of measure, values taken by a non-numeric field, kind of contextual information, etc. To illustrate, the trait Plant Life Span takes string values among ["Bisannual", "Annual", "Biennial", "Perennial"] but one also finds at lot of other values like "perennial < 20 years", "biasannual", "pere", "nope", "from few decades to more than 60 years", "1", "2", "3", "winter annual", "shrub", "woody", etc. Hence, a step of cleaning is absolutely necessary before this rich source of information can be exploited in an automated way. This step, mainly based on string search, discards irrelevant, doubtful (cf. error risk) or unusable information, and transforms values for the retained fields. We also observed that the growing conditions of the observed plants may lead to very different trait values, hence we made the distinction between natural conditions and experimental conditions. At the end of this step, each triple (observation ID, species ID , trait ID, growing conditions) occurring in the obtained database has a single trait value expressed in a standardized unit (i.e., we chose a unit of measure for each trait). 6 Although cleaning could have comprised the whole data and be independent of the specific vine grassing use case, we performed some selective cleaning for time reasons. In particular, only herbaceous species are relevant for the use case, while this is not a well defined category from an ecology viewpoint; we constructed this category from specific values of trait IDs, in order to distinguish species that are herbaceous from the others. Finer categories could be built for other use cases. The data source DB2 considers solely herbaceous with natural growing conditions and required no cleaning. Using database mappings. The database mappings build the working database by retrieving only herbaceous species (about half of the species in TRY) and aggregating all the trait values coming from different observations for a given tuple (species ID, trait ID, growing conditions, source database). Hence, at the end of this step, the working database contains a single trait value for each tuple (species ID, trait ID, growing condition, source database). Some additional queries compute views, in order to simplify the expression of data-to- knowledge mappings. In particular, for each trait ID, one computes its minimal and maximal values among all the values it takes in the working database with the same growing condition. Using data-to-knowledge mappings. The computation of species’ score for ecosystem functions and services, which is performed at the conceptual level, requires to aggregate values of different traits. To do so, we turn each value associated with a trait ID into a normalized 6 Actually, the trait (ID) occurring in an observation has itself several associated “subtraits”, whose values need to be aggregated, but for the sake of simplicity we do not detail this aspect in this paper. (R1): numberOfMatches(Trait,2,DB,Aggregation), hasTraitID(Trait,TraitID1,DB), hasTraitID(Trait,TraitID2,DB), TraitID1 𝑠𝑗 means that 𝑠𝑖 is deemed better than 𝑠𝑗 for the service. To aggregate these different orders, we built a directed graph, whose nodes are the species 𝑠𝑖 and there is an edge (𝑠𝑖 , 𝑠𝑗 ) if at least one document says that 𝑠𝑖 > 𝑠𝑗 . We found that the results in the literature were remarkably consistent, as the graph was circuit-free. Hence, the graph provided a partial order on the whole set of species. We then considered all the total orders compatible with this partial order, and assigned to each species a score equal to the average of its ranks in the total orders. Finally, we studied the correlation between the tool and expert scores, according to Pearson correlation coefficient 𝑟. In our case, a perfect correlation is reflected by 𝑟 = −1 (as the tool and the expert rank species in opposite orders), a perfect inverse correlation by 𝑟 = 1, and an absence of correlation by 𝑟 = 0. Globally, we found a good correlation (𝑟 = −0.67). Interestingly, the reliability of the tool’s score appeared to be a crucial parameter: the value of 𝑟 drops to −0.74 when we consider only species whose computed service value has a reliability at least 50%, and to −0.81 for reliability at least 60% (i.e., for these species, at least 6 of the 9 traits actually have a value by the predicate hasTraitValue). This highlights the crucial issue of missing values in the data sources. 7 https://gitlab.inria.fr/rules/graal-v2 8 We give here a brief outline of the evaluation, for more detail see the file Appendix.pdf at https://gitlab.inria.fr/ enajm/related-doc 6.2. Concluding Remarks In this paper, we studied the feasibility of a new approach for the selection of plant species in agriculture according to ecosystem services. This approach exploits on the one hand scientific results on the relationships between functional traits, ecosystem functions and services, and on the other hand data collected by the research community in ecology. Our modeling relies on generic rules, which allows for a smooth evolution of expert knowledge: expert facts can be automatically generated from the diagrams, without impact on the rules. To put this approach in practice, we first had to undertake significant efforts to clean the data from TRY (which integrates almost all data sets about functional traits). This was a mandatory step before any automated exploitation. The evaluation carried out with agronomists confirms that very satisfactory results can be obtained as long as the proportion of missing values is not to high. As the TRY initiative is developing, in both volume and standardization of the data, and there is a growing effort to produce trait data in agronomy, the approach is indeed promising. We expect the accuracy of the results to increase if contextual information (like climate and soil in particular) is taken into account. We did not exploit this kind of information because it would have further reduced available data and worsen the problem of missing values. As future work, we plan to also consider agricultural practices, as they are decisive for the realisation or neutralisation of a potential ecosystemic function. Finally, on a more practical side, our agenda includes a dedicated user interface, with explanation of query results in close relationship with expert diagrams. Acknowledgments We thank the experts in agronomy and agroecology who took part in this study: Christian Gary, Raphaël Métral, Léo Garcia and Aurélie Métay. We are also grateful to Sébastien Minette for providing us with the second traits database. This work was supported by a governmental grant managed by the Agence Nationale de la Recherche (ANR) within the framework of the "Investissements d’Avenir" program under the reference ANR-16-CONV- 0004 (#DigitAg). References [1] M. Duru, O. Therond, G. Martin, R. Martin-Clouaire, M.-A. Magne, E. Justes, E.-P. Journet, J.-N. Aubertot, S. Savary, J.-E. Bergez, J.-P. Sarthou, How to implement biodiversity- based agriculture to enhance ecosystem services: a review, Agronomy for Sustainable Development 35 (2015). doi:10.1007/s13593-015-0306-1. [2] F. Lescourret, T. Dutoit, F. Rey, F. Côte, M. Hamelin, E. Lichtfouse, Agroecological engi- neering, Agron. Sustain. Dev. 35 (2015) 1191–1198. doi:10.1007/s13593-015-0335-9. [3] L. Garcia, G. Damour, C. Gary, S. Follain, Y. Le Bissonnais, A. Metay, Trait-based approach for agroecology: contribution of service crop root traits to explain soil aggregate stability in vineyards, Plant and Soil 435 (2019). doi:10.1007/s11104-018-3874-4. [4] J. Kattge, G. Bönisch, S. Díaz, S. Lavorel, I. Prentice, P. Leadley, S. Tautenhahn, G. Werner, T. Aakala, M. Abedi, A. Acosta, et al., Try plant trait database – enhanced coverage and open access, Global Change Biology 26 (2020) 119–188. doi:10.1111/gcb.14904. [5] J. Kattge, S. Diaz, S. Lavorel, I. C. Prentice, P. Leadley, G. Bönisch, E. Garnier, M. Westoby, P. B. Reich, I. J. Wright, et al., Try–a global database of plant traits, Global change biology 17 (2011) 2905–2935. [6] L. Garcia, F. Celette, C. Gary, A. Ripoche, H. Valdés-Gómez, A. Metay, Management of service crops for the provision of ecosystem services in vineyards: A review, Agricul- ture, Ecosystems & Environment 251 (2018) 158–170. URL: https://www.sciencedirect. com/science/article/pii/S0167880917304309. doi:https://doi.org/10.1016/j.agee. 2017.09.030. [7] H. Ozier-Lafontaine, J.-M. Blazy, M. Publicol, C. Melfort, SIMSERV - Expert system of selection assistance of service plants, 2010. URL: https://hal.inrae.fr/hal-02820845. [8] J. Van der Wolf, L. Jassogne, G. Gram, P. Vaast, Turning local knowledge on agroforestry into an online decision-support tool for tree selection in smallholders’ farms, Experimental Agriculture 55 (2019) 50–66. [9] A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, R. Rosati, Linking data to ontologies, in: S. Spaccapietra (Ed.), Journal on Data Semantics X, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 133–173. [10] G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, M. Zakharyaschev, Ontology-based data access: A survey, in: IJCAI, ijcai.org, 2018, pp. 5511–5519. [11] D. Calvanese, B. Cogrel, E. G. Kalayci, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez-Muro, G. Xiao, Obda with the ontop framework., in: SEBD, Citeseer, 2015, pp. 296–303. [12] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodriguez-Muro, R. Rosati, M. Ruzzi, D. F. Savo, The mastro system for ontology-based data access, Semantic Web 2 (2011) 43–53. [13] M. Buron, F. Goasdoué, I. Manolescu, M. Mugnier, Ontology-based RDF integration of heterogeneous data, in: Proceedings of EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, 2020, pp. 299–310. [14] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider (Eds.), The Description Logic Handbook: Theory, Implementation, and Applications, Cambridge University Press, 2003. [15] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison Wesley, 1994. [16] V. Lifschitz, Answer set programming, Springer Heidelberg, 2019. [17] J. M. Kalwij, Review of ‘the plant list, a working list of all plant species’, Journal of Vegetation Science 23 (2012) 998–1002. [18] B. Boyle, N. Hopkins, Z. Lu, J. A. Raygoza Garay, D. Mozzherin, T. Rees, N. Matasci, M. L. Narro, W. H. Piel, S. J. Mckay, et al., The taxonomic name resolution service: an online tool for automated standardization of plant names, BMC bioinformatics 14 (2013) 1–15. [19] J. Baget, M. Leclère, M. Mugnier, S. Rocher, C. Sipieter, Graal: A toolkit for query answering with existential rules, in: N. Bassiliades, G. Gottlob, F. Sadri, A. Paschke, D. Roman (Eds.), Proceedings of RuleML 2015, Berlin, Germany, 2015, volume 9202 of Lecture Notes in Computer Science, Springer, 2015, pp. 328–344.