Global Agricultural Concept Scheme A Hub for Agricultural Vocabularies Tom Baker Caterina Caracciolo Independent FAO consultant Food and Agriculture Organization of the UN (FAO) Bonn, Germany Italy, Rome Elizabeth Arnaud Bioversity International Montpellier, France Abstract— Thesauri are used to tag semi-structured GACS Core Beta 3.16, soft-launched at the Open Harvest documents, texts, while more complex semantic structures are workshop of May 2016, provides 15,000 concepts formed by used to describe (annotate) scientific data. We are creating a mapping and merging the most frequently used concepts from Global Agricultural Concept Scheme (GACS) by mapping the three source thesauri. GACS Core concepts are labeled in AGROVOC, CABT and NALT – three major thesauri in the multiple languages, with some in more than twenty-five area of food and agriculture, with a beta release in May 2016. We languages. The soft launch opened a period of testing and see GACS as a hub linking user-oriented thesauri with feedback in preparation for the next phase of its development, semantically more precise domain ontologies linking, in turn, to which will begin in circa October 2016. GACS Core Beta 3.1 datasets about food and agriculture, in order to make that data presents a set of concepts that is considered to be fairly stable, more interoperable and reusable with URIs that are not expected to change (see an example of Keywords—thesauri, ontologies, food, agriculture, GACS, concept in GACS in Fig. 1). Problems resulting from the AGROVOC, CABT, NALT, Crop Ontology integration process, such as overlapping labels, have been substantially fixed, though much detailed work remains to be done, notably the specification of a common hierarchical I. GLOBAL AGRICULTURAL CONCEPT SCHEME structure. During this test phase, implementers are encouraged The Food and Agricultural Organization of the United to use GACS on an experimental basis and provide feedback. Nations (FAO), CAB International (CABI), and the National Agricultural Library of the USDA (NAL) have long maintained separate thesauri about agriculture, food and related topics -- the AGROVOC Concept Scheme1, CAB Thesaurus, and NAL Thesaurus – for use in indexing their respective bibliographic databases:: AGRIS (8 million records), CAB Abstracts (8.3), and Agricola (5.2). the AGROVOC Concept Scheme, CAB Thesaurus2, and NAL Thesaurus3. The thesauri provide globally identified concepts for use in automated indexing and retrieval, subject description, natural language processing, and translation. Having previously collaborated on mappings and common classifications, the three organizations resolved in 2013 to explore the feasibility of pooling their most frequently used concepts into a jointly maintained Global Agricultural Concept Scheme (GACS). GACS was seen as the first step towards improving the coherence and interoperability of agricultural Fig. 1 A concept in GACS data – a vision explored in a July 2015 workshop on “Agrisemantics”4 , with support from the Gates Foundation, In the next phase of development, the scope of GACS will elaborated in the Chania Declaration 5 of May 2016, and be broadened beyond the core. Concepts from some of the pursued by an Agrisemantics Working Group that is forming source thesauri that were not included in GACS Core may be within the Research Data Alliance initiative. given an id.agrisemantics.org URI in a GACS Extension to be maintained by their original owners or, optionally, in 1 http://aims.fao.org/agrovoc collaboration. The notion of GACS Module anticipates a 2 http://www.cabi.org/cabthesaurus/ 3 http://agclass.nal.usda.gov/ 4 http://aims.fao.org/sites/default/files/Report_workshop_Agrisemantics.pdf 5 6 http://blog.agroknow.com/?p=5067 http://agrisemantics.org/gacs longer-term need to devolve maintenance of distinct types of connected data elements across a diversity of cropwheat- concepts, such as organisms or geographical names, to related datasets from databases and repositories along with communities of experts. multi-media information, and relevant literature from main bibliographic databases like AGRIS, CABI and NAL with the II. SEMANTIC ASSETS FOR FOOD AND AGRICULTURE goal of improving food security. Information relevant to food and agriculture encompasses The Agrisemantics vision points in two directions: on the data collected on factors ranging from yield and climate to one hand, to turn GACS into a more extensive network of demographics and markets., Information is presented in forms thesauri and concept schemes to ensure the appropriate ranging from narrative texts (policy, technical, and scientific coverage for our domain of interest. In particular, we are going documents) through structured datasets (empirical data). to test the notion of a GACS Extension on the example of Information may be graphically visualized, e.g., plotted onto AGROVOC. On the other hand, we aim at establishing tools timelines or maps, or plugged into models for nowcasting or and methodologies to connect GACS and its constellation of for forecasting trends. All types of data, from the analytical to “extensions” to multiple domain-specific ontologies. the empirical, are required for achieving sustainable food The first ontology we will be working with is the Crop systems. Ontology [1], which supports data comparison and Thesauri provide concepts for indicating the overall topic interpretation at a higher granularity by providing a means for of information resources, usually semi-structured texts such as annotating data element with trait measurement method and bibliographic abstracts, journal articles, but also videos and unit or scale. (See Fig. 2) courseware. Empirical data is composed of data elements with precise definitions at defined levels of granularity. Datasets are typically serialized in formats specific to a particular software application, and their individual data elements are named within the context of that particular application. Interoperability across datasets is hampered by the sheer effort required to determine equivalences among differently named elements, then to extract sets of comparable elements from a diversity of applications and formats. Ontologies, focused set of related concepts specified with precise definitions and global identifiers, are increasingly used to “annotate” data. However, ontologies too may embody ad-hoc semantics in Fig. 2 Mapping from thesaurus to ontology different degrees, and are usually totally disconnected from the world of thesauri, so preventing a seamless access to “hard” and soft data alike. More specifically, a wheat data element labeled with the code “GW” in a phenotype dataset can be mapped to the general III. LINKING THESAURI TO DATA VIA ONTOLOGIES concept "grain weight" as defined, and given global identity (URI), in the CGIAR Crop Ontology7. The CO term ‘Grain The more fuzzily defined, globally identified concepts of Weight’ can, in turn, be mapped to ‘Grain’ in AGROVOC and general-purpose, search-oriented thesauri and concept schemes, GACS. More information can then be discovered through a such as GACS, may be mapped to the more precisely defined, query system using this mapping that will return, aside from globally identified, domain-specific, application-oriented datasets related to grain weight, references to published papers ontologies and, from there, to locally defined data elements where grain weight was studied. embedded in software-specific databases. An unbroken chain may be formed linking the most general concepts to the most specific data elements. Semantic authority control for data elements facilitates the re-use of datasets, and links from ACKNOWLEDGMENTS precise ontologies to search-oriented concepts facilitates the discovery of those datasets. Special thanks to the GACS Working Group: Tom Baker, Caterina Caracciolo, Anton Doroszenko, Lori Finch, Sujata One path to data interoperability is to use appropriately Suri, and Osma Suominen. defined ontologies – i.e., ontologies that not only enable the extraction of data from a database (process often called “data REFERENCES annotation”), but that can also situate data within the [1] Rosemary S., Matteis L., Skofic M., Portugal A., McLaren G., Hyman G., appropriate "context" -- a modeled set of data about the time Arnaud E.: 2012. Bridging the phenotypic and genetic data useful for and place of its collection along with any additional elements integrated breeding through a data annotation using the Crop Ontology required for its correct interpretation. Another path is to place developed by the crop communities of practice. Frontiers in Physiology, those ontologies in a network with other semantic assets, vol. 3 including the thesauri and concept schemes used to express the “topicality” of information resources. Such an integration of semantic assets may support, for example, an analysis of the yield gap in sub-Saharan African countries by providing well- 7 http://www.cropontology.org