Data-driven Update of AGROVOC Using Agricultural Text Corpora Hercules Panoutsopoulos 1 and Christopher Brewster 1,2 1 Maastricht University, Institute of Data Science, Paul-Henri Spaaklaan 1 (PHS1), Maastricht, 6229 EN, The Netherlands 2 TNO, Data Science Group, Kampweg 55, Soesterberg, 3769 DE, The Netherlands Abstract AGROVOC is a well-known multilingual controlled vocabulary covering the fields of agriculture, forestry, fisheries, and food. It is used for dataset annotation, indexing of literature, and automated text tagging, and its effective use depends on its continuous update. Currently, updates are done manually by a dispersed community of editors. In this paper, we present work towards automated update recommendations using large corpora of agricultural text (such as the AGRIS database). The work is based on the extraction of agricultural concept mentions from text through the deployment of custom trained Named Entity Recognition models and the exploitation of Graph Neural Networks to recommend concept and relation additions towards predicting future AGROVOC states. The research questions and methodology are presented together with the results of an initial experiment. The next steps and future research directions are outlined. This work forms part of a PhD research on monitoring and predicting changes in knowledge graphs utilising textual data. Keywords 1 AGROVOC, knowledge graph, update, Named Entity Recognition, Graph Neural Networks 1. Motivation AGROVOC is a multilingual, structured vocabulary of more than 40K agricultural concepts, concept definitions and relations, and concept labels. It is structured as a directed acyclic graph using the SKOS standard2 and represents associations between concepts by means of hierarchical and non-hierarchical relations. Utilising semantic web technology standards, AGROVOC provides knowledge organisation affordances enabling data retrieval. It allows standardised indexing via the unambiguous identification of resources, thus making search operations more efficient [1]. AGROVOC is curated by FAO experts in collaboration with editors from affiliated organisations. However, the pace at which new information and data become available, through the various kinds of publications, poses challenges to keeping it up to date. Advances in Natural Language Processing and Machine Learning hold the promise of providing technological support to the manual work involved in AGROVOC’s maintenance and curation. In this context, the aim of this paper is to present work on the provision of automated recommendations for AGROVOC updates based on agricultural text corpora (such as the abstracts in the AGRIS database3). The goal is to identify concepts absent in AGROVOC but present in text to recommend for addition to an updated vocabulary version. Such recommendations include identifying where in the graph the new concepts should be added also specifying links to existing concepts. This work will eventually lead to methods for predicting future AGROVOC states based on the computation of diachronic changes. Proceedings of HAICTA 2022, September 22–25, 2022, Athens, Greece EMAIL: herculespanoutsopoulos@gmail.com (A. 1); christopher.brewster@maastrichtuniversity.nl (A. 2) ORCID: 0000-0002-8060-9750 (A. 1); 0000-0001-6594-9178 (A. 2) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2 https://www.w3.org/2004/02/skos/ 3 https://agris.fao.org/agris-search/index.do 260 2. Background and Related Work There is a growing body of research on the development of knowledge graphs utilising unstructured or structured data sources (cf. [2] for a review of literature on automated knowledge graph construction). However, less research has been undertaken on automated knowledge graph update [3]. A method based on the combined use of Relational Graph Convolutional Networks (R-GCNs) [4], capturing an entity’s context in a graph with bi-directional Gated Recurrent Units (bi-GRUs), having the capacity to identify the context of a word’s appearance in text, is proposed in [5]. In that work, graph update is approached as a task of adding or deleting relations, assuming fixed sets of entities and relation types, to codify the information in the text. Fundamentally, research on automated graph update methods has taken the form of link prediction (e.g., [4, 6, 7, 8]). However, in such a context, important aspects, such as new concept addition, are overlooked. Apart from that, there is also interest in temporal node and graph embeddings [9, 10, 11]. Within this context, there has been work in time-aware relational Graph Neural Networks (GNNs) predicting new relations based on diachronic changes in the graph [12]. 3. Description of Proposed Research AGROVOC provides affordances for annotation of agricultural data, information retrieval, literature indexing, and automated text tagging [13]. Given the pace at which new information becomes available, it is important to timely capture domain developments, taking these from food- and agriculture-related publications, and integrate them into AGROVOC, to ensure an up-to-date knowledge representation enabling accurate resource identification. AGROVOC has grown over the years following changes in the domain as shown in Figures 1 and 24. Figure 1: Number of concepts in AGROVOC per year The number of concepts in AGROVOC (Figure 1) has increased over time, which is to be expected given the developments in the fields of food and agriculture. However, changes in the number of relation types (Figure 2) have not followed a similar pattern, with the observed drops in the recorded numbers requiring further explanation. To acquire further insights into how AGROVOC is updated, the creation dates and temporal distribution of concept occurrences in literature (abstracts from the AGRIS database) were computed for a random sample of concepts from the 2022 AGROVOC version (Table 1). 4 Figures 1 and 2 have been created using data from SPARQL queries submitted to the AGROVOC versions from 2013 and 2022. The queries are available in the paper’s GitHub repository. 261 Figure 2: Number of relation types in AGROVOC per year Table 1 Dates of addition of a sample of concepts in AGROVOC and numbers of their occurrence in literature before and after their addition to AGROVOC Date of addition Occurrences in literature Occurrences in literature Concept to AGROVOC before creation after creation c_5903 2011 422 119 c_59e0f842 2019 1668 34 c_25740 2011 143 29 c_27140 2011 125 19 c_786c0cff 2019 2843 464 c_33193 1990 0 494 c_62e403a1 2019 209 327 c_41ce07e7 2017 231 307 Despite the small sample size, it is evident that in many cases the number of concept occurrences in literature before their addition to AGROVOC is greater than the number of their occurrences after being added to AGROVOC. It can be concluded that the addition of new concepts to AGROVOC is not based on their frequency of occurrences in literature. This is further supported by the temporal distribution of new concept additions illustrated in Figure 35. A high peak in the number of concepts added in 2011 is observed (26,667 concepts) with the average number of concept additions per year being much lower before 2011 (≅ 66 concepts) and after 2011 (≅ 800 concepts). Based on these findings and considering the rapid pace of advances in agriculture, we propose that manual updates appear to not be sufficient for the timely capture and representation of new knowledge. The proposed PhD research aims to develop, test, and evaluate methods recommending automated AGROVOC updates based on text. This forms part of a broader effort on the monitoring and predicting of changes in knowledge graphs utilising textual data. To this end, we have posed the following research questions: 1. How can we extract agricultural concepts from text, absent in AGROVOC, and identify which ones to propose as new concepts to be integrated into AGROVOC? 2. Given a new concept to be integrated into AGROVOC what existing relations need also to be added to link the new concept to existing concepts? 5 The code used to obtain the data shown in Table 1 and Figure 3 is available in the paper’s GitHub repository. 262 Figure 3: Temporal distribution of new concept additions in AGROVOC 4. Research Methodology and Experiments The research methodology, depicted in Figure 4, has two phases: (i) Extraction of novel agricultural concepts from text; and (ii) Generation of recommendations for automated AGROVOC updates. Each phase involves the implementation of an experiment. The experiments are described below. Figure 4: PhD research methodology 263 Extraction of mentions of novel agricultural concepts from text: The focus is on the development of an agricultural term extraction tool to identify mentions of novel concepts (not seen in AGROVOC) in the corpus of texts. Given a version of AGROVOC available at a time point t and a corpus spanning across a time frame t+Δt, the goal is to identify new concept mentions and recommend them to be added to the vocabulary. The term extraction tool is based on off-the-shelf Named Entity Recognition (NER) models. Abstracts of AGRIS publications are used as the tool’s training, validation, and test datasets. An initial version of the tool was built based on the spaCy library’s Tok2Vec6 and NER7 components, using their default architectures (spacy.Tok2Vec.v2 and spacy.TransitionBasedParser.v2 respectively) and the language models shipped with spaCy (en_core_web_sm and en_core_web_lg). Training was made on a set of 617 AGRIS abstracts annotated manually with labels of agricultural concepts appearing in them. Table 2 lists the best precision, recall and F1-score achieved in the initial experiment and the tool configurations giving those results. The results reveal the challenges related to the classification of a string as an agricultural term, when manually annotating text with agricultural terms, which has a high degree of vagueness, and hence subjectivity, leaving room for different interpretations by humans and impacting performance. Optimisation of the term extraction tool based on the use of transformer-based architectures and agriculture-related vocabularies and ontologies to unambiguously annotate text is currently in progress. Table 2 Best precision, recall, and F1-score and configurations of the term extraction tool giving those results Model configuration Precision Recall F1-score (language model - batch size - learning rate) “en_core_web_lg” - 128 - 0.01 50.73% 47.34% 48.97% “en_core_web_sm” - 64 - 0.0001 46.08% 54.52% 49.95% “en_core_web_sm” - 64 - 0.0001 50.70% 52.96% 51.81% Generation of recommendations for automated AGROVOC updates: This experiment focuses on the generation of automated updates of AGROVOC drawing upon recommendations for adding new concepts and relations (from a set of existing relation types) to link the new conceps to concepts already in AGROVOC. The method will be based on Deep Neural Network-based Natural Language Processing (DNN-based NLP), capturing the context of agricultural concept mentions in text, and Graph Neural Networks (GNNs) capable of capturing a concept’s context in the graph, thereby allowing to identify where in the graph the new concept should be added and how it should be linked to existing concepts. The available AGROVOC versions will be used as ground truth to evaluate the method’s performance. 5. Discussion AGROVOC is an agriculture-related graph knowledge representation structure that can be used in various application scenarios. To facilitate an accurate identification of resources, based on its use, it is important to keep AGROVOC up to date. However, the rate at which new information and data become available together with the issues emerging from the AGROVOC’s update methods currently in practice (appearing not to follow the pace of domain developments as made evident from the relevant literature) necessitate the adoption of automated update solutions based on means of technological support. In this context, this paper has presented a PhD research on automated AGROVOC updates based on the extraction of novel concept mentions from text. Further work is currently in progress related to the development of the tool for extracting agricultural terms from text towards improving its performance. To this end, domain ontologies and vocabularies are intended to be used to annotate text automatically and unambiguously for obtaining the tool’s training, validation, and test datasets. Moreover, drawing upon transformer-based architectures will help to get better performance results. Future research will 6 https://spacy.io/api/tok2vec 7 https://spacy.io/api/entityrecognizer 264 be concerned with the deployment of time aware GNNs predicting future states of AGROVOC solely based on the computation of changes that have diachronically occurred in it. 6. Acknowledgements The authors would like to thank FAO’s support facility for providing previous AGROVOC versions. This work has been partly supported by the H2020 EUREKA project, contract number 862790. 7. References [1] I. Subirats-Coll, K. Kolshus, A.Turbati, A. Stellato, E. Mietzsch, D. Martini, and M. Zeng. AGROVOC: The linked data concept hub for food and agriculture. Computers and Electronics in Agriculture 196 (2022) p. 105965. doi: 10.1016/j.compag.2020.105965. [2] M. Masoud, B. Pereira, J. McCrae, and P. Buitelaar. Automatic Construction of Knowledge Graphs from Text and Structured Data: A Preliminary Literature Review, in D. Gromann, G. Sérasset, T. Declerck, J. P. McCrae, J. Gracia, J. Bosque-Gil, F. Bobillo, B. Heinisch (Eds.), Proceedings of the 3rd Conference on Language, Data and Knowledge (LDK 2021), Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany, 2021, Article No. 19; pp. 19:1– 19:9. doi:10.4230/OASIcs.LDK.2021.19. [3] G. Weikum, X.L. Dong, S. Razniewski, and F. Suchanek. Machine knowledge: Creation and curation of comprehensive knowledge bases. Foundations and Trends in Databases 10 (2021) 108- 490. doi: arXiv:2009.11564v2. [4] M. Schlichtkrull, T.N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling Relational Data with Graph Convolutional Networks. arXiv preprint (2017). doi: arXiv:1703.0610 3v4. [5] J. Tang, Y. Feng, and D. Zhao. Learning to Update Knowledge Graphs by Reading News, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics pages, Hong Kong, China, 2019, pp. 2632–2641. doi: 10.18653/v1/D19-1265. [6] A. Grover, and J. Leskovec. node2vec: Scalable feature learning for networks, in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2016, pp. 855-864. doi: 10.1145/2939672.2939754. [7] M. Zhang, and Y. Chen. Link prediction based on graph neural networks. arXiv preprint (2018). doi: arXiv:1802.09691v3. [8] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S.Y. Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32 (1) (2020) 4-24. doi: 10.1109/TNNLS.2020.2978386. [9] O. Michail. An introduction to temporal graphs: An algorithmic perspective. arXiv preprint (2015). doi: arXiv:1503.00278v1. [10] U. Singer, I. Guy, and K. Radinsky. Node embedding over temporal graphs. arXiv preprint (2019). doi: arXiv:1903.08889v3. [11] A. Taheri, and T. Berger-Wolf. Predictive temporal embedding of dynamic graphs, in Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019, pp. 57-64. doi: 10.1145/3341161.3342872. [12] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. Schardl, and C. Leiserson, C. Evolvegcn: Evolving graph convolutional networks for dynamic graphs, in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 5363-5370. doi: arXiv: 1902.10191v3. [13] E. Mietzsch, D. Martini, K. Kolshus, A. Turbati, and I. Subirats-Coll. How Agricultural Digital Innovation Can Benefit from Semantics: The Case of the AGROVOC Multilingual Thesaurus. Engineering Proceedings 9 (1) (2020) 17. doi: 10.3390/engproc2021009017. 265