Using NLP to Support Terminology Extraction and Domain Scoping: Report on the H2020 DESIRA Project Manlio Bacco Gianluca Brunori WN Lab, CNR-ISTI, Pisa, Italy PAGE, DISAAA, University of Pisa, Italy manlio.bacco@isti.cnr.it gianluca.brunori@unipi.it Felice Dell’Orletta Alessio Ferrari ItaliaNLP Lab, CNR-ILC, Pisa, Italy FMT Lab, CNR-ISTI, Pisa, Italy felice.dellorletta@ilc.cnr.it alessio.ferrari@isti.cnr.it Abstract The ongoing phenomenon of digitisation is changing social and work life, with tangible effects on the socio-economic context. Understanding the impact, opportunities, and threats of digital transformation re- quires the identification of viewpoints from a large diversity of stake- holders, from policy makers to domain experts, and from engineers to common citizens. The DESIRA (Digitisation: Economic and Social Impacts in Rural Areas) EU H2020 project1 considers rural areas, with a strong focus on agricultural and forestry activities, and aims at asses- sing the impact of digital technologies in those domains by involving a large number of stakeholders, all across Europe, around 20 focal ques- tions. Given the involvement of stakeholders with diverse background and skills, a primary goal of the project is to develop domain-specific and interactive reference taxonomies (i.e., structured classifications of terms) to facilitate common understanding of technologies in use in each domain at today. The taxonomies, which aims at easing the learn- ing of the meaning of technical and domain-specific terms, are going to be exploited by the stakeholders in 20 Living Labs built around the focal questions. This report paper focuses on the semi-automatic development of the taxonomies through natural language processing (NLP) techniques based on context-specific term extraction. Further- more, we crawl Wikipedia to enrich the taxonomies with additional categories and definitions. We plan to validate the taxonomies through field studies within the Living Labs. 1 Introduction The DESIRA project aims to assess and anticipate the impact of digital transformation in rural areas, with a specific focus on the fields of agriculture and forestry. The activities within the project see the creation of a common terminology in a very multi-disciplinary consortium, the assessment of past and present game changing Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: M. Sabetzadeh, A. Vogelsang, S. Abualhaija, M. Borg, F. Dalpiaz, M. Daneva, N. Fernández, X. Franch, D. Fucci, V. Gervasi, E. Groen, R. Guizzardi, A. Herrmann, J. Horkoff, L. Mich, A. Perini, A. Susi (eds.): Joint Proceedings of REFSQ-2020 Workshops, Doctoral Symposium, Live Studies Track, and Poster Track, Pisa, Italy, 24-03-2020, published at http://ceur-ws.org 1 Project website available at: http://desira2020.eu Figure 1: Overview of the DESIRA Project. The NLP-based part of the project is colored in grey (for “Interview with External Experts”, transcripts of the interviews are used as input for the process of NLP-based Taxonomy Synthesis). effects due to digital technologies, and the set-up of a methodology to anticipate future effects. Those activities are going to be performed within 20 Living Labs (LLs) in different geographical areas of Europe, composed of several stakeholders, being either part of the project consortium or providing external expertise. LLs can be defined as user-centered, open innovation ecosystems based on systematic user co-creation approach, integrating research and innovation processes in real-life communities and settings. As shown in Fig. 1, LLs are at the core of the activities in DESIRA. Each LL is associated to a specific digitisation domain. In our case, three main domains are considered: agriculture, forestry, and rural areas. This latter domain includes all the application environments for digitisation that do not belong to agriculture and forestry, such as, for example, management of small towns, rural communities, roads, etc. Each LL embodies a focal question, i.e., a set of primary goals to be addressed in the specific context of the LL: for instance, in central Italy, the focal question is related to forest management activities, and to the need to counteract both illegal logging and lack of information about wood provenance. The main purpose of a LL is to assess the past and present situation in the geographical area regarding its focal question (system-as-is in Fig. 1), identifying both drivers and obstacles in the current socio-technical system, and then agree on a desired future situation (system-to-be), highlighting the role that the introduction of digital technologies may play in enabling it. For instance, reference [BBF+ 19] focuses on novel digital technologies ready for use in the agricultural field and on existing non-technical barriers holding a larger adoption. During the DESIRA project, the stakeholders of each LL—about 15-20 people per LL—will physically meet in four workshops, and will continuously interact through Virtual Research Environments based on gCube [ACC+ 19], a collaborative online platform. The objective of the first two workshops is to gain a clear picture of the system-as-is within the LL, i.e., deepen the description of the context around the focal question and clearly identify socio-economic pros and cons of the specific digital solutions already in use, if any. The next two workshops will focus on the actions to be undertaken to transition into a more desirable digital- enabled scenario, activity to be strongly supported by the domain-dependent taxonomies and the impact model of digital technologies, as in Fig. 1. The activities of each LL are managed by an appointed LL moderator, which facilitates the discussion by leveraging (i ) an impact model of digital technologies (common to all LLs), and (ii ) a set of domain-dependent taxonomies. The model, built upon a structured survey with internal experts, interviews with external ones, and literature analysis, provide guidelines and examples to try and assess the socio-economic impact of digital technologies, ultimately referring to the UN Sustainable Development Goals (SDGs)2 . For instance, recalling the focal question of the LL in central Italy, the SDG under consideration is the 12th, i.e., responsible consumption and production. The impact model is used by LL participants to brainstorm on how certain technologies are and will influence their specific context. The domain-dependent taxonomies (one for each digitisation domain) are used to learn about the different technologies in use in the domain, and the meaning of the technical terminology. These are synthesised with the support of natural language processing (NLP) tools. Specifically, the information 2 Details on the SDGs available at: sustainabledevelopment.un.org Figure 2: Synthesis of the domain-dependent Taxonomy and relation with the impact model. Software tools are colored in grey. extraction tool Text2Knowledge [DVCM14]3 , developed by the ItaliaNLP Lab4 , is used to identify relevant terms and relations from different knowledge sources, namely literature and reports concerning the application of digital technologies in each considered domain, and interviews with external experts. Furthermore, Wikipedia is used to enrich the taxonomies with additional concepts and descriptive content, to facilitate domain scoping and independent learning by the LL participants. In this paper, we focus on the description of the approach for NLP-based taxonomy synthesis in the context of DESIRA, which can be relevant for RE community interested in NLP. The presentation of the paper at NLP4RE may also serve as a trigger to discuss with the workshop participants potential RE methodologies and tools to best manage the LLs. 2 Terminology Extraction and Domain Scoping We describe here the proposed approach for NLP-based taxonomy synthesis and domain scoping. Fig. 2 provides an overview of the approach: given a Digitisation Domain, such as, e.g. forestry, a set of documents and reports are selected around the topics of digital technologies and digitisation. Furthermore, a set of experts’ interview on such topics are performed and transcribed. The tool Text2Knowledge (T2K) is used to extract a knowledge graph that represents relevant terms and relations as in the input documents. The knowledge graph is expected to include technology-related terms, as for instance “blockchain”, along with domain-specific terms, such as “illegal logging” or “provenance” (see the examples in italic in Fig. 2). The knowledge graph is used for two goals: (i ) generation of a domain-dependent taxonomy; and (ii ) support for consolidation of the impact model of digital technologies. (i ) To generate a domain-dependent taxonomy, the knowledge graph is first loaded into Gephi5 , an open source graph visualisation and manipulation tool. An appointed Platform Manager, in collaboration with the LL Moderators, will edit the knowledge graph to produce the domain-dependent taxonomy. This is performed with the support of a Wikipedia Crawler, which allows the users to identify informative Wikipedia pages related to the various technological terms in the knowledge graph, and enrich the graph with links to those pages. The resulting domain-dependent taxonomies are exported through the Sigma.js6 tool, and can be visualised and navigated by LL participants by means of a common web browser. (ii ) To support the consolidation of the impact model of digital technologies, which will be used in the LLs, a consolidation workshop is foreseen. This workshop will use the information collected from internal experts and 3 Tool accessible upon request at: http://www.italianlp.it/demo/t2k-text-to-knowledge/ 4 Lab website available at: http://www.italianlp.it 5 Download the tool at: https://gephi.org 6 See: http://sigmajs.org external ones concerning the impact of digital technologies, according to their previous experience in technological applications in the different domains considered. Furthermore, the workshop will use the knowledge graph to identify potentially missing relations between technologies and socio-economic impacts due to digitisation. In the following, we focus on generation of the knowledge graph with T2K (Sect. 2.1), and on the definition of the domain-dependent taxonomies with the support of Wikipedia (Sect. 2.2). 2.1 Generation of the Knowledge Graph with T2K T2K is a tool to generate knowledge graphs from unstructured natural language documents. A knowledge graph is composed of nodes, representing relevant terms in the documents, and edges, representing relevant relations among the terms. Below, we briefly describe the principles used by T2K to extract terms and relations. 2.1.1 Identification of Relevant Terms The NLP method for term extraction was developed by the second author and is named contrastive analysis [BDMV10]. In this context, a term is a conceptually independent linguistic unit, which can be composed by a single or multiple words. The contrastive analysis technology aims at detecting those terms in a document that are specific for the context of the document under consideration [BDMV10, Del09]. In our case, the context is given by the specific domain (e.g., forestry) combined with the topics of digitisation. Roughly, contrastive analysis considers the terms extracted from context-generic documents (e.g., newspapers), and the terms ex- tracted from context-specific documents under analysis. If a term in the context-specific document highly occurs also in the context-generic documents, such a term is considered as context-generic. On the other hand, if the term is not frequent in the context-generic documents, the term is considered as context-specific. In our work, the documents from which we want to extract context-specific terms are the input documents that better represent the digitisation domains involved, namely agriculture, forestry and rural areas. The pro- posed method requires two main steps. First, conceptually independent expressions (i.e., terms) are identified. Then, contrastive analysis is applied to select the terms that are specific for the context of the document. The overall process includes the following four tasks. 1. POS Tagging: Part of Speech (POS) Tagging is performed with an English version of the tool in [Del09]. With POS Tagging, each word is associated with its grammatical category (noun, verb, adjective, etc.). 2. Linguistic Filters: after POS tagging, we select all those words or groups of words (referred in the following as multi-words) that follow a set of specific POS patterns (i.e., sequences of POS), that we consider relevant in our context. For example, we will not be interested in those multi-words that end with a preposition, while we are interested in multi-words with a format like (such as “wearable device”). 3. C-NC Value: terms are finally identified and ranked by computing a “termhood” metric, called C-NC value [BDMV10]. This metric establishes how much a word or a multi-word is likely to be conceptually indepen- dent from the context in which it appears. The computation of the metric is rather complex, and the explanation of such computation is beyond the scope of this paper. The interested reader can refer to [BDMV10] for further details. After this analysis, we have a ranked list of words/multi-words that can be considered terms, together with their ranking according to the C-NC metric, and their frequency (i.e., number of occurrences). The more a word/multi-word is likely to be a term, the higher the ranking. 4. Contrastive Analysis: The previous step leads to a ranked list of terms where all the terms might be context-generic or -specific. With the contrastive analysis step, terms are re-ranked according to their context- specificity. This is done by comparing the extracted terms with the terms extracted from the Penn Treebank corpus, which collects articles from the Wall Street Journal. The final ranking is analysed by the LL moderators, and non-representative terms are discarded. 2.1.2 Identification of Relevant Relations In order to identify relevant relations among terms, we first select all the terms extracted in the previous step. Then, we search for possible relations among such terms. We state that there is a relation between two terms if such terms appear in the same sentence or in neighboring sentences. In order to give a rank to such relation, we use the Log-likelihood metric for binomial distributions as defined in [Dun93]. The explanation of such metric is beyond the scope of this paper. Here, we give an idea of the spirit of the metric. Roughly, a relation holds between two terms if such terms frequently appear together. Moreover, the relation is stronger if the two terms do not often occur with other terms. In other words, there is a sort of exclusive relation among the two terms. The relevant terms and relations are used to produce a knowledge graph, which can be visualised with T2K. 2.2 Domain Scoping with Wikipedia Crawling The taxonomy is then enriched by crawling the Wikipedia pages associated to technologies that are relevant for the specific domain. Specifically, the technology-related terms that appear in the taxonomy are searched in the Wikipedia pages, and the links to the pages are attached to the graph. This can help the participants of the LL to learn about those concepts that they are not familiar with. Furthermore, LL Moderators can look upon those concepts that do not have a Wikipedia definition, and add the links to other informative webpages on the topic. The software for Wikipedia crawling in DESIRA—currently under development—is freely accessible7 . Such software supports automatic exploration and comparison of Wikipedia hierarchical categories. The graph is then exported with Sigma.js and can be visualised through a common web browser. 3 Conclusion and Research Plan The DESIRA project started in June 2019, and, although most of the components of the toolchain, i.e., T2K, Gephi, Wikipedia Crawler and Sigma.js are available, we still have to test their integration for our purposes. In fact, we are carefully proceeding at this time, in order to evaluate the approach in itself. Furthermore, we are discussing how to ease the integration of the knowledge graph into the impact model, ultimately a key tool for LL workshops. To this end, we are currently experimenting the usage of the tools for the generation of the domain-dependent taxonomies. If the output meets the expectations of the project, the taxonomies will be made accessible to the LL participants. Specifically, the participants will be able to navigate the taxonomies through a web browser, to learn about the different technologies, and to use the acquired knowledge within their LL. To validate the usefulness of the taxonomies, we plan to: (a) retrieve quantitative information in terms of number of accesses to the web pages associated to the taxonomies, and (b) gather qualitative feedback of the participants on the practical usefulness of these taxonomies within the project. This will give an indication of the applicability of the considered NLP technologies for knowledge extraction in the context of DESIRA. Acknowledgements This work was partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 818194. References [ACC+ 19] Massimiliano Assante, Leonardo Candela, Donatella Castelli, Roberto Cirillo, Gianpaolo Coro, Luca Frosini, Lucio Lelii, Francesco Mangiacrapa, Valentina Marioli, Pasquale Pagano, et al. The gcube system: delivering virtual research environments as-a-service. Future Generation Computer Systems, 95:445–453, 2019. [BBF+ 19] Manlio Bacco, Paolo Barsocchi, Erina Ferro, Alberto Gotta, and Massimiliano Ruggeri. The Digiti- sation of Agriculture: a Survey of Research Activities on Smart Farming. Array, 3–4:1–11, 2019. [BDMV10] Francesca Bonin, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi. A contrastive approach to multi-word extraction from domain-specific corpora. In Proc. of LREC’10, pages 19–21, 2010. [Del09] Felice Dell’Orletta. Ensemble system for part-of-speech tagging. In Proc. of Evalita’09, Evaluation of NLP and Speech Tools for Italian, 2009. [Dun93] Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational lin- guistics, 19(1):61–74, 1993. [DVCM14] Felice Dell’Orletta, Giulia Venturi, Andrea Cimino, and Simonetta Montemagni. T2kˆ 2: a sys- tem for automatically extracting and organizing knowledge from texts. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pages 2062–2070, 2014. 7 Public repository available at: https://github.com/alessioferrari/DESIRA-WikiAnalysis-Repo