<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using NLP to Support Terminology Extraction and Domain Scoping: Report on the H2020 DESIRA Pro ject</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manlio Bacco WN Lab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CNR-ISTI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CNR-ILC</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy felice.dellorletta@ilc.cnr.it</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gianluca Brunori PAGE, DISAAA, University of Pisa, Italy Alessio Ferrari FMT Lab, CNR-ISTI</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>The ongoing phenomenon of digitisation is changing social and work life, with tangible e ects on the socio-economic context. Understanding the impact, opportunities, and threats of digital transformation requires the identi cation of viewpoints from a large diversity of stakeholders, from policy makers to domain experts, and from engineers to common citizens. The DESIRA (Digitisation: Economic and Social Impacts in Rural Areas ) EU H2020 project1 considers rural areas, with a strong focus on agricultural and forestry activities, and aims at assessing the impact of digital technologies in those domains by involving a large number of stakeholders, all across Europe, around 20 focal questions. Given the involvement of stakeholders with diverse background and skills, a primary goal of the project is to develop domain-speci c and interactive reference taxonomies (i.e., structured classi cations of terms) to facilitate common understanding of technologies in use in each domain at today. The taxonomies, which aims at easing the learning of the meaning of technical and domain-speci c terms, are going to be exploited by the stakeholders in 20 Living Labs built around the focal questions. This report paper focuses on the semi-automatic development of the taxonomies through natural language processing (NLP) techniques based on context-speci c term extraction. Furthermore, we crawl Wikipedia to enrich the taxonomies with additional categories and de nitions. We plan to validate the taxonomies through eld studies within the Living Labs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The DESIRA project aims to assess and anticipate the impact of digital transformation in rural areas, with a
speci c focus on the elds of agriculture and forestry. The activities within the project see the creation of a
common terminology in a very multi-disciplinary consortium, the assessment of past and present game changing
e ects due to digital technologies, and the set-up of a methodology to anticipate future e ects. Those activities
are going to be performed within 20 Living Labs (LLs) in di erent geographical areas of Europe, composed of
several stakeholders, being either part of the project consortium or providing external expertise. LLs can be
de ned as user-centered, open innovation ecosystems based on systematic user co-creation approach, integrating
research and innovation processes in real-life communities and settings.</p>
      <p>As shown in Fig. 1, LLs are at the core of the activities in DESIRA. Each LL is associated to a speci c
digitisation domain. In our case, three main domains are considered: agriculture, forestry, and rural areas. This
latter domain includes all the application environments for digitisation that do not belong to agriculture and
forestry, such as, for example, management of small towns, rural communities, roads, etc. Each LL embodies
a focal question, i.e., a set of primary goals to be addressed in the speci c context of the LL: for instance, in
central Italy, the focal question is related to forest management activities, and to the need to counteract both
illegal logging and lack of information about wood provenance.</p>
      <p>The main purpose of a LL is to assess the past and present situation in the geographical area regarding its focal
question (system-as-is in Fig. 1), identifying both drivers and obstacles in the current socio-technical system,
and then agree on a desired future situation (system-to-be), highlighting the role that the introduction of digital
technologies may play in enabling it. For instance, reference [BBF+19] focuses on novel digital technologies
ready for use in the agricultural eld and on existing non-technical barriers holding a larger adoption. During
the DESIRA project, the stakeholders of each LL|about 15-20 people per LL|will physically meet in four
workshops, and will continuously interact through Virtual Research Environments based on gCube [ACC+19], a
collaborative online platform.</p>
      <p>The objective of the rst two workshops is to gain a clear picture of the system-as-is within the LL, i.e., deepen
the description of the context around the focal question and clearly identify socio-economic pros and cons of the
speci c digital solutions already in use, if any.</p>
      <p>The next two workshops will focus on the actions to be undertaken to transition into a more desirable
digitalenabled scenario, activity to be strongly supported by the domain-dependent taxonomies and the impact model
of digital technologies, as in Fig. 1.</p>
      <p>The activities of each LL are managed by an appointed LL moderator, which facilitates the discussion by
leveraging (i ) an impact model of digital technologies (common to all LLs), and (ii ) a set of domain-dependent
taxonomies. The model, built upon a structured survey with internal experts, interviews with external ones,
and literature analysis, provide guidelines and examples to try and assess the socio-economic impact of digital
technologies, ultimately referring to the UN Sustainable Development Goals (SDGs)2. For instance, recalling the
focal question of the LL in central Italy, the SDG under consideration is the 12th, i.e., responsible consumption
and production.</p>
      <p>The impact model is used by LL participants to brainstorm on how certain technologies are and will in uence
their speci c context. The domain-dependent taxonomies (one for each digitisation domain) are used to learn
about the di erent technologies in use in the domain, and the meaning of the technical terminology. These
are synthesised with the support of natural language processing (NLP) tools. Speci cally, the information
2Details on the SDGs available at: sustainabledevelopment.un.org
extraction tool Text2Knowledge [DVCM14]3, developed by the ItaliaNLP Lab4, is used to identify relevant
terms and relations from di erent knowledge sources, namely literature and reports concerning the application
of digital technologies in each considered domain, and interviews with external experts. Furthermore, Wikipedia
is used to enrich the taxonomies with additional concepts and descriptive content, to facilitate domain scoping
and independent learning by the LL participants.</p>
      <p>In this paper, we focus on the description of the approach for NLP-based taxonomy synthesis in the context of
DESIRA, which can be relevant for RE community interested in NLP. The presentation of the paper at NLP4RE
may also serve as a trigger to discuss with the workshop participants potential RE methodologies and tools to
best manage the LLs.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Terminology Extraction and Domain Scoping</title>
      <p>We describe here the proposed approach for NLP-based taxonomy synthesis and domain scoping. Fig. 2 provides
an overview of the approach: given a Digitisation Domain, such as, e.g. forestry, a set of documents and reports
are selected around the topics of digital technologies and digitisation. Furthermore, a set of experts' interview
on such topics are performed and transcribed. The tool Text2Knowledge (T2K) is used to extract a knowledge
graph that represents relevant terms and relations as in the input documents. The knowledge graph is expected
to include technology-related terms, as for instance \blockchain", along with domain-speci c terms, such as
\illegal logging" or \provenance" (see the examples in italic in Fig. 2). The knowledge graph is used for two
goals: (i ) generation of a domain-dependent taxonomy; and (ii ) support for consolidation of the impact model
of digital technologies.</p>
      <p>(i ) To generate a domain-dependent taxonomy, the knowledge graph is rst loaded into Gephi5, an open
source graph visualisation and manipulation tool. An appointed Platform Manager, in collaboration with the
LL Moderators, will edit the knowledge graph to produce the domain-dependent taxonomy. This is performed
with the support of a Wikipedia Crawler, which allows the users to identify informative Wikipedia pages related
to the various technological terms in the knowledge graph, and enrich the graph with links to those pages. The
resulting domain-dependent taxonomies are exported through the Sigma.js6 tool, and can be visualised and
navigated by LL participants by means of a common web browser.</p>
      <p>(ii ) To support the consolidation of the impact model of digital technologies, which will be used in the LLs, a
consolidation workshop is foreseen. This workshop will use the information collected from internal experts and
3Tool accessible upon request at: http://www.italianlp.it/demo/t2k-text-to-knowledge/
4Lab website available at: http://www.italianlp.it
5Download the tool at: https://gephi.org
6See: http://sigmajs.org
external ones concerning the impact of digital technologies, according to their previous experience in technological
applications in the di erent domains considered. Furthermore, the workshop will use the knowledge graph to
identify potentially missing relations between technologies and socio-economic impacts due to digitisation.</p>
      <p>In the following, we focus on generation of the knowledge graph with T2K (Sect. 2.1), and on the de nition
of the domain-dependent taxonomies with the support of Wikipedia (Sect. 2.2).
2.1</p>
      <sec id="sec-2-1">
        <title>Generation of the Knowledge Graph with T2K</title>
        <p>T2K is a tool to generate knowledge graphs from unstructured natural language documents. A knowledge graph
is composed of nodes, representing relevant terms in the documents, and edges, representing relevant relations
among the terms. Below, we brie y describe the principles used by T2K to extract terms and relations.
2.1.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Identi cation of Relevant Terms</title>
        <p>The NLP method for term extraction was developed by the second author and is named contrastive
analysis [BDMV10]. In this context, a term is a conceptually independent linguistic unit, which can be composed
by a single or multiple words. The contrastive analysis technology aims at detecting those terms in a document
that are speci c for the context of the document under consideration [BDMV10, Del09]. In our case, the context
is given by the speci c domain (e.g., forestry) combined with the topics of digitisation. Roughly, contrastive
analysis considers the terms extracted from context-generic documents (e.g., newspapers), and the terms
extracted from context-speci c documents under analysis. If a term in the context-speci c document highly occurs
also in the context-generic documents, such a term is considered as context-generic. On the other hand, if the
term is not frequent in the context-generic documents, the term is considered as context-speci c.</p>
        <p>In our work, the documents from which we want to extract context-speci c terms are the input documents
that better represent the digitisation domains involved, namely agriculture, forestry and rural areas. The
proposed method requires two main steps. First, conceptually independent expressions (i.e., terms) are identi ed.
Then, contrastive analysis is applied to select the terms that are speci c for the context of the document. The
overall process includes the following four tasks.
1. POS Tagging: Part of Speech (POS) Tagging is performed with an English version of the tool in [Del09].
With POS Tagging, each word is associated with its grammatical category (noun, verb, adjective, etc.).
2. Linguistic Filters: after POS tagging, we select all those words or groups of words (referred in the following
as multi-words ) that follow a set of speci c POS patterns (i.e., sequences of POS), that we consider relevant in
our context. For example, we will not be interested in those multi-words that end with a preposition, while we
are interested in multi-words with a format like &lt;adjective, noun&gt; (such as \wearable device").
3. C-NC Value: terms are nally identi ed and ranked by computing a \termhood" metric, called C-NC
value [BDMV10]. This metric establishes how much a word or a multi-word is likely to be conceptually
independent from the context in which it appears. The computation of the metric is rather complex, and the explanation
of such computation is beyond the scope of this paper. The interested reader can refer to [BDMV10] for further
details. After this analysis, we have a ranked list of words/multi-words that can be considered terms, together
with their ranking according to the C-NC metric, and their frequency (i.e., number of occurrences). The more
a word/multi-word is likely to be a term, the higher the ranking.
4. Contrastive Analysis: The previous step leads to a ranked list of terms where all the terms might be
context-generic or -speci c. With the contrastive analysis step, terms are re-ranked according to their
contextspeci city. This is done by comparing the extracted terms with the terms extracted from the Penn Treebank
corpus, which collects articles from the Wall Street Journal. The nal ranking is analysed by the LL moderators,
and non-representative terms are discarded.
2.1.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Identi cation of Relevant Relations</title>
        <p>In order to identify relevant relations among terms, we rst select all the terms extracted in the previous step.
Then, we search for possible relations among such terms. We state that there is a relation between two terms if
such terms appear in the same sentence or in neighboring sentences. In order to give a rank to such relation, we
use the Log-likelihood metric for binomial distributions as de ned in [Dun93]. The explanation of such metric
is beyond the scope of this paper. Here, we give an idea of the spirit of the metric. Roughly, a relation holds
between two terms if such terms frequently appear together. Moreover, the relation is stronger if the two terms
do not often occur with other terms. In other words, there is a sort of exclusive relation among the two terms.
The relevant terms and relations are used to produce a knowledge graph, which can be visualised with T2K.
The taxonomy is then enriched by crawling the Wikipedia pages associated to technologies that are relevant for
the speci c domain. Speci cally, the technology-related terms that appear in the taxonomy are searched in the
Wikipedia pages, and the links to the pages are attached to the graph. This can help the participants of the
LL to learn about those concepts that they are not familiar with. Furthermore, LL Moderators can look upon
those concepts that do not have a Wikipedia de nition, and add the links to other informative webpages on
the topic. The software for Wikipedia crawling in DESIRA|currently under development|is freely accessible7.
Such software supports automatic exploration and comparison of Wikipedia hierarchical categories. The graph
is then exported with Sigma.js and can be visualised through a common web browser.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Research Plan</title>
      <p>The DESIRA project started in June 2019, and, although most of the components of the toolchain, i.e., T2K,
Gephi, Wikipedia Crawler and Sigma.js are available, we still have to test their integration for our purposes. In
fact, we are carefully proceeding at this time, in order to evaluate the approach in itself. Furthermore, we are
discussing how to ease the integration of the knowledge graph into the impact model, ultimately a key tool for
LL workshops.</p>
      <p>To this end, we are currently experimenting the usage of the tools for the generation of the domain-dependent
taxonomies. If the output meets the expectations of the project, the taxonomies will be made accessible to the
LL participants. Speci cally, the participants will be able to navigate the taxonomies through a web browser,
to learn about the di erent technologies, and to use the acquired knowledge within their LL. To validate the
usefulness of the taxonomies, we plan to: (a) retrieve quantitative information in terms of number of accesses
to the web pages associated to the taxonomies, and (b) gather qualitative feedback of the participants on the
practical usefulness of these taxonomies within the project. This will give an indication of the applicability of
the considered NLP technologies for knowledge extraction in the context of DESIRA.</p>
      <sec id="sec-3-1">
        <title>Acknowledgements</title>
        <p>This work was partially supported by the European Union's Horizon 2020 research and innovation programme
under grant agreement no. 818194.
[Dun93]</p>
        <p>Felice Dell'Orletta. Ensemble system for part-of-speech tagging. In Proc. of Evalita'09, Evaluation
of NLP and Speech Tools for Italian, 2009.</p>
        <p>Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational
linguistics, 19(1):61{74, 1993.
7Public repository available at: https://github.com/alessioferrari/DESIRA-WikiAnalysis-Repo</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [ACC+19]
          <string-name>
            <surname>Massimiliano</surname>
            <given-names>Assante</given-names>
          </string-name>
          , Leonardo Candela, Donatella Castelli, Roberto Cirillo, Gianpaolo Coro, Luca Frosini, Lucio Lelii, Francesco Mangiacrapa, Valentina Marioli,
          <string-name>
            <given-names>Pasquale</given-names>
            <surname>Pagano</surname>
          </string-name>
          , et al.
          <article-title>The gcube system: delivering virtual research environments as-a-service</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>95</volume>
          :
          <fpage>445</fpage>
          {
          <fpage>453</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [BDMV10]
          <string-name>
            <given-names>Francesca</given-names>
            <surname>Bonin</surname>
          </string-name>
          , Felice Dell'Orletta,
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <article-title>A contrastive approach to multi-word extraction from domain-speci c corpora</article-title>
          .
          <source>In Proc. of LREC'10</source>
          , pages
          <fpage>19</fpage>
          {
          <fpage>21</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [DVCM14]
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          , Giulia Venturi, Andrea Cimino, and Simonetta Montemagni.
          <article-title>T2k^ 2: a system for automatically extracting and organizing knowledge from texts</article-title>
          .
          <source>In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)</source>
          , pages
          <year>2062</year>
          {
          <year>2070</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>