=Paper= {{Paper |id=Vol-3427/short3 |storemode=property |title=Isolating Terminology Layers in Complex Linguistic Environments: a Study About Waste Management (Short Paper) |pdfUrl=https://ceur-ws.org/Vol-3427/short3.pdf |volume=Vol-3427 |authors=Nicola Cirillo |dblpUrl=https://dblp.org/rec/conf/mdtt/Cirillo23 }} ==Isolating Terminology Layers in Complex Linguistic Environments: a Study About Waste Management (Short Paper)== https://ceur-ws.org/Vol-3427/short3.pdf
Isolating Terminology Layers in Complex Linguistic
Environments: a Study About Waste Management
(Short Paper)
Nicola Cirillo1
1
    Università degli Studi di Salerno


                                         Abstract
                                         Automatic term extraction aims at extracting terminological units from specialized corpora in order to
                                         assist terminolographers to develop lexicographic resources. In this paper, we introduce Domain Concept
                                         Relatedness, a novel term extraction technique meant to isolate the terminology of a given subject field.
                                         In order to evaluate our technique, we apply it to the extraction of waste management terms from a new
                                         Italian corpus about waste management legislation. We test it against Sketch Engine and the contrastive
                                         approach showing that our technique effectively extracts multi-word terms belonging to a given subject
                                         field but still fails to extract single-word terms.

                                         Keywords
                                         Terminology Extraction, Waste Management, Domain-knowledge, Specialized Languages




1. Introduction and related work
Automatic Term Extraction (ATE) is a natural language processing task. Its focus is the extraction
of terms from specialized corpora. Typically, an ATE tool extracts all the terms that occur in
a corpus disregarding their domain. While this approach is totally reasonable in most cases,
it would be beneficial to distinguish terms related to different subject fields in some contexts.
This demand has already been pointed out by [1] and [2]. They state that when ATE is applied
to legislative documents, it becomes crucial to differentiate terms that belong to the regulated
sector from legal terms but unfortunately only a few ATE methods address this task.
   It is too simplistic to assume that all the terms contained in a specialized corpus belong to
the same domain. In practice, many specialized languages include the terminology of multiple
subject fields. For instance, the terminology of institutional languages comes from different
domains of knowledge, namely law, administration, economy, and finance, plus all the technical
terms of the regulated sectors [3]. In addition, each field can be divided into more specific
sub-fields that can, in turn, be separated into smaller sub-fields, leading to a complex hierarchical
model. We will refer to terms of a given subject field as a Terminology Layer, from now on
TLR, regardless of its position in the hierarchy. Thus, the terminology of a specialized language
has a hierarchical structure. Top-level layers contain terms related to broader domains, while
2nd International Conference on "Multilingual digital terminology today. Design, representation formats and
management systems" (MDTT) 2023, June 29–30, 2023, Lisbon, Portugal
" nicirillo@unisa.it (N. Cirillo)
 0000-0002-2107-1313 (N. Cirillo)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
lower-level layers incorporate more topic-specific terms [4]. For example, a corpus of legislative
documents about waste management contains at least three layers: the law TLR, the waste
management TLR, and the waste management law TLR (which is an intersection of the first
two). The first TLR contains terms such as competent authority, member state, and regulation,
the second includes incineration plant, separate collection, and landfill, and the third comprises
competent authority of dispatch, country of destination, and list of wastes.
    We believe that ATE research has mostly overlooked the complex stratification held by terms.
To date, research mainly focuses on the extraction of all the terms with ever-increasing accuracy,
tackling ATE through several strategies. Standard methods couple linguistic approaches to
simple corpus statistics [5, 6, 7, 8]. Other methods involve the comparison between a focus
corpus (i.e. a specialized corpus) and a reference corpus (i.e. a general language corpus) [9, 10, 11].
It is also worth mentioning that deep learning approaches are becoming increasingly popular
and obtain notable results [12, 13, 14]. Even though most approaches do not address the
isolation of TLRs, there are at least two exceptions that we are aware of, namely the contrastive
approach [15, 2] and the approach proposed by [4]. The idea behind the contrastive approach
is that the distribution of a term in the focus corpus is insufficient to determine its domain-
specificity. Instead, it is also necessary to examine its distribution across a contrastive corpus
(i.e. a corpus containing documents about different domains). [4] attempted to isolate the TLR
of a vast domain such as the environment. In particular, they test two measures. Specificity,
a measure that compares the distribution of terms in the focus corpus with their distribution
in a reference corpus and Inverse Document Frequency. They expect the terminology of the
general environmental lexicon to obtain a low Inverse Document Frequency because it is evenly
distributed throughout the document collection.
    The remainder of this paper is organized as follows. In section 2 we describe Domain Concept
Relatedness, a novel ATE technique meant to isolate the terminology of a given subject field.
Section 3 contains an overview of the corpora that we created to evaluate our technique and the
results of the evaluation. Finally, In section 4 we draw conclusions and propose future research
directions. In summary, the main contributions of this paper are:

    • We organize the terminology of a subject field into Terminology Layers.
    • We propose a novel ATE technique to isolate a specific Terminology Layer.
    • We create a focus corpus about waste management legislation and a contrastive corpus
      about legislation of other sectors.
    • We test our technique against existing ATE techniques.


2. Proposed Approach
We propose Domain Concept Relatedness (DCR), an ATE technique capable of isolating TLRs.
It is based on Key Concept Relatedness (KCR) [16] and addresses its limitations. DCR involves
four phases: candidate extraction, generation of concept embeddings, key concept extraction,
and relatedness computation. The code and the resources used for the evaluation are publicly
available on GitHub1 .
    1
        https://github.com/nicolaCirillo/termdomain
2.1. Key concept relatedness
KCR [16] is an ATE technique based on the assumption that terms of a given domain must
be semantically related to already-known key concepts from that domain. To measure seman-
tic relatedness, KCR relies on cosine similarities between concepts embeddings. In the latest
version of KCR [16], concept embeddings are generated from hyperlinks of Wikipedia pages.
In particular, the Wikipedia dump is preprocessed by removing markups and by replacing
each hyperlink with a special token, and then the embeddings are generated via word2vec [17].
Next, to obtain the list of key concepts, KCR employs a keyword extraction algorithm, namely a
simplified version of KP-Miner [18]. It automatically extracts key concepts from each document
in the corpus. Finally, for each candidate term 𝑡 for which a concept embedding exists, the
algorithm computes the semantic relatedness to a subset of 𝑘 key concepts (selected using the k
Nearest Neighbour algorithm, with only positive examples). Below is the formula to compute
relatedness.
                                                       𝑘
                                                    1 ∑︁
                              𝑟𝑒𝑙𝑎𝑡𝑒𝑑𝑛𝑒𝑠𝑠(𝑡, 𝐶𝑑 ) =       𝑐𝑜𝑠(𝑣𝑡 , 𝑣𝑖 )                     (1)
                                                    𝑘
                                                            𝑖=1

where 𝐶𝑑 is a set of key concepts sorted by cosine similarity to the candidate term 𝑡, 𝑘 is a
parameter from kNN while 𝑣𝑡 and 𝑣𝑖 are the word vectors of the term 𝑡 and the key concept 𝑖,
respectively. The relatedness of candidates that do not have a concept embedding is set to 0.
  Overall, KCR is a promising technique. According to the evaluation done by [16], KCR is the
best-performing technique on the FAO dataset. Furthermore, when it comes to document-level
ATE, KCR clearly outperforms other techniques [19]. Nonetheless, it has two main limitations.
The first limitation is the fact that it is unsuited to treat domains with low coverage by Wikipedia
[16]. The second limitation has to do with the selection of the key concepts. Selecting the key
concepts via automatic keyword extraction has the advantage of making KCR fully unsupervised,
nevertheless, it does not ensure the adherence of all the key concepts to the investigated domain.
preventing the possibility of using KCR to isolate a TLR.

2.2. Domain Concept Relatedness
In the candidate extraction step, DCR employs syntactic patterns based upon RegexpParser from
the nltk Python library2 . Since our main contribution is the scoring mechanism, we decided to
keep the candidate extraction fairly simple. Thus, we extracted only nouns and noun phrases.
   To rank candidates, we employed a modified version of KCR. Our aim is to make it extract
only those terms that belong to the investigated domain, ignoring other terms that may occur
in the focus corpus, however, KCR is not intended to isolate a given TLR. Worse still, it is not
suitable for domains with low coverage by Wikipedia. To address the first issue, we made DCR
semi-supervised. In particular, the set of key concepts is taken from an existing thesaurus3
(or provided by the user). This modification ensures that all the key concepts belong to the
investigated domain, thus enabling DCR to isolate a given TLR. In order to overcome the
Wikipedia coverage problem, DCR employs a different technique to produce concept embeddings.
    2
     https://www.nltk.org/
    3
     in the evaluation (section 3), key          concepts   are   taken   from   the   following   glossary:
https://www.zerosprechi.eu/index.php/glossario
    Corpus                       Subjects                     Documents   Sentences    Tokens
     focus                  waste management                     148        15,463     630,456
                                 transport                        60         8,239     331,061
                                   health                         28         3,271     130,708
   contrastive                     safety                         15         1,959      71,005
                                agriculture                       56         1,807      68,859
                                    TOT                          159        15,276     601,633
   Contrastive   transport, health, safety, and agriculture      159        15,276     601,633

Table 1
Structure of the corpora.


While KCR produces them from the Wikipedia dump, DCR produces concept embeddings
directly from the focus corpus by employing the Alacarte [20] algorithm. The advantage of
this technique is two-fold: it eliminates the Wikipedia coverage problem and ensures that each
candidate has its vector representation. The convenience of Alacarte over traditional models
is its ability to induce embeddings on the fly. Moreover, Alacarte generally produces more
robust representations than traditional models when dealing with rare words and this is a key
advantage in term extraction. To learn the transformation matrix, Alacarte needs only a set
of pre-trained embeddings and the corpus used to induce them. For this purpose, we trained
a word2vec model on the Paisà corpus [21]. Once the embeddings are generated, relatedness
scores are computed as in equation 1 with 𝑘 set to 5.


3. Evaluation
To test our novel technique, we compared it to two already existing ATE techniques. The first
one is the term extraction tool of Sketch Engine, it compares the frequency of terms in the focus
corpus with their frequency in a reference corpus. The second one is the contrastive approach
[2], a technique that has already been employed to extract the terminology of the regulated
sector from legislative documents. Due to the lack of gold-labelled corpora, we constructed
a corpus of EU directives and regulations about waste management and applied the selected
techniques to it. Then, the list of extracted terms was evaluated by three different annotators.
This methodology has the drawback of assessing only precision (i.e. the correctness of the
extracted items) but not recall (i.e the fraction of terms that have been extracted). Therefore, we
computed precision at k (P@k) and average precision (AP). P@k is simply the number of correct
terms out of the top k extracted items while AP indicates how highly are correct terms ranked.

3.1. Corpora
For our experiment, we constructed two corpora (ca. 600,000 tokens each), both composed of
EU directives and regulations. The focus corpus is about the waste management sector, while
the contrastive corpus (required to test the contrastive approach) covers four different subject
matters (agriculture and fisheries, public health, safety at work, and transport). To build the
focus corpus, we selected all the directives and regulations that are about a Eurovoc concept
related to the waste domain [22] (i.e. environmental policy, waste, and waste management) or
one of its hyponyms. On the other hand, directives and regulations that compose the contrastive
corpus concern different subject matters that, in our opinion, show enough variety to let the
contrastive approach capture terms that characterize directives and regulations in general.

3.2. Dataset
We produced the evaluation dataset by selecting the first 200 single-word terms and the first 200
multi-word terms extracted from the focus corpus by each tool (1039 unique items in total). Then,
these items were judged by three annotators, namely the author and two linguistics students. We
provided each annotator with annotation guidelines and with the focus corpus. Prior to giving
judgments, annotators had to check the usage of each item in the focus corpus. Each annotator
provided two judgments. Firstly, they decided which items are terms. Secondly, they specified
the domain to which terms belong. We defined five domains: legislation in general (LAW); waste
management legislation (WASTE LAW); waste management (WASTE); topics related to waste
management (WASTE REL); other domains (OTHER). Overall, the inter-annotator agreement
concerning term identification is moderate (Fleiss k 0.54) and the agreement on domains is
slightly lower (Fleiss k 0.47). Nonetheless, these figures are consistent with the literature.
Agreement tends to be quite low due to the lack of clear boundaries between terminology and
general language [23]. In the final version of the dataset, an item is considered a valid term
only if at least two annotators judged it as such. The same holds for domain tags, if at least
two annotators assigned an item to the same domain, that tag was kept in the final dataset.
Otherwise, the main annotator (the author of this paper) decided which tag to keep.

3.3. Results
We ran three evaluations. In the first one, all terms are considered to be correctly extracted
regardless of the domain. In the second one, only the terms belonging to the WASTE, WASTE
LAW, and WASTE REL domains (waste terms) are considered correctly extracted, and in the last
one, we consider to be correctly extracted only the terms belonging to the LAW domain (legal
terms). Results are shown in Table 2. By looking at the P@200, it is clear that Sketch Engine is
not capable of discriminating waste terms from legal ones since they have been extracted in
similar percentages. Conversely, the contrastive approach and DCR are better suited for isolating
waste terminology, notably the latter, for which the percentage of extracted legal terms is almost
negligible. In addition, DCR also obtained the highest AP in the extraction of multi-word terms
regarding waste while keeping a P@200 similar to the contrastive approach. Thus they extract
almost the same number of terms but DCR tends to rank them higher. The same does not hold
in the single-word scenario, where DCR shows lower scores than the contrastive approach
and Sketch Engine. Overall, DCR works well with multi-word terms. It is able to isolate the
terminology of the waste domain better than Sketch Engine and the contrastive approach while
performing poorly on single-word terms.
                                              All terms      Waste terms      Legal terms
      Term type              Tool
                                            P@200 AP         P@200 AP        P@200 AP
                        Sketch Engine        0.40     0.52    0.23    0.33    0.18     0.23
      single-word    contrastive approach    0.47     0.55    0.31    0.37     0.17    0.21
                             DCR             0.20     0.28    0.17    0.29     0.03    0.02
                        Sketch Engine        0.49     0.53    0.29    0.36    0.21     0.20
      multi-word     contrastive approach    0.54     0.67    0.40    0.54     0.15    0.15
                             DCR             0.43     0.62    0.37    0.62     0.06    0.05

Table 2
Results of the evaluation.


4. Conclusions and perspectives
The ability to isolate the terms of a specific subject field is an important feature of ATE tools
that has been underlined in the context of legislative documents, where it is crucial to separate
the terminology of the regulated sector from legal terminology. To this end, we propose Domain
Concept Relatedness (DCR), a novel ATE technique to isolate terminology layers. The main
difference between DCR and other existing ATE techniques with the same goal is that it is not
based on corpus statistics but on word embeddings. Due to the lack of gold-labelled data, we
create a corpus of EU directives and regulations about waste management and use it to test
DCR. Our evaluation suggests that DCR is effective at extracting multi-word terms belonging
to a given subject field but fails to extract single-word terms.
   Furthermore, it must be noted that DCR has many parameters that will be optimized to
increase its effectiveness, notably the set of word embeddings employed. It would also be
interesting to combine DCR with traditional ATE techniques based on corpus statistics by using
it as a re-ranking technique.


References
 [1] A. Lenci, S. Montemagni, V. Pirrelli, G. Venturi, Ontology learning from italian legal texts,
     in: Law, Ontologies and the Semantic Web, IOS Press, 2009, pp. 75–94.
 [2] F. Bonin, F. Dell’Orletta, S. Montemagni, G. Venturi, A contrastive approach to multi-word
     extraction from domain-specific corpora, in: Proceedings of the Seventh International Con-
     ference on Language Resources and Evaluation (LREC’10), European Language Resources
     Association (ELRA), Valletta, Malta, 2010. URL: http://www.lrec-conf.org/proceedings/
     lrec2010/pdf/553_Paper.pdf.
 [3] D. Vellutino, L’italiano istituzionale per la comunicazione pubblica, il Mulino, 2018.
 [4] P. Drouin, M.-C. L’Homme, B. Robichaud, Lexical profiling of environmental corpora,
     in: Proceedings of the Eleventh International Conference on Language Resources and
     Evaluation (LREC 2018), 2018.
 [5] J. S. Justeson, S. M. Katz, Technical terminology: some linguistic properties and an
     algorithm for identification in text, Natural language engineering 1 (1995) 9–27.
 [6] D. A. Evans, R. G. Lefferts, Clarit - trec experiments, Information Processing & Management
     31 (1995) 385–395.
 [7] K. Church, W. Gale, Inverse document frequency (idf): A measure of deviations from
     poisson, in: Natural language processing using very large corpora, Springer, 1999, pp.
     283–295.
 [8] R. Navigli, P. Velardi, Learning domain ontologies from document warehouses and dedi-
     cated web sites, Computational Linguistics 30 (2004) 151–179.
 [9] K. Ahmad, L. Gillam, L. Tostevin, et al., University of surrey participation in trec8:
     Weirdness indexing for logical document extrapolation and retrieval (wilder), in: TREC,
     1999, pp. 1–8.
[10] Y. Park, R. J. Byrd, B. Boguraev, Automatic glossary extraction: Beyond terminology
     identification., in: COLING, volume 10, 2002, pp. 1072228–1072370.
[11] K. Meijer, F. Frasincar, F. Hogenboom, A semantic approach for extracting domain tax-
     onomies from text, Decision Support Systems 62 (2014). doi:10.1016/j.dss.2014.03.
     006.
[12] M. Kucza, J. Niehues, T. Zenkel, A. Waibel, S. Stüker, Term extraction via neural sequence
     labeling a comparative evaluation of strategies using recurrent neural networks, volume
     2018-September, International Speech Communication Association, 2018, pp. 2072–2076.
     doi:10.21437/Interspeech.2018-2017.
[13] A. Hazem, M. Bouhandi, F. Boudin, B. Daille, Termeval 2020: Taln-ls2n system for auto-
     matic term extraction, in: 6th International Workshop on Computational Terminology
     (COMPUTERM 2020), 2020.
[14] S. H. Manjunath, J. P. McCrae, Encoder-attention-based automatic term recognition
     (ea-atr), in: 3rd Conference on Language, Data and Knowledge (LDK 2021), Schloss
     Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
[15] R. Basili, A. Moschitti, P. M. Teresa, F. M. Zanzotto, A contrastive approach to term
     extraction, 2001. URL: https://www.researchgate.net/publication/283854408.
[16] N. Astrakhantsev, Atr4s: toolkit with state-of-the-art automatic terms recognition methods
     in scala, Language Resources and Evaluation 52 (2018) 853–872.
[17] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations
     in vector space, 1st International Conference on Learning Representations, {ICLR} 2013,
     Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings (2013).
[18] S. R. El-Beltagy, A. Rafea, Kp-miner: Participation in semeval-2, Association for Computa-
     tional Linguistics, 2010, pp. 190–193. URL: https://aclanthology.org/S10-1041.
[19] A. Šajatović, M. Buljan, J. Šnajder, B. D. Bašić, Evaluating automatic term extraction
     methods on individual documents, Association for Computational Linguistics, 2019, pp.
     149–154. URL: https://aclanthology.org/W19-5118. doi:10.18653/v1/W19-5118.
[20] M. Khodak, N. Saunshi, Y. Liang, T. Ma, B. Stewart, S. Arora, A la carte embedding:
     Cheap but effective induction of semantic feature vectors, in: Proceedings of the 56th
     Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
     Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 12–22. URL:
     https://aclanthology.org/P18-1002. doi:10.18653/v1/P18-1002.
[21] V. Lyding, E. Stemle, C. Borghetti, M. Brunello, S. Castagnoli, F. Dell’orletta, H. Dittmann,
     A. Lenci, V. Pirrelli, The paisa’ corpus of italian web texts, Association for Computational
     Linguistics, 2014, pp. 36–43.
[22] D. Vellutino, R. Maslias, F. Rossi, Verso l’interoperabilità semantica di iate. studio prelim-
     inare per il dominio “gestione dei rifiuti urbani”, Terminologie specialistiche e diffusione
     dei saperi (2016) 1–240.
[23] A. Rigouts Terryn, V. Hoste, E. Lefever, In no uncertain terms: a dataset for monolingual
     and multilingual automatic term extraction from comparable corpora, Language Resources
     and Evaluation 54 (2020) 385–418.