<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Lisbon, Portugal
" tomara.gotkova@univ-lorraine.fr (T. Gotkova); alexander.shvets@upf.edu (A. Shvets)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Key Environmental Lexicon Extraction Using Generative Transformer (Short Paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomara Gotkova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Shvets</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pompeu Fabra University, NLP Group</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, ATILF, Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents a study of the core environmental lexicon at the intersection of the fields of terminology and natural language processing. The goal was to find a way of automatizing the expansion of the preselected keyword list, and, in particular, to evaluate the ability of generative transformers to extract keywords unseen during the training phase. As a starting point, we collected keywords pertinent to the environmental discourse. Additionally, we compiled a corpus of texts on current and emerging environmental issues. These materials were used to train deep generative models of two types: a T5 transformer and a pointer-generator network pretrained for concept extraction as a baseline. We show that T5 significantly outperforms the baseline in detecting unseen keywords. We further provide qualitative analysis of the outcome of the resulting model applied to weakly annotated texts and confirm that the model helps to discover more keywords pertinent to the environmental topic.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;environmental terminology</kwd>
        <kwd>deep generative models</kwd>
        <kwd>keyword extraction</kwd>
        <kwd>specialized corpus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Our primary objective is rooted in terminology: we aim to identify the core environmental
terminology which we see as a set of central terms that shape the modern environmental
discourse. As a first step towards this objective, we opt for a supervised machine learning
approach that consists in training deep generative models with preselected lexical material
and a specialized corpus of environmental texts. In the following sections, we comment on
the theoretical framework that underlies our terminological tasks, describe the dataset, the
selection of generative models, preliminary extraction results and points for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical framework</title>
      <sec id="sec-2-1">
        <title>2.1. The notion of “environmental coreness”</title>
        <p>
          Environmental terminology is a patchwork of terms which belong to diferent disciplines
(anthropology, chemistry, biology, ecology, physics) and topics (renewable energy, ocean pollution,
biodiversity). Due to such heterogeneity, environmental terminology defies clear-cut
segmentation when it comes to certain tasks. While it is relatively easy to discern terms specific to a
given environmental subtopic or subdiscipline, detecting terms which are relevant for most of
the subtopics or subdisciplines at once remains a challenging (but feasible) task. For instance,
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] proposes a method of identifying general environmental lexicon which “cuts across the entire
ifeld of the environment”, e.g., biologist, ecosystem, green.
        </p>
        <p>
          Previous research explored the notion of “coreness” as applied to both general and specialized
lexicon. Depending on the purpose of a given core wordlist, core words can be defined by
such properties as frequency, commonness, universality, semantic primitiveness, etc. [
          <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
          ].
We focus on the notion of environmental coreness in the specialized texts on the current
and emerging environmental issues, e.g., air pollution, loss of biodiversity, waste management,
etc. A given environmental term is considered core if it meets the following criteria: (i) it
refers to the most essential environmental concept (sustainable), (ii) it is pertinent to several
environmental subtopics at once (ecosystem), (iii) it exhibits strong semantic connections with
other environment-related terms, (iv) it is not specific to specialized environmental discourse
only as it is difused in general language discourse as well (mass media texts, general public
communication, etc.).
2.2. Term vs. keyword
We advocate for the lexico-semantic approach to terminology which treats terms as lexical
units [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. According to the integral component of the Meaning-Text theory – the Explanatory
Combinatorial Lexicology – a lexical unit is a word which corresponds to one specific sense
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Hence, the term carbon implies a pointer to a specific sense, e.g., ’chemical element C’. As
regards to machine learning tasks, however, we deliberately refrain from using the notion of
“term” to demonstrate that we stay at the level of abstract units with no clear terminological
status. Instead, we use the notion of keyword which is a wordform devoid of clear semantic
features as there is no direct reference to a specific sense. For instance, the keyword carbon1
per se does not refer to any specific sense but it can acquire semantic features in context. It
should be noted that our notion of keyword is diferent from the uses which refer to concepts
rather than semantically ambiguous wordforms, e.g., keywords of a scientific paper.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>3.1. Data
Keyword list. As a result of continuous sampling of environmental lexical material2, we
compiled the initial list of 268 unique environment-related keywords. These keywords were further
divided into two categories. We selected 104 keywords which we see as core-candidates, i.e.,
keywords which may potentially be validated as core environmental terms (carbon, climate,
global warming, greenhouse gas). The supervised models are expected to expand this list.
1Terms are written in italics; keywords are written in teletype.</p>
      <p>
        2The process included both manual and automatic selection and was partly done in collaboration with an expert
in green chemistry and an expert in lexicology [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The remaining 164 keywords were categorized as supplementary3. We considered the
following keywords as supplementary: complex keywords built with Ckeywords (air pollution,
anthropogenic carbon dioxide, atmospheric warming) and keywords which would
not satisfy the criteria of coreness but are nevertheless important for environmental discourse
(ice, Earth).</p>
      <p>Specialized corpus. Our specialized corpus is a monolingual English domain-specific corpus
composed of 44 reports issued by international environmental organizations such as European
Environmental Agency, Intergovernmental Panel on Climate Change, United Nations
Environmental Program and World Meteorological Organization. These reports give a comprehensive
overview of the current and emerging environmental issues.</p>
      <p>We converted documents to plain text excluding figures and tables and manually cleaned
artifacts remained after the conversion. Consider an example of a sentence with Ckeywords
given in bold and Skeywords underlined:</p>
      <p>Moreover, the degradation of wetlands releases stored carbon, fuelling climate change.</p>
      <sec id="sec-3-1">
        <title>3.2. Automatic and gold-standard annotation</title>
        <p>We designed a simple procedure to annotate the entire corpus of about 30K sentences to have
enough data samples for training a neural network. The first step is to parse the corpus using
UDPipe4, while the second is to consider all the sequences of tokens of lengths from one to six
(i.e., up to the maximum number of words in keywords in our lists) taking the normal forms of
lexical items using their lemmas, and looking them up (with conditions on part-of-speech tags)
in the lists of keywords which we automatically expanded with alternatives beforehand (e.g.,
for biodiversity conservation we added conservation of the biodiversity, for
bio-based – biobased, etc.). Finally, each sentence with the corresponding found items made
a single data sample. The obtained samples cover 103 out of 104 Ckeywords (carbon-free
was not found in this corpus) and, in total, 255 out of 268 keywords. The search procedure
took into account many possible occurrences including the cases of overlapping and
discontinuous keywords such as soil pollution and air pollution in a phrase “soil and air
pollution”.</p>
        <p>Resulting samples were shufled and split into the training, development (dev), and test
subsets in the proportion 80/10/10. We performed shufling several times until the examples
were distributed among the subsets in such a way that only 80% of the keywords are used for
training (they also appear in two other subsets), while other 10% and 10% are used exclusively
in the dev and test subsets without intersections5. We preserve these 20% of keywords to assess
the ability of the model to extract “new” keywords unseen during the training. We leverage
the dev set to select the most prominent intermediate states of the model obtained during the
training, and the test set – for the final evaluation. Sentences without keywords were also added
proportionally to the subsets to guide the model when it should not extract anything.</p>
        <p>3Further in the text, core-candidate keywords and supplementary keywords are called Ckeyword and Skeyword
respectively.</p>
        <p>4https://ufal.mf.cuni.cz/udpipe
5A couple of thousands of samples were removed from the dataset to meet the condition of exclusiveness.</p>
        <p>In addition to our simple automatic annotation, we manually selected and examined 200
sentences from the corpus (excluded from the subsets) and created fully annotated samples
(with some keywords beyond the existing lists) that we refer to as a gold standard. The size of
the overall dataset is shown in Table 1.</p>
        <p>Ckey+Skey</p>
        <p>Ckey</p>
        <p>Ckey new
# pos
# neg
Training</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Generative extraction models</title>
        <p>
          The overlapping and discontinuous keywords in environmental texts create a problem in
applying traditional sequence labelling-based extractors. Instead, in this work, we opt for deep
neural generative models that are capable of translating a sentence into an arbitrary sequence
of words (not necessarily coherently connected) like T5 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] as we would like a model to output
keywords in the form appeared in a sentence one after another separated by a reserved symbol6.
        </p>
        <p>In our experiments, we worked with two versions of the pretrained transformer T5, T5-small
and T5-large7. We also chose a pretrained pointer-generator-based concept extraction model
(CE-PGN) [10] as an alternative that we successfully applied for public discourse analysis in
the domain of interior and urban design [11]. Originally, this model was designed to extract
concepts mainly in a form of nominal phrases which is not the exclusive form for the keywords
considered in this work. Still, we assumed that tuning it on our data could change its behaviour.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We report on the precision  =  /(  +   ) and recall  =  / scores for diferent
types of keywords (Skeywords, Ckeywords, and Ckeywords new – Ckeywords unseen during
the training) in Table 2 where   is the number of correctly extracted mentions of the scored
type,   – the number of extracted mentions out of all ground-truth mentions (  does not
depend on the type under scoring), and  – the number of ground-truth mentions of the
scored type.</p>
      <p>
        As expected, the original CE-PGN model extracts a small number of keywords with a very
low precision as it tends to find all the concepts independently of the domain. The fine-tuning
6E.g., Sustainable forest management can maintain... → Sustainable * Sustainable forest management * forest
7For languages other than English, mT5 shall be used as it allows for cross-lingual transfer learning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
8The model was tuned on the same training set but annotated only with Ckeywords.
re-oriented it towards the environmental domain – both scores were significantly improved.
However, it performed poorly on extracting unseen keywords.
      </p>
      <p>T5-large performed better than other models apart from when the small version gained a
slightly higher recall score on the unseen Ckeywords of the test set. Interestingly, the annotation
with Skeywords helped to detect Ckeywords better. The model that was trained only to extract
Ckeywords (T5-large c-tuned) generalized poorer and missed many more unseen Ckeywords.</p>
      <p>For the quality check of the extraction results, we manually checked 171 non-annotated
keywords extracted from dev set using T5-large. As a result, 70 novel keywords (41%) were
obtained, other 32 (19%) corresponded to existing keywords missing in automatic annotation
due to mistakes of the parser, and only the rest 69 (40%) were false negatives, i.e., not keywords.
45 keywords out of the 70 novel were combinations of already existing keywords in our lists
(ecological drought, biomass contaminant), the remaining 25 keywords were new to
us (smog, renewable electricity, biomethane). This result is linguistically valuable for
us: all 25 new keywords are pertinent to the environmental topic. Although, some keywords
are too specific ( cryosphere), all of them are considered as an important addition to our list.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>Results of our experiments provided several valuable insights as regards both linguistics and
information extraction areas. First, the preselected keywords proved pertinent to the
environmental topic and, in particular, to the vocabulary of the current and emerging environmental
issues. More specifically, only 13 keywords out of 268 were not present in our corpus (5%).
Second, tests performed with T5-large demonstrated that supplementary lexical material
(Skeywords) enhanced the model’s ability to detect Ckeywords. Therefore, as the list of Ckeywords
used to train the model grows, it is necessary to add to the list of Skeywords as well. Third, we
consider it now important to increase the number of manually annotated samples to improve
the gold standard dataset and this will allow us to train the model on annotated data of high
quality in addition to automatically annotated sets. Fourth, T5-large model proved eficient for
extracting unseen keywords: it detected 50-70% of them in a set (62% across all the evaluation
sets). Finally, we extracted 70 novel keywords pertinent to the topic of current and emerging
environmental issues which were not present in our preselected keyword list.</p>
      <p>The ultimate goal of our research, which goes beyond this study, is multifold. The finalized
keyword list will be used to scrape data from social networks, namely Twitter and Reddit,
to monitor the general public’s perception of core environmental terms. As a parallel task,
both preselected and extracted keywords will be subject to a lexicographic analysis in order to
convert them into meaningful lexical units and describe them in a lexicographic resource. In
some cases, a given complex keyword may be decomposed into several terms. For example, the
keyword climate pollutant should be converted to and lexicographically described as two
separate terms climate and pollutant, for phraseological reasons. Additionally, the obtained list
of terms will be analyzed according to the criteria of environmental coreness discussed in 2.1.
If a given term satisfies all four criteria, it can be validated as a core environmental term. We
would also like to explore the diferences in terminology in multilingual material with mT5 and
study the transferability of the obtained extractive models to other domains.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This research was funded by the EC-funded research and innovation programme Horizon
Europe under the grant agreement number 101070278 and by the French PIA project “Lorraine
Université d’Excellence”, reference ANR-15-IDEX-04-LUE.
mt5: A massively multilingual pre-trained text-to-text transformer, in: Proceedings of the
2021 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2021, pp. 483–498.
[10] A. Shvets, L. Wanner, Concept extraction using pointer–generator networks and distant
supervision for data augmentation, in: International Conference on Knowledge Engineering
and Knowledge Management, Springer, 2020, pp. 120–135.
[11] E. A. Stathopoulos, A. Shvets, R. Carlini, S. Diplaris, S. Vrochidis, L. Wanner, I. Kompatsiaris,
Social media and web sensing on interior and urban design, in: 2022 IEEE Symposium on
Computers and Communications (ISCC), IEEE, 2022, pp. 1–6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Drouin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. L'Homme</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Robichaud</surname>
          </string-name>
          ,
          <article-title>Lexical profiling of environmental corpora</article-title>
          , in: N. C. C. chair), K. Choukri,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hasida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Isahara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mazo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Piperidis</surname>
          </string-name>
          , T. Tokunaga (Eds.),
          <source>Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <source>European Language Resources Association (ELRA)</source>
          , Paris, France,
          <year>2018</year>
          , pp.
          <fpage>3419</fpage>
          -
          <lpage>3425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Carter</surname>
          </string-name>
          ,
          <article-title>Is there a Core Vocabulary? Some Implications for Language Teaching*</article-title>
          ,
          <source>Applied Linguistics</source>
          <volume>8</volume>
          (
          <year>1987</year>
          )
          <fpage>178</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Speelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Geeraerts</surname>
          </string-name>
          ,
          <article-title>Core vocabulary, borrowability and entrenchment: A usage-based onomasiological approach</article-title>
          ,
          <source>Diachronica</source>
          <volume>31</volume>
          (
          <year>2014</year>
          )
          <fpage>74</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Brezina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gablasova</surname>
          </string-name>
          ,
          <source>Is There a Core General Vocabulary? Introducing the New General Service List, Applied Linguistics</source>
          <volume>36</volume>
          (
          <year>2013</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M.-C. L'Homme</surname>
          </string-name>
          ,
          <article-title>Lexical semantics for terminology : an introduction, Terminology and lexicography research and practice (TLRP)</article-title>
          , John Benjamins Publishing Company, Amsterdam Philadelphia,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I. A</given-names>
            .
            <surname>Mel</surname>
          </string-name>
          <article-title>'čuk,</article-title>
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Clas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polguère</surname>
          </string-name>
          ,
          <article-title>Introduction à la lexicologie explicative et combinatoire, Universités francophones</article-title>
          , Duculot,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gotkova</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Chepurnykh, Public perception and usage of the term : Linguistic analysis in an environmental social media corpus</article-title>
          ,
          <source>Psychology of Language and Communication</source>
          <volume>26</volume>
          (
          <year>2022</year>
          )
          <fpage>297</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          , C. Rafel,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>