Overview of the CLEF 2022 SimpleText Task 2:
Complexity Spotting in Scientific Abstracts
Liana Ermakova1 , Irina Ovchinnikov2 , Jaap Kamps3 , Diana Nurbakova4 ,
Sílvia Araújo5 and Radia Hannachi6
1
  Université de Bretagne Occidentale, HCTI, Brest, France
2
  ManPower Language Solution, Israel
3
  University of Amsterdam, Amsterdam, The Netherlands
4
  University of Lyon, INSA Lyon, CNRS, LIRIS, UMR5205, F-69621 Villeurbanne, France
5
  Universidade do Minho, CEHUM, 4710-057 Braga, Portugal
6
  Université de Bretagne Sud, HCTI, 56321 Lorient, France


                                         Abstract
                                         This paper provides an overview of the Task 2: What is unclear? of the Automatic Simplification of
                                         Scientific Texts (SimpleText) lab, run as part of CLEF 2022. The main aim of the SimpleText lab is to
                                         promote a more open scientific information access via automatic text simplification. Task 2 focuses on
                                         complexity spotting within scientific texts (passage). Thus, the goal is to detect the terms/concepts that
                                         require specific background knowledge for understanding of the passage and to assess their complexity
                                         for non-experts. Overall, four runs from four different teams have been submitted to this task. In this
                                         paper, we describe the data collection, the task setup, and the evaluation procedure. We also give a brief
                                         overview of the participating approaches.

                                         Keywords
                                         automatic text simplification, terminology, background knowledge, scientific article, science populariza-
                                         tion, contextualization, term difficulty


1. Introduction
Nowadays, scientific literature has become more available to every citizen thanks to digital-
isation. However, an important barrier preventing citizens to access the objective scientific
knowledge from the original sources remains present. One of the key issues here is a high
complexity of scientific texts to non-experts due to the lack of required background knowledge,
including the comprehension of terminology. Even for native speakers it is hard to understand
the terminology beyond their area of expertise. Nevertheless, a basic set of terms the general
public acquired thanks to secondary and college education allows them to comprehend popular
science publications. Comprehension of the term presupposes grasping of the concept it refers to


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ liana.ermakova@univ-brest.fr (L. Ermakova)
 https://simpletext-project.com/ (L. Ermakova)
 0000-0002-7598-7474 (L. Ermakova); 0000-0003-1726-3360 (I. Ovchinnikov); 0000-0002-6614-0087 (J. Kamps);
0000-0002-6620-7771 (D. Nurbakova); 0000-0003-4321-4511 (S. Araújo)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
without any definition. To understand the concept, we need to involve it in a structured system
in our semantic memory that can require more knowledge than we had learned.
   To help readers to stay up-to-date with scientific advances, text simplification can be used. To
facilitate the reading, the traditional methods try to eliminate complex concepts and construc-
tions [1]. However, it is not always possible, especially in the case of scientific literature. Thus,
readers of a popular science publication lean on their experience of processing new information
and recognize a case when they need definition or clarification of an unfamiliar term since they
do not understand its concept.
   To alleviate the lack of background knowledge that can prevent a proper comprehension [2],
we argue that a simplification method should provide information, essential to understanding of
complex scientific concepts. This is one of the objectives of CLEF 2022 SimpleText lab. Despite
some recent efforts that have been done in automatic text simplification (e.g. [3]), improving
scientific text comprehensibility and its adaptation to different audiences in an automatic manner
remains an open challenge.
   The CLEF 2022 SimpleText track1 is an open forum for researchers and practitioners working
on the automatic generation of simplified summaries of scientific texts. It is a new evaluation
lab that follows up the CLEF 2021 SimpleText Workshop [4]. The track provides data and bench-
marks for discussing the challenges of automatic text simplification proposing the following
interconnected tasks:
Task 1: What is in (or out)? Select passages to include in a simplified summary, given a
     query.
Task 2: What is unclear? Given a passage and a query, rank terms/concepts that are required
     to be explained for understanding this passage (definitions, context, applications,..).
Task 3: Rewrite this! Given a query, simplify passages from scientific abstracts.
This paper focuses on the second task of complexity spotting. We refer for details of the other
tasks to the overview papers of Task 1 [5] and Task 3 [6], or the Track overview paper [7].
   In the CLEF 2022 edition of SimpleText, a total of 62 teams registered for the SimpleText track.
A total of 40 users downloaded data from the server. A total of 9 distinct teams submitted 24
runs, of which 10 runs were updated. The details of statistics on runs submitted for shared tasks
are presented in Table 1. As it can be seen, four teams participated in Task 2.
   The rest of this paper is structured in the following way. Section 2 presents a brief overview
of related works, including other evaluation initiatives, related tasks and related approaches.
We provide a detailed description of the task complexity spotting itself, submitted runs, and the
evaluation protocol in Section 3. In Section 4, we discuss the results of the official submissions.
We end with Section 5 discussing the results and findings, and lessons for the future.


2. Related work
According to the Cambridge Dictionary [16], a term is “a word or expression used in relation
to a particular subject, often to describe something official or technical”. Almost the same
    1
        https://simpletext-project.com
Table 1
CLEF 2022 SimpleText official run submission statistics
 Team                              Task 1              Task 2              Task 3                Total runs
 aaac                                                  1   (1 updated)                                     1
 CLARA-HD [8]                                                               1                              1
 CYUT Team2 [9]                    1                                        1                              2
 HULAT-UC3M [10]                                                           10    (4 updated)              10
 LEA_T5 [11]                                           1                    1                              2
 NLP@IISERB [12]                   3   (3 updated)                                                         3
 PortLinguE [13]                                                            1    (1 updated)               1
 SimpleScientificText [14]                             1   (1 updated)                                     1
 UAms [15]                         2                   1                                                   3
 Total runs                        6                   4                   14                             24


definition of terms is given by Kaguera and Marshman [17] describing them as “lexical items
that represent concepts of a domain”. Thus, terms form the core vocabulary of a specific and
specialised domain.

2.1. Term Complexity
Term perception can be rather ambiguous and subjective [18], especially when it comes to
assess term complexity. Indeed, the discrepancy between basic competence of a reader and
professional competence of an author of a scientific article derives the subjective complexity
of terminology. The objective complexity of terminology is derived by peculiar characteristics
of terminological systems. In this Section, we clarify the objective complexity of terminology
caused by complexity of research areas, research traditions and socio-cultural diversity.
   Terminology belongs to professional and scientific discourse, where there exist so called
languages for special purpose. Belonging to the language for special purposes, terminological
systems do not share peculiarities of the general lexicon [19]. A terminological system tends to
avoid synonyms and polysemy, but has to provide a term for each concept within a system of
concepts of the domain. According to the General Theory of Terminology, which is based on
the work of Eugen Wüster (see description in [20]), terminological systems support univocity
(unambiguous match of the term to its concept). This general approach is still relevant in
technical communication where professionals (technical writers, translators, etc.) use term
banks, e.g. Eurodicautom2 , Termium3 , LEXIS4 [21], Normaterm5 [22], and the Grand dictionnaire
terminologique6 (formerly the Banque de terminologie du Québec). In academia, this approach
is mostly applied to terminological systems in Science and Computer Science; however, it is not
    2
      A database for terminology and translations created and used by the European Commission, replaced in 2007
by Interactive Terminology for Europe (IATE) https://iate.europa.eu/.
    3
      A linguistic and terminology database owned by the Translation Bureau of Public Services and Procurement
Canada, https://www.btb.termiumplus.gc.ca/
    4
      A German term bank used by technical translators.
    5
      French term bank covering science and technology fields and developed by AFNOR.
    6
      A term bank created by the Quebec Board of the French Language, https://gdt.oqlf.gouv.qc.ca/
relevant for Cognitive Science (e.g., Neuroscience) and Humanities.
   Complexity of a terminological system is a derivative of scientific complexity. The
complexity of a scientific area depends on peculiar attributes and conditions [23]. The most
basic peculiarities are the numerosity of counting entities and their interaction: high diversity
of disordered interaction among multiple entities represents a complex research area. To refer
to the entities, their interactions and degrees of disorder, the research area needs complex
terminology. Ladyman et al. [24] offered to determine complexity of a research area according
to five qualitative conditions: numerosity of elements, numerosity of interactions, disorder,
openness, feedback. Considering terminological systems, numerosity of elements and numerosity
of interactions in a complex research area require a rich and clear structured system of terms,
preferably taxonomy. Transparency of the terminological system structure facilitates the
research, analysis and description of disordered systems and non-equilibrium states of the
systems. Effect of numerosity of elements and their interactions on the complexity of the
terminological system of the research area is obvious through comparison of different areas
that attract interest of wide readership: Neuroscience and Computer Science [25].
   The complexity of terminology is associated with a formal representation (signifier)
of a term. Putting aside borrowings, we would like to mention symbols and abbreviations
(acronyms, backronyms, syllabic abbreviations, clipping etc.). Symbols and abbreviations belong
to a set of peculiarities of a language for special purpose. Symbolic language of science involves
symbols and abbreviations as means to optimize content transferring, to standardize naming
of numerous elements, frequent interaction among them, and standard procedures of data
processing. Languages for special purpose in Natural Science and Mathematical Sciences
(including Computer Science) contain complicated systems of symbols. Meanwhile, symbols
and abbreviations are in use in all research areas disregarding their complexity. Nevertheless,
readers of popularized publications expect explanations of the symbols and abbreviations.
   Another cause of the terminological complexity is research traditions. Neuroscience and
computer science represent the new research areas. Nevertheless, humans became curious about
the brain and how to treat its damage thousands years ago; the brain has attracted researchers’
attention since the very first steps in practical medicine. The neuroscientific terminology reflects
rich traditions of the brain study in the history of science: Latin (e.g. cerebellum ‘little brain’)
and Greek (e.g. diencephalon ‘interbrain’) borrowings, eponyms (Broca’s area), metaphors (e.g.
hemispheres), etc. Diversity of the traditions provides neuroscience with parallel terms, which
refer to the same concept (e.g., names of the disease: [26]). Understanding the neuroscientific
terminology requires knowledge of the science development.
   Computer science has begun to develop its traditions mostly in the middle of the XX century;
therefore, it lacks Latin and Greek terminology as well as numerous eponyms. As compared to
neuroscience, the terminology in computer science seems less complicated and more transparent
for nonprofessionals; moreover, an average reader of popularized science understands many
terms since he / she employs computers in the everyday routine. Readership of popular science
publications is probably familiar with the basic terminology of this area, while the neuroscientific
terminology requires definitions and clarifications.
   The complexity of terminology is often caused by socio-cultural diversity of readership of
popular science publications. The diversity is revealed in comprehension of basic terminology
of Science and Humanities that is affected by programs of secondary and college education.
The programs provide people with grounds and backdrops for comprehending current news of
popular science. Since content of the programs varies in different institutions and countries,
readers have differences in their background and terminological lexicon especially in Humanities.
   While popularizing science, journalists substitute complex terms by basic ones or clarify
the underlying concept, which is denoted by the complex term. Enhancing the popular science
text readability, popularization may bring in damaging its comprehensibility. Both ways to
avoid the complex terminology may lead to misinformation or distortion of the content. The
term substitution may distort the content since semantic relations in terminological systems
are not similar to those in the general lexicon of the language. It is presupposed that a network
of connections within a terminological system does not support synonyms and maintains a
transparent one-to-one relationship between the term and the concept it referred to. A list of
the potential substitutions usually includes a widespread name of the concept if any exists in the
general lexicon (e.g. sea cow instead of manatee), hypernym (e.g. herbivore marine mammal for
manatee) and co-hyponyms of the complex term with additional explanation since co-hyponyms
denote a different object (quality, action, etc.) within the same category. Meanwhile, common-
sense concepts are not equal to scientific concepts in the complex research areas; therefore,
appealing to the common sense requires clarifications. Thus, term substitutions do not enhance
structure of the popular scientific text. Probably, the best way to clarify the term is to illustrate
its concept [27].
   Speaking about automatic systems of generating a popular review of scientific publications,
we need to choose the way for term recognition and extraction. In order to substitute or clarify
any unfamiliar term we need to recognize it in scientific discourse and then provide readers
with references, definitions or illustrations.
   Summarizing our consideration of complexity of terminology, we note that the selection of a
way to facilitate perception of terms in popular scientific publications depends on complexity
of the research area, richness of the research tradition of the area, and cultural diversity.

2.2. Automatic Terminology Extraction
Automatic Term Extraction (ATE) or Automatic Terminology Extraction is an automated process
of detecting terms in a corpus of specialised texts. It has been a relevant NLP task since 1980s
and remains challenging from several perspectives, such as data collection (creation of manually
annotated domain-specific corpora), extraction algorithms (definition of term length, minimum
term frequency, term POS-pattern), evaluation (usually limited to the use of precision metric as
the information about all terms in a text is often missing) [18].
   The ATE methods are traditionally classified in three groups:

    • Linguistic methods: these methods are based on linguistic properties such as POS-patterns
      or other morpho-syntactic patters (e.g. [28, 29]).
    • Statistical methods: these methods are based on statistical properties (various weightings
      have been proposed, e.g. frequency, mutual information, log-likelihood ratio, etc.) and
      usually analyse 𝑛-grams measuring termhood or unithood [30].
    • Hybrid methods: these methods are combinations of the previous two (e.g. [31]). Usually,
      the initial selection is performed based on linguistic properties which is followed by the
      ranking procedure on the basis of statistical measures [18]. Hybrid approaches have been
      shown to outperform linguistic or statistical methods [32].

    As stated in [18], one of the difficulties is to well define the cut-off threshold for term
candidates.
    Recent advances in Machine Learning techniques, including Deep Learning models, have
made the taxonomy of ATE methodology more complex and diverse [33]. Numerous methods
have been proposed (e.g. [34, 35]).
    Lately, large transformer models such as Jurassic-1 [36], Google’s T5 [37], BERT [38], or
GPT-3 [39] have been shown to be successful on several NLP tasks, outperforming other state-
of-the-art models. They make use of subword tokenizers, such as Byte-Pair Encoding (BPE) [40]
and WordPiece [41]. For instance, BPE that uses the idea of word segmentation into subword
units is exploited in GPT-2 [42] and Roberta [43]. A similar subword tokenization algorithm
WordPiece is ussed in BERT [38], DistilBERT [44], and Electra [45]. Despite a comparative
shallowness of these models, they have been shown to be quite effective for the related use case
of languages with large vocabularies and many rare words [46, 40]. Therefore, their use might
be promising for terminology extraction.
    In the context of term extraction from scientific texts with the final goal of text simplification,
it is also important to consider named entities. Named entities are objects, abstract or physical,
such as a person, location, organization, product, etc., that can be denoted with a proper name.
They can also designate certain natural terms like biological species, substances [47]. For a
recent survey of existing deep learning techniques for Named Entity Recognition (NER) task,
refer to [48].

2.3. Related Evaluation Initiatives
This section presents a brief overview of related evaluation initiatives, related tasks and related
approaches.
   CLEF SimpleText track was first accepted in 2020 (see [49] for the overview of the first edition
of CLEF SimpleText workshop). However, there have been other initiatives addressing the
related topics on scholarly document processing at NLP conference.
   The lack of background knowledge can become a barrier to reading comprehension and there
is a knowledge threshold allowing reading comprehension [2]. Scientific text simplification
presupposes the facilitation of readers’ understanding of complex content by establishing
links to basic lexicon while traditional methods of text simplification try to eliminate complex
concepts and constructions [1]. SimpleText is not limited to a “Split and Rephrase” task [50]
but also aims to provide a sufficient context to a scientific text. Entity linking could mitigate
the background knowledge problem, by providing definitions, illustrations, examples, and
related entities, but the existing entity linking datasets are focused on people, places, and
organisation [51], while a non-expert reader of a scientific article needs assistance with new
concepts and methods. INEX/CLEF’11-14 Tweet Contextualization [52] and CLEF’16-17 Cultural
Microblog Contextualization [53] tracks aim to provide lacking background knowledge to a
tweet. Besides completely different nature of tweets and popular science, this use case differs
from the text simplification as this lack of background knowledge is due to the tweet length.
In contrast to the Background Linking task at TREC’20 News Track [54], SimpleText focuses
on (1) scientific text; (2) selection of notions to be explained; (3) helpfulness of the provided
information rather than its relevance.
   Probably, the closest evaluation campaign to SimpleText’s task 2 is TermEval 2020: Shared
Task on Automatic Term Extraction Using Annotated Corpora for Term Extraction Research (ACTER)
Dataset [18]. One of the challenges related to term extraction methodology is stated to be the
definition of the degree of specialisation or domain-specification required for a lexical item to
be considered a term. This aspect which is difficult to quantify is partially tackled under “term
difficulty” goal of the task 2 of the CLEF SimpleText lab. TermEval was set up as a binary task:
term or not. In contrast to that, SimpleText aims at detecting a term and identifying its difficulty
level.

Datasets Simple Wikipedia based datasets could be useful to train AI models but (1) they are
not scientific publications; (2) there is no direct correspondence between Wikipedia and Simple
Wikipedia articles [55]. Another dataset was introduced at TAC 2014 Biomedical Summarization
Track [56] with a goal to retrieve important aspects of a paper from the perspective of the
community. In TermEval task [18], the organisers proposed ACTER, a manually annotated
domain-specific corpora covering 3 languages (English, French, and Dutch) and four domains
(corruption, dressage (equitation), heart failure, and wind energy). The annotators labelled
around 50k token for each language and domain. The tokens were judged according to their
degree of domain-specificity and lexicon-specificity. Three term labels were used: Specific Terms
(i.e. domain- and lexicon-specific), Common Terms (domain-specific, not lexicon-specific), and
Out-of-Domain (OOD) Terms (not domain-specific, lexicon-specific). In SimpleText, we focus
on term difficulty which is in line with lexicon-specificity of TermEval task (in particular, when
using 3-point scale), without assessing domain-specificity.
    In contrast to that, we evaluate simplification in terms of lexical and syntax complexity
combining with error analysis. As we demonstrated previously, scientific information is often
distorted accidentally due to misunderstanding of terminology, omission of essential details,
insertion of erroneous background etc. [55]. Information distortion analysis is close to scientific
claim verification [57, 58] but fact checking is limited to search for relevant evidence and decide
whether it supports the claim. Another close work is [59], where the TF-IDF cosine similarity
between documents is computed on (1) a collection of abstracts of scientific papers from the
Citation Network Dataset V1 AMINER [60] and (2) a set of articles from Huffington Post.
However, this approach is not robust to lexical changes, which are crucial for text simplification.
To the best of our knowledge, no other automatic nor semi-automatic method for information
distortion analysis exists.


3. CLEF 2022 SimpleText Task 2 Test Collection
In this section, we discuss the second task about complexity spotting in an extracted sentence
from a scientific abstract, addressing the task:
      Given a passage and a query, rank terms/concepts that are required to be explained
      for understanding this passage (definitions, context, applications etc.).
   The goal of this task is to decide which terms (up to 5) require explanation and contextualiza-
tion to help a reader to understand a complex scientific text — for example, with regard to a
query, terms that need to be contextualized (with a definition, example and/or use-case). For
each passage, participants should provide a ranked list of difficult terms with corresponding
scores on the scale 1-3 (3 to be the most difficult terms, while the meaning of terms scored
1 can be derived or guessed) and on the scale 1-5 (5 to be the most difficult terms). Passages
(sentences) are considered to be independent, i.e. difficult term repetition was allowed.

3.1. Train Data
For this task, data is two-fold: Medicine and Computer Science, as these two domains are the
most popular on forums like ELI5 [25, 61]. As in 2021, for Computer Science, we use scientific
abstracts from the Citation Network Dataset: DBLP+Citation, ACM Citation network (12th
version)7 [49]. A master student in Technical Writing and Translation manually annotated each
sentence by extracting difficult terms and attributing difficulty scores on a scale of 1-3 (3 to be
the most difficult terms, while the meaning of terms scored 1 can be derived or guessed) and on
a scale of 1-5 (5 to be the most difficult terms).
   In 2022, we introduced new data based on Google Scholar and PubMed articles on muscle
hypertrophy and health annotated by a master student in Technical Writing and Translation,
specializing in these domains. The selected abstracts included the objectives of the study, the
results and sometimes the methodology. The abstracts including only the topic of the study
were excluded because of the lack of information. To avoid the curse of knowledge, another
master student in Technical Writing and Translation not familiar with the domain was solicited
for complexity spotting.
   We provided 453 annotated examples in total.

3.2. Test Data
To construct the test data, we retrieved 116,763 sentences from the DBLP abstracts according
to the queries from Task 1. We then manually evaluated 592 distinct sentences for 11 queries.
For the query Digital assistant we took the first 1,000 sentences retrieved by ElasticSearch. We
pool terms submitted by all participants for all these queries, representing a number of 4,167
distinct pairs sentence-term in total. We ensured that for each evaluated source sentence the
pool contained the results of all participants. Statistics of the number of evaluated sentences
per query for Task 2 are given in Table 2.

3.3. Input and Output Formats
The input for the train and the test data was provided in JSON and CSV formats with the
following fields:

snt_id a unique passage (sentence) identifier.

source_snt passage text.
   7
       https://www.aminer.org/citation
Table 2
SimpleText Task 2: Statistics of the number of evaluated sentences per query
 Query                                                     # Sentences     # Sentence-term pairs
  1   guessing attack                                                60                      389
  2   end to end encryption                                          55                      390
  3   imbalanced data                                                55                      381
  4   distributed attack                                             54                      385
  5   genetic algorithm                                              51                      374
  6   quantum computing                                              51                      385
  7   qbit                                                           50                      363
  8   side-channel attack                                            49                      340
  9   traffic optimization                                           47                      344
 10   quantum applications                                           42                      320
 11   cyber-security                                                 35                      244
 12   conspiracy theories                                            23                      180
 13   crowsourcing                                                   15                      104
 14   digital assistant                                               5                       32


doc_id a unique source document identifier.

query_id a query ID.

query_text difficult terms should be extracted from sentences with regard to this query.

Input example (JSON format):
{"snt_id":"G06.2_2548923997_3", "source_snt":"These communication systems render
˓→  self-driving vehicles vulnerable to many types of malicious attacks, such as
˓→  Sybil attacks, Denial of Service (DoS), black hole, grey hole and wormhole
˓→  attacks.", "doc_id":2548923997, "query_id":"G06.2", "query_text":"self driving"}

   Participants had to submit a list of terms to be contextualized in a JSON format or a tabulated
file TSV (for manual runs) with the following fields:

run_id Run ID starting with (team_id)_(task_id)_(name).

manual Whether the run is manual {0, 1}.

snt_id a unique passage (sentence) identifier from the input file.

term Term or other phrase to be explained.

term_rank_snt term difficulty rank within the given sentence.

score_5 term difficulty score on the scale from 1 to 5 (5 to be the most difficult terms).

score_3 term difficulty score on the scale from 1 to 3 (3 to be the most difficult terms).

Output example (JSON format):
{"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"black
˓→  hole attack", "term_rank_snt":1, "score_5":5, "score_3":3},
{"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"grey
˓→  hole attack", "term_rank_snt":2, "score_5":5, "score_3":3},
{"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"Sybil
˓→  attack", "term_rank_snt":3, "score_5":5, "score_3":3},
{"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3",
˓→  "term":"wormhole attack", "term_rank_snt":4, "score_5":5,"score_3":3},
{"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"Denial
˓→  of service attack", "term_rank_snt":5, "score_5":4, "score_3":3}


3.4. Evaluation metrics
We evaluated terms according to:

    • correctness of term limits;
    • term difficulty score on the scale 1-3;
    • term difficulty score on the scale 1-5.

   For both scales of term difficulty, we used a converted scale 1-7. This scale 1-7 was chosen
following the psycho-linguistic research of the perception and evaluation of lexical meanings
performed by Osgood and his colleagues [62], in contrast to the psychometric Likert scale
(1-5, Strongly disagree/Disagree/Neither agree nor disagree/Agree/Strongly agree), commonly
used in the research that employs questionnaires [63]. In the classical version of the semantic
differential technique, the scale shows the variety of the human perception of semantic nuances
from negative (-3) to positive (+3) polarity where 0 marks the “norm" [62]. The scale 1-7
matches the Osgood’s scale and seems more suitable to evaluate concepts and features avoiding
associations with negative / positive assessment. Since the 1970s, the scale has been employed
in various studies as an evaluation tool for qualitative features.
   Table 3 provides examples of the used term difficulty scale. We separate the examples of
abbreviations from non-abbreviated phrases / words.
   We added 0 for terms that should not be explained at all and we converted the original scale
1-7 as presented in Table 5.
   Table 6 provides some examples of the annotation for Task 2. TERM refers to the terms
retrieved by participants, Correct limits is a binary category showing whether the retrieved
terms is well limited, Corrected is an eventual correction of retrieved term limits, Difficulty is a
term difficulty score in scale 1-7.


4. SimpleText Task 2 Results
In this section we discuss the results for the official submissions to the Task 2.

4.1. Participant Approaches
A total of 4 teams submitted runs, of which 2 runs were updated.
Table 3
Examples of the term difficulty scale used for evaluation. Difficult terms are highlighted with the green
color
Grade              Non-abbreviated (ordinary) term                           Abbreviation
   7      “The qubit—qutrit pair acts as a closed system and     XCSFHP in “We compared XCSFHP
          one external qubit serve as the environment for the    to XCSF on several problems.”
          pair.”                                                 “The effect of alphabet cardinality and
                                                                 the selection pressure on the scalabil-
                                                                 ity of the real-coded ECGA ( rECGA )
                                                                 method is investigated.”
                                                                 “We here study the protection of quan-
                                                                 tum Fisher information ( QFI ) of the
                                                                 phase parameter in entangled-atom
                                                                 states within the framework of in-
                                                                 dependently dissipative environments
                                                                 and driven individually by classical
                                                                 fields.”
   6      “This paper bring forward based on “ XCS with computed prediction,
          immune         genetic     algorithm      to    solve namely XCSF, extends XCS by replac-
           man on board automated storage and retrieval          ing the classifier prediction with a
           system optimized problem, immune genetic parametrized prediction function.”
                                                                 “Side-channel attack ( SCA ) is a very
          algorithm remains the characteristic which is not ...”
                                                                 efficient cryptanalysis technology to
          “ Tile coding is a well-known function approxima- attack cryptographic devices.”
          tor that has been successfully applied to many
          reinforcement learning tasks.”
          “ Quantum circuits of many qubits are challenging
          to implement making designs with low qubit cost
          desirable.”
   5      “Experiment simulation result express: the result of “This paper presents a simple real-
           immune genetic algorithm is better than traditional coded estimation of distribution al-
          genetic algorithm in the circumstance of the same gorithm (EDA) design using x-ary ex-
          clusters and the same evolution generation.”           tended compact genetic algorithm
          “The results show that the population size re-         ( XECGA ) and discretization meth-
          quired by rECGA-to successfully solve a class          ods.”
          of      additively- separable problems -scales sub-
          quadratically with problem size and the number
          of function evaluations scales sub-cubically with
          problem size.”
   4      “Specifically, the real-valued decision variables are “This paper presents a simple real-
          mapped to discrete symbols of user-specified cardi- coded estimation of distribution al-
          nality using discretization methods .”                 gorithm ( EDA ) design using x-ary
          “Immune genetic algorithm can shorten storage or extended compact genetic algorithm
          retrieval distance in application, and enhance stor- (XECGA) and discretization methods.”
          age or retrieval efficiency .”
          “The effect of alphabet cardinality and the selection
          pressure on the scalability of the real-coded ECGA
          (rECGA) method is investigated.”
          “ Deep learning has become increasingly popular
          in both academic and industrial areas in the past
          years.”
Table 4
Examples of the term difficulty scale used for evaluation: grades 0-3. Difficult terms are highlighted
with the green color
Grade              Non-abbreviated (ordinary) term                          Abbreviation
   3     “The XECGA is then used to build the probabilistic “We evaluate each measure’s perfor-
         model and to sample a new population based on the mance by AUC which is usually used
          probabilistic model .”                                for evaluation of imbalanced data clas-
         scale sub-quadratically in “The results show that sification.”
         the population size required by rECGA-to success- “This theoretical analysis is confirmed
         fully solve a class of additively- separable problems- by the experimental results: using sev-
          scales sub-quadratically with problem size and the eral sampling methods to rebalance
                                                                the imbalanced data sets, it is found
         number of function evaluations scales sub-cubically
                                                                that the performances of LDA on bal-
         with problem size.”
                                                                anced data sets are superior to those
         “ Molecular transistors can play a very important of LDA on imbalanced data sets.”
         role in the design and fabrication of complex logic
         functions inside chips.”
   2     “Experiment simulation result express: the result of NIST (The National Institute of Stan-
         immune genetic algorithm is better than traditional dards and Technology) in “Recently
         genetic algorithm in the circumstance of the same       NIST has published the second draft
          clusters and the same evolution generation.”          document of recommendation for the
         “Specifically, the real-valued decision variables are entropy sources used for random bit
         mapped to discrete symbols of user-specified cardi- generation.”
         nality using discretization methods.”
   1     “video labeling game is a crowsourcing tool to col- 2D        (2-dimensional),      3D     (3-
         lect user-generated metadata for video clips.”         dimensional) maps as in “The
         “On the other hand, a 3dimensional (3D) map, which 3D maps will give more intuitive
         is one of major themes in machine vision research, information compared to conventional
         has been utilized as a simulation tool in city and 2-dimensional ( 2D ) ones.”
          landscape planning , and other engineering fields.”
   0     “This device has two work modes: ”native” and ”re- et al. (from latin “et alii” meaning
         mote”.”                                              “and others”) in “However, Nam et al.
         “Immune genetic algorithm can shorten storage or pointed out. . . ”
         retrieval distance in application, and enhance stor-
         age or retrieval efficiency.”
         “The proposed rECGA is simple , making it
         amenable for further empirical and theoretical anal-
         ysis.”


  Team UAms from the University of Amsterdam [15] performed the experiments using IDF-
based term weighting allowing to locate the most rare terms. Then the obtained rarity measure
was balanced with the relevance or centrality of the terms to the given passage.
  Team SimpleScientificText from Wuhan University [14] used a pipeline of term recognition
and complexity spotting, formulating the latter as classification task. The term recognition
Table 5
SimpleText Task 2: Scale conversion rules
Term difficulty scale             0          1          2           3        4         5        6           7
7 point scale                     0          1          2           3        4         5        6           7
⇒ 5 point scale                   0          1               2               3             4                5
7 point scale                     0          1          2           3        4         5        6           7
⇒ 3 point scale                   0               1                          2                      3


Table 6
SimpleText Task 2: Examples of the annotation
Sentence                                                            Term               Limits       Diffi-
                                                                                 OK     Corrected   culty
This device has two work modes: ‘native’ and ‘ remote ’.            remote       YES                    1
This device has two work modes : ‘native’ and ‘remote’.           work modes     YES                    0
This device has two work modes: ‘native’ and ‘remote’.           modes native    NO    work modes       0
This device has two work modes: ‘native’ and ‘remote’.            device work    NO      device         0
This device has two work modes: ‘ native ’ and ‘ remote ’.       native remote   NO      native         1


was performed in two main steps: term extraction using KeyBERT8 followed by filtering based
on the similarity of extracted terms with the query calculated with PhraseSimilarity9 . The
model of the evaluation of complexity is built upon three groups of features (lexical, syntactic
and semantic) and assembles various state-of-the-art classification models using a soft voting
strategy.
   Team LEA_T5 [11] from the University of Western Brittany (UBO) used T510 model [64] via
the SimpleT5 library 11 as the core of their approach. The Google T5 (Text-To-Text Transfer
Transformer) model is based on the transfer learning with a unified text-to-text transformer [64].
   Team aaac has not provided any detail about their run.

4.2. Results
The results are given in Tables 7 and 8. In both tables, we present results for correctly attributed
scores regardless the correctness of term limits (Score_3 and Score_5) and the number of correctly
limited terms with correctly attributed scores (+ Limits). Table 7 provides the results on all
sentences we evaluated. However, to have comparable results for partial runs we also report
scores on a subset 167 common sentences in Table 8, although we were constrained to exclude
the run lea_t5 due to a very low number of evaluated sentences.


    8
      https://github.com/MaartenGr/KeyBERT
    9
      https://github.com/franplk/PhraseSimilarity
   10
      https://github.com/google-research/text-to-text-transfer-transformer
   11
      https://github.com/Shivanandroy/simpleT5
Table 7
SimpleText Task 2: Results for the official runs
                         Total           Evaluated                  Score_3                   Score_5
                                                +Limits                   +Limits                   +Limits
aaac                    581,285         2,951        1,388         702          318          415          175
SimpleScientificText     63,027           298          262          48           44           47           42
UAms                    263,022         1,315        1,175         105           69           60           49
lea_t5                   23,331             5            4           0            0            0            0


Table 8
SimpleText Task 2: Results on a subset of 167 common sentences
                         Total           Evaluated                  Score_3                   Score_5
                                                +Limits                   +Limits                   +Limits
aaac                    581,285           833         414          200          104          127           67
UAms                    263,022           574         514           46           28           25           21
SimpleScientificText     63,027           208         188           33           32           32           29


5. Conclusion and future work
We overviewed Task 2 of the CLEF 2022 SimpleText track that aims at identifying and ranking
difficult terms within scientific texts. We evaluated term difficulty with regard to the queries
from Task 1. For Task 2, we created a corpus of sentences extracted from the abstracts of
scientific publications, with manual annotations of term complexity.
   For next year, we will extend Task 2 to provide a context to difficult terms and we will work
on automatic metrics based on the insights we obtained this year. In particular, for Task 2,
participants will be asked to provide context for difficult terms. This context should provide a
definition and take into account ordinary readers’ needs to associate their particular problems
with the opportunities that science provides them to solve the problems [25]. This year, the
HULAT-UC3M [10] team submitted runs which combine tasks 2 and 3 which demonstrates
strong interconnection of the tasks as often the terminology cannot be removed nor simplified
but it needs to be explained to a reader.
   Further details about the lab can be found at the SimpleText website: http://simpletext-project.
com. Please join us and help to make scientific results understandable!

Acknowledgments
We like to acknowledge the support of the Lab Chairs of CLEF 2022, Allan Hanbury and Martin Potthast, for
their help and patience. Special thanks to the University Translation Office of the Université de Bretagne
Occidentale, and to Nicolas Poinsu and Ludivine Grégoire for their major impact in the train data construction
and Léa Talec-Bernard and Julien Boccou for their help in evaluation of participants’ runs. We thank Josiane
Mothe for reviewing papers. We also thank Alain Kerhervé, and the MaDICS (https:// www.madics.fr/ ateliers/
simpletext/ research group.
References
 [1] M. Maddela, W. Xu, A Word-Complexity Lexicon and A Neural Readability Ranking Model for
     Lexical Simplification, in: Proc. of EMNLP 2018, ACL, Brussels, Belgium, 2018, pp. 3749–3760. URL:
     https://www.aclweb.org/anthology/D18-1410.
 [2] T. O’Reilly, Z. Wang, J. Sabatini, How Much Knowledge Is Too Little? When a Lack of Knowledge
     Becomes a Barrier to Comprehension:, Psychological Science (2019). URL: https://journals.sagepub.
     com/doi/10.1177/0956797619862276.
 [3] M. Maddela, F. Alva-Manchego, W. Xu, Controllable Text Simplification with Explicit Paraphrasing
     (2021). URL: http://arxiv.org/abs/2010.11004.
 [4] L. Ermakova, P. Bellot, P. Braslavski, J. Kamps, J. Mothe, D. Nurbakova, I. Ovchinnikova, E. Sanjuan,
     Text Simplification for Scientific Information Access: CLEF 2021 SimpleText Workshop, in: Advances
     in Information Retrieval - 43nd European Conference on IR Research, ECIR 2021, Lucca, Italy, March
     28 – April 1, 2021, Proc., Lucca, Italy, 2021.
 [5] E. SanJuan, S. Huet, J. Kamps, L. Ermakova, Overview of the CLEF 2022 SimpleText Task 1: Passage
     selection for a simplified summary, in: [65], 2022.
 [6] L. Ermakova, I. Ovchinnikova, J. Kamps, D. Nurbakova, S. Araújo, R. Hannachi, Overview of the
     CLEF 2022 SimpleText Task 3: Query biased simplification of scientific texts, in: [65], 2022.
 [7] L. Ermakova, E. SanJuan, J. Kamps, S. Huet, I. Ovchinnikova, D. Nurbakova, S. Araújo, R. Hannachi,
     É. Mathurin, P. Bellot, Overview of the CLEF 2022 SimpleText Lab: Automatic simplification of
     scientific texts, in: A. Barrón-Cedeño, G. D. S. Martino, M. D. Esposti, F. Sebastiani, C. Macdonald,
     G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), CLEF’22: Proceedings of the Thirteenth
     International Conference of the CLEF Association, Lecture Notes in Computer Science, Springer,
     2022.
 [8] A. Menta, A. Garcia-Serrano, Controllable Sentence Simplification Using Transfer Learning, in:
     Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum,
     Bologna, Italy, September 5th - to - 8th, 2022, CEUR Workshop Proceedings, CEUR-WS.org, Bologna,
     Italy, 2022.
 [9] S.-H. Wu, H.-Y. Huang, CYUT Team2 SimpleText Shared Task Report in CLEF-2022, in: Proceedings
     of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy,
     September 5th - to - 8th, 2022, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[10] A. Rubio, P. Martínez, HULAT-UC3M at SimpleText@CLEF-2022: Scientific text simplification
     using BART, in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, CEUR Workshop Proceedings,
     CEUR-WS.org, Bologna, Italy, 2022.
[11] T.-B. Talec-Bernard, Is Using an AI to Simplify a Scientific Text Really Worth It?, in: Proceedings of
     the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy,
     September 5th - to - 8th, 2022, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[12] S. Saha, D. Roy, B. Y. Goud, C. S. Reddy, T. Basu, NLP-IISERB@Simpletext2022: To Explore the
     Performance of BM25 and Transformer Based Frameworks for Automatic Simplification of Scientific
     Texts, in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
     Forum, Bologna, Italy, September 5th - to - 8th, 2022, CEUR Workshop Proceedings, CEUR-WS.org,
     Bologna, Italy, 2022.
[13] J. Monteiro, M. Aguiar, S. Araújo, Using a Pre-trained SimpleT5 Model for Text Simplification in a
     limited Corpus, in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, CEUR Workshop Proceedings,
     CEUR-WS.org, Bologna, Italy, 2022.
[14] H. Jianfei, M. Jin, Assembly Models for SimpleText Task 2: Results from Wuhan University Research
     Group, in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
     Forum, Bologna, Italy, September 5th - to - 8th, 2022, CEUR Workshop Proceedings, CEUR-WS.org,
     Bologna, Italy, 2022.
[15] F. Mostert, A. Sampatsing, M. Spronk, J. Kamps, University of Amsterdam at the CLEF 2022
     SimpleText Track, in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of
     the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, CEUR Workshop Proceedings,
     CEUR-WS.org, Bologna, Italy, 2022.
[16] term, ???? URL: https://dictionary.cambridge.org/dictionary/english/term.
[17] K. Kageura, E. Marshman, Terminology extraction and management, in: M. O’Hagan (Ed.), The
     Routledge Handbook of Translation and Technology, 1 ed., Routledge, Abingdon, Oxon ; New York,
     NY : Routledge, 2020. |, 2019, pp. 61–77. URL: https://www.taylorfrancis.com/books/9781315311241/
     chapters/10.4324/9781315311258-4. doi:10.4324/9781315311258-4.
[18] A. Rigouts Terryn, V. Hoste, P. Drouin, E. Lefever, TermEval 2020: Shared Task on Automatic
     Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset,
     in: Proceedings of the 6th International Workshop on Computational Terminology, European
     Language Resources Association, Marseille, France, 2020, pp. 85–94. URL: https://aclanthology.org/
     2020.computerm-1.12.
[19] B.-L. Gunnarsson, Language for Special Purposes, in: G. R. Tucker, D. Corson (Eds.), Encyclopedia
     of Language and Education, Springer Netherlands, Dordrecht, 1997, pp. 105–117. URL: http://link.
     springer.com/10.1007/978-94-011-4419-3_11. doi:10.1007/978-94-011-4419-3_11.
[20] M. Trojar, Wüster’s View of Terminology, Slovenski jezik / Slovene Linguistic Studies 11 (2017).
     URL: https://ojs.zrc-sazu.si/sjsls/article/view/7344.
[21] E. Hoffmann, The LEXIS termbank, in: Proceedings of Translating and the Computer 9: Potential
     and practice, Aslib, London, UK, 1987. URL: https://aclanthology.org/1987.tc-1.14.
[22] C. Hermetet-Filez, Des activités de normalisation... à l’élaboration d’un dictionnaire, Cahiers de
     l’APLIUT 9 (1990) 36–39. URL: https://www.persee.fr/doc/apliu_0248-9430_1990_num_9_3_2106.
     doi:10.3406/apliu.1990.2106.
[23] K. Wiesner, J. Ladyman, Measuring complexity (2019). URL: https://arxiv.org/abs/1909.13243.
     doi:10.48550/ARXIV.1909.13243.
[24] J. Ladyman, K. Wiesner, What is a complex system?, Yale University Press, 2020. URL: https:
     //yalebooks.yale.edu/book/9780300251104/what-complex-system/.
[25] I. Ovchinnikova, D. Nurbakova, L. Ermakova, What science-related topics need to be popularized?
     A comparative study, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proc. of the
     Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania,
     September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, 2021, pp. 2242–2255.
     URL: http://ceur-ws.org/Vol-2936/paper-203.pdf.
[26] B. J. Good, M.-J. Vecchio Good, The Semantics of Medical Discourse, in: R. D. Whitley,
     E. Mendelsohn, Y. Elkana (Eds.), Sciences and Cultures, volume 5, Springer Netherlands, Dor-
     drecht, 1981, pp. 177–212. URL: http://link.springer.com/10.1007/978-94-009-8429-5_6. doi:10.
     1007/978-94-009-8429-5_6.
[27] A. M. Silletti, The Role of Illustrations in Popularizing Medical Discourse, Linguæ & - Rivista di
     lingue e culture moderne (2015) 65–81. URL: http://www.ledonline.it/index.php/linguae/article/
     view/839. doi:10.7358/ling-2015-002-sill.
[28] D. Bourigault, Surface grammatical analysis for the extraction of terminological noun phrases,
     in: Proceedings of the 14th conference on Computational linguistics -, volume 3, Association for
     Computational Linguistics, Nantes, France, 1992, p. 977. URL: http://portal.acm.org/citation.cfm?
     doid=992383.992415. doi:10.3115/992383.992415.
[29] D. A. Evans, R. G. Lefferts, CLARIT-TREC experiments, in: Proceedings of the second conference
     on Text retrieval conference, TREC-2, Pergamon Press, Inc., USA, 1995, pp. 385–395.
[30] K. Kageura, B. Umino, Methods of automatic term recognition: A review, Terminology. International
     Journal of Theoretical and Applied Issues in Specialized Communication 3 (1996) 259–289. URL:
     http://www.jbe-platform.com/content/journals/10.1075/term.3.2.03kag. doi:10.1075/term.3.2.
     03kag.
[31] V. Kosa, D. Chaves-Fraga, H. Dobrovolskyi, V. Ermolayev, Optimized Term Extraction Method
     Based on Computing Merged Partial C-Values, in: V. Ermolayev, F. Mallet, V. Yakovyna, H. C. Mayr,
     A. Spivakovsky (Eds.), Information and Communication Technologies in Education, Research, and
     Industrial Applications, volume 1175, Springer International Publishing, Cham, 2020, pp. 24–49. URL:
     http://link.springer.com/10.1007/978-3-030-39459-2_2. doi:10.1007/978-3-030-39459-2_2.
[32] J. Valaski, S. Reinehr, A. Malucelli, Approaches and Strategies to Extract Relevant Terms: How Are
     They Being Applied?, in: Proceedings of The 2015 World Congress in Computer Science, Computer
     Engineering, and Applied Computing (WorldComp’15), Monte Carlo Resort, Las Vegas, USA, 2015,
     pp. 478–484. URL: http://worldcomp-proceedings.com/proc/p2015/ICA2668.pdf.
[33] Y. Gao, Y. Yuan, Feature-Less End-to-End Nested Term Extraction, in: J. Tang, M.-Y. Kan,
     D. Zhao, S. Li, H. Zan (Eds.), Natural Language Processing and Chinese Computing, volume
     11839, Springer International Publishing, Cham, 2019, pp. 607–616. URL: http://link.springer.com/
     10.1007/978-3-030-32236-6_55. doi:10.1007/978-3-030-32236-6_55.
[34] M. Kucza, J. Niehues, T. Zenkel, A. Waibel, S. Stüker, Term Extraction via Neural Sequence Labeling
     a Comparative Evaluation of Strategies Using Recurrent Neural Networks, in: Interspeech 2018,
     ISCA, 2018, pp. 2072–2076. URL: https://www.isca-speech.org/archive/interspeech_2018/kucza18_
     interspeech.html. doi:10.21437/Interspeech.2018-2017.
[35] S. Shah, S. S, S. Reddy, Similarity Driven Unsupervised Learning for Materials Science Terminology
     Extraction, Computación y Sistemas 23 (2019). URL: https://www.cys.cic.ipn.mx/ojs/index.php/
     CyS/article/view/3266. doi:10.13053/cys-23-3-3266.
[36] O. Lieber, O. Sharir, B. Lentz, Y. Shoham, Jurassic-1: Technical Details and Evaluation (2021) 9.
[37] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mT5: A
     massively multilingual pre-trained text-to-text transformer, in: Proc. of the 2021 Conference of
     the North American Chapter of the ACL: Human Language Technologies, ACL, Online, 2021, pp.
     483–498. URL: https://aclanthology.org/2021.naacl-main.41.
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, 2017. URL: http://arxiv.org/abs/1706.03762.
[39] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
     D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot
     Learners (2020). URL: http://arxiv.org/abs/2005.14165.
[40] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in:
     Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, volume 1,
     Association for Computational Linguistics, 2016, pp. 1715–1725. URL: http://aclweb.org/anthology/
     P16-1162. doi:10.18653/v1/P16-1162.
[41] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg,
     G. Corrado, et al., Google’s multilingual neural machine translation system: Enabling zero-shot
     translation, Transactions of the Association for Computational Linguistics 5 (2017) 339–351.
[42] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsuper-
     vised multitask learners, OpenAI blog 1 (2019) 9.
[43] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[44] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
     cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[45] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators
     rather than generators, arXiv preprint arXiv:2003.10555 (2020).
[46] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information,
     volume 5, 2017, pp. 135–146. URL: https://doi.org/10.1162/tacl_a_00051. doi:10.1162/tacl_a_
     00051.
[47] D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Lingvisticae
     Investigationes 30 (2007) 3–26.
[48] J. Li, A. Sun, J. Han, C. Li, A Survey on Deep Learning for Named Entity Recognition, IEEE
     Transactions on Knowledge and Data Engineering 34 (2022) 50–70. URL: https://ieeexplore.ieee.
     org/document/9039685/. doi:10.1109/TKDE.2020.2981314.
[49] L. Ermakova, P. Bellot, P. Braslavski, J. Kamps, J. Mothe, D. Nurbakova, I. Ovchinnikova, E. SanJuan,
     Overview of SimpleText 2021 - CLEF Workshop on Text Simplification for Scientific Information
     Access, in: K. S. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi,
     G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction,
     Lecture Notes in Computer Science, Springer International Publishing, Cham, 2021, pp. 432–449.
[50] S. Narayan, C. Gardent, S. B. Cohen, A. Shimorina, Split and Rephrase, in: Proc. of EMNLP 2017,
     ACL, Copenhagen, Denmark, 2017, pp. 606–616. URL: https://www.aclweb.org/anthology/D17-1064.
[51] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater,
     G. Weikum, Robust disambiguation of named entities in text, in: Proc. of EMNLP 2011, 2011, pp.
     782–792.
[52] P. Bellot, V. Moriceau, J. Mothe, E. SanJuan, X. Tannier, INEX tweet contextualization task:
     Evaluation, results and lesson learned, Inf. Process. Manage. 52 (2016) 801–819. URL: https://doi.
     org/10.1016/j.ipm.2016.03.002.
[53] L. Ermakova, L. Goeuriot, J. Mothe, P. Mulhem, J.-Y. Nie, E. SanJuan, CLEF 2017 Microblog Cultural
     Contextualization Lab Overview, in: Experimental IR Meets Multilinguality, Multimodality, and
     Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland,
     September 11-14, 2017, Proc., 2017, pp. 304–314. URL: https://doi.org/10.1007/978-3-319-65813-1_27.
[54] A. Anand Deshmukh, U. Sethi, IR-BERT: Leveraging BERT for Semantic Search in Background
     Linking for News Articles 2007 (2020). URL: http://adsabs.harvard.edu/abs/2020arXiv200712603A.
[55] L. N. Ermakova, D. Nurbakova, I. Ovchinnikova, Covid or not Covid? Topic Shift in Information
     Cascades on Twitter, in: A. f. C. Linguistics (Ed.), 3rd International Workshop on Rumours and
     Deception in Social Media (RDSM) Collocated with COLING 2020, Proc. of the 3rd International
     Workshop on Rumours and Deception in Social Media (RDSM), Barcelona (on line), Spain, 2020, pp.
     32–37. URL: https://hal.archives-ouvertes.fr/hal-03066857.
[56] Text Analysis Conference (TAC) 2014 Biomedical Summarization Track, 2014. URL: https://tac.nist.
     gov/2014/BiomedSumm/.
[57] D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or Fiction:
     Verifying Scientific Claims (2020). URL: http://arxiv.org/abs/2004.14974.
[58] P. Nakov, D. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar,
     G. D. S. Martino, Automated Fact-Checking for Assisting Human Fact-Checkers (2021). URL:
     http://arxiv.org/abs/2103.07769.
[59] R. Pradeep, X. Ma, R. Nogueira, J. Lin, Scientific Claim Verification with VERT5ERINI (2020). URL:
     http://arxiv.org/abs/2010.11930.
[60] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, ArnetMiner: extraction and mining of academic
     social networks, in: Proceeding of the 14th ACM SIGKDD international conference on Knowledge
     discovery and data mining - KDD 08, ACM Press, Las Vegas, Nevada, USA, 2008, p. 990. URL:
     http://dl.acm.org/citation.cfm?doid=1401890.1402008.
[61] L. Ermakova, P. Bellot, J. Kamps, D. Nurbakova, I. Ovchinnikova, E. SanJuan, E. Mathurin, S. Araújo,
     R. Hannachi, S. Huet, N. Poinsu, Automatic Simplification of Scientific Texts: SimpleText Lab at
     CLEF-2022, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.),
     Advances in Information Retrieval, volume 13186, Springer International Publishing, Cham, 2022,
     pp. 364–373.
[62] C. E. Osgood, Semantic Differential Technique in the Comparative Study of Cultures1, American
     Anthropologist 66 (1964) 171–200. URL: https://onlinelibrary.wiley.com/doi/abs/10.1525/aa.1964.66.
     3.02a00880.
[63] R. Likert, A technique for the measurement of attitudes, Archives of Psychology 22 140 (1932)
     55–55.
[64] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
     the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
     Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[65] G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proc. of the Working Notes of CLEF 2022:
     Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2022.