Overview of the CLEF 2022 SimpleText Task 2: Complexity Spotting in Scientific Abstracts

Overview of the CLEF 2022 SimpleText Task 2: Complexity Spotting in Scientific Abstracts LianaErmakova liana.ermakova@univ-brest.fr Université de Bretagne Occidentale HCTI

Brest France

IrinaOvchinnikov ManPower Language Solution

Israel

JaapKamps University of Amsterdam

Amsterdam The Netherlands

DianaNurbakova University of Lyon INSA Lyon CNRS LIRIS

UMR5205, F-69621 Villeurbanne France

SílviaAraújo Universidade do Minho CEHUM

4710-057 Braga Portugal

RadiaHannachi Université de Bretagne Sud HCTI

56321 Lorient France

Evaluation Forum

September 5-8 2022 Bologna Italy

Overview of the CLEF 2022 SimpleText Task 2: Complexity Spotting in Scientific Abstracts 1613-0073 DCBE22246901010854D0F800BB20A3FD GROBID - A machine learning software for extracting information from scholarly documents automatic text simplification, terminology, background knowledge, scientific article, science popularization, contextualization, term difficulty (L. Ermakova) https://simpletext-project.com/ (L. Ermakova) 0000-0002-7598-7474 (L. Ermakova) 0000-0003-1726-3360 (I. Ovchinnikov) 0000-0002-6614-0087 (J. Kamps) 0000-0002-6620-7771 (D. Nurbakova) 0000-0003-4321-4511 (S. Araújo)

This paper provides an overview of the Task 2: What is unclear? of the Automatic Simplification of Scientific Texts (SimpleText) lab, run as part of CLEF 2022. The main aim of the SimpleText lab is to promote a more open scientific information access via automatic text simplification. Task 2 focuses on complexity spotting within scientific texts (passage). Thus, the goal is to detect the terms/concepts that require specific background knowledge for understanding of the passage and to assess their complexity for non-experts. Overall, four runs from four different teams have been submitted to this task. In this paper, we describe the data collection, the task setup, and the evaluation procedure. We also give a brief overview of the participating approaches.

Introduction

Nowadays, scientific literature has become more available to every citizen thanks to digitalisation. However, an important barrier preventing citizens to access the objective scientific knowledge from the original sources remains present. One of the key issues here is a high complexity of scientific texts to non-experts due to the lack of required background knowledge, including the comprehension of terminology. Even for native speakers it is hard to understand the terminology beyond their area of expertise. Nevertheless, a basic set of terms the general public acquired thanks to secondary and college education allows them to comprehend popular science publications. Comprehension of the term presupposes grasping of the concept it refers to without any definition. To understand the concept, we need to involve it in a structured system in our semantic memory that can require more knowledge than we had learned.

To help readers to stay up-to-date with scientific advances, text simplification can be used. To facilitate the reading, the traditional methods try to eliminate complex concepts and constructions [1]. However, it is not always possible, especially in the case of scientific literature. Thus, readers of a popular science publication lean on their experience of processing new information and recognize a case when they need definition or clarification of an unfamiliar term since they do not understand its concept.

To alleviate the lack of background knowledge that can prevent a proper comprehension [2], we argue that a simplification method should provide information, essential to understanding of complex scientific concepts. This is one of the objectives of CLEF 2022 SimpleText lab. Despite some recent efforts that have been done in automatic text simplification (e.g. [3]), improving scientific text comprehensibility and its adaptation to different audiences in an automatic manner remains an open challenge.

The CLEF 2022 SimpleText track 1 is an open forum for researchers and practitioners working on the automatic generation of simplified summaries of scientific texts. It is a new evaluation lab that follows up the CLEF 2021 SimpleText Workshop [4]. The track provides data and benchmarks for discussing the challenges of automatic text simplification proposing the following interconnected tasks: Task 1: What is in (or out)? Select passages to include in a simplified summary, given a query.

Task 2: What is unclear? Given a passage and a query, rank terms/concepts that are required to be explained for understanding this passage (definitions, context, applications,..).

Task 3: Rewrite this! Given a query, simplify passages from scientific abstracts.

This paper focuses on the second task of complexity spotting. We refer for details of the other tasks to the overview papers of Task 1 [5] and Task 3 [6], or the Track overview paper [7].

In the CLEF 2022 edition of SimpleText, a total of 62 teams registered for the SimpleText track. A total of 40 users downloaded data from the server. A total of 9 distinct teams submitted 24 runs, of which 10 runs were updated. The details of statistics on runs submitted for shared tasks are presented in Table 1. As it can be seen, four teams participated in Task 2.

The rest of this paper is structured in the following way. Section 2 presents a brief overview of related works, including other evaluation initiatives, related tasks and related approaches. We provide a detailed description of the task complexity spotting itself, submitted runs, and the evaluation protocol in Section 3. In Section 4, we discuss the results of the official submissions. We end with Section 5 discussing the results and findings, and lessons for the future.

Related work

According to the Cambridge Dictionary [16], a term is "a word or expression used in relation to a particular subject, often to describe something official or technical". Almost the same

aaac 1 (1 updated) 1 CLARA-HD [8] 1 1 CYUT Team2 [9]

1 1 2 HULAT-UC3M [10] 10 (4 updated) 10 LEA_T5 [11] 1 1 2 NLP@IISERB [12] 3 (3 updated) 3 PortLinguE [13] 1 (1 updated) 1 SimpleScientificText [14] 1 (1 updated) definition of terms is given by Kaguera and Marshman [17] describing them as "lexical items that represent concepts of a domain". Thus, terms form the core vocabulary of a specific and specialised domain.

Term Complexity

Term perception can be rather ambiguous and subjective [18], especially when it comes to assess term complexity. Indeed, the discrepancy between basic competence of a reader and professional competence of an author of a scientific article derives the subjective complexity of terminology. The objective complexity of terminology is derived by peculiar characteristics of terminological systems. In this Section, we clarify the objective complexity of terminology caused by complexity of research areas, research traditions and socio-cultural diversity. Terminology belongs to professional and scientific discourse, where there exist so called languages for special purpose. Belonging to the language for special purposes, terminological systems do not share peculiarities of the general lexicon [19]. A terminological system tends to avoid synonyms and polysemy, but has to provide a term for each concept within a system of concepts of the domain. According to the General Theory of Terminology, which is based on the work of Eugen Wüster (see description in [20]), terminological systems support univocity (unambiguous match of the term to its concept). This general approach is still relevant in technical communication where professionals (technical writers, translators, etc.) use term banks, e.g. Eurodicautom2 , Termium3 , LEXIS 4 [21], Normaterm5 [22], and the Grand dictionnaire terminologique 6 (formerly the Banque de terminologie du Québec). In academia, this approach is mostly applied to terminological systems in Science and Computer Science; however, it is not relevant for Cognitive Science (e.g., Neuroscience) and Humanities.

Complexity of a terminological system is a derivative of scientific complexity. The complexity of a scientific area depends on peculiar attributes and conditions [23]. The most basic peculiarities are the numerosity of counting entities and their interaction: high diversity of disordered interaction among multiple entities represents a complex research area. To refer to the entities, their interactions and degrees of disorder, the research area needs complex terminology. Ladyman et al. [24] offered to determine complexity of a research area according to five qualitative conditions: numerosity of elements, numerosity of interactions, disorder, openness, feedback. Considering terminological systems, numerosity of elements and numerosity of interactions in a complex research area require a rich and clear structured system of terms, preferably taxonomy. Transparency of the terminological system structure facilitates the research, analysis and description of disordered systems and non-equilibrium states of the systems. Effect of numerosity of elements and their interactions on the complexity of the terminological system of the research area is obvious through comparison of different areas that attract interest of wide readership: Neuroscience and Computer Science [25].

The complexity of terminology is associated with a formal representation (signifier) of a term. Putting aside borrowings, we would like to mention symbols and abbreviations (acronyms, backronyms, syllabic abbreviations, clipping etc.). Symbols and abbreviations belong to a set of peculiarities of a language for special purpose. Symbolic language of science involves symbols and abbreviations as means to optimize content transferring, to standardize naming of numerous elements, frequent interaction among them, and standard procedures of data processing. Languages for special purpose in Natural Science and Mathematical Sciences (including Computer Science) contain complicated systems of symbols. Meanwhile, symbols and abbreviations are in use in all research areas disregarding their complexity. Nevertheless, readers of popularized publications expect explanations of the symbols and abbreviations.

Another cause of the terminological complexity is research traditions. Neuroscience and computer science represent the new research areas. Nevertheless, humans became curious about the brain and how to treat its damage thousands years ago; the brain has attracted researchers' attention since the very first steps in practical medicine. The neuroscientific terminology reflects rich traditions of the brain study in the history of science: Latin (e.g. cerebellum 'little brain') and Greek (e.g. diencephalon 'interbrain') borrowings, eponyms (Broca's area), metaphors (e.g. hemispheres), etc. Diversity of the traditions provides neuroscience with parallel terms, which refer to the same concept (e.g., names of the disease: [26]). Understanding the neuroscientific terminology requires knowledge of the science development.

Computer science has begun to develop its traditions mostly in the middle of the XX century; therefore, it lacks Latin and Greek terminology as well as numerous eponyms. As compared to neuroscience, the terminology in computer science seems less complicated and more transparent for nonprofessionals; moreover, an average reader of popularized science understands many terms since he / she employs computers in the everyday routine. Readership of popular science publications is probably familiar with the basic terminology of this area, while the neuroscientific terminology requires definitions and clarifications.

The complexity of terminology is often caused by socio-cultural diversity of readership of popular science publications. The diversity is revealed in comprehension of basic terminology of Science and Humanities that is affected by programs of secondary and college education.

The programs provide people with grounds and backdrops for comprehending current news of popular science. Since content of the programs varies in different institutions and countries, readers have differences in their background and terminological lexicon especially in Humanities.

While popularizing science, journalists substitute complex terms by basic ones or clarify the underlying concept, which is denoted by the complex term. Enhancing the popular science text readability, popularization may bring in damaging its comprehensibility. Both ways to avoid the complex terminology may lead to misinformation or distortion of the content. The term substitution may distort the content since semantic relations in terminological systems are not similar to those in the general lexicon of the language. It is presupposed that a network of connections within a terminological system does not support synonyms and maintains a transparent one-to-one relationship between the term and the concept it referred to. A list of the potential substitutions usually includes a widespread name of the concept if any exists in the general lexicon (e.g. sea cow instead of manatee), hypernym (e.g. herbivore marine mammal for manatee) and co-hyponyms of the complex term with additional explanation since co-hyponyms denote a different object (quality, action, etc.) within the same category. Meanwhile, commonsense concepts are not equal to scientific concepts in the complex research areas; therefore, appealing to the common sense requires clarifications. Thus, term substitutions do not enhance structure of the popular scientific text. Probably, the best way to clarify the term is to illustrate its concept [27].

Speaking about automatic systems of generating a popular review of scientific publications, we need to choose the way for term recognition and extraction. In order to substitute or clarify any unfamiliar term we need to recognize it in scientific discourse and then provide readers with references, definitions or illustrations.

Summarizing our consideration of complexity of terminology, we note that the selection of a way to facilitate perception of terms in popular scientific publications depends on complexity of the research area, richness of the research tradition of the area, and cultural diversity.

Automatic Terminology Extraction

Automatic Term Extraction (ATE) or Automatic Terminology Extraction is an automated process of detecting terms in a corpus of specialised texts. It has been a relevant NLP task since 1980s and remains challenging from several perspectives, such as data collection (creation of manually annotated domain-specific corpora), extraction algorithms (definition of term length, minimum term frequency, term POS-pattern), evaluation (usually limited to the use of precision metric as the information about all terms in a text is often missing) [18].

The ATE methods are traditionally classified in three groups:

• Linguistic methods: these methods are based on linguistic properties such as POS-patterns or other morpho-syntactic patters (e.g. [28,29]). • Statistical methods: these methods are based on statistical properties (various weightings have been proposed, e.g. frequency, mutual information, log-likelihood ratio, etc.) and usually analyse 𝑛-grams measuring termhood or unithood [30]. • Hybrid methods: these methods are combinations of the previous two (e.g. [31]). Usually, the initial selection is performed based on linguistic properties which is followed by the ranking procedure on the basis of statistical measures [18]. Hybrid approaches have been shown to outperform linguistic or statistical methods [32].

As stated in [18], one of the difficulties is to well define the cut-off threshold for term candidates.

Recent advances in Machine Learning techniques, including Deep Learning models, have made the taxonomy of ATE methodology more complex and diverse [33]. Numerous methods have been proposed (e.g. [34,35]).

Lately, large transformer models such as Jurassic-1 [36], Google's T5 [37], BERT [38], or GPT-3 [39] have been shown to be successful on several NLP tasks, outperforming other stateof-the-art models. They make use of subword tokenizers, such as Byte-Pair Encoding (BPE) [40] and WordPiece [41]. For instance, BPE that uses the idea of word segmentation into subword units is exploited in GPT-2 [42] and Roberta [43]. A similar subword tokenization algorithm WordPiece is ussed in BERT [38], DistilBERT [44], and Electra [45]. Despite a comparative shallowness of these models, they have been shown to be quite effective for the related use case of languages with large vocabularies and many rare words [46,40]. Therefore, their use might be promising for terminology extraction.

In the context of term extraction from scientific texts with the final goal of text simplification, it is also important to consider named entities. Named entities are objects, abstract or physical, such as a person, location, organization, product, etc., that can be denoted with a proper name. They can also designate certain natural terms like biological species, substances [47]. For a recent survey of existing deep learning techniques for Named Entity Recognition (NER) task, refer to [48].

Related Evaluation Initiatives

This section presents a brief overview of related evaluation initiatives, related tasks and related approaches.

CLEF SimpleText track was first accepted in 2020 (see [49] for the overview of the first edition of CLEF SimpleText workshop). However, there have been other initiatives addressing the related topics on scholarly document processing at NLP conference.

The lack of background knowledge can become a barrier to reading comprehension and there is a knowledge threshold allowing reading comprehension [2]. Scientific text simplification presupposes the facilitation of readers' understanding of complex content by establishing links to basic lexicon while traditional methods of text simplification try to eliminate complex concepts and constructions [1]. SimpleText is not limited to a "Split and Rephrase" task [50] but also aims to provide a sufficient context to a scientific text. Entity linking could mitigate the background knowledge problem, by providing definitions, illustrations, examples, and related entities, but the existing entity linking datasets are focused on people, places, and organisation [51], while a non-expert reader of a scientific article needs assistance with new concepts and methods. INEX/CLEF'11-14 Tweet Contextualization [52] and CLEF'16-17 Cultural Microblog Contextualization [53] tracks aim to provide lacking background knowledge to a tweet. Besides completely different nature of tweets and popular science, this use case differs from the text simplification as this lack of background knowledge is due to the tweet length.

In contrast to the Background Linking task at TREC'20 News Track [54], SimpleText focuses on (1) scientific text; (2) selection of notions to be explained; (3) helpfulness of the provided information rather than its relevance.

Probably, the closest evaluation campaign to SimpleText's task 2 is TermEval 2020: Shared Task on Automatic Term Extraction Using Annotated Corpora for Term Extraction Research (ACTER) Dataset [18]. One of the challenges related to term extraction methodology is stated to be the definition of the degree of specialisation or domain-specification required for a lexical item to be considered a term. This aspect which is difficult to quantify is partially tackled under "term difficulty" goal of the task 2 of the CLEF SimpleText lab. TermEval was set up as a binary task: term or not. In contrast to that, SimpleText aims at detecting a term and identifying its difficulty level.

Datasets Simple Wikipedia based datasets could be useful to train AI models but (1) they are not scientific publications; (2) there is no direct correspondence between Wikipedia and Simple Wikipedia articles [55]. Another dataset was introduced at TAC 2014 Biomedical Summarization Track [56] with a goal to retrieve important aspects of a paper from the perspective of the community. In TermEval task [18], the organisers proposed ACTER, a manually annotated domain-specific corpora covering 3 languages (English, French, and Dutch) and four domains (corruption, dressage (equitation), heart failure, and wind energy). The annotators labelled around 50k token for each language and domain. The tokens were judged according to their degree of domain-specificity and lexicon-specificity. Three term labels were used: Specific Terms (i.e. domain-and lexicon-specific), Common Terms (domain-specific, not lexicon-specific), and Out-of-Domain (OOD) Terms (not domain-specific, lexicon-specific). In SimpleText, we focus on term difficulty which is in line with lexicon-specificity of TermEval task (in particular, when using 3-point scale), without assessing domain-specificity.

In contrast to that, we evaluate simplification in terms of lexical and syntax complexity combining with error analysis. As we demonstrated previously, scientific information is often distorted accidentally due to misunderstanding of terminology, omission of essential details, insertion of erroneous background etc. [55]. Information distortion analysis is close to scientific claim verification [57,58] but fact checking is limited to search for relevant evidence and decide whether it supports the claim. Another close work is [59], where the TF-IDF cosine similarity between documents is computed on (1) a collection of abstracts of scientific papers from the Citation Network Dataset V1 AMINER [60] and (2) a set of articles from Huffington Post. However, this approach is not robust to lexical changes, which are crucial for text simplification. To the best of our knowledge, no other automatic nor semi-automatic method for information distortion analysis exists.

CLEF 2022 SimpleText Task 2 Test Collection

In this section, we discuss the second task about complexity spotting in an extracted sentence from a scientific abstract, addressing the task:

Given a passage and a query, rank terms/concepts that are required to be explained for understanding this passage (definitions, context, applications etc.).

The goal of this task is to decide which terms (up to 5) require explanation and contextualization to help a reader to understand a complex scientific text -for example, with regard to a query, terms that need to be contextualized (with a definition, example and/or use-case). For each passage, participants should provide a ranked list of difficult terms with corresponding scores on the scale 1-3 (3 to be the most difficult terms, while the meaning of terms scored 1 can be derived or guessed) and on the scale 1-5 (5 to be the most difficult terms). Passages (sentences) are considered to be independent, i.e. difficult term repetition was allowed.

Train Data

For this task, data is two-fold: Medicine and Computer Science, as these two domains are the most popular on forums like ELI5 [25,61]. As in 2021, for Computer Science, we use scientific abstracts from the Citation Network Dataset: DBLP+Citation, ACM Citation network (12th version)7 [49]. A master student in Technical Writing and Translation manually annotated each sentence by extracting difficult terms and attributing difficulty scores on a scale of 1-3 (3 to be the most difficult terms, while the meaning of terms scored 1 can be derived or guessed) and on a scale of 1-5 (5 to be the most difficult terms).

In 2022, we introduced new data based on Google Scholar and PubMed articles on muscle hypertrophy and health annotated by a master student in Technical Writing and Translation, specializing in these domains. The selected abstracts included the objectives of the study, the results and sometimes the methodology. The abstracts including only the topic of the study were excluded because of the lack of information. To avoid the curse of knowledge, another master student in Technical Writing and Translation not familiar with the domain was solicited for complexity spotting.

We provided 453 annotated examples in total.

Test Data

To construct the test data, we retrieved 116,763 sentences from the DBLP abstracts according to the queries from Task 1. We then manually evaluated 592 distinct sentences for 11 queries.

For the query Digital assistant we took the first 1,000 sentences retrieved by ElasticSearch. We pool terms submitted by all participants for all these queries, representing a number of 4,167 distinct pairs sentence-term in total. We ensured that for each evaluated source sentence the pool contained the results of all participants. Statistics of the number of evaluated sentences per query for Task 2 are given in Table 2.

Input and Output Formats

The input for the train and the test data was provided in JSON and CSV formats with the following fields: snt_id a unique passage (sentence) identifier.

source_snt passage text. doc_id a unique source document identifier.

query_id a query ID.

query_text difficult terms should be extracted from sentences with regard to this query.

Input example (JSON format):

{"snt_id":"G06.2_2548923997_3", "source_snt":"These communication systems render self-driving vehicles vulnerable to many types of malicious attacks, such as Sybil attacks, Denial of Service (DoS), black hole, grey hole and wormhole attacks.", "doc_id":2548923997, "query_id":"G06.2", "query_text":"self driving"} ˓→ ˓→

˓→

Participants had to submit a list of terms to be contextualized in a JSON format or a tabulated file TSV (for manual runs) with the following fields: run_id Run ID starting with (team_id)_(task_id)_(name). manual Whether the run is manual {0, 1}. snt_id a unique passage (sentence) identifier from the input file.

term Term or other phrase to be explained. term_rank_snt term difficulty rank within the given sentence. score_5 term difficulty score on the scale from 1 to 5 (5 to be the most difficult terms). score_3 term difficulty score on the scale from 1 to 3 (3 to be the most difficult terms).

Output example (JSON format):

{"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"black hole attack", "term_rank_snt":1, "score_5":5, "score_3":3}, ˓→ {"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"grey hole attack", "term_rank_snt":2, "score_5":5, "score_3":3}, ˓→ {"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"Sybil attack", "term_rank_snt":3, "score_5":5, "score_3":3}, ˓→ {"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"wormhole attack", "term_rank_snt":4, "score_5":5,"score_3":3}, ˓→ {"run_id":"NP_task_2_run1", "manual":1, "snt_id":"G06.2_2548923997_3", "term":"Denial of service attack", "term_rank_snt":5, "score_5":4, "score_3":3} ˓→

Evaluation metrics

We evaluated terms according to:

• correctness of term limits;

• term difficulty score on the scale 1-3;

• term difficulty score on the scale 1-5.

For both scales of term difficulty, we used a converted scale 1-7. This scale 1-7 was chosen following the psycho-linguistic research of the perception and evaluation of lexical meanings performed by Osgood and his colleagues [62], in contrast to the psychometric Likert scale (1-5, Strongly disagree/Disagree/Neither agree nor disagree/Agree/Strongly agree), commonly used in the research that employs questionnaires [63]. In the classical version of the semantic differential technique, the scale shows the variety of the human perception of semantic nuances from negative (-3) to positive (+3) polarity where 0 marks the "norm" [62]. The scale 1-7 matches the Osgood's scale and seems more suitable to evaluate concepts and features avoiding associations with negative / positive assessment. Since the 1970s, the scale has been employed in various studies as an evaluation tool for qualitative features.

Table 3 provides examples of the used term difficulty scale. We separate the examples of abbreviations from non-abbreviated phrases / words.

We added 0 for terms that should not be explained at all and we converted the original scale 1-7 as presented in Table 5.

Table 6 provides some examples of the annotation for Task 2. TERM refers to the terms retrieved by participants, Correct limits is a binary category showing whether the retrieved terms is well limited, Corrected is an eventual correction of retrieved term limits, Difficulty is a term difficulty score in scale 1-7.

SimpleText Task 2 Results

In this section we discuss the results for the official submissions to the Task 2.

Participant Approaches

A total of 4 teams submitted runs, of which 2 runs were updated. Team UAms from the University of Amsterdam [15] performed the experiments using IDFbased term weighting allowing to locate the most rare terms. Then the obtained rarity measure was balanced with the relevance or centrality of the terms to the given passage.

Team SimpleScientificText from Wuhan University [14] used a pipeline of term recognition and complexity spotting, formulating the latter as classification task. The term recognition was performed in two main steps: term extraction using KeyBERT8 followed by filtering based on the similarity of extracted terms with the query calculated with PhraseSimilarity9 . The model of the evaluation of complexity is built upon three groups of features (lexical, syntactic and semantic) and assembles various state-of-the-art classification models using a soft voting strategy.

Team LEA_T5 [11] from the University of Western Brittany (UBO) used T510 model [64] via the SimpleT5 library 11 as the core of their approach. The Google T5 (Text-To-Text Transfer Transformer) model is based on the transfer learning with a unified text-to-text transformer [64].

Team aaac has not provided any detail about their run.

Results

The results are given in Tables 7 and 8. In both tables, we present results for correctly attributed scores regardless the correctness of term limits (Score_3 and Score_5) and the number of correctly limited terms with correctly attributed scores (+ Limits). Table 7 provides the results on all sentences we evaluated. However, to have comparable results for partial runs we also report scores on a subset 167 common sentences in Table 8, although we were constrained to exclude the run lea_t5 due to a very low number of evaluated sentences.

Conclusion and future work

We overviewed Task 2 of the CLEF 2022 SimpleText track that aims at identifying and ranking difficult terms within scientific texts. We evaluated term difficulty with regard to the queries from Task 1. For Task 2, we created a corpus of sentences extracted from the abstracts of scientific publications, with manual annotations of term complexity.

For next year, we will extend Task 2 to provide a context to difficult terms and we will work on automatic metrics based on the insights we obtained this year. In particular, for Task 2, participants will be asked to provide context for difficult terms. This context should provide a definition and take into account ordinary readers' needs to associate their particular problems with the opportunities that science provides them to solve the problems [25]. This year, the HULAT-UC3M [10] team submitted runs which combine tasks 2 and 3 which demonstrates strong interconnection of the tasks as often the terminology cannot be removed nor simplified but it needs to be explained to a reader.

Further details about the lab can be found at the SimpleText website: http://simpletext-project. com. Please join us and help to make scientific results understandable!

Table 11CLEF 2022 SimpleText official run submission statisticsTeamTask 1Task 2Task 3Total runs

Table 22SimpleText Task 2: Statistics of the number of evaluated sentences per queryQuery

Table 33Examples of the term difficulty scale used for evaluation. Difficult terms are highlighted with the green colorGradeNon-abbreviated (ordinary) termAbbreviation7"The qubit-qutrit pair acts as a closed system andXCSFHP in "We compared XCSFHPone external qubit serve as the environment for theto XCSF on several problems. "pair. ""The effect of alphabet cardinality andthe selection pressure on the scalabil-ity of the real-coded ECGA ( rECGA )method is investigated. ""We here study the protection of quan-tum Fisher information ( QFI ) of thephase parameter in entangled-atomstates within the framework of in-dependently dissipative environmentsand driven individually by classicalfields. "6"This paper bring forward based on" XCS with computed prediction,immunegeneticalgorithmtosolvenamely XCSF, extends XCS by replac-man on board automated storage and retrievaling the classifier prediction with asystem optimized problem, immune genetic algorithm remains the characteristic which is not ... " " Tile coding is a well-known function approxima-tor that has been successfully applied to manyparametrized prediction function. " "Side-channel attack ( SCA ) is a very efficient cryptanalysis technology to attack cryptographic devices. "reinforcement learning tasks. "" Quantum circuits of many qubits are challengingto implement making designs with low qubit costdesirable. "5"Experiment simulation result express: the result of"This paper presents a simple real-immune genetic algorithm is better than traditionalcoded estimation of distribution al-genetic algorithm in the circumstance of the samegorithm (EDA) design using x-ary ex-clusters and the same evolution generation. "tended compact genetic algorithm"The results show that the population size re-( XECGA ) and discretization meth-quired by rECGA-to successfully solve a classods. "ofadditively-separable problems -scales sub-quadratically with problem size and the numberof function evaluations scales sub-cubically withproblem size. "4"Specifically, the real-valued decision variables are"This paper presents a simple real-mapped to discrete symbols of user-specified cardi-coded estimation of distribution al-nality using discretization methods . "gorithm ( EDA ) design using x-ary"Immune genetic algorithm can shorten storage orextended compact genetic algorithmretrieval distance in application, and enhance stor-(XECGA) and discretization methods. "age or retrieval efficiency . ""The effect of alphabet cardinality and the selectionpressure on the scalability of the real-coded ECGA(rECGA) method is investigated. "" Deep learning has become increasingly popularin both academic and industrial areas in the pastyears. "

Table 44Examples of the term difficulty scale used for evaluation: grades 0-3. Difficult terms are highlighted with the green colorGradeNon-abbreviated (ordinary) termAbbreviation3"The XECGA is then used to build the probabilistic"We evaluate each measure's perfor-model and to sample a new population based on themance by AUC which is usually usedprobabilistic model . "for evaluation of imbalanced data clas-scale sub-quadratically in "The results show thatsification. "the population size required by rECGA-to success-"This theoretical analysis is confirmedfully solve a class of additively-separable problems-by the experimental results: using sev-scales sub-quadratically with problem size and the number of function evaluations scales sub-cubically with problem size. " " Molecular transistors can play a very important role in the design and fabrication of complex logiceral sampling methods to rebalance the imbalanced data sets, it is found that the performances of LDA on bal-anced data sets are superior to those of LDA on imbalanced data sets. "inside chips. "2"Experiment simulation result express: the result ofNIST (The National Institute of Stan-immune genetic algorithm is better than traditionaldards and Technology) in "Recentlygenetic algorithm in the circumstance of the sameNIST has published the second draftclusters and the same evolution generation. "document of recommendation for the"Specifically, the real-valued decision variables are mapped to discrete symbols of user-specified cardi-entropy sources used for random bit generation. "nality using discretization methods. "1"video labeling game is a crowsourcing tool to col-2D(2-dimensional),3D(3-lect user-generated metadata for video clips. "dimensional) maps as in "The"On the other hand, a 3dimensional (3D) map, which3D maps will give more intuitiveis one of major themes in machine vision research,information compared to conventionalhas been utilized as a simulation tool in city and2-dimensional ( 2D ) ones. "landscape planning , and other engineering fields. "0"This device has two work modes: "native" and "re-et al. (from latin "et alii" meaningmote". ""and others") in "However, Nam et al."Immune genetic algorithm can shorten storage orpointed out. . . "retrieval distance in application, and enhance stor-age or retrieval efficiency. ""The proposed rECGA is simple , making itamenable for further empirical and theoretical anal-ysis. "

Table 55SimpleText Task 2: Scale conversion rulesTerm difficulty scale012345677 point scale01234567⇒ 5 point scale0123457 point scale01234567⇒ 3 point scale0123

Table 66SimpleText Task 2: Examples of the annotationSentence

Table 77SimpleText Task 2: Results for the official runsTotalEvaluatedScore_3Score_5+Limits+Limits+Limitsaaac581,2852,9511,388702318415175SimpleScientificText63,02729826248444742UAms263,0221,3151,175105696049lea_t523,331540000

Table 88SimpleText Task 2: Results on a subset of 167 common sentencesTotalEvaluatedScore_3Score_5+Limits+Limits+Limitsaaac581,28583341420010412767UAms263,02257451446282521SimpleScientificText63,02720818833323229

https://simpletext-project.com A database for terminology and translations created and used by the European Commission, replaced in 2007 by Interactive Terminology for Europe (IATE) https://iate.europa.eu/. A linguistic and terminology database owned by the Translation Bureau of Public Services and Procurement Canada, https://www.btb.termiumplus.gc.ca/ A German term bank used by technical translators. French term bank covering science and technology fields and developed by AFNOR. A term bank created by the Quebec Board of the French Language, https://gdt.oqlf.gouv.qc.ca/ https://www.aminer.org/citation https://github.com/MaartenGr/KeyBERT https://github.com/franplk/PhraseSimilarity https://github.com/google-research/text-to-text-transfer-transformer https://github.com/Shivanandroy/simpleT5

Acknowledgments

We like to acknowledge the support of the Lab Chairs of CLEF 2022, Allan Hanbury and Martin Potthast, for their help and patience. Special thanks to the University Translation Office of the Université de Bretagne Occidentale, and to Nicolas Poinsu and Ludivine Grégoire for their major impact in the train data construction and Léa Talec-Bernard and Julien Boccou for their help in evaluation of participants' runs. We thank Josiane Mothe for reviewing papers. We also thank Alain Kerhervé, and the MaDICS (https:// www.madics.fr/ ateliers/ simpletext/ research group.

A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification MMaddela WXu Proc. of EMNLP 2018, ACL of EMNLP 2018, ACL

Brussels, Belgium

2018 How Much Knowledge Is Too Little? When a Lack of Knowledge Becomes a Barrier to Comprehension TO'reilly ZWang JSabatini 10.1177/0956797619862276 Psychological Science 2019 Controllable Text Simplification with Explicit Paraphrasing MMaddela FAlva-Manchego WXu 2021 Text Simplification for Scientific Information Access: CLEF 2021 SimpleText Workshop LErmakova PBellot PBraslavski JKamps JMothe DNurbakova IOvchinnikova ESanjuan Advances in Information Retrieval -43nd European Conference on IR Research, ECIR 2021

Lucca, Italy; Lucca, Italy

Proc March 28 -April 1, 2021. 2021 ESanjuan SHuet JKamps LErmakova Overview of the CLEF 2022 SimpleText Task 1: Passage selection for a simplified summary 2022 Overview of the CLEF 2022 SimpleText Task 3: Query biased simplification of scientific texts LErmakova IOvchinnikova JKamps DNurbakova SAraújo RHannachi 2022 Overview of the CLEF 2022 SimpleText Lab: Automatic simplification of scientific texts LErmakova ESanjuan JKamps SHuet IOvchinnikova DNurbakova SAraújo RHannachi ÉMathurin PBellot CLEF'22: Proceedings of the Thirteenth International Conference of the CLEF Association Lecture Notes in Computer Science ABarrón-Cedeño GD SMartino MDEsposti FSebastiani CMacdonald GPasi AHanbury MPotthast GFaggioli NFerro Springer 2022 Controllable Sentence Simplification Using Transfer Learning AMenta AGarcia-Serrano Proceedings of the Working Notes of CLEF 2022 -Conference and Labs of the Evaluation Forum CEUR Workshop Proceedings the Working Notes of CLEF 2022 -Conference and Labs of the Evaluation Forum