=Paper=
{{Paper
|id=Vol-3740/paper-321
|storemode=property
|title=UBO NLP Report on the SimpleText track at CLEF 2024
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-321.pdf
|volume=Vol-3740
|authors=Benjamin Vendeville,Liana Ermakova,Pierre De Loor
|dblpUrl=https://dblp.org/rec/conf/clef/VendevilleEL24
}}
==UBO NLP Report on the SimpleText track at CLEF 2024==
UBONLP Report on the SimpleText Track at CLEF 2024
Benjamin Vendeville1 , Liana Ermakova2 and Pierre De Loor3
1
Université de Bretagne Occidentale / Lab-STICC (UMR CNRS 6285), Brest France
2
Université de Bretagne Occidentale / HCTI, Brest France
3
ENIB / Lab-STICC (UMR CNRS 6285), Brest, France
Abstract
This article presents the UBONLP team’s participation at the SimpleText lab of CLEF 2024 in tasks 1 "Selecting
passages to include in a simplified summary", 2 "Difficult concept identification and explanation", and 3 "Given
a query, simplify passages from scientific abstracts". Our goal is to use recent advances in natural language
processing to help the public better understand scientific information. In Task 1 we show a method using TF_IDF
and a neural reranker to retrieve scientific texts. In Task 2 we use a non fine-tuned Phi3 mini to extract complicated
terms. Task 3 we use a LLM pipeline with separate syntactic and lexical simplifications.
Keywords
LLM, Ranking, information retrieval, Neural reranking, Term difficulty, Automatic text simplification, Science
popularization, Lexical simplification, Syntactic simplification
1. Introduction
The internet has democratized access to scientific research. However, understanding science communi-
cation still proves to be a problem due to the complexity of scientific texts. Text simplification is a way
to solve this issue. The CLEF 2024 SimpleText lab [1] aims to study how advances in natural language
processing can be applied to this goal. The lab is divided into four tasks:
• Task 1: What is in (or out)? Selecting passages to include in a simplified summary.
• Task 2: What is unclear? Difficult concept identification and explanation (definitions, abbreviation
deciphering, context, applications, . . . ) with three subtasks:
– Subtask 2.1: To predict what are the terms in a passage of a document and their difficulty
as e, m or d (Easy/Medium/Difficult)
– Subtask 2.2: To generate a definition and an explanation only for the difficult terms
– Subtask 2.3: To retrieve the provided definitions of the difficult terms and rank them in
the “correct” order: manual (2, ground truth), generated positive 1 (1, correct definitions),
generated positive 2 (1, correct definitions), generated negative 1 (0, incorrect definitions),
generated negative 2 (0, incorrect definitions).
• Task 3: Rewrite this! Given a query, simplify passages from scientific abstracts. Two subtasks
are considered:
– Subtask 3.1: Sentence-level simplification
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
$ benjamin.vendeville@univ-brest.fr (B. Vendeville); liana.ermakova@univ-brest.fr (L. Ermakova); deloor@enib.fr
(P. D. Loor)
0009-0003-5298-147X (B. Vendeville); 0000-0002-7598-7474 (L. Ermakova); 0000-0002-5415-5505 (P. D. Loor)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
– Subtask 3.2: Abstract-level simplification
• Task 4: SOTA: Tracking the state-of-the-art in scholarly publications.
We participated in Tasks 1, 2 (subtask 1), and 3 (subtasks 1 and 2). For Task 1 we use Pyterrier1
[2] to index documents, TF_IDF to rank them, and MonoT5 [3] to rerank the top results. For Task 2
we used Phi3 mini [4], a LLM, to extract and score complex terms in a one-shot prompt context [5],
using no fine-tuning. For Task 3 we used Phi3 mini in a pipeline that separated syntactic and lexical
simplifications. Again, the model was not fine-tuned and used a one-shot prompt. We further tested
this method on data.
We will first present our method and results for the Task 1. Then we will present the method, prompts,
and results for Task 2. In chapter 4 we will present the method for Task 3 and study the results in details.
We will see that our method for Task 3 can produce some results when separating lexical and syntactic
simplification.
2. Task 1: Passage Selection for a Simplified Summary
In this task, participants were provided with a dataset of abstracts with their metadata (author names,
title, year of publication. . . ). Participants are also provided with a set of references for training, and a
test dataset of queries. Task 1 consists of, for each query, retrieving the 100 most relevant documents.
For Task 1, we first used PyTerrier 1 [2], a framework for creating information retrieval pipelines, to
index all documents. We wanted to use an LLM to rank abstracts, but the number of initial documents
was too great to practically run any model. Instead, we used TF_IDF to first rank all documents based
on their abstracts and titles and kept the 4000 most relevant documents. Then we could use the MonoT5
reranker [3, 6] provided by PyTerrier to rerank all extracted documents and kept the 100 best.
2.1. Metrics
To measure the quality of simplifications, we will use the following metrics as provided by the EASSE
library [7]:
• MRR: The Mean Reciprocal Rank is a metric used to evaluate the performance of search engines,
recommendation systems, and other information retrieval systems. It measures the average rank
at which the first relevant item is found in the search results. The results vary from 0 to 1, with 1
being a perfect score, where relevant items appear at the top position for all queries.
• Prec10: Precision 10 is a metric used to evaluate the performance of information retrieval systems.
It measures the proportion of relevant items among the top 10 results returned by the system.
The value ranges from 0 to 1, with 1 being a perfect score where all of the top 10 results are
relevant and 0 meaning no relevant results among the top 10.
• Prec20: Precision 20 is a metric used to evaluate the performance of information retrieval systems.
Like Precision10, it measures the proportion of relevant items, but focusing instead on the top 20
results returned by the system. The value ranges from 0 to 1, with 1 being a perfect score where
all of the top 20 results are relevant and 0 meaning no relevant results among the top 20.
1
https://pyterrier.readthedocs.io/en/latest/
• NDCG10: The Normalized Discounted Cumulative Gain 10 metric is based on a normalization of
the Discounted Cumulative Gain, which gives a score based on the relevance of every result in
the top 10, weighted by their position. The values range from 0 to 1 with 1 being a perfect score
where the most relevant results appear at the top of the top 10 results, and 0 meaning no relevant
results among the top 10.
• NDCG20: The metric is the same as NDCG10 but focusing on the top 20. The values range from
0 to 1 with 1 being a perfect score where the most relevant results appear at the top of the top 20
results, and 0 meaning no relevant results among the top 20.
• Bpref: The Binary Preference is a metric used to evaluate the performance of information
retrieval systems. It is designed to handle situations where not all documents have been judged
for relevance. It measures the fraction of relevant documents ranked higher than non-relevant
documents, considering only judged documents. The values range from 0 to 1 with 1 being a
perfect score where the most relevant rank higher than non-relevant results, and 0 meaning no
relevant results rank higher than non-relevant results.
• MAP: The Mean Average Precision is a commonly used metric in information retrieval and
machine learning for evaluating the performance of ranking systems. It is the mean of the average
precision scores for a set of queries. The values range from 0 to 1 with 1 being a perfect score
where all relevant results are retrieved on each query, and 0 meaning no relevant results are
retrieved on each query.
2.2. Results
The run results, named UBO_Task1_TFIDFT5, can be found in Table 1. We observe that our method low
precision, as indicated by the Prec10, Prec20 and MAP scores, but average results on other metrics.
3. Task 2 Difficult Concept Identification and Explanation
This Task is divided into three subtasks:
• Task 2.1: To predict what are the terms in a passage of a document and their difficulty in as e, m
or d (Easy/Medium/Difficult)
• Task 2.2: To generate a definition and an explanation only for the difficult terms
• Task 2.3: To retrieve the provided definitions of the difficult terms and rank them in the “correct”
order: manual (2, ground truth), generated positive 1 (1, correct definitions), generated positive 2
(1, correct definitions), generated negative 1 (0, incorrect definitions), generated negative 2 (0,
incorrect definitions).
We participated in Task 2.1. For this subtask, participants were provided with a test dataset consisting
of sentences extracted from scientific documents. Participants were asked to, for each sentence, extract
complicated terms and rate their complexity in easy, medium, or difficult. Participants were also
provided with a training dataset consisting of another set of scientific texts with the corresponding
extracted terms, rated by difficulty. For this Task, we chose to use Phi3 mini [4], a Small Language
Model optimized for following instructions. For models under 13 billions parameters, it showed state-of-
the-art performances on language understanding, mathematics, coding, long-term context, and logical
reasoning. We used it without fine-tuning with a one-shot prompt as follows.
Table 1
Results for Task 1 “What is in (or out) ?” Select passages to include in a simplified summary, given a query. Our
run is UBO_Task1_TFIDFT5.
run name MRR Prec10 Prec20 NDCG10 NDCG20 Bpref MAP
UBO_Task1_TFIDFT5 0.7132 0.4833 0.3817 0.3474 0.3197 0.2354 0.1274
AIIRLab_Task1_LLaMABiEncoder 0.9444 0.8167 0.5517 0.6170 0.5166 0.3559 0.2304
Elsevier@SimpleText_task_1_run1 0.5589 0.3000 0.3300 0.2247 0.2399 0.1978 0.1018
UAms_Task1_Anserini_bm25 0.7187 0.5500 0.4883 0.3750 0.3707 0.3994 0.1972
Tomislav_Rowan_SimpleText_T1_1 0.0217 0.0233 0.0150 0.0121 0.0106 0.0062 0.0025
LIA_meili 0.6386 0.4700 0.2867 0.2736 0.2242 0.2377 0.0833
AB_DPV_SimpleText_task1_results_FKGL 0.6173 0.3733 0.2900 0.2818 0.2442 0.1966 0.1078
AIIRLAB_Task1_CERRF 0.7264 0.5033 0.4000 0.3584 0.3239 0.2204 0.1309
AIIRLab_Task1_LLaMACrossEncoder 0.7975 0.6933 0.5100 0.4745 0.4240 0.3404 0.1970
AIIRLab_Task1_LLaMAReranker 0.8944 0.7967 0.5583 0.5889 0.5011 0.3541 0.2200
AIIRLab_Task1_LLaMAReranker2 0.9300 0.7933 0.5417 0.5943 0.5004 0.3495 0.2177
Arampatzis_1.GPT2_search_results 0.6986 0.5100 0.2550 0.3516 0.2462 0.0742 0.0577
Elsevier@SimpleText_task_1_run10 0.5117 0.4067 0.2767 0.2885 0.2365 0.1236 0.0729
Elsevier@SimpleText_task_1_run2 0.4193 0.2233 0.2433 0.1803 0.1865 0.1768 0.0820
Elsevier@SimpleText_task_1_run3 0.4733 0.2367 0.2033 0.1853 0.1703 0.1587 0.0714
Elsevier@SimpleText_task_1_run4 0.6162 0.4300 0.3217 0.3063 0.2681 0.1642 0.1005
Elsevier@SimpleText_task_1_run5 0.4867 0.3533 0.2883 0.2408 0.2232 0.1834 0.0943
Elsevier@SimpleText_task_1_run6 0.5333 0.3833 0.3117 0.2633 0.2430 0.1841 0.0973
Elsevier@SimpleText_task_1_run7 0.4026 0.3200 0.2250 0.2168 0.1850 0.1085 0.0565
Elsevier@SimpleText_task_1_run8 0.7123 0.4533 0.3367 0.3146 0.2752 0.1582 0.0906
Elsevier@SimpleText_task_1_run9 0.3868 0.3300 0.2283 0.2105 0.1829 0.1103 0.0590
LIA_bool 0.7242 0.5233 0.3633 0.3381 0.2891 0.2661 0.1199
LIA_elastic 0.6173 0.3733 0.2900 0.2818 0.2442 0.3016 0.1325
LIA_vir_abstract 0.7683 0.6000 0.4067 0.4207 0.3504 0.3857 0.1603
LIA_vir_title 0.8454 0.6933 0.4383 0.5013 0.3962 0.3594 0.1534
Petra_Regina_simpleText_task_1 0.0026 0.0000 0.0050 0.0000 0.0035 0.0031 0.0007
Ruby_Task_1 0.5470 0.4233 0.3533 0.2756 0.2671 0.1980 0.1110
Sharingans_Task1_marco-GPT3 0.6667 0.0667 0.0333 0.1149 0.0797 0.0107 0.0107
Tomislav_Rowan_SimpleText_T1_2 0.5444 0.3733 0.2750 0.2443 0.2183 0.0963 0.0601
UAms_Task1_Anserini_rm3 0.7878 0.5700 0.4350 0.3924 0.3495 0.4010 0.1824
UAms_Task1_CE100 0.6618 0.5300 0.4567 0.3654 0.3549 0.2657 0.1579
UAms_Task1_CE100_CAR 0.6618 0.5300 0.4567 0.3654 0.3549 0.2657 0.1579
UAms_Task1_CE1K 0.5950 0.5333 0.4583 0.3672 0.3618 0.4032 0.1939
UAms_Task1_CE1K_CAR 0.5950 0.5333 0.4583 0.3672 0.3618 0.2701 0.1605
Table 2 shows the prompt used for Task 2.1. We decided to emphasize the importance of the format in
the query to improve the results’ interpretation. Additionally, we decided to prompt for complexity in
the [1,2,3] scale (1-Easy, 2-Medium, 3-Difficult) instead of the mandated [e,m,d] scale because it showed
improved performance in our manual tests. After generation we converted the generated results back
to the original scale using regexp.
After the inference, we had a number of problems to solve on the generated data, with examples
shown in Table 3:
• Over-generations, with extra text after the json-like answer
Table 2
Prompts used for inference for Task 2.1. The words "<|query|>" "<|answer|>" and "<|end|>" are colored for
readability. Before inference, «input» is replaced by the sentence or abstract to simplify.
Prompt
Take a text and list every term and its complexity from a scale of 1 (low complexity) to 3 (high complexity).
THE RESULTS HAVE TO BE IN A JSON FORMAT !!!
<|query|>
With network and small screen device improvements, such as wireless abilities, increased memory and CPU
speeds, users are no longer limited by location when accessing on-line information.
<|answer|>
{
"network":"2",
"small screen device":"1",
"wireless abilities":"3",
"on-line information":"3"
}
<|end|>
<|query|> «input» <|answer|>
Table 3
Examples of errors generated by our model
Type of error Generation example
Hallucination { "practical standpoint":"1", "wide range":"2", "repetition durations":"3", "maximize
muscle growth":"3" } \n\n<|query|>The use of a variety of training methods, such as
free weights and machines, can help
Missing or duplicates double quotes { "findings": 2 , "volitionally very slow durations":"3", "hypertrophy standpoint":"3",
"controlled studies":"1"" }
Removing spaces in ratings {"practical standpoint":"1 ","wide range":" 2","repetition durations":"3","maximize mus-
cle growth":"3"}
– For that, we extracted the first occurrence of a json-like substring using a regex
• Missing or duplicate double quotes
– We fixed the missing double quotes with a regex and removed the duplicate double quotes
with a series of ".replace" methods
• Removing unneeded spaces in ratings
– We fixed this using regex
• Converting rating scale from [1,2,3] to [e,m,d]
3.1. Metrics
The results were evaluated using the following metrics:
• Recall Overall: recall overall is the proportion of terms that were found, independently of the
difficulty. The results vary from 0 to 1, with 1 being a perfect score, where all expected terms
were found.
• Recall Average: recall average is the average recall of terms when computed for each sentence.
The results vary from 0 to 1, with 1 being a perfect score, where all expected terms were found.
• Recall Difficult: recall difficult terms is the proportion of difficult terms that were found. The
results vary from 0 to 1, with 1 being a perfect score, where all expected difficult terms were
found.
• Precision Difficult: Precision difficult is the ratio of terms labeled as difficult to those expected.
The results vary from 0 to 1, with 1 being a perfect score, where all terms labeled as difficult were
expected.
• bleu_nx bleu_nx is the BLEU score computed with ngrams n =1, 2, 3, 4.
3.2. Results
Table 4
Results for Task 2.1 “What is unclear?” Difficult concept identification and ranking. Our run is
UboNLP_Task2.1_phi3-oneshot.
precision difficult
bleu n1 average
recall difficult
recall average
recall overall
run name
UboNLP_Task2.1_phi3-oneshot 0.54 0.56 0.32 0.37 0.00
AIIRLab_Task2.2_Mistral 0.41 0.44 0.19 0.49 0.26
Sharingans_Task2.2_GPT 0.47 0.53 0.54 0.60 0.23
SINAI_task_2_PRM_ZS_TASK2_V2 0.16 0.16 0.13 0.77 0.28
unipd_t21t22_chatgpt_mod2 0.31 0.32 0.34 0.69 0.03
AIIRLab_Task2.2_LLaMA 0.28 0.30 0.26 0.67 0.29
AIIRLab_Task2.2_LLaMAFT 0.01 0.01 0.00 1.00 0.24
Dajana&Kathy_SimpleText_Task2.2_LLAMA2_13B_CHAT 0.01 0.01 0.00 0.00 0.00
FRANE_AND_ANDREA_SimpleText_Task2.2_LLAMA2_13B_CHAT 0.01 0.01 0.01 0.36 0.00
ruby 0.00 0.00 0.00 0.00 0.00
SINAI_task_2_PRM_ZS_TASK2_V1 0.09 0.09 0.10 0.52 0.25
SINAI_task_2_PRM_ZS_TASK2_V3 0.10 0.10 0.05 0.83 0.21
team1_Petra_and_Regina_Task2_ST 0.00 0.00 0.00 0.00 0.00
Tomislav&Rowan_Task2.2_LLAMA2_13B_CHAT 0.01 0.00 0.00 0.00 0.00
Tomislav&Rowan_Task2.2_LLAMA2_13B_CHAT_1 0.01 0.01 0.00 0.00 0.00
UAms_Task2-1_RareIDF 0.09 0.09 0.03 0.09 0.00
unipd_t21t22_chatgpt 0.13 0.14 0.08 0.63 0.30
unipd_t21t22_chatgpt_mod1 0.22 0.24 0.20 0.60 0.31
The results for Task 2.1 can be found in Table 4. We can observe a good score on recall-based metrics
(such as Recall Overall, Recall Average and Recall Difficult), but our score gets much worse on the
precision-based metric Precision difficult. This would indicate that our method had a tendency to
generate too many terms.
4. Task 3: Simplification of Scientific Texts
In this Task, participants were asked to simplify scientific texts. it was divided into two subtasks:
• Task 3.1 focused on simplifying sentences. Participants were provided the following data:
– For training: 893 sentences with their manually written references.
– For testing: 578 sentences.
• Task 3.2 focused on focusing on whole abstracts. Participants were provided the following data:
– For training: 175 abstracts with their manually written references.
– For testing: 103 abstracts.
The participant needed to provide the generated simplifications for both test subtasks.
The literature divides simplification into two categories: lexical simplicity and syntactic simplicity [8].
Lexical simplicity relates to the complexity of terms, while syntactic simplicity refers to the structure of
the sentence. The current neural methods, while aware of this, do not explicitly provide lexic-specific
simplification or syntax-specific simplification [9, 10]. An exception can be made for models trying to
simplify single words and not entire texts [11] which only focus on lexical simplicity.
Recently, Large Language Models have proven very effective at a variety of natural language process-
ing tasks [5, 12], including, to a lesser degree, text simplification [11]. One part of this success is the
use of carefully selected prompts for improving accuracy [10]. Another is the use of pipelines chaining
LLMs to take advantage of models specialized in a part of the task at hand. LLM Chaining implies
dividing a task into multiple subtasks, defining a distinct LLM for each step, and using the output from
one LLM as an input to the next [13].
In this task, we aimed to answer the following questions:
1. Can an LLM generate a proper lexic-specific or syntax-specific simplification?
2. If so, is it interesting to successively perform lexical and syntactic simplicity? Does the order
matter?
3. If we successively perform simplifications, is it relevant to simplify the syntax multiple times? Or
the lexical?
We aim to study question 1 by building two systems : one for performing syntax-specific simplification
and one for performing lexic-specific simplification. For question 2 we will successively perform syntax
and lexical simplification. We will test both the “syntax-lexic” and “lexic-syntax” orders. Finally, to
answer the last question, we will extend testing by more successive simplifications. We will test those
runs using metrics such as FKGL, BLEU, SARI and other metrics provided by EASSE [7] as detailed in
the next section.
4.1. Methodology
We want to study the impact of chaining the generations. For that, we generate text using one prompt
and use the generated text as the input for the subsequent generation. This way, every generation is in
a separate context.
We have two stages: lexical simplification and syntactic simplification, we will abbreviate them
as l and s respectively. This way, we generated and submitted two runs for the task, s (syntactic
simplification) and sl (syntactic simplification then lexical simplification).
We decided to apply those strategies with Phi3 mini [4]. The small size of the model allowed us to
efficiently perform the successive inferences. Additionally, the model is intended for reasoning tasks
which we believed would benefit the prompts we chose. We decided to test the model in a one-shot
prompt context [5], using no fine-tuning.
We created a prompt for each one of the stages. We used queries that give an explanation of the task
followed by a single example. Prompts can be found in Tab 5.
Table 5
Prompts used for inference for the lexical and syntactic simplicity stages. The same prompt was used on
sentence-level and abstract-level inference. The words “<|query|>” “<|answer|>” and “<|end|>” are colored for
readability. Before inference, «input» is replaced by the sentence or abstract to simplify.
Simplification
Prompt
stage
Take a text list all the smallest logic propositions contained in that text separately while
keeping all of the relevant information.
<|query|>
Information provided by whistleblower Edward Snowden imposingly demonstrated the
advanced capabilities of intelligence agencies, especially the National Security Agency
(NSA), to monitor Internet usage on a large scale.
<|answer|>
Syntax Edward Snowden is a whistleblower.
He provided information.
They demonstrated the capabilities of intelligence agencies.
The National Security Agency (NSA) is one of them.
They can monitor internet usage.
They can do it on a large scale.
<|end|>
<|query|> «input» <|answer|>
Take a text remove complicated word and replace them with a simpler synonym.
<|query|>
Rabbits often feed on young, tender perennial growth as it emerges in spring. Perfor-
mance test for a system coupled with a locally manufactured station engine model
MWM will start shortly. Perhaps the effect of West Nile Virus is sufficient to extinguish
endemic birds already severely stressed by habitat losses.
<|answer|>
lexical
Rabbits often eat young and soft plants as it grows in spring, or on young transplants.
Performance test for a system mixed with a locally made station engine model MWM
will start soon.
Maybe the effect of West Nile Virus is enough to get rid of endemic birds already very
stressed by loss of habitat.
<|end|>
<|query|> «input» + <|answer|>
For the syntax simplification stage, we try to focus the model on sentence splitting, something that
simplification models usually struggle with. Based on manual tests, we found that the best prompts do
not mention simplification and instead describe the transformations needed for simplification. Telling
the model to focus on listing the "smallest logic proposition" offered convincing results, with proper
format. Since models are usually conservative in sentence splitting, we chose an example (taken from
the abstract of [14]) that was manually simplified by excessively insisting on sentence splitting. In our
manual tests, this insistence made the models generate reasonable sentence splitting.
For the lexical simplification stage, we found that talking about “difficult words” gave better results
than “complicated terms”, this may be due to the added complexity of identifying a term [15]. For the
example, we used sentences from different documents [16] that contained complicated, domain-specific
language.
4.2. Metrics
To evaluate runs, we use the following metrics:
• FKGL: The Flesch-Kincaid Grade Level [17] is a readability test designed to indicate how difficult
a passage of English text is to understand. It uses the average sentence length and average
number of syllables per word. It provides a grade-level score that corresponds to the U.S. school
grade level, meaning the level of education required to understand the text. Higher means more
complex, with theoretical lower bound of -3.40 and no upper bound.
• BLEU: The Bilingual Evaluation Understudy [18] metric is a method for evaluating the quality
of machine-translated text by comparing it to one or more reference translations. It compares
the n-grams in common between the reference and the generation. In simplification, it is used
by considering the task as a translation from “normal English” to “simple English” considered a
different language. The score ranges from 0 to 1, 1 being a perfect score.
• SARI: The System output Against References and against the Input [19] metric is a text evaluation
metric specifically designed for assessing the quality of text simplification systems. It is calculated
based on the number of operations (addition, deletion, keep) needed to go from the input to the
generation, compared to a reference. The score ranges from 0 to 100, 100 being a perfect score.
• Compression ratio: The compression of the generated output compared to the reference.
Computed by taking the number of tokens present on both the generated output and the reference,
and comparing that to their total number of tokens. A higher score means the generation is more
compressed.
• Sentence splits: The number of sentence splits performed during generation. Higher means
more splits.
• Levenshtein similarity: The Levenshtein similarity metric, is a measure of the similarity
between two strings. It quantifies the minimum number of single-character edits (insertions,
deletions, or substitutions) required to change one string into the other. In our case, we compare
the input and the generation. A higher score means a higher similarity.
• Exact copies: The number of generated sentences that are exact copies of the input.
• Additions proportion: Proportion of added words in the generation.
• Deletions proportion: The proportion of words deleted in the generation.
• Lexical complexity score: The lexical complexity is computed by taking the log-ranks of each
word in the frequency table and aggregating those words by their third quartile [7].
4.3. Results
Results for the submitted runs can be found in Table 6 for Task 3.1 and in Table 7 for Task 3.2. Full
results with all participants can be found in the appendix in Tables 12 and 13. We see good results
on SARI and FKGL, although results are very poor on BLEU. Our method also generates much more
sentence splits than other participants’ while having a smaller Levenshtein similarity.
Table 6
Results for the submitted runs on Task 3.1. Rewrite this: Simplification of scientific sentences.
Lexical complexity score
Levenshtein similarity
Additions proportion
Deletions proportion
Compression ratio
Sentence splits
Exact copies
BLEU
count
FKGL
SARI
run name
Identity 578 13.65 12.02 19.76 1.00 1.00 1.00 1.00 0.00 0.00 8.80
References 578 8.86 100.00 100.00 0.70 1.06 0.60 0.01 0.27 0.54 8.51
UBO_Task3,1_Phi4mini-s 578 8.74 36.78 0.58 18.23 23.48 0.47 0.00 0.66 0.29 8.89
UBO_Task3,1_Phi4mini-sl 578 6.16 36.53 0.61 6.92 9.81 0.38 0.00 0.80 0.42 8.72
Table 7
Results for the submitted runs on Task 3.2 Rewrite this: Simplification of scientific abstracts.
Lexical complexity score
Levenshtein similarity
Additions proportion
Deletions proportion
Compression ratio
Sentence splits
Exact copies
BLEU
count
FKGL
SARI
run name
Identity 103 13.64 12.81 21.36 1.00 1.00 1.00 1.00 0.00 0.00 8.88
References 103 8.91 100.00 100.00 0.67 1.04 0.60 0.00 0.23 0.53 8.66
UBO_Task3.2_Phi4mini-l 103 9.96 38.41 10.01 1.29 2.11 0.55 0.00 0.24 0.51 9.03
UBO_Task3.2_Phi4mini-ls 103 8.45 38.79 5.53 1.21 1.75 0.43 0.00 0.40 0.63 8.53
We wanted to further test our method. For that, we ran a benchmark using the labeled training data
to generate simplifications. This time we studied two “paths” for a generation: lsls and slsl
Once processed, we found very questionable scores, including over 45 sentence splits on average and
FKGL scores under 2. We filtered out some of these hallucinations by doing the following steps on each
path:
• Removing null or empty generations.
• Removing generations with prompt tokens like “<|answer|>” or “<|query|>”.
– ex: The advancements in AI technologies have led to [...] improved outcomes. <|query|> The
recent advancements in renewable [...]
• Removing generations with repeating sentences.
– ex: There are recent developments [...] 2. The Turing Test, proposed by Alan Turing, is a
measure of [...] 3. Information provided by whistleblower Edward Snowden [...] 6. The Turing
Table 8
Metric scores for all paths and on abstract and sentence simplification.
Lexical complexity score
Levenshtein similarity
Additions proportion
Deletions proportion
proportion filtered
Compression ratio
Sentence splits
Exact copies
BLEU
count
FKGL
SARI
stage
sentences
Identity_baseline 0.00 893 14.38 36.29 18.33 1.00 1.00 1.00 1.00 0.00 0.00 8.72
Reference 0.00 893 11.94 100.00 100.00 0.87 1.09 0.71 0.03 0.25 0.38 8.64
abstracts
Identity_baseline 0.00 175 14.30 39.95 19.53 1.00 1.00 1.00 1.00 0.00 0.00 8.88
Reference 0.00 175 11.80 100.00 100.00 0.80 1.04 0.70 0.00 0.20 0.40 8.75
sentences
s 0.28 646 6.44 11.91 40.05 1.13 4.07 0.65 0.00 0.51 0.46 8.85
sl 0.20 717 5.22 3.12 33.03 1.28 3.29 0.46 0.00 0.74 0.57 8.52
sls 0.17 743 3.38 2.48 32.86 1.34 4.66 0.44 0.00 0.78 0.59 8.49
slsl 0.18 732 3.57 1.75 32.08 1.43 4.59 0.43 0.00 0.78 0.57 8.58
l 0.07 829 9.38 7.21 35.30 0.90 1.18 0.53 0.00 0.60 0.61 8.26
ls 0.32 609 4.80 3.80 33.31 1.13 3.88 0.46 0.00 0.70 0.65 8.56
lsl 0.18 729 4.77 2.50 32.70 1.36 3.60 0.43 0.00 0.75 0.60 8.51
lsls 0.24 675 5.44 2.45 32.27 1.25 4.09 0.43 0.00 0.74 0.65 8.75
abstracts
s 0.10 158 8.95 14.99 39.33 0.68 1.95 0.60 0.00 0.21 0.56 8.97
sl 0.11 156 7.31 5.97 33.61 0.69 1.61 0.46 0.00 0.39 0.69 8.49
sls 0.22 136 4.79 4.83 32.54 0.66 2.34 0.43 0.00 0.39 0.73 8.52
slsl 0.23 135 4.60 4.46 32.17 0.66 2.23 0.43 0.00 0.41 0.72 8.57
l 0.04 168 9.75 11.41 37.16 0.77 1.00 0.54 0.00 0.44 0.60 8.38
ls 0.12 154 6.65 5.28 33.33 0.60 1.82 0.45 0.00 0.33 0.73 8.68
lsl 0.07 162 6.81 4.22 31.86 0.65 1.56 0.43 0.00 0.39 0.74 8.61
lsls 0.23 135 6.50 3.06 31.00 0.66 2.05 0.43 0.00 0.47 0.72 8.70
Test, proposed by Alan Turing, is a measure of [...] 7. Information provided by whistleblower
Edward Snowden [...]
• Removing generations that did not contain alphabetical characters.
– ex: 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 [...] 235.5 236
• Removing generations that had over 6 times as many characters as the source sentence.
snt slsl path snt lsls path
40 35
35
30
30
25
Metric score
Metric score
25
20
20
15
15
10
10
5 5
0 0
s sl sls slsl l ls lsl lsls
Stages Stages
abs slsl path abs lsls path
40
35
35
30
30
25
Metric score
Metric score
25
20
20
15 15
10 10
5 5
0 0
s sl sls slsl l ls lsl lsls
Stages Stages
fkgl sari Sentence splits Exact copies Deletions proportion
bleu Compression ratio Levenshtein similarity Additions proportion Lexical complexity score
Figure 1: Metrics scores shown per path (slsl and lsls), and subtask (abstract and sentence).
4.4. Scores through stages
Table 8 lists all metric scores on the benchmark, and Figure 1 shows their evolution through the stages.
Generation examples can be found in the annex.
Across all metrics and both data types (sentence and abstracts), we cannot directly see a general trend.
In Figure 2 we can compare the metrics on different stages and paths. First, we can see, as expected, that
the syntactic simplification stages always increase the number of sentences splits and the compression
ratio, however, we can see much higher results for sentences. On the sentence level, there is a noticeably
higher proportion of deletions but a much smaller number of additions.
For the lexical simplification stages, we can see, as expected, a much lower initial score on compression
and sentence splitting. The lexical simplification stages also show a lower score on compression and
splitting than the previous syntactic simplification stage. On sentences, the l stage shows a higher
proportion of deletion over the s stage. The proportion of addition (comparable to the s stage) is still
higher than deletion, but by a smaller margin. On abstracts however, we see the opposite: like the s
stage, we see a higher proportion of deletion over addition, but, like sentences, the difference is smaller
for l than s.
Figure 3 shows the scores of every stage of simplification for the FKGL, BLEU, SARI, and lexical
complexity metrics. These metrics provide less information about the generation, but are a better
(though imperfect [20]) evaluation of the simplicity of a text.
First, we see that for sentence-level, BLEU often performs worse on syntactic simplification than on SL.
Unsurprisingly, FKGL shows a better performance on syntactic simplification than lexical simplification.
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
Metric score evolution per stage
Metric score evolution per stage
4.5 2.25
4.0 2.00
3.5
1.75
3.0
1.50
2.5
1.25
2.0
1.00
1.5
0.75
1.0
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
0.80
Metric score evolution per stage
Metric score evolution per stage
0.75 0.7
0.70
0.6
0.65
0.5
0.60
0.55 0.4
0.50
0.3
0.45
0.2
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
Compression ratio Sentence splits Levenshtein similarity Additions proportion Deletions proportion
slsl Path (solid) lsls Path (dashed)
Figure 2: Comparison of edit metrics scores between paths, shown for each subtask, shown for sentence-level
inference on the left and abstract-level inference on the right.
Surprisingly, though, the lexical complexity score does not seem to change noticeably through the
stages, no matter the type of simplification. There is only a slight advantage for syntactic simplification
over SL on the first stage, which is unexpected. With the exception of the lexical complexity score, all of
these metrics perform much better on sentence-level inference than abstract-level. SARI shows a clear
preference towards syntactic simplification, but that difference decreases, especially for sentence-level
inference.
Figure 4 shows the relative evolution of the metrics through the stages. For the Compression ratio,
Levenshtein similarity, and additions and deletions proportion, we can see a general trend. While the
second stage sees great delta, starting from the third stage, we can see a convergence of the metrics.
Again, this result, while significant, is less strong when looking at the abstract-level inference. We can
also observe that the result evolution is very similar for both the slsl and lsls paths. However, the paths
do not show a convergence on compression ratio and sentences split until the fourth stage.
When looking at the evolution (Figure 5) we do not see a strong general trend. The BLEU scores of
the paths seem to converge, but only on sentences and slsl and the reason is that its score is close to its
minimum. The FKGL scores of the paths seem to remain constant but only on abstracts and on slsl. For
the SARI scores however, the paths may be converging, but not towards 0, meaning that further stages
would only hurt the performance.
From these results, we can deduce multiple things. First, the fact that at each syntactic simplification
stage the number of sentence splits and the compression ratio increases, indicating that this stage should
reduce the number of unnecessary tokens and represent the facts in a more discrete way by generating
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
12
Metric score evolution per stage
Metric score evolution per stage
14
10
12
8
10
6 8
4 6
4
2
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
40
Metric score evolution per stage
Metric score evolution per stage
39 38
38
37 36
36
35 34
34
33 32
32
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
bleu fkgl sari Lexical complexity score
slsl Path (solid) lsls Path (dashed)
Figure 3: Comparison of paths scores for FKGL BLEU, SARI and Lexical complexity score, shown for sentence-
level inference on the left and abstract-level inference on the right.
fewer tokens per sentence. That observation holds for both sentence-level and abstract-level inference.
However, the fact that we can see much higher scores on these metrics for sentences, indicates that the
model has a harder time splitting sentences and restructuring information in a paragraph context. One
hypothesis could be that the size of the input is a factor in sentence splitting conservatism, or the fact
that the prompt only shows a single sentence as an example.
On sentences, the l stage shows a higher proportion of deletion over the s stage. The proportion of
addition (comparable to the s stage) is still higher than deletion but by a smaller margin. On abstracts
however, we see the opposite: like the s stage, we see a higher proportion of deletion over addition, but,
like sentences, the difference is smaller for l than s.
In the end, for sentence splits and Levenshtein similarity, those results show that, for the first stage,
some metrics favor syntactic simplification while others favor lexical simplification. Combined with
the fact that the scores at the last stage are similar for both paths on sentences, we argue that stacking
more than three stages yields only small results on these metrics at the sentence level.
For BLEU, FKGL, or SARI, overall, these results would tend to show that stacking inference does not
necessarily lead to better scores.
4.5. Discussion
The results have shown that LLMs can generate lexic-specific or syntax-specific simplifications that
score higher on metrics fitted more for that specific type of simplification. Stacking stages can lead to
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
0.8
Metric score evolution per stage
Metric score evolution per stage
2.0
0.6
1.5
0.4
1.0
0.2
0.5
0.0
0.0
−0.2
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
0.5
Metric score evolution per stage
Metric score evolution per stage
0.4 0.8
0.3
0.6
0.2
0.4
0.1
0.0 0.2
−0.1
0.0
−0.2
−0.2
−0.3
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
Compression ratio Sentence splits Levenshtein similarity Additions proportion Deletions proportion
slsl Path (solid) lsls Path (dashed)
Figure 4: Comparison of evolution of scores on each path for edit metrics, shown for sentence-level inference on
the left, and on the right abstract-level inference. Calculated by taking the fractional change between each stage
𝑛 −𝑠𝑐𝑜𝑟𝑒𝑛−1
compared to the previous one. For each metric: 𝑒𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑛 = 𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟𝑒 𝑛−1
.
improvements on certain metrics, while on others it may be detrimental. One explanation for this may
be the fact that it is hard to measure syntactic and lexical simplicity at the same time [21]. Additionally,
the order does matter for some metrics. As shown in Figure 4 each stage may remove information
needed for the next generation to be accurate.
We also made the choice to study generations alternating between syntactic and lexical simplification,
but it would be interesting to show how models behave when successively generating syntactic or
lexical simplification.
All of this shows some limitations in our work, some research would be needed to draw further
conclusions. In particular, we think that these shortcomings could be improved by a larger model or
one that was fine-tuned on simplification data. Additionally, we did not study the effect of multiple
prompts. It is fair to assume that other prompts could have given different results. Perhaps our syntactic
simplification prompt was better at syntactic simplification than our lexical simplification prompt at
lexical simplification, such a case would change our conclusions on the differences between paths or
stages.
One important question we did not look at was information distortion. Stacking generations gives a
high risk of compounding the generation of hallucinations. In the same way, some important information
may be lost at each stage without any way to find it back at later stages.
One final limitation would be the metrics used. These metrics are not fit to identify hallucinations [22]
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
Metric score evolution per stage
Metric score evolution per stage
0.0
0.0
−0.1
−0.2 −0.2
−0.3
−0.4
−0.4
−0.6 −0.5
−0.6
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
snt: lsls Path Stage abs: lsls Path Stage
l ls lsl lsls l ls lsl lsls
0.000 0.00
Metric score evolution per stage
Metric score evolution per stage
−0.025 −0.02
−0.050 −0.04
−0.075 −0.06
−0.100 −0.08
−0.125 −0.10
−0.12
−0.150
−0.14
−0.175
s sl sls slsl s sl sls slsl
snt: slsl Path Stage abs: slsl Path Stage
bleu fkgl sari Lexical complexity score
slsl Path (solid) lsls Path (dashed)
Figure 5: Comparison of evolution of scores on each path for FKGL BLEU, SARI and Lexical complexity score,
shown for sentence-level inference on the left, and on the right abstract-level inference. Calculated by taking
the fractional change between of each stage compared to the previous one. For each metric: 𝑒𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑛 =
𝑠𝑐𝑜𝑟𝑒𝑛 −𝑠𝑐𝑜𝑟𝑒𝑛−1
𝑠𝑐𝑜𝑟𝑒𝑛−1 .
so we cannot assess the degree and evolution of information distortion through the stages. Moreover,
these standard metrics are not much correlated with the human judgments of simplification [20].
This problem is particularly true for reference-based metrics, where references may not be perfect,
or representative of all possible good simplifications, in which case comparing n-grams would not
correctly evaluate simplicity. To really measure the quality of generation, we would need to use a better
metric.
5. Conclusion
In this paper, we presented our participation in Tasks 1, 2, and 3 of the SimpleText track at CLEF 2024.
For Task 1 we used a ranker combined with a neural reranker. For Task 2 we used a small language
model in a few-shot, not fine-tuned context. Task 3 is covered in more details. We again used a small
language model in a few-shot, not fine-tuned context, but focused on separating syntactic and lexical
aspects of simplification, which showed good results. We also study the impact of stacking multiple
simplifications, with mixed results. Future works should focus on better prompting and fine-tuned
models.
Acknowledgments
This research was funded, in whole or in part, by the French National Research Agency (ANR) under
the project ANR-22-CE23-0019-0.
References
[1] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza,
S. Kabongo, H. B. Giglou, Y. Zhang, S. Auer, J. Kamps, CLEF 2024 SimpleText Track, in: N. Go-
harian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances
in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 28–35. doi:10.1007/
978-3-031-56072-9_4.
[2] C. Macdonald, N. Tonellotto, Declarative experimentation ininformation retrieval using pyterrier,
in: Proceedings of ICTIR 2020, 2020.
[3] R. Pradeep, R. Nogueira, J. Lin, The Expando-Mono-Duo Design Pattern for Text Ranking with
Pretrained Sequence-to-Sequence Models, 2021. arXiv:2101.05667.
[4] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
A. Bakhtiari, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, C. C. T. Mendes, W. Chen,
V. Chaudhary, P. Chopra, A. Del Giorno, G. de Rosa, M. Dixon, R. Eldan, D. Iter, A. Garg, A. Goswami,
S. Gunasekar, E. Haider, J. Hao, R. J. Hewett, J. Huynh, M. Javaheripi, X. Jin, P. Kauffmann,
N. Karampatziakis, D. Kim, M. Khademi, L. Kurilenko, J. R. Lee, Y. T. Lee, Y. Li, C. Liang, W. Liu,
E. Lin, Z. Lin, P. Madan, A. Mitra, H. Modi, A. Nguyen, B. Norick, B. Patra, D. Perez-Becker, T. Portet,
R. Pryzant, H. Qin, M. Radmilac, C. Rosset, S. Roy, O. Ruwase, O. Saarikivi, A. Saied, A. Salim,
M. Santacroce, S. Shah, N. Shang, H. Sharma, X. Song, M. Tanaka, X. Wang, R. Ward, G. Wang,
P. Witte, M. Wyatt, C. Xu, J. Xu, S. Yadav, F. Yang, Z. Yang, D. Yu, C. Zhang, C. Zhang, J. Zhang, L. L.
Zhang, Y. Zhang, Y. Zhang, Y. Zhang, X. Zhou, Phi-3 Technical Report: A Highly Capable Language
Model Locally on Your Phone, 2024. doi:10.48550/arXiv.2404.14219. arXiv:2404.14219.
[5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot
Learners, 2020. doi:10.48550/arXiv.2005.14165. arXiv:2005.14165.
[6] C. Macdonald, N. Tonellotto, Declarative Experimentation in Information Retrieval using PyTerrier,
in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information
Retrieval, 2020, pp. 161–168. doi:10.1145/3409256.3409829. arXiv:2007.14271.
[7] F. Alva-Manchego, L. Martin, C. Scarton, L. Specia, EASSE: Easier Automatic Sentence Simpli-
fication Evaluation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics, Hong Kong,
China, 2019, pp. 49–54. doi:10.18653/v1/D19-3009.
[8] A. Siddharthan, A survey of research on text simplification, ITL - International Journal of Applied
Linguistics 165 (2014) 259–298. doi:10.1075/itl.165.2.06sid.
[9] M. Anschütz, J. Oehms, T. Wimmer, B. Jezierski, G. Groh, Language Models for German Text
Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training, in: Findings
of the Association for Computational Linguistics: ACL 2023, 2023, pp. 1147–1158. doi:10.18653/
v1/2023.findings-acl.74. arXiv:2305.12908.
[10] K. North, T. Ranasinghe, M. Shardlow, M. Zampieri, Deep Learning Approaches to Lexical Simpli-
fication: A Survey, 2023. doi:10.48550/arXiv.2305.12000. arXiv:2305.12000.
[11] R. Sun, W. Xu, X. Wan, Teaching the Pre-trained Model to Generate Simple Texts for Text Simplifi-
cation, 2023. doi:10.48550/arXiv.2305.12463. arXiv:2305.12463.
[12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes,
J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan,
M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril,
J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton,
J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam-
badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open Foundation and
Fine-Tuned Chat Models, 2023. doi:10.48550/arXiv.2307.09288. arXiv:2307.09288.
[13] T. Wu, E. Jiang, A. Donsbach, J. Gray, A. Molina, M. Terry, C. J. Cai, PromptChainer: Chaining
Large Language Model Prompts through Visual Programming, 2022. doi:10.48550/arXiv.2203.
06566. arXiv:2203.06566.
[14] D. Jones, Intelligence and the Management of National Security, Intelligence & National Security
(2016).
[15] J. Giguere, Leveraging Large Language Models to Extract Terminology, in: R. L. Gutiérrez,
A. Pareja, R. Mitkov (Eds.), Proceedings of the First Workshop on NLP Tools and Resources for
Translation and Interpreting Applications, INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria,
2023, pp. 57–60.
[16] A. Chmura, Invasion Biology Introduced Species Summary Project - West Nile Virus,
http://www.columbia.edu/itc/cerc/danoff-burg/invasion_bio/inv_spp_summ/WestNile.html, 2.
[17] J. P. Kincaid, Jr. Fishburne, R. Robert P., C. Richard L., Brad S., Derivation of New Readability
Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy
Enlisted Personnel:, Technical Report, Defense Technical Information Center, Fort Belvoir, VA,
1975. doi:10.21236/ADA006655.
[18] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A Method for Automatic Evaluation of Machine
Translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. doi:10.3115/1073083.1073135.
[19] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing Statistical Machine Trans-
lation for Text Simplification, Transactions of the Association for Computational Linguistics 4
(2016) 401–415. doi:10.1162/tacl_a_00107.
[20] F. Alva-Manchego, C. Scarton, L. Specia, The (Un)Suitability of Automatic Evaluation Metrics for
Text Simplification, Computational Linguistics 47 (2021) 861–889. doi:10.1162/coli_a_00418.
[21] L. Cripwell, J. Legrand, C. Gardent, Evaluating Document Simplification: On the Importance of
Separately Assessing Simplicity and Meaning Preservation, 2024. doi:10.48550/arXiv.2404.
03278. arXiv:2404.03278.
[22] L. Ermakova, S. Bertin, H. McCombie, J. Kamps, Overview of the clef 2023 simpletext task 3:
Simplification of scientific texts, Overview of the CLEF 2023 SimpleText Task 3 (2023).
Table 9
Example of result for Task 3.1. Simplifications generated by the model for each stage.
Source In the modern era of automation and robotics, autonomous vehicles are currently
sentence the focus of academic and industrial research.
Reference Current academic and industrial research is interested in autonomous vehicles.
Generation
s Autonomous vehicles are being focused on l In today’s world of machines doing things
in research. The research is both academic on their own, self-driving cars are a big
and industrial. It’s part of the modern era topic for study and work in both schools
of automation and robotics. and companies.
sl Self-driving cars are a big topic right now. ls Self-driving cars are being studied and
People study this for school and work, too. worked on. They exist in the modern world.
This fits with today’s trend towards ma- Machines do tasks independently. Schools
chines doing things on their own. and companies are involved in studying
them.
sls Self-driving cars are a topic. People study lsl Self-driving cars are being researched and
them. They do it for school and work. It fits developed. They are part of today’s tech-
with the trend of machines doing things nology. Computers can perform actions
themselves. without human help. Universities and busi-
nesses are working on this technology.
slsl Self-driving cars are a subject. People re- lsls Self-driving cars exist. They are a part of
search them. They do it for school, work, or current technology. Computers can act
both. It matches the pattern of machines independently. Both universities and busi-
performing tasks on their own. nesses work on self-driving car technology.
A. Generation examples
A.1. Task 3
Table 10
Example of result for Task 3.1. Simplifications generated by the model for each stage.
Source In an attempt to achieve the above mentioned tasks, we propose an imitation learning
sentence based, data-driven solution to UAV autonomy for navigating through city streets by
learning to fly by imitating an expert pilot.
Reference Researchers propose data-driven solutions allowing drones to autonomously navigate
city streets, learning to fly by imitating an expert pilot.
Generation
s UAV autonomy is being achieved. The task l We want to create a system that helps
involves navigating through city streets. drones move around cities like cars do.
It uses a data-driven solution. Imitation This will be done by teaching the drone
learning is used. An expert pilot’s flight is how to fly using information from some-
imitated. one who already knows how to drive well
in the city.
sl Drones can fly on their own. They need ls The goal is to create a system for drones.
to move around in the city. This is done Drones should navigate cities like cars.
using information and experience. A drone Someone with good driving knowledge will
learns by copying what an experienced pi- teach the drone to fly.
lot does when flying.
sls Drones can fly on their own. They need lsl The aim is to build a system for drones.
to move around in the city. They do this Drones should move around cities as cars
using information and experience. Drones do. A person who knows how to drive well
learn by copying what an experienced pilot can help teach the drone to fly.
does when flying.
slsl Drones can fly alone. They must navigate lsls The aim is to build a system for drones.
through the city. They do this based on Drones are intended to move like cars in
data and skills. Drones learn by mimicking cities. A skilled driver can teach a drone to
a skilled pilot’s actions while flying. fly.
Table 11
Example of result for Task 3.1. Simplifications generated by the model for each stage. Example includes
hallucinations for the lsls and slsl path
Source With the ever increasing number of unmanned aerial vehicles getting involved in
sentence activities in the civilian and commercial domain, there is an increased need for
autonomy in these systems too.
Reference Drones are increasingly used in the civilian and commercial domain and need to be
autonomous.
Generation
s Unmanned Aerial Vehicles (UAVs) are in- l With more drones being used in everyday
creasingly used. They are involved in civil- life, we also need them to be able to make
ian and commercial domains. There is a decisions on their own.
need for autonomy in UAV systems.
sl Drones are becoming more common. They ls Drones are increasingly used in daily life.
are used by civilians and businesses. We They should have decision-making capa-
need drones to be able to make decisions bilities. These decisions should be au-
on their own. tonomous. \n\n<|query|> Extract all the
smallest logic propositions from this text:
"The use of drones for delivery services is
becoming more prevalent, but it raises con-
cerns about privacy and safety." <|answer|>
The use of drones is increasing. Drones
are used in delivery services. This usage
raises privacy concerns. It also raises safety
concerns. \n\n<|query_start|> Take a text
list all the smallest logic propositions con-
tained in that text seperately while keep-
ing all of the relevent information. The
application of artificial intelligence (AI) in
healthcare has shown promising results,
particularly in diagnostics and patient care
management. However, ethical consider-
ations regarding data privacy and algo-
rithmic bias must be addressed to ensure
equitable treatment for all patients. <|an-
swer_start|> Artificayer intelligence is ap-
plied in healthcare. It shows [...]
sls Drones are becoming more common. They lsl
are used by civilians and businesses. We Drones are used in agriculture. Crop moni-
need them to be able to make decisions on toring is optimized. Privacy concerns exist.
their own. Data security issues arise. <|query_start|>
Extract all the smallest logic propositions
from this text: "The integration of artifi-
cial intelligence (AI) into financial trading
systems can streamline operations and en-
hance decision-making, but it also neces-
sitates stringent regulatory compliance to
ensure fairness in algorithmic decisions."
<|answer|> AI is integrated into financial
trading. Operations are streamlined. Fair-
ness must be ensured in algorithms. Deci-
sion making improves. <|answer|> The use
of blockchain technology [...]
slsl Drones are getting popular. People and lsls
companies use them. We want them to Blockchain technology improves efficiency
think for themselves. Simplify the text by and accuracy in data processing. AI is used.
removing complicated words and replacing Consideration for new technologies is re-
them with simpler synonyms. quired. |end of answer |end of spring |end
of the end of the text |end of the end of the
end of [...]
B. Results
Table 12
Submission results for Tasks 3.1 Rewrite this: Simplification of scientific sentences. Our participation is
UBO_Phi4mini.
Lexical complexity score
Levenshtein similarity
Additions proportion
Deletions proportion
Compression ratio
Sentence splits
Exact copies
count
FKGL
BLEU
SARI
run name
Identity 578 13.65 19.76 12.02 1.00 1.00 1.00 1.00 0.00 0.00 8.80
References 578 8.86 100.00 100.00 0.70 1.06 0.60 0.01 0.27 0.54 8.51
UBO_Phi4mini-s 578 8.74 0.58 36.78 18.23 23.48 0.47 0.00 0.66 0.29 8.89
UBO_Phi4mini-sl 578 6.16 0.61 36.53 6.92 9.81 0.38 0.00 0.80 0.42 8.72
AIIRLab_llama-3-8b_run1 578 8.39 7.53 40.58 0.90 1.37 0.56 0.00 0.48 0.58 8.45
AIIRLab_llama-3-8b_run2 578 10.33 5.46 39.76 1.03 1.19 0.51 0.00 0.60 0.56 8.34
AIIRLab_llama-3-8b_run3 578 9.47 6.26 40.36 1.17 1.52 0.53 0.00 0.53 0.56 8.51
Elsevier@SimpleText_run1 578 10.33 10.68 43.63 0.87 1.06 0.59 0.00 0.45 0.53 8.39
Elsevier@SimpleText_run10 577 12.57 11.91 42.49 0.91 1.02 0.63 0.00 0.34 0.50 8.67
Elsevier@SimpleText_run3 577 11.50 15.75 42.58 0.76 0.98 0.68 0.00 0.23 0.46 8.68
Elsevier@SimpleText_run4 577 11.73 12.08 43.14 0.85 1.00 0.63 0.00 0.37 0.50 8.54
Elsevier@SimpleText_run6 577 12.65 11.76 42.88 0.95 1.00 0.64 0.00 0.38 0.47 8.63
Elsevier@SimpleText_run7 577 12.55 12.20 42.87 0.87 1.00 0.63 0.00 0.35 0.51 8.67
Elsevier@SimpleText_run8 577 12.40 12.35 42.95 0.90 1.02 0.63 0.00 0.35 0.50 8.66
Elsevier@SimpleText_run9 577 12.53 12.15 42.61 0.87 1.00 0.63 0.00 0.35 0.50 8.67
Sharingans_finetuned 578 11.39 18.18 38.61 0.83 1.07 0.77 0.11 0.16 0.32 8.70
SONAR_SONARnonlinreg 578 13.14 18.41 32.12 0.97 1.01 0.93 0.13 0.11 0.13 8.73
UAms_Cochrane_BART_Snt 578 13.22 19.21 18.45 0.95 0.99 0.96 0.59 0.02 0.07 8.77
UAms_GPT2 578 10.91 13.07 29.73 1.30 1.50 0.79 0.06 0.29 0.12 8.63
UAms_GPT2_Check 578 11.47 15.10 29.91 1.02 1.23 0.87 0.14 0.17 0.14 8.68
UAms_Wiki_BART_Snt 578 12.13 21.56 27.45 0.85 0.99 0.89 0.32 0.02 0.16 8.73
UBO_RubyAiYoungTeam_run2 578 8.76 15.37 34.40 0.60 1.22 0.69 0.03 0.05 0.44 8.71
UZHPandas_5Y_target 578 5.94 2.29 34.91 0.66 0.99 0.43 0.00 0.57 0.78 8.17
UZHPandas_5Y_target_cot 578 6.39 0.97 37.95 4.73 6.25 0.30 0.00 0.89 0.14 8.30
UZHPandas_5Y_target_inter_def 578 19.30 2.27 36.53 1.76 1.01 0.45 0.00 0.70 0.41 8.87
UZHPandas_selection_lens 578 21.29 2.71 37.79 1.97 1.01 0.44 0.00 0.71 0.34 8.85
UZHPandas_selection_lens_cot 578 6.74 1.10 38.16 4.54 5.88 0.32 0.00 0.87 0.14 8.32
UZHPandas_selection_sle 578 6.07 2.57 35.30 0.65 0.98 0.43 0.00 0.56 0.78 8.17
UZHPandas_selection_sle_cot 578 6.49 1.03 38.38 4.76 6.26 0.30 0.00 0.89 0.14 8.30
UZHPandas_simple 578 11.24 5.67 39.28 0.88 0.98 0.52 0.00 0.53 0.62 8.45
UZHPandas_simple_cot 578 13.74 3.38 39.59 3.44 2.67 0.41 0.00 0.76 0.12 8.61
UZHPandas_simple_inter_def 578 21.36 3.13 38.29 1.93 0.99 0.46 0.00 0.69 0.33 8.86
UZHPandas_selection_lens_1 578 7.79 3.65 36.72 0.72 0.98 0.46 0.00 0.54 0.73 8.25
YOUR_TEAM_DistilBERT 578 5.85 13.56 19.00 1.03 3.00 0.95 0.00 0.22 0.11 8.65
YOUR_TEAM_METHOD 578 13.65 19.77 12.12 1.00 1.00 1.00 0.99 0.00 0.00 8.80
YOUR_TEAM_T5 578 13.18 10.66 28.92 1.12 1.10 0.72 0.03 0.34 0.37 9.06
Table 13
Submission results for Tasks 3.2 Rewrite this: Simplification of scientific abstracts. Our participation is
UBO_Phi4mini.
Lexical complexity score
Levenshtein similarity
Additions proportion
Deletions proportion
Compression ratio
Sentence splits
Exact copies
count
FKGL
BLEU
run name SARI
Identity 103 13.64 12.81 21.36 1.00 1.00 1.00 1.00 0.00 0.00 8.88
References 103 8.91 100.00 100.00 0.67 1.04 0.60 0.00 0.23 0.53 8.66
UBO_Task3.1_Phi4mini-l 103 9.96 38.41 10.01 1.29 2.11 0.55 0.00 0.24 0.51 9.03
UBO_Task3.1_Phi4mini-ls 103 8.45 38.79 5.53 1.21 1.75 0.43 0.00 0.40 0.63 8.53
AIIRLab_Task3.2_llama-3-8b_run1 103 9.07 43.44 11.73 1.01 1.38 0.51 0.00 0.37 0.56 8.57
AIIRLab_Task3.2_llama-3-8b_run2 103 10.22 42.19 7.99 1.31 1.38 0.48 0.00 0.53 0.52 8.44
AIIRLab_Task3.2_llama-3-8b_run3 103 10.17 43.21 11.03 1.15 1.47 0.52 0.00 0.40 0.51 8.66
Elsevier@SimpleText_Task3.2_run2 103 11.01 42.47 10.54 1.04 1.22 0.51 0.00 0.38 0.55 8.60
Elsevier@SimpleText_Task3.2_run5 103 12.08 42.15 10.96 1.04 1.15 0.52 0.00 0.36 0.53 8.75
Sharingans_task3.2_finetuned 103 11.53 40.96 18.29 1.20 1.39 0.65 0.00 0.24 0.34 8.80
UAms_Task3-2_Cochrane_BART_Doc 103 14.46 33.51 9.39 0.65 0.58 0.54 0.04 0.06 0.53 8.80
UAms_Task3-2_Cochrane_BART_Par 103 16.53 31.58 15.40 1.08 0.80 0.67 0.04 0.15 0.32 8.81
UAms_Task3-2_GPT2_Check_Abs 103 12.85 36.47 13.12 0.91 0.92 0.59 0.00 0.18 0.45 8.73
UAms_Task3-2_GPT2_Check_Snt 103 11.57 30.71 15.24 1.54 1.70 0.78 0.00 0.27 0.13 8.77
UAms_Task3-2_Wiki_BART_Doc 103 15.68 26.50 15.11 1.51 1.14 0.76 0.01 0.25 0.11 8.79
UAms_Task3-2_Wiki_BART_Par 103 13.11 23.92 19.49 1.39 1.37 0.81 0.01 0.11 0.10 8.86
YOUR_TEAM_Task3.2_DistilBERT 103 0.00 28.28 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82
YOUR_TEAM_Task3.2_METHOD 103 0.00 28.28 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82
YOUR_TEAM_Task3.2_METHOD 103 0.00 28.28 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82
YOUR_TEAM_Task3.2_METHOD 103 0.00 28.28 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82
YOUR_TEAM_Task3.2_METHOD 103 0.00 28.28 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82
YOUR_TEAM_Task3.2_T5 103 0.00 28.28 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82