Enhancing Scientific Document Simplification
                                through Adaptive Retrieval and Generative Models
                                Artemis Capari, Hosein Azarbonyad, Zubair Afzal and Georgios Tsatsaronis
                                Elsevier, Amsterdam


                                                                      Abstract
                                                                      The CLEF SimpleText Lab focuses on identifying pertinent sections from an vast collection of scientific
                                                                      papers in response to general queries, recognizing and explaining complex terminology in those sections,
                                                                      and ultimately, making the sections easier to understand. The first task is akin to the ad-hoc retrieval
                                                                      task where the objective is to find relevant sections based on a query/topic, but it also requires ranking
                                                                      models to evaluate documents according to their readability and complexity, alongside relevance. The
                                                                      third task is centered around simplifying sentences from scientific abstracts.
                                                                          In this paper, we outline our strategy for creating a ranking model to address the first task and our
                                                                      methods for employing GPT-3.5 in a zero-shot manner for the third task. To create the ranking model,
                                                                      we initially assess the performance of several models on a proprietary test collection built using scientific
                                                                      papers from various science fields. Subsequently, we fine-tune the top-performing model on a large set
                                                                      of unlabelled documents using the Generative Pseudo Labeling approach. We further experiment with
                                                                      generating new queries using the provided queries, topics, and abstracts to generate a search query. Our
                                                                      approach’s primary contribution and findings indicate that a bi-encoder model, trained on the MS-Marco
                                                                      dataset and fine-tuned further on a vast collection of unlabelled scientific sections, yields the best results
                                                                      on the proprietary dataset, specifically designed for the scientific passage retrieval task.
                                                                          For the third task, we aim to test the limits of a zero-shot Large Language Model (LLM), namely
                                                                      GPT-3.5, by experimenting with various zero-shot and few-shot prompts on both sentence-level and
                                                                      abstract-level. We find that few-shot prompting results in a higher performance on BLEU and SARI, but
                                                                      leads to a higher FKGL, as the simplified sentences in the provided test set have a higher FKGL as well.
                                                                      Conversely, lower FKGL can be obtained with zero-shot prompting, but will result in lower BLEU and
                                                                      SARI scores as well.

                                                                      Keywords
                                                                      Information Retrieval, Scientific Documents, Domain Adaptation, Scholarly Document Processing


                                1. Introduction
                                The scientific community often utilizes specialized and complex terminology, making scientific
                                texts difficult to comprehend for the general audience [1]. With continuous developments in
                                many disciplines, even researchers and scientists find it increasingly difficult to stay up to date
                                with novel content and technical concepts. Studies have shown that the readability of scientific
                                literature is declining over time [2]. This trend presents both challenges and opportunities for
                                researchers and publishers to improve the readability of complex scientific information for a
                                broader audience.
                                   SimpleText Lab [3] is dedicated to addressing these challenges by making scientific content

                                CLEF 2024: Conference and Labs of the Evaluation Forum, September 9-12, 2024, Grenoble, France
                                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
more accessible. The lab’s primary objectives include identifying relevant passages in response
to user queries [4], explaining complex terminology within these passages [5], and ultimately
simplifying the text to improve readability [6]. The initial step in this process is a passage
retrieval task known as "What is in (or out)?", where the goal is to retrieve all passages pertinent
to a given query or topic, which can then be used to create a simplified summary. In addition to
relevance, ranking models must also consider the complexity of passages, prioritizing those
that are easier to understand.
   Current state-of-the-art ranking models are based on semantic matching using either cross-
encoder or bi-encoder architectures, or a combination of both [7]. These models are typically
trained on publicly available datasets like MS-Marco [8], which do not include scientific doc-
uments. Since the SimpleText Lab’s retrieval task and associated training/evaluation sets are
centered around scientific literature, existing ranking models may underperform in this context
due to the complexity and specialized terminology of scientific documents.
   In this paper, we use our findings from our SimpleText participation of the previous year [9, 10]
to further expand our experiments. In our previous participation, we fine-tuned pre-trained
state-of-the-art ranking models [11, 12] on a set of unlabelled scientific documents using a
domain adaptation technique known as Generative Pseudo Labeling (GPL) [13] to retrieve relevant
documents for the SimpleText task. This method proved to be successful as our submissions
dominated the top of the scoreboard in the 2023 SimpleText Task 1 [9, 10]. Therefore, we
consider the models to be sufficiently fitted to the data, and we rather aim to improve the input
given to the ranking model by generating new search queries using the provided queries, topics,
and abstracts.
   In addition to the first task, we also participated in the third task, "Rewrite this!", where the
objective is to simplify passages from scientific abstracts given a query [6]. We aim to test the
limits of prompt engineering on the text simplification task with GPT-3.5, developing prompts
with instructions at varying levels of detail, comparing zero-shot versus few-shot prompting,
and providing additional context for the sentence/abstract to be simplified within the prompt.
   The remainder of this paper is organized as follows: Section 2 reviews related work, Section
3 details the technical aspects of our system, Sections 4 and 5 present our empirical evaluations,
and Section 6 discusses the limitations of our current approach and suggests directions for
future research.


2. Related Work
In this section, we review the related work on the passage retrieval (SimpleText Task 1 [4]) and
text simplification (SimpleText Task 3 [6]) tasks.

2.1. Passage Retrieval
The field of information retrieval (IR) has seen significant advancements with the introduction
of dense retrieval models. These models utilize fixed-length dense vector representations to
depict both queries and documents [14]. This approach enables efficient and precise extraction
of pertinent information from large text corpora, achieved by calculating the similarity score
between vectors representing queries and documents. Compared to conventional sparse retrieval
models like BM25 [15], dense retrieval models have exhibited superior performance in diverse
tasks, including document ranking and open-domain question answering.
   Bi-encoders and cross-encoders are two variants of dense retrieval models. Despite sharing
the common objective of capturing the semantic meaning of queries and documents into dense
vector representations, these two models differ in their neural network architecture. Bi-encoders
operate by independently encoding the query and document with two separate encoders into
dense vectors. These vectors are then compared using a similarity function, resulting in a
relevance score. A prominent example of bi-encoders is the Dense Passage Retrieval (DPR)
model [14]. DPR employs a two-stage retrieval process. Initially, a broad set of passages is
retrieved using sparse techniques. Subsequently, each passage is represented as a dense vector
using a pre-trained language model like BERT [16]. The query is mapped to a dense vector
representation as well. The final ranking of the passages is determined by the cosine similarity
between vectors representing the query and passage.
   On the other hand, cross-encoders use a single encoder to encode the query and document
into a shared embedding space. The documents are then ranked based on the similarity score
computed between this shared embedding and the learned representation of the positive doc-
ument. Cross-encoders can capture more intricate interactions between the query and the
document. However, they are computationally more demanding since they require a unique
embedding for each query-document pair. In contrast, bi-encoders separately encode queries
and documents, thus requiring only a single document corpus for all queries [11]. Consequently,
cross-encoders are typically used only as re-rankers [17, 18, 19, 20, 21, 22].

2.2. Text Simplification
When simplifying a text, multiple aspects are to be considered. One aspect is the Lexical
Simplification (LS), where complex terminology is replaced with simpler synonyms or explana-
tions. However, a sentence could still be grammatically complex, and therefore the Syntactic
Simplification (SS) should also be considered [23].
   The first attempts at automatic LS were rule-based approaches where texts were analyzed
and complex terms were identified, after which they were replaced with their most frequently
used synonyms [24]. Later rule-based LS approaches aimed to be more context-aware, using
methods that better capture semantic meaning such as context vectors [25, 26] or employing a
BERT model to generate and rank substitutions for complex words [27]. Data-driven LS uses
scientific methods and machine learning techniques to learn LS rules from large datasets, such
as English Wikipedia and Simple English Wikipedia. [28, 23, 29].
   Syntactic simplification also consists of both rule-based and data-driven approaches. Early
rule-based methods used handcrafted rules to split long sentences and simplify them, but often
failed due to complexities such as crossed dependencies and ambiguities [30, 23]. Improvements
were made by integrating a parser, Lightweight Dependency Analyzer (LDA), to learn simplifi-
cation rules from a corpus of sentences and their simplified versions [30, 23]. Subsequent work
focused on preserving text cohesion, syntactic dependencies, and multilingual applications,
but struggled to generalize across different sentence structures and languages [31, 32, 33, 34].
Data-driven methods, using large corpora and statistical models, enhanced the flexibility and
robustness of text simplification [35, 36].
  Deep learning techniques have advanced the field of text simplification in recent years. For
instance, Sequence-to-sequence (Seq2Seq) models with attention mechanisms [37, 38], have
been adapted for text simplification tasks. Nisioi et al. [39] demonstrated the effectiveness of
neural models in generating simplified text by training on large-scale datasets. These models can
capture complex linguistic patterns and produce more fluent and coherent simplified sentences
compared to traditional methods. Advancements in pre-trained language models [40] have
further improved automatic text simplification. These models, pre-trained on vast amounts of
text data, can be fine-tuned for specific tasks, including text simplification Martin et al. [41].


3. Methodology
In this section, we outline the specific methodologies employed in this paper for the passage
retrieval and text simplification tasks in the context of SimpleText lab. In the first task [4], our
aim is to adapt a passage retrieval model to the scientific domain and improve its performance.
For this purpose, we construct a validation dataset using a selection of scientific texts. We also
experiment with the Generative Pseudo Labeling (GPL) approach for unsupervised domain
adaptation. Finally, we focus on creating effective search queries using GPT-3.5 to improve
the performance of the retrieval model. For the text simplification task, we explore various
prompt-engineering techniques on GPT-3.5 to simplify a given text. The following sections
provide an in-depth description of these tasks and the rationale behind our chosen methods.

3.1. Task 1
In order to fine-tune and test our models, we initially construct a validation dataset utilizing a
selection of scientific texts annotated by subject matter experts. Subsequently, we use part of
this dataset, along with a large collection of scientific documents to fine-tune a dense-retrieval
model. This serves to make the model more suitable for scientific passage retrieval. We also
experiment with using LLM-generated search queries using abstracts as context.

3.1.1. Test Collection
The test collection [42, 43] is created using 100 queries dispersed through 20 distinct scientific
disciplines 1 . Each query is specifically chosen to represent a recognized scientific concept, thus
enabling the collection of credible and pertinent passages. Following the selection of queries, we
employ the well-established pooling technique to retrieve candidate documents for annotation
for each query. Five distinct models (comprising two lexical matching, two bi-encoders, and
one cross-encoder) are selected for pool construction. These models are chosen based on their
performance on a small subset, or to ensure a variety of models, which in turn guarantees a
diversity of documents within the pool. We select 50 documents per query using the pooling

1
    Including Genetics and Molecular Biology, Computer Science, Economics, Agricultural and Biological Sciences,
    Biochemistry, Econometrics and Finance, Toxicology and Pharmaceutical Science, Chemical Engineering, Veterinary
    Science and Veterinary Medicine, Chemistry, Materials Science, Earth and Planetary Sciences, Engineering, Food
    Science, Immunology and Microbiology, Mathematics, Nursing and Health Professions, Medicine and Dentistry,
    Neuroscience, Pharmacology, Psychology, Physics and Astronomy, Social Science
Figure 1: Generative Pseudo Labeling (GPL) for training domain-adapted dense retriever [13]


approach. These documents are then classified by domain experts as "relevant", "partially
relevant", or "non-relevant". This curated dataset serves as the benchmark for assessing the
performance of various ranking models 2 .

3.1.2. GPL
The Generative Pseudo Labeling (GPL) approach, originally introduced in [13], is an unsuper-
vised domain adaptation technique. This framework harnesses the architecture of a pre-trained
generative model to generate pseudo labels for unlabeled data within the target domain, creating
a training set suitable for supervised learning. This method has shown superior performance
compared to other unsupervised domain adaptation methods across various benchmark datasets,
and it has achieved state-of-the-art performance in the unsupervised domain adaptation of
dense retrieval.
   The importance of large data sets in training dense retrieval methods has been frequently
emphasized in previous research [16, 14, 7]. Given our manually annotated dataset, comprised of
only 5,000 snippets stemming from a set of 100 queries, we face a potential limitation. However,
we possess a vast reservoir of unlabeled scientific documents, including research articles, that
could provide an abundance of snippets and potential queries. These could be labeled through
GPL, based on their relevance, to fine-tune and adapt the extant ranking models to the task of
scientific document retrieval.
   We adapt the GPL framework to suit our specific needs by firstly eliminating the query
generation component (See Figure 1). Instead, we select a known set of scientific concepts per
domain, and subsequently identify all passages that refer to each concept within the documents.
This approach is predicated on the idea that the explicit mention of a scientific concept within a
document is a strong indicator of the document’s relevance to the concept. In this context, each
document that mentions a specific concept is considered a positive example. A bi-encoder is
then employed to determine negative examples for each query. The GPL framework employs
a cross-encoder as a ’teacher model’ to fine-tune the underlying bi-encoder model using the
collated positive and negative documents. This process enables the adaptation of the bi-encoder
model for our specific application - the ranking of scientific documents.

2
    The benchmark set can be found here https://github.com/acapari/KAPR
Table 1
Details on fine-tuning of various models
Model Name             Bi-Encoder                      Queries            Documents   Batch Size   Training Steps   Epochs
MS-DB-v4-GPL-CS        msmarco-distilbert-base-v4      218 (10 golden)    23670       16           15000            1
MS-DB-tas-b-GPL-CS     msmarco-distilbert-base-tas-b   218 (10 golden)    23670       16           15000            1
MS-DB-v4-GPL-all       msmarco-distilbert-base-v4      4637 (80 golden)   893110      32           280000           1
MS-DB-tas-b-GPL-all    msmarco-distilbert-base-tas-b   4637 (80 golden)   893110      32           280000           1


   For our use-case, we have fine-tuned two different bi-encoders msmarco-distilbert-base-v4[11]
(MS-DB-v4) and msmarco-distilbert-base-tas-b[12] (MS-DB-tas-b) using a subsection of our
benchmark set, spanning 20 different scientific domains, consisting of 4 queries each. We
augmented the training set with a large set of unlabeled passages. When testing performance
with the remainder of our benchmark set, we found that msmarco-distilbert-base-tas-b was
most suitable for tasks that require understanding of a wide range of domains. However, as the
SimpleText task aims at finding references in Computer Science, we have also fine-tuned the
aforementioned models on queries and articles from just the Computer Science and Mathematics
domains. Naturally, these models were fine-tuned on far less data (See Table 1). Each of the
models were fitted on pseudo labels created with ms-marco-MiniLM-L-6-v2, using the Adam
Optimizer [44] with a learning rate of 2e−5 and 1000 warm-up steps.

3.1.3. Generated Search Queries
Several aspects are of importance when aiming to retrieve passages that are relevant for creating
a simplified summary around a given topic. One aspect, as described above, is the model used
to retrieve the passages. However, even with a high performing model, it is important how
you ask the model to retrieve those passages. We therefore employ GPT-3.5 to generate search
queries in two different set-ups. We first generate new topics using the provided abstracts only
(See Figure 2). We also generate new queries using the provided queries and abstracts with the
objective of finding a better search query to highlight a certain aspect of the article (See Figure
3). The generated search queries 3 are then used to retrieve a corpus of ElasticSearch top-k
documents, after which the same queries are used to re-rank the corpus with our fine-tuned
models.

3.2. Task 3
For the text simplification task, we explore the advantages and limitations of various prompt-
engineering techniques on GPT-3.5. We first design several prompts where we simply ask the
model to simplify a given sentence/abstract (See Figure 4, 6, and 7). Subsequently, we augment
these prompts with more detailed instructions on how to simplify a given input (see Figure 5, 9,
and 11).
   We also apply few-shot prompting, a frequently used technique that enables in-context
learning without the need to update model parameters [45, 46] by adding examples of the
desired input and output to the prompt. We use sentences/abstracts and their simplified versions
3
    The queries can be found on https://github.com/acapari/SimpleText_24_T1/
  Goal:
  I have a task to retrieve passages that help understand a given article.

  Request:
  Your task is to help me write the best possible search query to retrieve
  articles that would help understand the provided article.
  This query should be concise and focus on the provided topic.
  Only provide ONE search query.

  Article:
  "{abstract}"

  Search Query:


Figure 2: Topic prompt


  Goal:
  I have a task to retrieve passages that help understand a given article.
  We dissect the content of the article into key-topics, and retrieve
  passages for those topics.

  Request:
  I need your help to create the best possible search query for a given
  topic in the context of the provided article.
  This query should be concise and focus on the provided topic.
  Only provide ONE search query.

  Topic:
  {query_text}

  Article:
  "{abstract}"

  Search Query:


Figure 3: Query prompt


from the provided test set as sample input-output pairs in our few-shot prompts. As models are
biased by the order of the in-context examples [47], we randomly take 𝑛 samples from the test
set and ensure that the selected samples are not from the same abstract as the input sentence
(see Figure 5, 8, 9, and 10).
   Finally, we explore two different methods for adding background information that can
potentially be used to help identify essential information that should be included in the simplified
text. The first method provides additional context to sentence-level simplification prompts by
simply including the abstract that the sentence to be simplified is extracted from. This can
potentially aid in avoiding overly simplified sentences as it shows the role of the sentence in a
bigger context (see Figure 8, 9, 10, and 11). Simplifying a text often involves breaking down and
explaining complex concepts. Our second method for adding background information therefore
involves a two-step process, where we first design a prompt where the task is to identify key
concepts in a given abstract and provide their definitions or more generally known synonyms.
We expect this method to aid with lexical simplification, in particular, [48] (see Figure 8). We
thus explore and combine the following methods:

    • Simple zero-shot prompting (Figure 1, 3, and 4)
    • Zero-shot prompting with detailed instructions (Figure 2, 6, and 8)
    • Few-shot prompting (Figure 2, 5, 6, and 7)
    • Adding background information to the prompt
         – Adding abstract for sentence-level simplification (Figure 5, 6, 7, and 8)
         – Generating and providing definitions/synonyms of key concepts (Figure 5)


4. Experiments
In this section, we describe the details of the runs and the specific models used to produce the
results per run for Task 1 and Task 2.

4.1. Task 1
We have applied our models in several settings before selecting the final 10 submitted runs.
The selection was made based on the performance on the provided qrels and successes of
submissions from the previous year [10]. As shown in Table 2, the rankings were retrieved by
taking the top-k documents found for each query from he 2024 SimpleText Task 1 Train Qrels
by the Elastic Search API (top-100 for runs 1, 4, 8, and 10, top-500 for runs 2, 5, 7, and 9, and
top-1000 for runs 3 and 6). These were then re-ranked using our fine-tuned models. Runs 2
and 7 were obtained with MS-DB-v4-GPL-CS, a msmarco-distilbert-base-v4 model that was only
fine-tuned on Computer Science and Mathematics data, while we used MS-DB-tas-b-GPL-al, a
msmarco-distilbert-base-tas-b model fine-tuned on all Science Direct Domains, for the remaining
runs.
   For runs 1, 3, and 7, the top-k documents were retrieved by searching for “query”, and then
re-ranked using one of our fine-tuned models, again using “query” as the query input, while run
9 simply uses “topic” as query input. A combination of the two inputs, namely “query, topic”,
was used as the query input for ranking and re-ranking runs 2, 4, 5, and 6. Finally, we included
two runs with generated query inputs with run 8 at query-level and run 10 at topic-level, both
using a top-100 ElasticSearch corpus and the MS-DB-tas-b-GPL-all model.

4.2. Task 3
Experiments conducted for Task 3 revolve around testing the limits of prompt engineering, by
comparing performance between simple and very detailed or multi-step prompts, zero-shot and
few-shot prompting, and sentence-level and abstract-level.
Table 2
Configurations of official submissions for Task 1
                    Run     Query Input      Corpus          Model
                     1          query        ES Top-100      MS-DB-tas-b-GPL-all
                     2       query, topic    ES Top-500      MS-DB-v4-GPL-CS
                     3          query        ES Top-1000     MS-DB-tas-b-GPL-all
                     4       query, topic    ES Top-100      MS-DB-tas-b-GPL-all
                     5       query, topic    ES Top-500      MS-DB-tas-b-GPL-all
                     6       query, topic    ES Top-1000     MS-DB-tas-b-GPL-all
                     7          query        ES Top-500      MS-DB-v4-GPL-CS
                     8        gen query      ES Top-100      MS-DB-tas-b-GPL-all
                     9          topic        ES Top-500      MS-DB-tas-b-GPL-all
                     10       gen topic      ES Top-100      MS-DB-tas-b-GPL-all


Table 3
Configurations of official submissions for Task 3
                    Run    Prompt   Few-Shot    Level      Two-Step   Uses Abstract
                      1       1     False       Sentence   False      False
                      2       2     False       Abstract   False      -
                      3       3     False       Sentence   False      False
                      4       4     False       Sentence   False      False
                      5       2     True        Abstract   False      -
                      6       5     False       Sentence   True       True
                      7       6     False       Sentence   False      True
                      8       7     True        Sentence   False      True
                      9       8     True        Sentence   False      True
                     10       6     True        Sentence   False      True
                     11       8     False       Sentence   False      True
                     12       5     True        Sentence   True       True


   Table 3 presents the configurations of the submitted runs. We will also discuss the results of
two additional runs for further comparisons between zero-shot and few-shot prompting. Each
prompt can be found in Appendix A. The submission consists of 2 abstract-level runs, namely
runs 2 and 5, using the same prompt, but comparing the performance between zero-shot and
few-shot prompting. The remaining 8 runs are at sentence level, using prompts with varying
levels of complexity, and using few-shot prompting for runs 8, 9, and 10. For sentence-level runs
6-10, the entire abstract is also provided as additional context. We include run 6, which involves
a two-step process where we first request GPT-3.5 to identify and explain complex terms found
in the abstract. Subsequently, we provide the completion of the first step as additional context
in the second prompt, where the task is to simplify a given sentence.


5. Results
In this section, we describe the details of different runs and the results for the passage retrieval
(Task 1) and text simplification (Task 3) tasks.
Table 4
Performance of Official Runs on the 2024 SimpleText Task 1 Train Qrels

Run     P@10     R@10       RR@10       nDCG@5        nDCG@10            nDCG@50    nDCG@100
 1      0.612     0.103      0.799        0.584          0.555            0.399         0.407
 2      0.584     0.088      0.727        0.566          0.550            0.401         0.364
 3      0.552     0.091      0.761        0.547          0.511            0.369         0.352
 4      0.500     0.076      0.666        0.487          0.468            0.356         0.330
 5      0.508     0.079      0.657        0.500          0.461            0.353         0.335
 6      0.472     0.072      0.697        0.471          0.439            0.337         0.327
 7      0.344     0.044      0.470        0.373          0.340            0.227         0.210
 8      0.340     0.042      0.502        0.328          0.321            0.236         0.227
 9      0.312     0.040      0.451        0.324          0.298            0.205         0.191
 10     0.244     0.026      0.309        0.253          0.234            0.160         0.138


5.1. Task 1
5.1.1. Submitted Runs
For Task 1, we have partially selected the runs based on the performance of our submissions of
the previous year, where most of our submitted runs were produced using the MS-DB-v4-GPL-CS
model as it showed higher performance on the provided qrels [10]. However, official results
indicated otherwise, as our top-performing runs were obtained using the MS-DB-tas-b-GPL-
all model [49]. We also found that “query, topic” was the best query input. However, the
evaluation on the provided qrels indicates otherwise. Following Table 4, run 1 ranks highest
across most metrics, which uses “query” as input and a corpus of top-100 ES documents. Last
year’s best-performing run, run 2, ranks second on the provided qrels. Increasing the number of
ES documents from 100 to 500 has a negative impact on 𝑅𝑅@10, 𝑛𝐷𝐶𝐺@50, and 𝑛𝐷𝐶𝐺@100,
while increasing the corpus to 1000 ES documents worsens performance across all metrics.
Finally, the results suggest that using “query” as query input performs best, followed by “query,
topic”, while the generated inputs perform as some of the worst at rank 8 and 10. Query inputs
at topic level are the lowest-performing query inputs at rank 9 and 10.

5.1.2. Official Results
As per Table 5, where the results are sorted on the primary measure, nDCG@10, we see that
the rankings do not correspond with those of the Train qrels presented in Table 4. While the
generated query inputs, run 8 and 10, were ranked as some of the lowest, we see that they
perform as some of the best on the Test qrels. We further observe that run 4 performs slightly
better than run 10, both using the ES Top-100 corpus and MS-DB-tas-b-GPL-all, but using “query,
topic” and “gen topic” as query inputs respectively. While the generated topic still obtains better
results than most of our other submissions, we thus conclude that the “query”, both generated
in run 8 and original in run 4, provide an additional level of detail required when retrieving
relevant documents. “gen query” results in highest performance, followed by“query, topic” and
“gen topic”, while “query” and “topic” produce the lowest performing results. Furthermore,
Table 5
Results for CLEF 2024 SimpleText Task 1 on the Test qrels (G01.C1-G10.C1 and T06-T11).
runid                                  MRR      P@10     P@20     NDCG@10   NDCG@20      Bpref    MAP
AIIRLab_Task1_LLaMABiEncoder           0.9444   0.8167   0.5517    0.6170    0.5166      0.3559   0.2304
AIIRLab_Task1_LLaMAReranker2           0.9300   0.7933   0.5417    0.5943    0.5004      0.3495   0.2177
AIIRLab_Task1_LLaMAReranker            0.8944   0.7967   0.5583    0.5889    0.5011      0.3541   0.2200
LIA_vir_title                          0.8454   0.6933   0.4383    0.5013    0.3962      0.3594   0.1534
AIIRLab_Task1_LLaMACrossEncoder        0.7975   0.6933   0.5100    0.4745    0.4240      0.3404   0.1970
LIA_vir_abstract                       0.7683   0.6000   0.4067    0.4207    0.3504      0.3857   0.1603
UAms_Task1_Anserini_rm3                0.7878   0.5700   0.4350    0.3924    0.3495      0.4010   0.1824
UAms_Task1_Anserini_bm25               0.7187   0.5500   0.4883    0.3750    0.3707      0.3994   0.1972
UAms_Task1_CE1K                        0.5950   0.5333   0.4583    0.3672    0.3618      0.4032   0.1939
UAms_Task1_CE1K_CAR                    0.5950   0.5333   0.4583    0.3672    0.3618      0.2701   0.1605
UAms_Task1_CE100                       0.6618   0.5300   0.4567    0.3654    0.3549      0.2657   0.1579
UAms_Task1_CE100_CAR                   0.6618   0.5300   0.4567    0.3654    0.3549      0.2657   0.1579
AIIRLAB_Task1_CERRF                    0.7264   0.5033   0.4000    0.3584    0.3239      0.2204   0.1309
Arampatzis_1.GPT2_search_results       0.6986   0.5100   0.2550    0.3516    0.2462      0.0742   0.0577
UBO_Task1_TFIDFT5                      0.7132   0.4833   0.3817    0.3474    0.3197      0.2354   0.1274
LIA_bool                               0.7242   0.5233   0.3633    0.3381    0.2891      0.2661   0.1199
Elsevier@SimpleText_task_1_run8        0.7123   0.4533   0.3367    0.3146    0.2752      0.1582   0.0906
Elsevier@SimpleText_task_1_run4        0.6162   0.4300   0.3217    0.3063    0.2681      0.1642   0.1005
Elsevier@SimpleText_task_1_run10       0.5117   0.4067   0.2767    0.2885    0.2365      0.1236   0.0729
AB_DPV_SimpleText_task1_results_FKGL   0.6173   0.3733   0.2900    0.2818    0.2442      0.1966   0.1078
LIA_elastic                            0.6173   0.3733   0.2900    0.2818    0.2442      0.3016   0.1325
Ruby_Task_1                            0.5470   0.4233   0.3533    0.2756    0.2671      0.1980   0.1110
LIA_meili                              0.6386   0.4700   0.2867    0.2736    0.2242      0.2377   0.0833
Elsevier@SimpleText_task_1_run6        0.5333   0.3833   0.3117    0.2633    0.2430      0.1841   0.0973
Tomislav_Rowan_SimpleText_T1_2         0.5444   0.3733   0.2750    0.2443    0.2183      0.0963   0.0601
Elsevier@SimpleText_task_1_run5        0.4867   0.3533   0.2883    0.2408    0.2232      0.1834   0.0943
Elsevier@SimpleText_task_1_run1        0.5589   0.3000   0.3300    0.2247    0.2399      0.1978   0.1018
Elsevier@SimpleText_task_1_run7        0.4026   0.3200   0.2250    0.2168    0.1850      0.1085   0.0565
Elsevier@SimpleText_task_1_run9        0.3868   0.3300   0.2283    0.2105    0.1829      0.1103   0.0590
Elsevier@SimpleText_task_1_run3        0.4733   0.2367   0.2033    0.1853    0.1703      0.1587   0.0714
Elsevier@SimpleText_task_1_run2        0.4193   0.2233   0.2433    0.1803    0.1865      0.1768   0.0820
Sharingans_Task1_marco-GPT3            0.6667   0.0667   0.0333    0.1149    0.0797      0.0107   0.0107
Tomislav_Rowan_SimpleText_T1_1         0.0217   0.0233   0.0150    0.0121    0.0106      0.0062   0.0025
Petra_Regina_simpleText_task_1         0.0026   0.0000   0.0050    0.0000    0.0035      0.0031   0.0007


MS-DB-tas-b-GPL-all always outperforms MS-DB-v4-GPL-CS, and the optimal corpus appears to
be ES Top-100.

5.2. Task 3
5.2.1. Submitted Runs
When comparing the results in Table 6, we see a correlation between BLUE and SARI, but a
trade-off with FKGL. This can be attributed to the fact that BLUE and SARI are metrics that reflect
similarity between the predicted output and the reference, while FKGL reflects the required
education level to understand the text [50]. As the FKGL score for the provided simplified
sentences and abstracts is relatively high, i.e. 13.62 at sentence level and 13.38 at abstract
level, a lower FKGL score would naturally lead to a lower performance in BLUE and SARI.
Table 6
Performance of Official Runs on the 2024 SimpleText Task 3 Test Set
FKGL     BLEU     SARI     Run     Prompt     Few-Shot       Level    Two-Step     Uses Abstract
11.54     0.15    36.63     1         1          False     Sentence      False          False
12.12     0.12    34.92     2         2          False     Abstract      False            -
13.09     0.25    42.57     3         3          False     Sentence      False          False
12.85     0.20    39.00     4         4          False     Sentence      False          False
13.26     0.14    36.39     5         2          True      Abstract      False            -
13.70     0.21    39.95     6         5          False     Sentence      True           True
13.80     0.20    39.31     7         6          False     Sentence      False          True
13.74     0.20    39.16     8         7          True      Sentence      False          True
13.68     0.21    39.12     9         8          True      Sentence      False          True
13.82     0.20    39.05     10        6          True      Sentence      False          True
13.70     0.20    38.92     11        8          False     Sentence      False          True
13.97     0.19    38.54     12        5          True      Sentence      True           True


This is further highlighted by the runs using few-shot prompting, where the provided samples
were used. The examples used in the prompt were of a higher education level, and therefore
the output sentences/abstracts are of a similar level. High FKGL scores were also obtained
with certain zero-shot prompts, namely runs 5 and 11. Run 5 includes a two-step process, and
run 8 has a very detailed prompt, while both include the entire abstract as additional context.
We suspect that these were factors that contributed to the generation of relatively complex
sentences instead of real simplifications. Shorter prompts containing fewer details and less
additional context on the other hand, lead to simpler generated sentences. As we observe with
runs 1, 3, and 4, which were generated with the simplest zero-shot prompts, have some of the
lowest FKGL scores. Run 3 in particular has a relatively low FKGL with the highest BLEU and
SARI scores.
   If we compare the performance of zero-shot prompts with their few-shot counterparts, i.e.
run 6 vs. run 12, run 7 vs. run 10, and run 8 vs run 11, we see that adding examples positively
impacts all metrics. When comparing runs 2 and 5 however, we see that adding examples to the
prompt negatively impacts FKGL on abstract level. This might indicate that the LLM is not able
to generalize the simplification task on abstract level when multiple abstracts from potentially
different articles are given as examples.
   Overall, our results indicate that the off-the-shelf LLM performs well in simplifying scientific
text, particularly at the sentence level. Interestingly, providing additional context around the
sentences does not enhance its performance in this task. This could be because the additional
context diverts the LLM’s focus from the individual sentence. When more context is provided,
the LLM may attempt to integrate information from multiple sentences, which can negatively
affect the quality of the simplified sentences and lead to more complex sentences.
Table 7
Results for CLEF 2024 SimpleText Task 3.1 sentence-level text simplification (task number removed from
the run_id) on the test set


                                                                                                                                                                                                                               Lexical complexity score
                                                                                                                                         Levenshtein similarity


                                                                                                                                                                                 Additions proportion

                                                                                                                                                                                                        Deletions proportion
                                                                                                   Compression ratio


                                                                                                                       Sentence splits


                                                                                                                                                                  Exact copies
                                                              count


                                                                         FKGL


                                                                                            BLEU
                                                                                   SARI
run_id
References                                                   578       8.86      100       100    0.7 1.06 0.6 0.01 0.27 0.54 8.51
Identity                                                     578      13.65     12.02     19.76     1     1    1    1    0    0 8.8
Elsevier@SimpleText_Task3.1_run1                             578      10.33     43.63     10.68 0.87 1.06 0.59 0.00 0.45 0.53 8.39
Elsevier@SimpleText_Task3.1_run4                             577      11.73     43.14     12.08 0.85 1.00 0.63 0.00 0.37 0.50 8.54
Elsevier@SimpleText_Task3.1_run8                             577      12.40     42.95     12.35 0.90 1.02 0.63 0.00 0.35 0.50 8.66
Elsevier@SimpleText_Task3.1_run6                             577      12.65     42.88     11.76 0.95 1.00 0.64 0.00 0.38 0.47 8.63
Elsevier@SimpleText_Task3.1_run7                             577      12.55     42.87     12.20 0.87 1.00 0.63 0.00 0.35 0.51 8.67
Elsevier@SimpleText_Task3.1_run9                             577      12.53     42.61     12.15 0.87 1.00 0.63 0.00 0.35 0.50 8.67
Elsevier@SimpleText_Task3.1_run3                             577      11.50     42.58     15.75 0.76 0.98 0.68 0.00 0.23 0.46 8.68
Elsevier@SimpleText_Task3.1_run10                            577      12.57     42.49     11.91 0.91 1.02 0.63 0.00 0.34 0.50 8.67
AIIRLab_Task3.1_llama-3-8b_run1                              578       8.39     40.58      7.53 0.90 1.37 0.56 0.00 0.48 0.58 8.45
AIIRLab_Task3.1_llama-3-8b_run3                              578       9.47     40.36      6.26 1.17 1.52 0.53 0.00 0.53 0.56 8.51
AIIRLab_Task3.1_llama-3-8b_run2                              578      10.33     39.76      5.46 1.03 1.19 0.51 0.00 0.60 0.56 8.34
UZH_Pandas_Task3.1_simple_with_cot                           578      13.74     39.59      3.38 3.44 2.67 0.41 0.00 0.76 0.12 8.61
UZH_Pandas_Task3.1_simple                                    578      11.24     39.28      5.67 0.88 0.98 0.52 0.00 0.53 0.62 8.45
Sharingans_task3.1_finetuned                                 578      11.39     38.61     18.18 0.83 1.07 0.77 0.11 0.16 0.32 8.70
UZH_Pandas_Task3.1_selection_with_sle_cot                    578       6.49     38.38      1.03 4.76 6.26 0.30 0.00 0.89 0.14 8.30
UZH_Pandas_Task3.1_simple_with_intermediate_definitions      578      21.36     38.29      3.13 1.93 0.99 0.46 0.00 0.69 0.33 8.86
UZH_Pandas_Task3.1_selection_with_lens_cot                   578       6.74     38.16      1.10 4.54 5.88 0.32 0.00 0.87 0.14 8.32
UZH_Pandas_Task3.1_5Y_target_with_cot                        578       6.39     37.95      0.97 4.73 6.25 0.30 0.00 0.89 0.14 8.30
UZH_Pandas_Task3.1_selection_with_lens                       578      21.29     37.79      2.71 1.97 1.01 0.44 0.00 0.71 0.34 8.85
UBO_Task3.1_Phi4mini-s                                       578       8.74     36.78      0.58 18.23 23.48 0.47 0.00 0.66 0.29 8.89
UZH_Pandas_Task3.1_selection_with_lens_1                     578       7.79     36.72      3.65 0.72 0.98 0.46 0.00 0.54 0.73 8.25
UBO_Task3.1_Phi4mini-sl                                      578       6.16     36.53      0.61 6.92 9.81 0.38 0.00 0.80 0.42 8.72
UZH_Pandas_Task3.1_5Y_target_with_intermediate_definitions   578      19.30     36.53      2.27 1.76 1.01 0.45 0.00 0.70 0.41 8.87
UZH_Pandas_Task3.1_selection_with_sle                        578       6.07     35.30      2.57 0.65 0.98 0.43 0.00 0.56 0.78 8.17
UZH_Pandas_Task3.1_5Y_target                                 578       5.94     34.91      2.29 0.66 0.99 0.43 0.00 0.57 0.78 8.17
UBO_RubyAiYoungTeam_Task3.2                                  578       8.76     34.40     15.37 0.60 1.22 0.69 0.03 0.05 0.44 8.71
SONAR_Task3.1_SONARnonlinreg                                 578      13.14     32.12     18.41 0.97 1.01 0.93 0.13 0.11 0.13 8.73
UAms_Task3-1_GPT2_Check                                      578      11.47     29.91     15.10 1.02 1.23 0.87 0.14 0.17 0.14 8.68
UAms_Task3-1_GPT2                                            578      10.91     29.73     13.07 1.30 1.50 0.79 0.06 0.29 0.12 8.63
YOUR_TEAM_Task3.1_T5                                         578      13.18     28.92     10.66 1.12 1.10 0.72 0.03 0.34 0.37 9.06
UAms_Task3-1_Wiki_BART_Snt                                   578      12.13     27.45     21.56 0.85 0.99 0.89 0.32 0.02 0.16 8.73
YOUR_TEAM_Task3.1_DistilBERT                                 578       5.85     19.00     13.56 1.03 3.00 0.95 0.00 0.22 0.11 8.65
UAms_Task3-1_Cochrane_BART_Snt                               578      13.22     18.45     19.21 0.95 0.99 0.96 0.59 0.02 0.07 8.77
YOUR_TEAM_Task3.1_METHOD                                     578      13.65     12.12     19.77 1.00 1.00 1.00 0.99 0.00 0.00 8.80


5.2.2. Official Results
As per Table 7, where the results are sorted on SARI, we see that our submitted runs (e.g.
Elsevier) dominate the top of the scoreboard on sentence-level simplification. We observe that
the results differ from our evaluation on the provided test set presented in Table 6, as run 1
ranked relatively low on SARI, while it ranks highest in the official results. However, it still
obtains the lowest FKGL score.
   The rankings show a correlation between the simplicity of the prompt, and the simplicity of
the generated sentences considering the runs with the simplest prompts, i.e. runs 1, 3, and 4,
Table 8
Results for CLEF 2024 SimpleText Task 3.2 abstract-level text simplification (task number removed from
the run_id) on the test set


                                                                                                                                                                                                                 Lexical complexity score
                                                                                                                      Levenshtein similarity


                                                                                                                                                                Additions proportion

                                                                                                                                                                                        Deletions proportion
                                                                              Compression ratio

                                                                                                   Sentence splits


                                                                                                                                                Exact copies
                                     count


                                                FKGL


                                                                      BLEU
                                                           SARI
run_id
References                         103        8.91     100.00     100.00     0.67                 1.04               0.60                      0.00            0.23                    0.53                     8.66
Identity                           103       13.64      12.81      21.36     1.00                 1.00               1.00                      1.00            0.00                    0.00                     8.88
AIIRLab_Task3.2_llama-3-8b_run1    103        9.07      43.44      11.73     1.01                 1.38               0.51                      0.00            0.37                    0.56                     8.57
AIIRLab_Task3.2_llama-3-8b_run2    103       10.22      42.19       7.99     1.31                 1.38               0.48                      0.00            0.53                    0.52                     8.44
AIIRLab_Task3.2_llama-3-8b_run3    103       10.17      43.21      11.03     1.15                 1.47               0.52                      0.00            0.40                    0.51                     8.66
Elsevier@SimpleText_Task3.2_run2   103       11.01      42.47      10.54     1.04                 1.22               0.51                      0.00            0.38                    0.55                     8.60
Elsevier@SimpleText_Task3.2_run5   103       12.08      42.15      10.96     1.04                 1.15               0.52                      0.00            0.36                    0.53                     8.75
Sharingans_task3.2_finetuned       103       11.53      40.96      18.29     1.20                 1.39               0.65                      0.00            0.24                    0.34                     8.80
UAms_Task3-2_Cochrane_BART_Doc     103       14.46      33.51       9.39     0.65                 0.58               0.54                      0.04            0.06                    0.53                     8.80
UAms_Task3-2_Cochrane_BART_Par     103       16.53      31.58      15.40     1.08                 0.80               0.67                      0.04            0.15                    0.32                     8.81
UAms_Task3-2_GPT2_Check_Abs        103       12.85      36.47      13.12     0.91                 0.92               0.59                      0.00            0.18                    0.45                     8.73
UAms_Task3-2_GPT2_Check_Snt        103       11.57      30.71      15.24     1.54                 1.70               0.78                      0.00            0.27                    0.13                     8.77
UAms_Task3-2_Wiki_BART_Doc         103       15.68      26.50      15.11     1.51                 1.14               0.76                      0.01            0.25                    0.11                     8.79
UAms_Task3-2_Wiki_BART_Par         103       13.11      23.92      19.49     1.39                 1.37               0.81                      0.01            0.11                    0.10                     8.86
UBO_Task3.1_Phi4mini-l             103        9.96      38.41      10.01     1.29                 2.11               0.55                      0.00            0.24                    0.51                     9.03
UBO_Task3.1_Phi4mini-ls            103        8.45      38.79       5.53     1.21                 1.75               0.43                      0.00            0.40                    0.63                     8.53
YOUR_TEAM_Task3.2_DistilBERT       103        0.00      28.28       0.00     0.00                 0.00               0.00                      0.00            0.00                    1.00                    10.82
YOUR_TEAM_Task3.2_METHOD           103        0.00      28.28       0.00     0.00                 0.00               0.00                      0.00            0.00                    1.00                    10.82
YOUR_TEAM_Task3.2_METHOD           103        0.00      28.28       0.00     0.00                 0.00               0.00                      0.00            0.00                    1.00                    10.82
YOUR_TEAM_Task3.2_METHOD           103        0.00      28.28       0.00     0.00                 0.00               0.00                      0.00            0.00                    1.00                    10.82
YOUR_TEAM_Task3.2_METHOD           103        0.00      28.28       0.00     0.00                 0.00               0.00                      0.00            0.00                    1.00                    10.82
YOUR_TEAM_Task3.2_T5               103        0.00      28.28       0.00     0.00                 0.00               0.00                      0.00            0.00                    1.00                    10.82


obtain the lowest FKGL scores. Run 3 obtains the highest BLEU score at 15.75, indicating that
prompt 6 produces the sentences most similar to the test set.
   When comparing the performance of runs 7 and 10, where zero-shot and few-shot versions
of prompt 6 were used respectively, we see that the zero-shot version of the prompt performs
better. This indicates that the references used in the test set were of higher simplicity than the
provided examples. The same observation can be made on abstract level in Table 8, where run 2
and run 5 are zero-shot and few-shot versions of prompt 2 respectively.
   While few-shot prompting is typically used to boost performance, we can infer why our
submissions using zero-shot prompting obtain higher performance, as Table 7 shows that the
FKGL of the reference set is 8.86, while the FKGL of the references in the 2024 SimpleText Task
3 Test Set is 13.62. The examples used in our few-shot prompts were thus more complex than
the ones used in the official evaluation, which is in turn reflected in the performance of the
sentences generated with these few-shot prompts.
6. Conclusion
Building on the success of our participation in the 2023 SimpleText Task 1 [9, 10], where we fine-
tuned ranking models on a large collection of unlabeled scientific documents using Generative
Pseudo-Labeling (GPL) [13], our objective for the 2024 SimpleText Task 1 [4] was to enhance the
search queries provided as input to these ranking models. We generated these search queries
with GPT-3.5 on both query and topic level, using article abstracts as context.
   While our submissions using this method achieved high rankings on the scoreboard in the
2023 SimpleText Task 1 [9, 10], our models did not outperform those of other teams in the 2024
SimpleText Task 1 [4]. We hypothesize that this is due to the fact that the pool of rankings
used to create the reference set differs from the previous year, as it consisted of rankings from
many lexical search methods, while there was an increased use of semantic search models and
generative methods in the current year.
   Furthermore, we observed that our submissions using generated search queries outperformed
those utilizing traditional search queries. Specifically, the distilbert-base-tas-b model, fine-tuned
via GPL on a vast collection of scientific documents and employed to re-rank the top 500
documents retrieved by an Elastic Search system, demonstrated superior performance when
combined with query-level generated search queries.
   Furthermore, we employed various prompt-engineering techniques for the SimpleText sim-
plification task [6], resulting in the highest-ranking performances on the sentence-level sim-
plification task. While the success of our submission can be largely attributed to the inherent
capabilities of the GPT-3.5 model, it is important to explore what methods can be used to
best exploit GPT-3.5 for text simplification tasks nonetheless. Our findings indicate that the
simplest prompts, wherein we requested sentence simplification without additional instructions
or examples, yielded the best FKGL and BLEU performances. Conversely, runs generated with
few-shot prompts did not perform as well, particularly on FKGL. This can be attributed to the
complexity of sentences in the provided test set, compared to the significantly simpler sentences
in the reference set of the official evaluation, which had a lower FKGL. Consequently, using the
provided set as few-shot examples led to the generation of more complex sentences.


References
 [1] Y. Jin, M.-Y. Kan, J. P. Ng, X. He, Mining scientific terms and their definitions: A study of
     the acl anthology, in: EMNLP, 2013, pp. 780–790.
 [2] P. Plavén-Sigray, G. J. Matheson, B. C. Schiffler, W. H. Thompson, The readability of
     scientific texts is decreasing over time, Elife 6 (2017) e27725.
 [3] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’souza,
     S. Kabongo, H. B. Giglou, Y. Zhang, et al., Overview of CLEF 2024 SimpleText track on
     improving access to scientific texts, in: CLEF, 2024.
 [4] E. SanJuan, et al., Overview of the CLEF 2024 SimpleText task 1: Retrieve passages to
     include in a simplified summary, CEUR Workshop Proceedings, 2024.
 [5] G. M. D. Nunzio, et al., Overview of the CLEF 2024 SimpleText task 2: Identify and explain
     difficult concepts, CEUR Workshop Proceedings, 2024.
 [6] L. Ermakova, et al., Overview of the CLEF 2024 SimpleText task 3: Simplify scientific text,
     CEUR Workshop Proceedings, 2024.
 [7] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, Beir: A heterogenous
     benchmark for zero-shot evaluation of information retrieval models, arXiv preprint
     arXiv:2104.08663 (2021).
 [8] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, Ms marco: A
     human generated machine reading comprehension dataset, choice 2640 (2016) 660.
 [9] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, O. Augereau, J. Kamps, Overview of the
     CLEF 2023 SimpleText Lab: Automatic simplification of scientific texts, in: CLEF, Springer,
     2023, pp. 482–506.
[10] A. Capari, H. Azarbonyad, G. Tsatsaronis, Z. Afzal, Elsevier at SimpleText: Passage retrieval
     by fine-tuning GPL on scientific documents (2023).
[11] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese BERT-
     Networks, in: EMNLP, 2019.
[12] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Efficiently teaching an effective
     dense retriever with balanced topic aware sampling, in: SIGIR, 2021, pp. 113–122.
[13] K. Wang, N. Thakur, N. Reimers, I. Gurevych, Gpl: Generative pseudo labeling for
     unsupervised domain adaptation of dense retrieval, arXiv preprint arXiv:2112.07577
     (2021).
[14] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense
     passage retrieval for open-domain question answering, arXiv preprint arXiv:2004.04906
     (2020).
[15] X. Wang, C. Macdonald, I. Ounis, Improving zero-shot retrieval using dense external
     expansion, Information Processing & Management 59 (2022) 103026.
[16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[17] R. Nogueira, K. Cho, Passage re-ranking with bert, arXiv preprint arXiv:1901.04085 (2019).
[18] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, arXiv
     preprint arXiv:1904.08375 (2019).
[19] R. Nogueira, Z. Jiang, J. Lin, Document ranking with a pretrained sequence-to-sequence
     model, arXiv preprint arXiv:2003.06713 (2020).
[20] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, CEDR: Contextualized embeddings for
     document ranking, in: SIGIR, 2019, pp. 1101–1104.
[21] S. MacAvaney, F. M. Nardini, R. Perego, N. Tonellotto, N. Goharian, O. Frieder, Efficient
     document re-ranking for transformers by precomputing term representations, in: SIGIR,
     2020, pp. 49–58.
[22] C. Li, A. Yates, S. MacAvaney, B. He, Y. Sun, Parade: Passage representation aggregation
     for document reranking, arXiv preprint arXiv:2008.09093 (2020).
[23] S. S. Al-Thanyyan, A. M. Azmi, Automated text simplification: a survey, ACM Computing
     Surveys (CSUR) 54 (2021) 1–36.
[24] J. Carroll, G. Minnen, Y. Canning, S. Devlin, J. Tait, Practical simplification of english
     newspaper text to assist aphasic readers, in: AAAI-98 workshop on integrating artificial
     intelligence and assistive technology, Madison, WI, 1998, pp. 7–10.
[25] S. Bott, L. Rello, B. Drndarević, H. Saggion, Can spanish be simpler? lexsis: Lexical
     simplification for spanish, in: COLING, 2012, pp. 357–374.
[26] O. Biran, S. Brody, N. Elhadad, Putting it simply: a context-aware approach to lexical
     simplification, in: ACL, 2011, pp. 496–501.
[27] J. Qiang, Y. Li, Y. Zhu, Y. Yuan, X. Wu, Lexical simplification with pretrained encoders, in:
     AAAI, volume 34, 2020, pp. 8649–8656.
[28] C. Scarton, Horacio saggion, automatic text simplification. synthesis lectures on human
     language technologies, april 2017. 137 pages, isbn: 1627058680 9781627058681, Natural
     Language Engineering 26 (2020) 489–492.
[29] W. Coster, D. Kauchak, Simple english wikipedia: a new text simplification task, in: ACL,
     2011, pp. 665–669.
[30] R. Chandrasekar, C. Doran, S. Bangalore, Motivations and methods for text simplifica-
     tion, in: COLING 1996 Volume 2: The 16th International Conference on Computational
     Linguistics, 1996.
[31] A. Siddharthan, Syntactic simplification and text cohesion, Research on Language and
     Computation 4 (2006) 77–109.
[32] A. Siddharthan, Text simplification using typed dependencies: A comparision of the
     robustness of different generation strategies, in: The 13th European Workshop on Natural
     Language Generation, 2011, pp. 2–11.
[33] D. Ferrés, M. Marimon, H. Saggion, A. AbuRa’ed, Yats: yet another text simplifier, in:
     NLDB, Springer, 2016, pp. 335–342.
[34] C. Scarton, A. P. Aprosio, S. Tonelli, T. M. Wanton, L. Specia, Musst: A multilingual
     syntactic simplification tool, in: IJCNLP, 2017, pp. 25–28.
[35] Z. Zhu, D. Bernhard, I. Gurevych, A monolingual tree-based translation model for sentence
     simplification, in: COLING, 2010, pp. 1353–1361.
[36] K. Woodsend, M. Lapata, Learning to simplify sentences with quasi-synchronous grammar
     and integer programming, in: EMNLP, 2011, pp. 409–420.
[37] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks,
     Advances in neural information processing systems 27 (2014).
[38] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align
     and translate, arXiv preprint arXiv:1409.0473 (2014).
[39] S. Nisioi, S. Štajner, S. P. Ponzetto, L. P. Dinu, Exploring neural text simplification models,
     in: ACL, 2017, pp. 85–91.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, Advances in neural information processing systems 30
     (2017).
[41] L. Martin, B. Sagot, E. de la Clergerie, A. Bordes, Controllable sentence simplification,
     arXiv preprint arXiv:1910.02677 (2019).
[42] A. Capari, H. Azarbonyad, G. Tsatsaronis, Z. Afzal, J. Dunham, Sciencedirect topic pages:
     A knowledge base of scientific concepts across various science domains, in: SIGIR, 2024.
[43] A. Capari, H. Azarbonyad, G. Tsatsaronis, Z. Afzal, J. Dunham, J. Kamps, Knowledge
     acquisition passage retrieval: Corpus, ranking models, and evaluation resources, in: CLEF,
     2024.
[44] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
     arXiv:1412.6980 (2014).
[45] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[46] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
     N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language
     models, arXiv preprint arXiv:2302.13971 (2023).
[47] Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot
     performance of language models, in: International conference on machine learning, PMLR,
     2021, pp. 12697–12706.
[48] P. Lal, S. Ruger, Extract-based summarization with simplification, in: ACL, London, 2002.
[49] S. Eric, S. Huet, K. Jaap, E. Liana, Overview of the clef 2022 simpletext task 1: passage
     selection for a simplified summary, CEUR Workshop Proceedings, 2022, pp. 2762–2772.
[50] R. Flesch, A new readability yardstick., Journal of applied psychology 32 (1948) 221.
Appendix

A. Task 3 Prompts

  ### TASK ###
  Simplify the language used in this sentence from a scientific article so that it can be understood by the general audience.
  Focus on simplifying the sentence structure and replacing scientific jargon with everyday language.

  ### REQUEST ###
  - Sentence:
  {row.source_snt}

  - Simplified Sentence:


Figure 4: Prompt 1

  ### TASK ####
  You are going to simplify a given abstract intended for an academic audience to a text that is understandable to the
  general audience.


  ### INSTRUCTIONS ####
  1. Identify Key Concepts: First, I would identify the main points or key concepts that the article is trying to convey.
  This could include its purpose, its methods, its findings or any other relevant information.
  2. Simplify Language: Scientific articles often use complex terminology that is specific to the field of study. I would
  replace these terms with simpler, more common words that a general audience would understand.
  3. Break Down Complex Ideas: If the article contains complex ideas or processes, I would break these down into smaller
  parts and explain them one at a time.
  4. Avoid Jargon: I would avoid using jargon, unless it’s necessary for understanding the concept. If it is, I would
  provide a clear and simple definition.

  {
  ### EXAMPLE 1 ###
  - Abstract:
  {ex1.abs_source}

  - Simplified Abstract:
  {ex1.simplified_abs}

  ### EXAMPLE 2 ###
  - Abstract:
  {ex2.abs_source}

  - Simplified Abstract:
  {ex2.simplified_abs}
  }

  ### REQUEST ###
  Remember: the goal is not to oversimplify or distort the scientific abstract, but to make it accessible and understandable
  to more people.

  - Abstract:
  {row.abs_source}

  - Simplified abstract:


Figure 5: Prompt 2

  Your task is to simplify a given sentence.

  ### OUTPUT ###
  Sentence:
  {row.source_snt}

  Simplified sentence:


Figure 6: Prompt 3
  ### TASK ###
  Simplify a given sentence extracted from a scientific article to a sentence that is understandable to the general
  audience.

  ### REQUEST ###
  Remember, the goal is to retain the original meaning of the sentence while making it easier for a general audience to
  understand.

  - Original Sentence:
  {row.source_snt}

  - Simplified sentence:


Figure 7: Prompt 4
                                                          Step 1

  ### TASK ###
  Identify complex and technical terms from a given scientific abstract that require simplification or explanation in order
  to be understood by a general audience.
  Provide the complex terms along with their simpler synonym or definition in list format.

  Abstract:
  {row.abs_source}

  Complex Terms:


                                                          Step 2

  ### TASK ###
  Simplify a given sentence extracted from a scientific article to a sentence that is understandable to the general audience.

  ### INSTRUCTIONS ###
  1. **Identify Technical Terms:** Look for scientific or technical terms that may not be commonly understood by a general
  audience. Replace these with simpler, more universally understood terms. For example, simplify "co-ingestion" to
  "consumption"; "nonsteroidal anti-inflammatory drug (NSAID)" to "nonsteroidal anti-inflammatory drugs".
  2. **Simplify Complex Phrases:** Replace complex phrases with simpler ones. For example, simplify "carried on experiments"
  to "conducted experiments".
  3. **Eliminate Unnecessary Details:** Remove any details or information that is not essential to the main point of the
  sentence.
  4. **Clarify Statistics and Measurements:** If a sentence includes statistical data or measurements, explain it in a way
  that makes it easier to understand. For example, simplify "(0.23 vs 0.45 [F = 4.24, p u003c 0.05])" to "(0.23 vs 0.45,
  with statistical significance)".
  5. **Make the Subject Clearer:** Make sure the subject of the sentence is clear. For example, simplify "intervention
  participants" to "those who received the CDSS suite".
  6. **Use Active Voice:** Try to use active voice instead of passive voice as it is easier to understand.
  7. **Break Down Long Sentences:** If the sentence is too long, try to break it down into smaller sentences.
  8. **Use Everyday Language:** Instead of scientific jargon, use everyday language whenever possible.
  9. **Use context:** Simplify the given sentence, but use the provided 'Source Abstract' if additional context is needed.
  10. **Explain Complex Terms:** Replace complex terms with simpler equivalents where possible or provide a definition for
  concepts that are essential, but not commonly understood. These terms are provided in 'Complex Terms'.
  11. **Simplify Sentence Structure**.


  {
  ### EXAMPLES ###
  ## EXAMPLE 1 ##
  - Source Abstract:
  {ex1.abs_source}

  - Original Sentence:
  {ex1.source_snt}

  - Complex Terms:
  {ex1.complex_terms}

  - Simplified Sentence:
  {ex1.simplified_snt}


  ## EXAMPLE 2 ##
  - Source Abstract:
  {ex2.abs_source}
                                                              ...
  - Original Sentence:
  {ex2.source_snt}

  - Complex Terms:
  {ex2.complex_terms}

  - Simplified Sentence:
  {ex2.simplified_snt}


  ## EXAMPLE 3 ##
  - Source Abstract:
  {ex3.abs_source}

  - Original Sentence:
  {ex3.source_snt}

  - Complex Terms:
  {ex3.complex_terms}

  - Simplified Sentence:
  {ex3.simplified_snt}

  }


  ### REQUEST ###
  Remember, the goal is to retain the original meaning of the sentence while making it easier for a general audience to
  understand. Focus on replacing scientific jargon with everyday language and explaining complex, essential terms.

  - Source Abstract:
  {row.abs_source}

  - Original Sentence:
  {row.source_snt}

  - Complex Terms:
  {row.complex_terms}

  - Simplified sentence:


Figure 8: Prompt 5

  ### TASK ###
  Simplify a given sentence extracted from a scientific article to a sentence that is understandable to the general audience.
  {

  ### EXAMPLES ###
  ## EXAMPLE 1 ##
  - Source Abstract:
  {ex1.abs_source}

  - Original Sentence:
  {ex1.source_snt}

  - Simplified Sentence:
  {ex1.simplified_snt}

  ## EXAMPLE 2 ##
  - Source Abstract:
  {ex2.abs_source}

  - Original Sentence:
  {ex2.source_snt}

  - Simplified Sentence:
  {ex2.simplified_snt}

  ## EXAMPLE 3 ##
  - Source Abstract:
  {ex3.abs_source}

  - Original Sentence:
  {ex3.source_snt}
                                                              ...
  - Simplified Sentence:
  {ex3.simplified_snt}


  ## EXAMPLE 4 ##
  - Source Abstract:
  {ex4.abs_source}

  - Original Sentence:
  {ex4.source_snt}

  - Simplified Sentence:
  {ex4.simplified_snt}


  ## EXAMPLE 5 ##
  - Source Abstract:
  {ex5.abs_source}

  - Original Sentence:
  {ex5.source_snt}

  - Simplified Sentence:
  {ex5.simplified_snt}}


  ### INSTRUCTIONS ###
  1. **Identify Technical Terms:** Look for scientific or technical terms that may not be commonly understood by a general
  audience. Replace these with simpler, more universally understood terms. For example, "co-ingestion" was simplified to
  "consumption"; "nonsteroidal anti-inflammatory drug (NSAID)" was simplified to "nonsteroidal anti-inflammatory drugs".
  2. **Simplify Complex Phrases:** Replace complex phrases with simpler ones. For example, "carried on experiments" was
  simplified to "conducted experiments".
  3. **Eliminate Unnecessary Details:** Remove any details or information that is not essential to the main point of the
  sentence. For example, "based on user and tweets characteristics" was removed as it was not essential to understand the
  main point.
  4. **Clarify Statistics and Measurements:** If a sentence includes statistical data or measurements, explain it in a way
  that makes it easier to understand. For example, "(0.23 vs 0.45 [F = 4.24, p u003c 0.05])" was simplified to "(0.23 vs
  0.45, with statistical significance)".
  5. **Make the Subject Clearer:** Make sure the subject of the sentence is clear. For example, "intervention participants"
  was clarified to "those who received the CDSS suite".
  6. **Use Active Voice:** Try to use active voice instead of passive voice as it is easier to understand.
  7. **Break Down Long Sentences:** If the sentence is too long, try to break it down into smaller sentences.
  8. **Use Everyday Language:** Instead of scientific jargon, use everyday language whenever possible.
  9. **Use context:** Simplify the given sentence, but use the provided 'Source Abstract' if additional context is needed.

  ### REQUEST ###
  Remember, the goal is to retain the original meaning of the sentence while making it easier for a general audience to
  understand.

  - Source Abstract:
  {row.abs_source}

  - Original Sentence:
  {row.source_snt}

  - Simplified sentence:


Figure 9: Prompt 6
  ### TASK ###
  Simplify a given sentence extracted from a scientific article to a sentence that is understandable to the general audience.


  ### INSTRUCTIONS ###
  1. **Identify Technical Terms:** Look for scientific or technical terms that may not be commonly understood by a general
  audience. Replace these with simpler, more universally understood terms. For example, "co-ingestion" was simplified to
  "consumption"; "nonsteroidal anti-inflammatory drug (NSAID)" was simplified to "nonsteroidal anti-inflammatory drugs".
  2. **Simplify Complex Phrases:** Replace complex phrases with simpler ones. For example, "carried on experiments" was
  simplified to "conducted experiments".
  3. **Eliminate Unnecessary Details:** Remove any details or information that is not essential to the main point of the
  sentence. For example, "based on user and tweets characteristics" was removed as it was not essential to understand the
  main point.
  4. **Clarify Statistics and Measurements:** If a sentence includes statistical data or measurements, explain it in a way
  that makes it easier to understand. For example, "(0.23 vs 0.45 [F = 4.24, p u003c 0.05])" was simplified to "(0.23 vs 0.45,
  with statistical significance)".
  5. **Make the Subject Clearer:** Make sure the subject of the sentence is clear. For example, "intervention participants"
  was clarified to "those who received the CDSS suite".
  6. **Use Active Voice:** Try to use active voice instead of passive voice as it is easier to understand.
  7. **Break Down Long Sentences:** If the sentence is too long, try to break it down into smaller sentences.
  8. **Use Everyday Language:** Instead of scientific jargon, use everyday language whenever possible.
  9. **Use context:** Simplify the given sentence, but use the provided 'Source Abstract' if additional context is needed.

  {
  ### EXAMPLES ###
  ## EXAMPLE 1 ##
  - Source Abstract:
  {ex1.abs_source}

  - Original Sentence:
  {ex1.source_snt}

  - Simplified Sentence:
  {ex1.simplified_snt}


  ## EXAMPLE 2 ##
  - Source Abstract:
  {ex2.abs_source}

  - Original Sentence:
  {ex2.source_snt}

  - Simplified Sentence:
  {ex2.simplified_snt}

  ## EXAMPLE 3 ##
  - Source Abstract:
  {ex3.abs_source}

  - Original Sentence:
  {ex3.source_snt}

  - Simplified Sentence:
  {ex3.simplified_snt}
  }

  ### REQUEST ###
  Remember, the goal is to retain the original meaning of the sentence while making it easier for a general audience to
  understand.

  - Source Abstract:
  {row.abs_source}

  - Original Sentence:
  {row.source_snt}

  - Simplified sentence:


Figure 10: Prompt 7
  ### TASK ###
  Simplify a given sentence extracted from a scientific article to a sentence that is understandable to the general audience.

  {

  ### EXAMPLES ###
  ## EXAMPLE 1 ##
  - Source Abstract:
  {ex1.abs_source}

  - Original Sentence:
  {ex1.source_snt}

  - Simplified Sentence:
  {ex1.simplified_snt}

  ## EXAMPLE 2 ##
  - Source Abstract:
  {ex2.abs_source}

  - Original Sentence:
  {ex2.source_snt}

  - Simplified Sentence:
  {ex2.simplified_snt}


  ## EXAMPLE 3 ##
  - Source Abstract:
  {ex3.abs_source}

  - Original Sentence:
  {ex3.source_snt}

  - Simplified Sentence:
  {ex3.simplified_snt}
  }

  ### INSTRUCTIONS ###
  1. **Identify Technical Terms:** Look for scientific or technical terms that may not be commonly understood by a general
  audience. Replace these with simpler, more universally understood terms. For example, "co-ingestion" was simplified to
  "consumption"; "nonsteroidal anti-inflammatory drug (NSAID)" was simplified to "nonsteroidal anti-inflammatory drugs".
  2. **Simplify Complex Phrases:** Replace complex phrases with simpler ones. For example, "carried on experiments" was
  simplified to "conducted experiments".
  3. **Eliminate Unnecessary Details:** Remove any details or information that is not essential to the main point of the
  sentence. For example, "based on user and tweets characteristics" was removed as it was not essential to understand the
  main point.
  4. **Clarify Statistics and Measurements:** If a sentence includes statistical data or measurements, explain it in a way
  that makes it easier to understand. For example, "(0.23 vs 0.45 [F = 4.24, p u003c 0.05])" was simplified to "(0.23 vs
  0.45, with statistical significance)".
  5. **Make the Subject Clearer:** Make sure the subject of the sentence is clear. For example, "intervention participants"
  was clarified to "those who received the CDSS suite".
  6. **Use Active Voice:** Try to use active voice instead of passive voice as it is easier to understand.
  7. **Break Down Long Sentences:** If the sentence is too long, try to break it down into smaller sentences.
  8. **Use Everyday Language:** Instead of scientific jargon, use everyday language whenever possible.


  ### REQUEST ###
  Remember, the goal is to retain the original meaning of the sentence while making it easier for a general audience to
  understand.

  - Source Abstract:
  {row.abs_source}

  - Original Sentence:
  {row.source_snt}

  - Simplified sentence:


Figure 11: Prompt 8