<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AIIR Lab Systems for CLEF 2024 SimpleText: Large Language Models for Text Simplification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicholas Largey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reihaneh Maarefdoust</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shea Durgin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Behrooz Mansouri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Southern Maine</institution>
          ,
          <addr-line>Portland, Maine</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper presents the participation of the Artificial Intelligence and Information Retrieval (AIIR) Lab from the University of Southern Maine in the CLEF 2024 SimpleText Lab. SimpleText has three main Tasks. Five systems are proposed for the first Task, which involves retrieving passages to include in a simplified summary. These systems select candidates using TF-IDF with expanded queries via LLaMA3. The re-ranking is performed using a bi-encoder, a cross-encoder, and LLaMA3. In Task 2, which involves identifying and explaining dificult concepts, three models utilizing LLaMA3 and Mistral are employed. Finally, for Task 3, which focuses on simplifying scientific text, four systems are introduced. Similar to Task 2, LLaMA3 and Mistral are used with diferent prompting and fine-tuning approaches. The experimental results show the proposed systems in Task 1 are the most efective, and for Tasks 2 and 3 are comparable with other systems proposed in the SimpleText lab.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scientific Text Simplification</kwd>
        <kwd>Definition Extraction</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and corresponding scientific paper abstracts, all divided into individual sentences. For this Task, we
also used a fine-tuned LLaMA3 model, and Mistral as our proposed approaches.</p>
      <p>The reported results show our proposed systems for all three Tasks have high efectiveness. For Task
1, and Subtask 3.2 our proposed models were the most efective ones, while for Task 2, and Subtask 3.1,
they are comparable to the leading systems. In the next sections, we will describe our systems for each
Task, followed by evaluation results and analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 1: Retrieving Passages to Include in a Simplified Summary</title>
      <p>
        This section first describes the data for Task 1 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Then we describe our five proposed systems. Finally,
we will provide the results and analysis.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Topic and Collection</title>
        <p>As described by the organizers, the topics for this Task are from two resources: 1) the tech section of
The Guardian2 newspaper (topics G01 to G20), and 2) Tech Xplore3 website (topics T01 to T20). Each
topic represents a query selected from one of these resources. For instance, for the topic ‘G13.1’, the
query is “digital marketing”, with its context being an article titled “Bafled by digital marketing? Find
your way out of the maze”, from The Guardian. Participants have access to the whole article, its title,
and the query.</p>
        <p>
          The main corpus consists of a large set of scientific abstracts and associated metadata in the field
of computer science and engineering. The 12th version of the Citation Network Dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], released
in 2020, provides this data extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other
sources. It contains 4,894,083 bibliographic references published before 2020, 4,232,520 English abstracts,
3,058,315 authors with afiliations, and 45,565,790 ACM citations.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Proposed Models</title>
        <p>
          AIIR Lab submitted five runs, of which three participated in the pooling and assessment process. Here,
we explain each of our proposed approaches:
• Query Expansion with LLaMA3, Search with Bi-Encoder / Cross-Encoder. (LLaMA
BiEncoder/CrossEncoder): For Task 1, input queries are short keyword terms (e.g., “drones”,
“advertising”, “gene editing”) selected from technical articles. To contextualize and potentially
expand these queries, we consider their related articles and leverage LLaMA34 for query
reformulation/expansion. Following the approach proposed by Anand et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we provide the query
and the article to the model, and use the system prompt for query rewriting/expansion shown in
Table 9 in the Appendix.
        </p>
        <p>
          Using our system prompt, we then pass the query, the related article title, and context to LLaMA3
and expand the initial query. Table 1 shows examples of expanded queries. After this step, we use
TF-IDF from PyTerrier [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] with default parameters to get the top-5000 results for each expanded
query.
        </p>
        <p>
          We then re-rank the candidates using two architectures of SentenceBERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]: bi- and cross-encoder.
For the bi-encoder, we use ‘all-mpnet-base-v2’ model due to its demonstrated efectiveness in
capturing semantic similarity between queries and documents across various information retrieval
Tasks. This model is used without further fine-tuning. The input query for the bi-encoder
combines the initial query, related article title, and LLaMA-expanded query. We consider the title
and abstract of each passage as the document for comparison with the query.
        </p>
        <p>
          For our second run, based on observations from previous lab participation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], we fine-tune a
cross-encoder model, ‘ms-marco-MiniLM-L-6-v2’. For fine-tuning, we use the data from previous
2https://www.theguardian.com/uk/technology
3https://techxplore.com/
4We used Meta-LLaMA3-8B-Instruct model from HuggingFace
years of the SimpleText lab, splitting into 90% training and 10% validation sets. We fine-tune the
model for 25 epochs, choosing the hyperparameters with the highest MRR@10 (Mean Reciprocal
Rank) on the validation set. The input queries were fed to this model as
        </p>
        <p>Initial Query + [TOP] + Article’s Title + [CON] + Expanded Query
where the initial query is the query specified by the organizers, the Article’s Title corresponds to
the topic text, and the Expanded Query is the context generated by LLaMA3. For example, the
input query for topic G11.1 would be:
drones + [TOP] + UK wants new drones in wake of Azerbaijan military success +
[CON] + UK military drones Nagorno-Karabakh conflict Azerbaijan Armenia.</p>
        <p>
          Documents in the collection are represented as ‘title + [ABS] + abstract’. In our fine-tuning
process, three special tokens {TOP, CON, ABS} are included to separate diferent text types. After
ifne-tuning the cross-encoder model, we re-rank the top-100 results retrieved by the bi-encoder
model.
• Re-ranking with LLaMA (LLaMA Re-Ranker): While we used LLaMA3 for query expansion
for our two first runs, for our next two runs, we used it as a pairwise re-ranker. Following the
approach proposed by Qin et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we used a system prompt for pair-wise re-ranking shown
in Table 9 (Appendix).
        </p>
        <p>Two variations of this architecture were implemented, difering in the user message provided to
LLaMA3. In one version, the user message included the query, related article title, and context
generated from the previous runs (i.e., the expanded query from the LLaMA3). The other version
omitted the context.</p>
        <p>Essentially, LLaMA3 was Tasked with determining which of the two documents was more relevant
to the query based on the provided information. We re-ranked the top-100 candidates retrieved
by the bi-encoder model. Since LLaMA3’s outputs in this context might not be suitable for direct
confidence scores, we assigned a simple ranking based on enumeration. The highest-ranked
document received a score of 100, with scores decreasing by 1 for lower ranks.
• Fine-Tuned Cross-Encoder combined with ElasticSearch (CERRF): Our last run leverages
ElasticSearch, provided by the organizers. We first retrieve the top-100 results for each topic
using a combination of the query and topic text. Subsequently, we re-rank these results using
a fine-tuned cross-encoder ‘ms-marco-MiniLM-L-6-v2’. For fine-tuning, the training data from
previous labs was used. We represented each input query as “&lt;query&gt; [QSP] &lt;topic text&gt;”,
while the papers were represented as “&lt;title&gt; [TSP] &lt;abstract&gt;”. Here, [QSP] and [TSP] are
special tokens separating the query text from the topic text and the paper title from its abstract,
respectively. To select optimal hyperparameters, topics G10 and G11 were chosen for validation.
The 2023 test set was used for the final evaluation. After hyperparameter selection, the model
was fine-tuned on all available training topics.</p>
        <p>
          In addition to the cross-encoder approach, we also perform a separate retrieval using Elasticsearch
with only the query (without the topic text). The results from both methods are then combined
using the modified Reciprocal Rank Fusion (MRRF) technique [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] as EQ.1, where  is the
document,  and  are the model’s similarity score and the rank, respectively. The underlying
principle of MRRF is that documents ranked highly by both retrieval methods are likely more
relevant than those ranked highly by only one method.
        </p>
        <p>( ∈ ) = ∑︁
∈</p>
        <p>()
60 + ()
(1)</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Experimental Results and Analysis</title>
        <p>Looking at the LLaMABiEncoder results, for only 10% of topics, the MRR value is not 1. The lowest
MRR is for the topic ‘G02.C1’, at 0.33 (P@10 of 0.7). For this topic, the query text by the organizers is
defined as “concerns related to the handling of sensitive information by voice assistants”. With LLaMA3,
the expanded query is “voice assistants handling sensitive information concerns Apple Siri recordings”,
does not seem to add any new useful terms to the original query. The top retrieved results for this
topic is an article titled, “Poster: A First Look at the Privacy Risks of Voice Assistant Apps.”, assessed as
non-relevant. For topics like ‘T11.1’ the original query “character relationship” is expanded to “character
relationship network map The Witcher”, helping find more relevant results, leading to MRR and P@10
of 1.</p>
        <p>Comparing our LLaMA3 re-ranking approach system, LLaMAReranker2 against LLaMABiEncoder,
there is no significant diference between the two systems, using Two-sided Paired Student’s t-Test
(p-value=0.05). Interestingly, both models have the same topics for which they did not achieve MMR
of 1. For topic ‘G02.C1’, the MMR drops to 0.2 with LLaMAReranker2 (P@10 of 0.3). Investigating the
results for this topic, LLaMA3 gave higher ranks to articles that have only titles (abstract missing) such
as the article titled “Examining the Use of Voice Assistants: A Value-Focused Thinking Approach”. With
the article’s abstract missing, these articles are assessed as non-relevant. Overall, using LLaMA3 for
either re-ranking or query expansion showed similar efectiveness, while re-ranking with a bi-Encoder
proved more eficient.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task 2: Identifying and Explaining Dificult Concepts</title>
      <p>
        This section describes the data for Task 2 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], our proposed models, and evaluation results. We rely on
LLaMA3 and Mistral [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] language models and propose three systems for Subtasks 1 and 2.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Training and Test Data:</title>
        <p>
          For Task 2, 576 sentences from 115 documents are provided for training. For these sentences, 2590
annotated dificult terms are available. Subtask 2.2 leverages a dataset of 501 sentences across 55
documents, containing 2,006 explanations and 1,521 definitions. These documents are selected from
high-ranked abstracts to the requests of Task 1. Participants are asked to detect dificult terms, along
with the dificulty level for Subtask 2.1, and provide definitions and explanations of detected dificult
terms for Subtask 2.2 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed Models</title>
        <p>Our team participated in Task 2, with three proposed systems, based on LLaMA3 and Mistral. Here we
describe our models:
• LLaMA: Our first model uses LLaMA3-8B-Instruct, using a system prompt to instruct the model
to act as a knowledgeable high school student (details in Table 10). This prompt achieved the best
performance among those studied on the training data. We process each sentence from the test
set using the following user message:</p>
        <p>For the sentence: SENTENCE, what are dificult terms (one to five consecutive terms)?
What is the dificulty level? Your output is term or terms: dificulty level (e, m, or d).</p>
        <p>Do not provide explanation, just give the answer.
where SENTENCE represents the actual sentence. We specify the output format, as LLaMA
can add unnecessary information. After identifying dificult terms, we again utilize LLaMA to
generate definitions and explanations. As shown in Table 10, we instruct LLaMA to act as a
technician with knowledge of technical terms and request definitions and explanations. The
following user message is used for this step:</p>
        <p>You have identified term “TERM” in the sentence: “SENTENCE” as an unclear term.
Provide its definition and explain what it is. The output should be like:</p>
        <p>Definition: Give definition here, Explanation: Give explanation here
where TERM represents the term identified earlier and SENTENCE is the sentence it originated
from.
• LLaMA Fine-tuned (LLaMAFT): Our second approach is based on prompt engineering and
reinforcement learning with human feedback to improve the quality of outputs generated by the
LLaMA model. We designed several models to enhance the feedback loop, ultimately aiming for
better results. Our exploration resulted in three distinct models, shown in Figure 1. Our models
difer in how the user and system messages are sent to LLaMA. Table 11, shows the order of
prompts used for each model. Each model mainly follows a two-step process:
– Step 1: After instructing model based on the prompts in Table 11, the user message is based
on the sentence and in some cases, by incorporating human-annotated data (output) from
training data. This output represents the desired outcome for the Task, including identified
dificult terms and their corresponding definitions and explanations.
– Step 2: Using the generated result by LLaMA from Step 1, and a new user prompt, a second
round of results is produced.</p>
        <p>Each model was studied with diferent combinations of training data and prompts. Through our
experiments, Model M3 outperformed other approaches and was used as our second run.
Step 1
Step 2
(Sentencei, Outputi)</p>
        <p>LLaMA3</p>
        <sec id="sec-3-2-1">
          <title>Sentencei</title>
          <p>LLaMA3
M0</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Sentencei</title>
          <p>LLaMA3</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Sentencej</title>
          <p>LLaMA3
(Sentencei, Outputi)
(Outputj, Sentencei)
LLaMA3
LLaMA3
M1
M2
LLaMA3</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Sentencei</title>
          <p>LLaMA3</p>
          <p>M3
• Mistral: Similar to our LLaMA3-based model, our approach with Mistral-7B leverages a system
prompt (details in Table 10). This prompt instructs Mistral to identify dificult terms. We process
training examples through a series of prompts and responses with Mistral to achieve this. Figure
2 illustrates the process in which we represent several ground truths to the model. The examples
used in the figure come from the training data, and the SENTENCE represents the test sentence
being analyzed. After detecting the dificult terms, we use the similar system prompt (shown
in Table 10) as our first model to generate definitions and explanations of dificult terms with
Mistral.</p>
          <p>system message</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Okay, pass the sentence.</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>CRISPR-Cas is a tool that is widely used for gene editing.</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>CRISPR-Cas; difficulty: d</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Here is another one: This technique was implemented inside a Personal Digital Assistant (PDA) portable device. technique: e, personal digital assistant: d, portable device: m</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>Now do the same for this sentence: SENTENCE …</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Results and Analysis</title>
        <p>Our proposed systems results on the test set are summarized in Table 4. For each run, the organizers
reported:
• Recall of all the terms, independently from the level of dificulty
• Precision of all the terms, independently from the level of dificulty
• Recall of the dificult terms
• Precision of the dificult terms
• BLEU score computed for bigrams</p>
        <p>Our proposed Mistral approach provided better results compared to LLaMA3. Providing an example,
for the sentence “Cryptocurrency was built initially as a possible implementation of digital currency,
then various derivatives were created in a variety of fields such as financial transactions, capital
management, and even nonmonetary applications.” (sentence ID: G08.1_2972302621_1), Table 5 shows
the ground-truth, and the results generated by Mistral and LLaMA, for Subtask 2.1. As can be seen,
LLaMA tends to extract fewer terms for each sentence, leading to lower recall; however, the precision
for correctly identifying dificulty level is more precise.</p>
        <p>Another interesting aspect of Task 2 is duplicate sentences. The organizers have provided repeated
sentences to study whether LLMs provide the same results. Our results show while Mistral mostly
produces the same responses, LLaMA3 responses seem to difer each time. For a short sentence, “This
is especially true for self-driving vehicles deployed in public transport services.”, LLaMA3 once extracts
the terms ‘self-driving’, ‘vehicles’, ‘public transport’ and the next time extracts ‘self-driving’, ‘deployed’.
Mistral extracted terms, however, remained the same.</p>
        <p>Note on LLaMAFT Run: We have identified a mistake while submitting this run. Our studies
for diferent models (M0 to M3) used a two-stage process of first identifying the dificult terms and
then generating the definitions. In our submitted model for the test data, we mistakenly used a single
prompt for all the Subtasks. Upon correction, including previous related documents and human answers
improved the results (Precision: 0.28, Recall: 0.41).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task 3: Simplify Scientific Text</title>
      <p>
        This section describes the data, proposed models, and evaluation results for Task 3 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
LLaMA3-8BInstruct and Mistral were utilized for both Subtasks 3.1 and 3.2.
      </p>
      <p>G06.2_855132903_1
In this paper we present queuing-theoretical methods for the
modeling, analysis, and control of autonomous
mobility-ondemand MOD systems wherein robotic, self-driving vehicles
transport customers within an urban environment and rebalance
themselves to ensure acceptable quality of service throughout
the network.</p>
      <p>Queuing models are used for autonomous mobility-on-demand
MOD systems. A queuing model is constructed so that queue
lengths and waiting time can be predicted. In MOD systems,
robotic, self-driving vehicles transport customers within an
urban environment and rebalance themselves to ensure quality of
service.</p>
      <sec id="sec-4-1">
        <title>4.1. Topic and Collection</title>
        <p>The training data consists of a collection of parallel text passages (source and simplified versions). These
simplified sentences are directly created from original scientific abstracts in the DBLP Citation Network
Dataset for Computer Science, Google Scholar, and PubMed articles on Health and Medicine (all from
2023). The dataset includes 648 sentences for training and 245 sentences for testing. The simplification
process involved either master’s students in Technical Writing and Translation or a team of a computer
scientist and a professional translator (native English speaker). An example of this source (original) and
target (simplified) sentence pair is provided in Table 6.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Proposed Models</title>
        <p>
          AIIR Lab submitted a total of four runs for both Subtasks 3.1 (sentence-level) and 3.2 (abstract-level).
Three of the runs utilized a fine-tuned LLaMA3-8B model and one used Mistral, with prompt engineering.
Our proposed approaches are as follows:
• Prompt Engineering with Instruction-tuned LLaMA3-8B: Our first three runs for this Task
utilized LLaMA3-8B which was instruction-tuned with the provided training data for both the
sentence and abstract levels. We used a split of 90:10 for training and validation. For instruction
tuning with LLaMA, we used Quantized Low-Rank Adaptation (QLoRA). QLoRA, as shown in
Figure 3, is a method used for fine-tuning processes to reduce the amount of memory required
and computational cost [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The model’s weights are first converted from 16-bit floating point
numbers to 4-bit NormalFloat. These reduced-size weight matrices are then approximated to
low-rank matrices by reducing the number of parameters, speeding up computation time, and
reducing the data footprint. These 4-bit embeddings then utilize NVIDIA’s unified memory
feature, which allows for automatic paging optimization before updating the weights. This paging
optimization allows for the CPU RAM to be accessed by the GPU directly for page-to-page
transferring, preventing the possibility of running out of GPU memory space as long as suficient
system memory is available.
        </p>
        <p>During the training process, the data was first run through QLoRA for the token embeddings to
be resized. The hyperparameters are set as follows: an alpha of 32, a dropout of 0.1, a Task type
of “CASUAL_LM” and an R-value of 8. The output data was then fed to LLaMA3-8B with the
hyperparameters of a learning rate of e-4, a paged_adam_32 optimization function, 20 epochs and
a batch size of 8.</p>
        <p>As shown in Table 6 each entry for the data is paired into source and target values. We passed
training data for LLaMA3 instruction-tuning as:
"Instruction:" + [P] + "Input: " + [S] + "Response: " + [T]</p>
        <p>Paged 4-bit
Embeddings</p>
        <p>QLoRA
Embeddings
4-bit
Paging
Optimizer
16-bit Tokenized Word Embedding Matrices
where prompt (P), for all training samples, was the one used for Run 1 (Table 12). The source (S)
and target (T ) values would be the output token embeddings from QLoRA. We believe this would
give LLaMA3 a better understanding of the linguistic styles in the desired target simplifications.
For prompt engineering, we focused on the average FKGL (Flesch-Kincaid Grade Level) score for
the provided test sentences and abstracts. The data was passed into our instruction-tuned model
and an FKGL score was averaged at the end of each run.
• Mistral (RUN 4): Using Mistral 7B, we used the system prompt as shown in Table 12. We then
used three sample sentences from training data, along with their simplified versions, to provide
examples for Mistral. As our final user message, we passed the test sentence/abstract to mistral
with the prompt:</p>
        <p>Now do the same for this text, simplify by explaining technical terms or replacing
them with easier words without removing context: TEXT
where TEXT is the input sentence/abstract.</p>
        <p>Note: While submitting this run, we only evaluated the model on the training data by mistake.</p>
        <p>Therefore, this run was excluded from the evaluation.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results and Analysis</title>
        <p>
          Task 3 results are evaluated based on several metrics, with SARI [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] score against the human reference
simplifications as the main measure. Table 7 shows our results for both Subtasks 3.1 (sentence-level)
and 3.2 (abstract-level). While for Subtask 3.1, our team runs are ranked second in terms of SARI score,
we achieved the highest SARI score for Subtask 3.2 between the participating teams. For the level of text
complexity, the FKGL readability measure is used. Compared to the references, our models have high
compression ratios and sentence splits, as LLaMA’s outputs are lengthier. An example of this is shown
in Table 8, where our simplified version of the original input text is compared against the ground-truth
and for Subtask 3.1.
        </p>
        <p>For Subtask 3.1, all LLaMA3’s Sari scores fell within a ± 0.82 diference from one another. The Sari
scores for Subtask 3.2 were similar to Subtask 3.1, in that, they varied by a relatively narrow margin of
± 1.25. The original sentences have an FKGL of 13-14 corresponding to a university-level text, with the
reference scores being 8.86 for Subtask 3.1 and 8.91 for Subtask 3.2. Our FKGL results for all runs in
both Tasks fell within the 8.39 to 10.33 FKGL range, with our run 1 scores being 0.47 points below for
Task 3.1 and 0.16 points above for Task 3.2 compared to the reference FKGL score.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The goal of the MOST project is to develop a novel, inexpensive, easy-to-use
digital talking device for blind and visually impaired users based on
of-theshelf handheld computers (Personal Digital Assistant).</p>
      <p>Simplified Result
The goal of the MOST project is to create a new talking device for blind people.</p>
      <p>The MOST project aims to create a simple, afordable, and easy-to-use digital
talking device for blind and visually impaired people using ordinary handheld
computers.</p>
      <p>The goal of the MOST project is to create a simple, afordable, and
easy-touse digital device that can talk to blind and visually impaired people using
handheld computers.</p>
      <p>The MOST project aims to create a simple, afordable, and user-friendly digital
talking device for blind and visually impaired people using common handheld
computers.</p>
      <p>AIIR lab participated in SimpleText CLEF 2024 lab Tasks 1 to 3, relying on large language models, namely
LLaMA3 and Mistral. In Task 1, we submitted five runs leveraging LLaMA for query expansion, TF-IDF
for candidate selection, and both bi-encoder and fine-tuned cross-encoder models for re-ranking. We
also explored LLaMA for re-ranking within this Task. Our bi-encoder model and LLaMA re-ranker
models were the most efective systems among the participating teams. For Task 2, we had three runs,
using LLaMA and Mistral. Our Mistral-based model provided better efectiveness compared to LLaMA,
providing higher recall and precision in detecting dificult terms. However, LLaMA model was better at
detecting dificulty levels. Finally, for Task 3, we participated in both Subtasks, submitting four runs
that employed LLaMA and Mistral. Our LLaMA models had high SARI scores for Subtasks 3.1 and
3.2. For future work, we aim to explore large language models further for these Tasks, incorporating
techniques such as chain-of-thoughts to study the efectiveness of these models for the related Tasks.</p>
      <p>Prompt
Being a ranking model your first Task is to do query expansion. For an information
need, you will add more context to it. Contextualize the query as best as you can in
one or two short sentences, for a given information need and context.</p>
      <p>You are a ranking model for information retrieval. Given a query and two documents,
you will say which one is more relevant. If Document 1 is more relevant say yes,
otherwise say no.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Prompts</title>
      <p>This section shows the prompts used for SimpleText lab for the Tasks we participated in. For query
rewriting/expansion and re-ranking, we used the system prompts shown in Table 9 with LLaMA3. For
Task 2, Table 10 shows the system prompts that we used for Subtasks 1 and 2. Table 11 shows our
prompts for fine-tuning LLaMA for Task 2. Finally, Table 12 shows our prompts for Task 3.</p>
      <p>Prompts
Instruction: Extract complex words from sentence, generate only one definition for each complex
word.</p>
      <p>System: Human answer to find complex term is {_}, Human definition
{_ }, Human positive definition {_}, Human negative definition
{_}.</p>
      <p>User: {} {instruction}
User: {} {instruction}
Instruction: Extract complex words from sentence, Do not generate long text.</p>
      <p>User: {}
System: Human answer to find complex term is {_}, Human definition
{_ }.</p>
      <p>User: {} {instruction}
Instruction: Extract dificult words from sentence, Do not generate long text.</p>
      <p>User: { }
System: Human answer to find complex term is {_} with dificulty {  } .
User: {} {instruction}
Instruction: Extract complex words from sentence, Do not generate long text.</p>
      <p>System: Human answer to find complex term is {_} .</p>
      <p>User: {} {instruction}
User: {} {instruction}
Instruction: Extract complex words from sentence, and label dificulty of word with one of ’e’ means
easy, ’m’ means medium, ’d’ means dificult, and then generate a definition for each complex word
based on sentence, generate an explanation for each complex word.</p>
      <p>System: Human answer to find complex term is {_} , and dificulty
{  _} , Human definition {_ } , Human good definition {_} ,
Human wrong definition {_} .</p>
      <p>User: {} {instruction}</p>
      <p>User: {Test Sentence} {instruction}</p>
      <p>Prompt
Simplify this text for English speaking science students in college. Maximize the use
of simple words and short sentences, but include keywords from the original text.</p>
      <p>Optimize the output ROUGE, SARI, and BLEU scores
You are a skilled editor, known for your ability to simplify complex text while
preserving its meaning. You have a strong understanding of readability principles and
how to apply them to improve text comprehension.</p>
      <p>Simplify the following scientific text for an average American citizen. Keep, but
define, any keywords and subjects with less complex words and phrases.</p>
      <p>You are a skilled editor, known for your ability to simplify complex text while
preserving it. You explain the technical terms, defining what they are (e.g., terms like
Blockchain, Cryptojacking, all abbreviations), without removing sentences or
summarizing them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <source>Overview of CLEF</source>
          <year>2024</year>
          <article-title>SimpleText track on improving access to scientific texts</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
          </string-name>
          , et al. (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7B, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>SanJuan</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 SimpleText task 1: Retrieve passages to include in a simplified summary</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>G. M. D. Nunzio</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 SimpleText task 2: Identify and explain dificult concepts</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 SimpleText task 3: Simplify scientific text</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Yao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Su</surname>
          </string-name>
          , ArnetMiner: Extraction and Mining of Academic Social Networks,
          <source>in: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          , et al.,
          <article-title>Context Aware Query Rewriting for Text Rankers using LLM, arXiv preprint</article-title>
          arXiv:
          <volume>2308</volume>
          .16753 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <article-title>Declarative Experimentation in Information Retrieval using PyTerrier</article-title>
          ,
          <source>in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence Embeddings using Siamese BERT-Networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Durgin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fletcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>AIIR and LIAAD Labs Systems for CLEF 2023 SimpleText.</article-title>
          ,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jagerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <article-title>Large Language Models are Efective Text Rankers with Pairwise Ranking Prompting, in: Findings of the Association for Computational Linguistics: NAACL</article-title>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <surname>DPRL</surname>
          </string-name>
          <article-title>Systems in the CLEF 2022 ARQMath Lab: Introducing MathAMR for Math-Aware Search</article-title>
          ,
          <source>Proc. CLEF 2022 (CEUR Working Notes)</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          , L. Zettlemoyer, QLoRA: Eficient Finetuning of Quantized LLMs,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Optimizing Statistical Machine Translation for Text Simplification</article-title>
          , volume
          <volume>4</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>