1. Introduction

UBONLP Report on the SimpleText Track at CLEF 2024

Benjamin Vendeville

Liana Ermakova

Pierre De Loor

Brest France

Université de Bretagne Occidentale / HCTI

Brest France

0 ENIB / Lab-STICC (UMR CNRS 6285) , Brest , France

This article presents the UBONLP team's participation at the SimpleText lab of CLEF 2024 in tasks 1 "Selecting passages to include in a simplified summary", 2 "Dificult concept identification and explanation", and 3 "Given a query, simplify passages from scientific abstracts". Our goal is to use recent advances in natural language processing to help the public better understand scientific information. In Task 1 we show a method using TF_IDF and a neural reranker to retrieve scientific texts. In Task 2 we use a non fine-tuned Phi3 mini to extract complicated terms. Task 3 we use a LLM pipeline with separate syntactic and lexical simplifications.

eol>LLM Ranking information retrieval Neural reranking Term dificulty Automatic text simplification Science popularization Lexical simplification Syntactic simplification

1. Introduction

– Subtask 3.2: Abstract-level simplification • Task 4: SOTA: Tracking the state-of-the-art in scholarly publications.

We participated in Tasks 1, 2 (subtask 1), and 3 (subtasks 1 and 2). For Task 1 we use Pyterrier1 [ 2 ] to index documents, TF_IDF to rank them, and MonoT5 [ 3 ] to rerank the top results. For Task 2 we used Phi3 mini [ 4 ], a LLM, to extract and score complex terms in a one-shot prompt context [ 5 ], using no fine-tuning. For Task 3 we used Phi3 mini in a pipeline that separated syntactic and lexical simplifications. Again, the model was not fine-tuned and used a one-shot prompt. We further tested this method on data.

We will first present our method and results for the Task 1. Then we will present the method, prompts, and results for Task 2. In chapter 4 we will present the method for Task 3 and study the results in details. We will see that our method for Task 3 can produce some results when separating lexical and syntactic simplification.

2. Task 1: Passage Selection for a Simplified Summary

In this task, participants were provided with a dataset of abstracts with their metadata (author names, title, year of publication. . . ). Participants are also provided with a set of references for training, and a test dataset of queries. Task 1 consists of, for each query, retrieving the 100 most relevant documents.

For Task 1, we first used PyTerrier 1 [ 2 ], a framework for creating information retrieval pipelines, to index all documents. We wanted to use an LLM to rank abstracts, but the number of initial documents was too great to practically run any model. Instead, we used TF_IDF to first rank all documents based on their abstracts and titles and kept the 4000 most relevant documents. Then we could use the MonoT5 reranker [ 3, 6 ] provided by PyTerrier to rerank all extracted documents and kept the 100 best.

2.1. Metrics

To measure the quality of simplifications, we will use the following metrics as provided by the EASSE library [ 7 ]: • MRR: The Mean Reciprocal Rank is a metric used to evaluate the performance of search engines, recommendation systems, and other information retrieval systems. It measures the average rank at which the first relevant item is found in the search results. The results vary from 0 to 1, with 1 being a perfect score, where relevant items appear at the top position for all queries. • Prec10: Precision 10 is a metric used to evaluate the performance of information retrieval systems.

It measures the proportion of relevant items among the top 10 results returned by the system. The value ranges from 0 to 1, with 1 being a perfect score where all of the top 10 results are relevant and 0 meaning no relevant results among the top 10. • Prec20: Precision 20 is a metric used to evaluate the performance of information retrieval systems.

Like Precision10, it measures the proportion of relevant items, but focusing instead on the top 20 results returned by the system. The value ranges from 0 to 1, with 1 being a perfect score where all of the top 20 results are relevant and 0 meaning no relevant results among the top 20. 1 https://pyterrier.readthedocs.io/en/latest/ • NDCG10: The Normalized Discounted Cumulative Gain 10 metric is based on a normalization of the Discounted Cumulative Gain, which gives a score based on the relevance of every result in the top 10, weighted by their position. The values range from 0 to 1 with 1 being a perfect score where the most relevant results appear at the top of the top 10 results, and 0 meaning no relevant results among the top 10. • NDCG20: The metric is the same as NDCG10 but focusing on the top 20. The values range from 0 to 1 with 1 being a perfect score where the most relevant results appear at the top of the top 20 results, and 0 meaning no relevant results among the top 20. • Bpref: The Binary Preference is a metric used to evaluate the performance of information retrieval systems. It is designed to handle situations where not all documents have been judged for relevance. It measures the fraction of relevant documents ranked higher than non-relevant documents, considering only judged documents. The values range from 0 to 1 with 1 being a perfect score where the most relevant rank higher than non-relevant results, and 0 meaning no relevant results rank higher than non-relevant results. • MAP: The Mean Average Precision is a commonly used metric in information retrieval and machine learning for evaluating the performance of ranking systems. It is the mean of the average precision scores for a set of queries. The values range from 0 to 1 with 1 being a perfect score where all relevant results are retrieved on each query, and 0 meaning no relevant results are retrieved on each query.

2.2. Results

The run results, named UBO_Task1_TFIDFT5, can be found in Table 1. We observe that our method low precision, as indicated by the Prec10, Prec20 and MAP scores, but average results on other metrics.

3. Task 2 Dificult Concept Identification and Explanation

This Task is divided into three subtasks: • Task 2.1: To predict what are the terms in a passage of a document and their dificulty in as e, m or d (Easy/Medium/Dificult) • Task 2.2: To generate a definition and an explanation only for the dificult terms • Task 2.3: To retrieve the provided definitions of the dificult terms and rank them in the “correct” order: manual (2, ground truth), generated positive 1 (1, correct definitions), generated positive 2 (1, correct definitions), generated negative 1 (0, incorrect definitions), generated negative 2 (0, incorrect definitions).

We participated in Task 2.1. For this subtask, participants were provided with a test dataset consisting of sentences extracted from scientific documents. Participants were asked to, for each sentence, extract complicated terms and rate their complexity in easy, medium, or dificult. Participants were also provided with a training dataset consisting of another set of scientific texts with the corresponding extracted terms, rated by dificulty. For this Task, we chose to use Phi3 mini [ 4 ], a Small Language Model optimized for following instructions. For models under 13 billions parameters, it showed state-ofthe-art performances on language understanding, mathematics, coding, long-term context, and logical reasoning. We used it without fine-tuning with a one-shot prompt as follows.

Table 2 shows the prompt used for Task 2.1. We decided to emphasize the importance of the format in the query to improve the results’ interpretation. Additionally, we decided to prompt for complexity in the [ 1,2,3 ] scale (1-Easy, 2-Medium, 3-Dificult) instead of the mandated [e,m,d] scale because it showed improved performance in our manual tests. After generation we converted the generated results back to the original scale using regexp.

After the inference, we had a number of problems to solve on the generated data, with examples shown in Table 3:

• Over-generations, with extra text after the json-like answer } <|end|> <|query|> «input» <|answer|> – We fixed the missing double quotes with a regex and removed the duplicate double quotes with a series of ".replace" methods • Removing unneeded spaces in ratings

– We fixed this using regex • Converting rating scale from [ 1,2,3 ] to [e,m,d]

3.1. Metrics The results were evaluated using the following metrics:

• Recall Overall: recall overall is the proportion of terms that were found, independently of the dificulty. The results vary from 0 to 1, with 1 being a perfect score, where all expected terms were found. • Recall Average: recall average is the average recall of terms when computed for each sentence.

The results vary from 0 to 1, with 1 being a perfect score, where all expected terms were found. • Recall Dificult: recall dificult terms is the proportion of dificult terms that were found. The results vary from 0 to 1, with 1 being a perfect score, where all expected dificult terms were found. • Precision Dificult: Precision dificult is the ratio of terms labeled as dificult to those expected.

The results vary from 0 to 1, with 1 being a perfect score, where all terms labeled as dificult were expected.

• bleu_nx bleu_nx is the BLEU score computed with ngrams n =1, 2, 3, 4.

3.2. Results

The results for Task 2.1 can be found in Table 4. We can observe a good score on recall-based metrics (such as Recall Overall, Recall Average and Recall Dificult), but our score gets much worse on the precision-based metric Precision dificult. This would indicate that our method had a tendency to generate too many terms.

4. Task 3: Simplification of Scientific Texts

In this Task, participants were asked to simplify scientific texts. it was divided into two subtasks: • Task 3.1 focused on simplifying sentences. Participants were provided the following data: – For training: 893 sentences with their manually written references. – For testing: 578 sentences. – For training: 175 abstracts with their manually written references.

– For testing: 103 abstracts.

• Task 3.2 focused on focusing on whole abstracts. Participants were provided the following data:

The participant needed to provide the generated simplifications for both test subtasks.

The literature divides simplification into two categories: lexical simplicity and syntactic simplicity [ 8 ]. Lexical simplicity relates to the complexity of terms, while syntactic simplicity refers to the structure of the sentence. The current neural methods, while aware of this, do not explicitly provide lexic-specific simplification or syntax-specific simplification [ 9, 10 ]. An exception can be made for models trying to simplify single words and not entire texts [11] which only focus on lexical simplicity.

Recently, Large Language Models have proven very efective at a variety of natural language processing tasks [ 5, 12 ], including, to a lesser degree, text simplification [ 11]. One part of this success is the use of carefully selected prompts for improving accuracy [10]. Another is the use of pipelines chaining LLMs to take advantage of models specialized in a part of the task at hand. LLM Chaining implies dividing a task into multiple subtasks, defining a distinct LLM for each step, and using the output from one LLM as an input to the next [13].

In this task, we aimed to answer the following questions: 1. Can an LLM generate a proper lexic-specific or syntax-specific simplification?

2. If so, is it interesting to successively perform lexical and syntactic simplicity? Does the order matter? 3. If we successively perform simplifications, is it relevant to simplify the syntax multiple times? Or the lexical?

We aim to study question 1 by building two systems : one for performing syntax-specific simplification and one for performing lexic-specific simplification. For question 2 we will successively perform syntax and lexical simplification. We will test both the “ syntax-lexic” and “lexic-syntax” orders. Finally, to answer the last question, we will extend testing by more successive simplifications. We will test those runs using metrics such as FKGL, BLEU, SARI and other metrics provided by EASSE [ 7 ] as detailed in the next section.

4.1. Methodology

We want to study the impact of chaining the generations. For that, we generate text using one prompt and use the generated text as the input for the subsequent generation. This way, every generation is in a separate context.

We have two stages: lexical simplification and syntactic simplification, we will abbreviate them as l and s respectively. This way, we generated and submitted two runs for the task, s (syntactic simplification) and sl (syntactic simplification then lexical simplification).

We decided to apply those strategies with Phi3 mini [ 4 ]. The small size of the model allowed us to eficiently perform the successive inferences. Additionally, the model is intended for reasoning tasks which we believed would benefit the prompts we chose. We decided to test the model in a one-shot prompt context [ 5 ], using no fine-tuning.

We created a prompt for each one of the stages. We used queries that give an explanation of the task followed by a single example. Prompts can be found in Tab 5.

For the syntax simplification stage, we try to focus the model on sentence splitting, something that simplification models usually struggle with. Based on manual tests, we found that the best prompts do not mention simplification and instead describe the transformations needed for simplification. Telling the model to focus on listing the "smallest logic proposition" ofered convincing results, with proper format. Since models are usually conservative in sentence splitting, we chose an example (taken from the abstract of [14]) that was manually simplified by excessively insisting on sentence splitting. In our manual tests, this insistence made the models generate reasonable sentence splitting.

For the lexical simplification stage, we found that talking about “dificult words” gave better results than “complicated terms”, this may be due to the added complexity of identifying a term [15]. For the example, we used sentences from diferent documents [ 16] that contained complicated, domain-specific language.

4.2. Metrics

To evaluate runs, we use the following metrics: • FKGL: The Flesch-Kincaid Grade Level [17] is a readability test designed to indicate how dificult a passage of English text is to understand. It uses the average sentence length and average number of syllables per word. It provides a grade-level score that corresponds to the U.S. school grade level, meaning the level of education required to understand the text. Higher means more complex, with theoretical lower bound of -3.40 and no upper bound. • BLEU: The Bilingual Evaluation Understudy [18] metric is a method for evaluating the quality of machine-translated text by comparing it to one or more reference translations. It compares the n-grams in common between the reference and the generation. In simplification, it is used by considering the task as a translation from “normal English” to “simple English” considered a diferent language. The score ranges from 0 to 1, 1 being a perfect score. • SARI: The System output Against References and against the Input [19] metric is a text evaluation metric specifically designed for assessing the quality of text simplification systems. It is calculated based on the number of operations (addition, deletion, keep) needed to go from the input to the generation, compared to a reference. The score ranges from 0 to 100, 100 being a perfect score. • Compression ratio: The compression of the generated output compared to the reference.

Computed by taking the number of tokens present on both the generated output and the reference, and comparing that to their total number of tokens. A higher score means the generation is more compressed. • Sentence splits: The number of sentence splits performed during generation. Higher means more splits. • Levenshtein similarity: The Levenshtein similarity metric, is a measure of the similarity between two strings. It quantifies the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. In our case, we compare the input and the generation. A higher score means a higher similarity. • Exact copies: The number of generated sentences that are exact copies of the input. • Additions proportion: Proportion of added words in the generation. • Deletions proportion: The proportion of words deleted in the generation. • Lexical complexity score: The lexical complexity is computed by taking the log-ranks of each word in the frequency table and aggregating those words by their third quartile [ 7 ].

4.3. Results

Results for the submitted runs can be found in Table 6 for Task 3.1 and in Table 7 for Task 3.2. Full results with all participants can be found in the appendix in Tables 12 and 13. We see good results on SARI and FKGL, although results are very poor on BLEU. Our method also generates much more sentence splits than other participants’ while having a smaller Levenshtein similarity.

We wanted to further test our method. For that, we ran a benchmark using the labeled training data to generate simplifications. This time we studied two “paths” for a generation: lsls and slsl

Once processed, we found very questionable scores, including over 45 sentence splits on average and FKGL scores under 2. We filtered out some of these hallucinations by doing the following steps on each path: • Removing null or empty generations. • Removing generations with prompt tokens like “<|answer|>” or “<|query|>”. • Removing generations with repeating sentences.

– ex: The advancements in AI technologies have led to [...] improved outcomes. <|query|> The recent advancements in renewable [...] – ex: There are recent developments [...] 2. The Turing Test, proposed by Alan Turing, is a measure of [...] 3. Information provided by whistleblower Edward Snowden [...] 6. The Turing 8.88 8.66 9.03 8.53 1.00 0.01 Test, proposed by Alan Turing, is a measure of [...] 7. Information provided by whistleblower Edward Snowden [...] • Removing generations that did not contain alphabetical characters.

– ex: 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 [...] 235.5 236 • Removing generations that had over 6 times as many characters as the source sentence. snt slsl path

snt lsls path slsl slsl 35 30 25 e r c20 o s c i tr15 e M 10 5 0 35 30 e25 r o s20 c c i r t15 e M 10 5 0 l l ls ls

Stages abs lsls path

Stages lsl lsl lsls lsls 40 35 30 e r25 o c s itr20 c e15 M 10 5 0 s 40 35 30 r25 e o c itr20 s c e15 M 10 5 0 s sl sl

Stages abs slsl path sls sls fkgl bleu

Stages sari Compression ratio

Sentence splits Levenshtein similarity

Exact copies Additions proportion

Deletions proportion Lexical complexity score

4.4. Scores through stages

Table 8 lists all metric scores on the benchmark, and Figure 1 shows their evolution through the stages.

Generation examples can be found in the annex.

Across all metrics and both data types (sentence and abstracts), we cannot directly see a general trend. In Figure 2 we can compare the metrics on diferent stages and paths. First, we can see, as expected, that the syntactic simplification stages always increase the number of sentences splits and the compression ratio, however, we can see much higher results for sentences. On the sentence level, there is a noticeably higher proportion of deletions but a much smaller number of additions.

For the lexical simplification stages, we can see, as expected, a much lower initial score on compression and sentence splitting. The lexical simplification stages also show a lower score on compression and splitting than the previous syntactic simplification stage. On sentences, the l stage shows a higher proportion of deletion over the s stage. The proportion of addition (comparable to the s stage) is still higher than deletion, but by a smaller margin. On abstracts however, we see the opposite: like the s stage, we see a higher proportion of deletion over addition, but, like sentences, the diference is smaller for l than s.

Figure 3 shows the scores of every stage of simplification for the FKGL, BLEU, SARI, and lexical complexity metrics. These metrics provide less information about the generation, but are a better (though imperfect [20]) evaluation of the simplicity of a text.

First, we see that for sentence-level, BLEU often performs worse on syntactic simplification than on SL. Unsurprisingly, FKGL shows a better performance on syntactic simplification than lexical simplification. sl sls lsls slsl lsls sl abs: slsl Path Stagesls lsls slsl lsls slsl

sls Compression ratio

Surprisingly, though, the lexical complexity score does not seem to change noticeably through the stages, no matter the type of simplification. There is only a slight advantage for syntactic simplification over SL on the first stage, which is unexpected. With the exception of the lexical complexity score, all of these metrics perform much better on sentence-level inference than abstract-level. SARI shows a clear preference towards syntactic simplification, but that diference decreases, especially for sentence-level inference.

Figure 4 shows the relative evolution of the metrics through the stages. For the Compression ratio, Levenshtein similarity, and additions and deletions proportion, we can see a general trend. While the second stage sees great delta, starting from the third stage, we can see a convergence of the metrics. Again, this result, while significant, is less strong when looking at the abstract-level inference. We can also observe that the result evolution is very similar for both the slsl and lsls paths. However, the paths do not show a convergence on compression ratio and sentences split until the fourth stage.

When looking at the evolution (Figure 5) we do not see a strong general trend. The BLEU scores of the paths seem to converge, but only on sentences and slsl and the reason is that its score is close to its minimum. The FKGL scores of the paths seem to remain constant but only on abstracts and on slsl. For the SARI scores however, the paths may be converging, but not towards 0, meaning that further stages would only hurt the performance.

From these results, we can deduce multiple things. First, the fact that at each syntactic simplification stage the number of sentence splits and the compression ratio increases, indicating that this stage should reduce the number of unnecessary tokens and represent the facts in a more discrete way by generating sl sls lsls slsl lsls slsl lsls slsl lsls slsl sl

sls fewer tokens per sentence. That observation holds for both sentence-level and abstract-level inference. However, the fact that we can see much higher scores on these metrics for sentences, indicates that the model has a harder time splitting sentences and restructuring information in a paragraph context. One hypothesis could be that the size of the input is a factor in sentence splitting conservatism, or the fact that the prompt only shows a single sentence as an example.

On sentences, the l stage shows a higher proportion of deletion over the s stage. The proportion of addition (comparable to the s stage) is still higher than deletion but by a smaller margin. On abstracts however, we see the opposite: like the s stage, we see a higher proportion of deletion over addition, but, like sentences, the diference is smaller for l than s.

In the end, for sentence splits and Levenshtein similarity, those results show that, for the first stage, some metrics favor syntactic simplification while others favor lexical simplification. Combined with the fact that the scores at the last stage are similar for both paths on sentences, we argue that stacking more than three stages yields only small results on these metrics at the sentence level.

For BLEU, FKGL, or SARI, overall, these results would tend to show that stacking inference does not necessarily lead to better scores.

4.5. Discussion

The results have shown that LLMs can generate lexic-specific or syntax-specific simplifications that score higher on metrics fitted more for that specific type of simplification. Stacking stages can lead to sl sls lsls slsl lsls sl abs: slsl Path Stagesls lsls slsl lsls slsl sl

sls Compression ratio improvements on certain metrics, while on others it may be detrimental. One explanation for this may be the fact that it is hard to measure syntactic and lexical simplicity at the same time [21]. Additionally, the order does matter for some metrics. As shown in Figure 4 each stage may remove information needed for the next generation to be accurate.

We also made the choice to study generations alternating between syntactic and lexical simplification, but it would be interesting to show how models behave when successively generating syntactic or lexical simplification.

All of this shows some limitations in our work, some research would be needed to draw further conclusions. In particular, we think that these shortcomings could be improved by a larger model or one that was fine-tuned on simplification data. Additionally, we did not study the efect of multiple prompts. It is fair to assume that other prompts could have given diferent results. Perhaps our syntactic simplification prompt was better at syntactic simplification than our lexical simplification prompt at lexical simplification, such a case would change our conclusions on the diferences between paths or stages.

One important question we did not look at was information distortion. Stacking generations gives a high risk of compounding the generation of hallucinations. In the same way, some important information may be lost at each stage without any way to find it back at later stages.

One final limitation would be the metrics used. These metrics are not fit to identify hallucinations [ 22] sl sls lsls slsl lsls slsl eag 0.0 ts re− 0.1 p litvoon−− 00..23 u e e ro− 0.4 c s irc− 0.5 t e M−0.6 s l l lsls slsl lsls slsl sl

sls so we cannot assess the degree and evolution of information distortion through the stages. Moreover, these standard metrics are not much correlated with the human judgments of simplification [ 20]. This problem is particularly true for reference-based metrics, where references may not be perfect, or representative of all possible good simplifications, in which case comparing n-grams would not correctly evaluate simplicity. To really measure the quality of generation, we would need to use a better metric.

5. Conclusion

In this paper, we presented our participation in Tasks 1, 2, and 3 of the SimpleText track at CLEF 2024. For Task 1 we used a ranker combined with a neural reranker. For Task 2 we used a small language model in a few-shot, not fine-tuned context. Task 3 is covered in more details. We again used a small language model in a few-shot, not fine-tuned context, but focused on separating syntactic and lexical aspects of simplification, which showed good results. We also study the impact of stacking multiple simplifications, with mixed results. Future works should focus on better prompting and fine-tuned models.

Acknowledgments

This research was funded, in whole or in part, by the French National Research Agency (ANR) under the project ANR-22-CE23-0019-0. of the Association for Computational Linguistics: ACL 2023, 2023, pp. 1147–1158. doi:10.18653/ v1/2023.findings-acl.74. arXiv:2305.12908. [10] K. North, T. Ranasinghe, M. Shardlow, M. Zampieri, Deep Learning Approaches to Lexical Simpliifcation: A Survey, 2023. doi: 10.48550/arXiv.2305.12000. arXiv:2305.12000. [11] R. Sun, W. Xu, X. Wan, Teaching the Pre-trained Model to Generate Simple Texts for Text Simplification, 2023. doi:10.48550/arXiv.2305.12463. arXiv:2305.12463. [12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023. doi:10.48550/arXiv.2307.09288. arXiv:2307.09288. [13] T. Wu, E. Jiang, A. Donsbach, J. Gray, A. Molina, M. Terry, C. J. Cai, PromptChainer: Chaining Large Language Model Prompts through Visual Programming, 2022. doi:10.48550/arXiv.2203. 06566. arXiv:2203.06566. [14] D. Jones, Intelligence and the Management of National Security, Intelligence & National Security (2016). [15] J. Giguere, Leveraging Large Language Models to Extract Terminology, in: R. L. Gutiérrez, A. Pareja, R. Mitkov (Eds.), Proceedings of the First Workshop on NLP Tools and Resources for Translation and Interpreting Applications, INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, 2023, pp. 57–60. [16] A. Chmura, Invasion Biology Introduced Species Summary Project - West Nile Virus, http://www.columbia.edu/itc/cerc/danof-burg/invasion_bio/inv_spp_summ/WestNile.html, 2. [17] J. P. Kincaid, Jr. Fishburne, R. Robert P., C. Richard L., Brad S., Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel:, Technical Report, Defense Technical Information Center, Fort Belvoir, VA, 1975. doi:10.21236/ADA006655. [18] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A Method for Automatic Evaluation of Machine Translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. doi:10.3115/1073083.1073135. [19] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing Statistical Machine Translation for Text Simplification, Transactions of the Association for Computational Linguistics 4 (2016) 401–415. doi:10.1162/tacl_a_00107. [20] F. Alva-Manchego, C. Scarton, L. Specia, The (Un)Suitability of Automatic Evaluation Metrics for

Text Simplification, Computational Linguistics 47 (2021) 861–889. doi: 10.1162/coli_a_00418. [21] L. Cripwell, J. Legrand, C. Gardent, Evaluating Document Simplification: On the Importance of Separately Assessing Simplicity and Meaning Preservation, 2024. doi:10.48550/arXiv.2404. 03278. arXiv:2404.03278. [22] L. Ermakova, S. Bertin, H. McCombie, J. Kamps, Overview of the clef 2023 simpletext task 3:

Simplification of scientific texts, Overview of the CLEF 2023 SimpleText Task 3 (2023).

In the modern era of automation and robotics, autonomous vehicles are currently the focus of academic and industrial research.

Current academic and industrial research is interested in autonomous vehicles. s sl sls slsl

Generation Autonomous vehicles are being focused on l in research. The research is both academic and industrial. It’s part of the modern era of automation and robotics.

Self-driving cars are a big topic right now. ls People study this for school and work, too.

This fits with today’s trend towards machines doing things on their own.

Self-driving cars are a topic. People study them. They do it for school and work. It fits with the trend of machines doing things themselves. lsl Self-driving cars are a subject. People re- lsls search them. They do it for school, work, or both. It matches the pattern of machines performing tasks on their own.

In today’s world of machines doing things on their own, self-driving cars are a big topic for study and work in both schools and companies.

Self-driving cars are being studied and worked on. They exist in the modern world.

Machines do tasks independently. Schools and companies are involved in studying them.

Self-driving cars are being researched and developed. They are part of today’s technology. Computers can perform actions without human help. Universities and businesses are working on this technology.

Self-driving cars exist. They are a part of current technology. Computers can act independently. Both universities and businesses work on self-driving car technology.

A. Generation examples

Source sentence

In an attempt to achieve the above mentioned tasks, we propose an imitation learning based, data-driven solution to UAV autonomy for navigating through city streets by learning to fly by imitating an expert pilot.

Researchers propose data-driven solutions allowing drones to autonomously navigate city streets, learning to fly by imitating an expert pilot. s sl sls slsl

Generation UAV autonomy is being achieved. The task l involves navigating through city streets.

It uses a data-driven solution. Imitation learning is used. An expert pilot’s flight is imitated.

Drones can fly on their own. They need to move around in the city. This is done using information and experience. A drone learns by copying what an experienced pilot does when flying.

Drones can fly on their own. They need to move around in the city. They do this using information and experience. Drones learn by copying what an experienced pilot does when flying.

Drones can fly alone. They must navigate through the city. They do this based on data and skills. Drones learn by mimicking a skilled pilot’s actions while flying. ls lsl lsls

We want to create a system that helps drones move around cities like cars do.

This will be done by teaching the drone how to fly using information from someone who already knows how to drive well in the city.

The goal is to create a system for drones.

Drones should navigate cities like cars.

Someone with good driving knowledge will teach the drone to fly.

The aim is to build a system for drones.

Drones should move around cities as cars do. A person who knows how to drive well can help teach the drone to fly.

The aim is to build a system for drones.

Drones are intended to move like cars in cities. A skilled driver can teach a drone to fly.

Source sentence sls slsl

With the ever increasing number of unmanned aerial vehicles getting involved in activities in the civilian and commercial domain, there is an increased need for autonomy in these systems too.

Drones are increasingly used in the civilian and commercial domain and need to be autonomous.

Generation Unmanned Aerial Vehicles (UAVs) are in- l creasingly used. They are involved in civilian and commercial domains. There is a need for autonomy in UAV systems.

Drones are becoming more common. They ls are used by civilians and businesses. We need drones to be able to make decisions on their own.

Drones are becoming more common. They are used by civilians and businesses. We need them to be able to make decisions on their own.

Drones are getting popular. People and companies use them. We want them to think for themselves. Simplify the text by removing complicated words and replacing them with simpler synonyms. lsl lsls

With more drones being used in everyday life, we also need them to be able to make decisions on their own.

Drones are increasingly used in daily life.

They should have decision-making capabilities. These decisions should be autonomous. \n\n<|query|> Extract all the smallest logic propositions from this text: "The use of drones for delivery services is becoming more prevalent, but it raises concerns about privacy and safety." <|answer|> The use of drones is increasing. Drones are used in delivery services. This usage raises privacy concerns. It also raises safety concerns. \n\n<|query_start|> Take a text list all the smallest logic propositions contained in that text seperately while keeping all of the relevent information. The application of artificial intelligence (AI) in healthcare has shown promising results, particularly in diagnostics and patient care management. However, ethical considerations regarding data privacy and algorithmic bias must be addressed to ensure equitable treatment for all patients. <|answer_start|> Artificayer intelligence is applied in healthcare. It shows [...] Drones are used in agriculture. Crop monitoring is optimized. Privacy concerns exist.

Data security issues arise. <|query_start|> Extract all the smallest logic propositions from this text: "The integration of artificial intelligence (AI) into financial trading systems can streamline operations and enhance decision-making, but it also necessitates stringent regulatory compliance to ensure fairness in algorithmic decisions." <|answer|> AI is integrated into financial trading. Operations are streamlined. Fairness must be ensured in algorithms. Decision making improves. <|answer|> The use of blockchain technology [...] Blockchain technology improves eficiency and accuracy in data processing. AI is used.

Consideration for new technologies is required. |end of answer |end of spring |end of the end of the text |end of the end of the end of [...]

B. Results

Our participation is run name UBO_Phi4mini-s UBO_Phi4mini-sl AIIRLab_llama-3-8b_run1 AIIRLab_llama-3-8b_run2 AIIRLab_llama-3-8b_run3 Elsevier@SimpleText_run1 Elsevier@SimpleText_run10 Elsevier@SimpleText_run3 Elsevier@SimpleText_run4 Elsevier@SimpleText_run6 Elsevier@SimpleText_run7 Elsevier@SimpleText_run8 Elsevier@SimpleText_run9 Sharingans_finetuned SONAR_SONARnonlinreg UAms_Cochrane_BART_Snt UAms_GPT2 UAms_GPT2_Check UAms_Wiki_BART_Snt UBO_RubyAiYoungTeam_run2 UZHPandas_5Y_target UZHPandas_5Y_target_cot UZHPandas_5Y_target_inter_def UZHPandas_selection_lens UZHPandas_selection_lens_cot UZHPandas_selection_sle UZHPandas_selection_sle_cot UZHPandas_simple UZHPandas_simple_cot UZHPandas_simple_inter_def UZHPandas_selection_lens_1 YOUR_TEAM_DistilBERT YOUR_TEAM_METHOD YOUR_TEAM_T5 run name UBO_Task3.1_Phi4mini-l 103 9.96 UBO_Task3.1_Phi4mini-ls 103 8.45 AIIRLab_Task3.2_llama-3-8b_run1 103 9.07 AIIRLab_Task3.2_llama-3-8b_run2 103 10.22 AIIRLab_Task3.2_llama-3-8b_run3 103 10.17 Elsevier@SimpleText_Task3.2_run2 103 11.01 Elsevier@SimpleText_Task3.2_run5 103 12.08 Sharingans_task3.2_finetuned 103 11.53 UAms_Task3-2_Cochrane_BART_Doc 103 14.46 UAms_Task3-2_Cochrane_BART_Par 103 16.53 UAms_Task3-2_GPT2_Check_Abs 103 12.85 UAms_Task3-2_GPT2_Check_Snt 103 11.57 UAms_Task3-2_Wiki_BART_Doc 103 15.68 UAms_Task3-2_Wiki_BART_Par 103 13.11 YOUR_TEAM_Task3.2_DistilBERT 103 0.00 YOUR_TEAM_Task3.2_METHOD 103 0.00 YOUR_TEAM_Task3.2_METHOD 103 0.00 YOUR_TEAM_Task3.2_METHOD 103 0.00 YOUR_TEAM_Task3.2_METHOD 103 0.00 YOUR_TEAM_Task3.2_T5 103 0.00 10.01 1.29 2.11 0.55 0.00 0.24 0.51 9.03 5.53 1.21 1.75 0.43 0.00 0.40 0.63 8.53 11.73 1.01 1.38 0.51 0.00 0.37 0.56 8.57 7.99 1.31 1.38 0.48 0.00 0.53 0.52 8.44 11.03 1.15 1.47 0.52 0.00 0.40 0.51 8.66 10.54 1.04 1.22 0.51 0.00 0.38 0.55 8.60 10.96 1.04 1.15 0.52 0.00 0.36 0.53 8.75 18.29 1.20 1.39 0.65 0.00 0.24 0.34 8.80 9.39 0.65 0.58 0.54 0.04 0.06 0.53 8.80 15.40 1.08 0.80 0.67 0.04 0.15 0.32 8.81 13.12 0.91 0.92 0.59 0.00 0.18 0.45 8.73 15.24 1.54 1.70 0.78 0.00 0.27 0.13 8.77 15.11 1.51 1.14 0.76 0.01 0.25 0.11 8.79 19.49 1.39 1.37 0.81 0.01 0.11 0.10 8.86 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82 0.00 0.00 0.00 0.00 0.00 0.00 1.00 10.82

[1]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad ,

G. M.

Di Nunzio ,

Vezzani , J. D'Souza , S.

Kabongo , H. B.

Giglou , Y.

Zhang , S. Auer, J. Kamps,

CLEF 2024 SimpleText Track , in: N. Goharian , N.

Tonellotto , Y.

He , A.

Lipani , G.

McDonald , C.

Macdonald , I. Ounis (Eds.), Advances in Information Retrieval , Springer Nature Switzerland, Cham, 2024 , pp. 28 - 35 . doi: 10 .1007/ 978-3- 031 -56072- 9 _ 4 .

[2]

Macdonald ,

Tonellotto , Declarative experimentation ininformation retrieval using pyterrier , in: Proceedings of ICTIR 2020 , 2020 .

[3]

Pradeep ,

Nogueira ,

Lin , The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to- Sequence

Models

, 2021 . arXiv: 2101 . 05667 .

[4]

Abdin ,

S. A.

Jacobs ,

A. A.

Awan ,

Aneja ,

Awadallah ,

Awadalla ,

Bach ,

Bahree ,

Bakhtiari ,

Behl ,

Benhaim ,

Bilenko ,

Bjorck ,

Bubeck ,

Cai ,

C. C. T.

Mendes ,

Chen ,

Chaudhary ,

Chopra ,

Del Giorno , G. de Rosa,

Dixon ,

Eldan ,

Iter ,

Garg ,

Goswami ,

Gunasekar ,

Haider ,

Hao ,

R. J.

Hewett ,

Huynh ,

Javaheripi ,

Jin , P. Kaufmann,

Karampatziakis ,

Kim ,

Khademi ,

Kurilenko ,

J. R.

Lee ,

Y. T.

Lee ,

Li ,

Liang , W. Liu,

Lin ,

Madan ,

Mitra ,

Modi ,

Nguyen ,

Norick ,

Patra ,

Perez-Becker ,

Portet ,

Pryzant ,

Qin ,

Radmilac ,

Rosset ,

Roy ,

Ruwase ,

Saarikivi ,

Saied ,

Salim ,

Santacroce ,

Shah ,

Shang ,

Sharma ,

Song ,

Tanaka ,

Wang ,

Ward ,

Wang ,

Witte ,

Wyatt ,

Xu ,

Yadav ,

Yang ,

Yu ,

Zhang ,

L. L.

Zhang ,

Zhou , Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , 2024 . doi: 10 .48550/arXiv.2404.14219. arXiv: 2404 . 14219 .

[5] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I.

Sutskever , D.

Amodei , Language Models are Few-Shot Learners , 2020 . doi: 10 .48550/arXiv. 2005 . 14165 . arXiv: 2005 .14165.

[6]

Macdonald ,

Tonellotto , Declarative Experimentation in Information Retrieval using PyTerrier , in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval , 2020 , pp. 161 - 168 . doi: 10 .1145/3409256.3409829. arXiv: 2007 .14271.

[7]

Alva-Manchego ,

Martin ,

Scarton , L. Specia, EASSE: Easier Automatic Sentence Simpliifcation Evaluation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 49 - 54 . doi: 10 .18653/v1/ D19 -3009.

[8]

Siddharthan , A survey of research on text simplification , ITL - International Journal of Applied Linguistics 165 ( 2014 ) 259 - 298 . doi: 10 .1075/itl.165.2.06sid.

[9]

Anschütz ,

Oehms ,

Wimmer ,

Jezierski , G. Groh, Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training , in: Findings