UZH_Pandas at SimpleText2024: Multi-Prompt Minimum Bayes Risk with Diverse Prompts Notebook for the SimpleText Lab at CLEF 2024 Andrianos Michail1,*,† , Pascal Severin Andermatt1,† and Tobias Fankhauser1 1 University of Zurich, Zurich, Switzerland Abstract This paper serves as a summary of further experiments of the paper "SimpleText Best of Labs in CLEF-2023: Scien- tific Text Simplification Using Multi-Prompt Minimum Bayes Risk Decoding" [1], adapted to the SimpleText2024 Shared Task 3.1 dataset. We observe how candidate simplifications generated by the off-the-shelf Llama3 perform differently depending on the prompt, and whether Minimum Bayes Risk (MBR) re-ranking is beneficial with underperforming candidates. Finally, on a small sample, we investigate the agreement of simplification candidate re-rankings between MBR and a human annotator. Keywords Scientific Text Simplification, Generative Language Models, Minimum Bayes Risk Decoding, Multi Prompt Ensembling, Prompt Engineering, Large Language Models, SimpleText@CLEF-2024 1. Introduction Automatic simplification of complex text and, even more precisely, scientific abstracts, remains chal- lenging. While LLMs have been shown to be adequate for text simplification, there appears to be a large variation in performance across different domains and prompting strategies [2]. We present the extended results of the further evaluations of the paper [1] on the SimpleText2024 shared task [3]. Our main contribution in this summary is to report the results of different prompting strategies in the test set and to examine the agreement between the Minimum Bayes Risk re-ranking choices and the candidate selected by a human. 2. Methodology We perform the simplifications with off-the-shelf Llama3 [4] 8B model, using the prompts in Table 1. Further to the plain prompts, we also experiment with variations of the prompts where we provide the simplification model with intermediate definitions of complex terms during inference. Table 1 The plain prompt templates used to generate the simplifications. Target Prompt P1: General Simplify the following scientific sentence to make it more understandable for a general audience: P2: 5Y Simplify the following scientific sentence. Explain it as if you were talking to a 5-year-old, using simple words and concepts: CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ andrianos.michail@cl.uzh.ch (A. Michail); pandermatt@ifi.uzh.ch (P. S. Andermatt); tobias.fankhauser@outlook.de (T. Fankhauser) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Prompt General Prompt 5Y (Task 3) Default Dataset Minimum Llama 3 Bayes Risk Best (LENS) Intermediate Result Definitions Candidates Multi-Prompt Generation Candidate Selection Figure 1: Complete schematic of the Simplification pipeline. For extended details, refer to [1] These definitions are generated by the same LLM in a separate session. We refer to the simplifications generated with this approach as being generated through Intermediate Definitions (ID). We ablate by selecting the best candidate using Minimum Bayes Risk [5, 6, 7] with LENS [8] as the utility function results in better performance. The complete schematic is illustrated in Figure 1. 3. Results In Table 2 we show the simplification evaluations of each individual prompt, together with the eval- uations of simplifications selected by Minimum Bayes Risk. The evaluation metrics generally agree on the ranking of the systems. The clear exception is that the simplifications receive exceptionally high FKGL [9] when the model is prompted by Intermediate Definitions (ID), because the definitions are defined within the sentence. However, due to the extremely low FKGL score of the 5Y prompt, we know that the model is over-simplifying the text, probably omitting some important details of the source text. The limitation of these prompts is also reflected in the SARI [10], demonstrating its appropriateness as an evaluation metric. Contrary to previous results [1], simplifications selected by Minimum Bayes Risk received worse ratings than the two best performing prompts. We hypothesize that this is due to the overshooting of simplifications generated by the 5Y prompt, which acts as a negative utility to select the best candidate, demonstrating the dependency of the approach on the source distribution of candidates. Table 2 Results of the evaluation of the SimpleText2024 Shared Task, Task 3.1, presented in descending order according to the SARI score. Other participants are omitted for brevity. run_id Sample Size FKGL↓ SARI↑ BLEU↑ Comp. ratio Sent. splits Lev. sim. Ex. copies Lex. comp. Reference Texts 578 08.86 100.00 100.00 0.70 1.06 0.60 0.01 8.51 Best Run (Elsevier) 578 10.33 43.63 10.68 0.87 1.06 0.59 0.00 8.39 General 578 11.24 39.28 05.67 0.88 0.98 0.52 0.00 8.45 General through ID 578 21.36 38.29 03.13 1.93 0.99 0.46 0.00 8.86 Minimum Bayes Risk (LENS) 578 07.79 36.72 03.65 0.72 0.98 0.46 0.00 8.25 5Y through ID 578 19.30 36.53 02.27 1.76 1.01 0.45 0.00 8.87 5Y 578 05.94 34.91 02.29 0.66 0.99 0.43 0.00 8.17 Source Texts 578 13.65 12.02 19.76 1.00 1.00 1.00 1.00 8.80 Figure 2: Selection rate for simplification candidates selected through a Human (left) and Minimum Bayes Risk (right). 3.1. Human Preference Selection We investigate the selection process of Minimum Bayes Risk (LENS) by comparing it to how a human would select the best candidate for simplification. Out of 50 human annotated selections, we visualize the percentage of examples selected from each source prompt in Figure 2. We see that the human selected about 38% of the simplification candidates generated by intermediate definitions, with the qualitative impression that they improve the clarity of complex terms, making them easier to read. In contrast, Minimum Bayes Risk (LENS) selected mainly (58%) samples from the 5Y prompt, which was the least selected by the human with a selection rate of only 10%, due to the qualitative impression that the candidates lacked important details from the source. In general, the cross-annotator agreement between Minimum Bayes Risk and human selection is quite low, with a Cohen’s 𝜅 = 0.14. 4. Limitations In our study, we only examine the behavior of Minimum Bayes Risk within a limited set of different prompts. In reality, Minimum Bayes Risk using LENS may be limited by the source candidate pipelines or the utility function itself, LENS. Importantly, our human selection annotation study is subjective and performed on a small sample of simplifications. 5. Conclusions This study extended previous work on scientific text simplification using Multi-Prompt Minimum Bayes Risk re-ranking applied to the SimpleText2024 Shared Task 3 dataset. Our results showed significant differences in performance between prompts, with one prompt leading to oversimplification, and finally we measured the agreement between Minimum Bayes Risk and human selection, including qualitative observations. Acknowledgments We express our deepest gratitude and sincere appreciation to Simon Clematide and the Department of Computational Linguistics for their unwavering support, computational resources and constructive guidance during the creation of this work. Andrianos Michail acknowledges funding by the SNSF (213585) under the "impresso 2" project. References [1] A. Michail, P. S. Andermatt, T. Fankhauser, Simpletext best of labs in CLEF-2023: Scientific text simplification using multi-prompt minimum bayes risk decoding, in: L. Goeuriot, G. Q. Philippe Mulhem, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, 2024. [2] T. Kew, A. Chi, L. Vásquez-Rodríguez, S. Agrawal, D. Aumiller, F. Alva-Manchego, M. Shardlow, BLESS: Benchmarking large language models on sentence simplification, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 13291–13309. URL: https://aclanthology.org/2023.emnlp-main.821. doi:10.18653/v1/2023.emnlp-main.821. [3] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza, J. Kamps, Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone, in: L. Goeuriot, G. Q. Philippe Mulhem, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, 2024. [4] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [5] S. Kumar, W. Byrne, Minimum bayes-risk word alignments of bilingual texts, in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), Association for Computational Linguistics, 2002, pp. 140–147. URL: https://aclanthology.org/W02-1019. doi:10. 3115/1118693.1118712. [6] S. Kumar, W. Byrne, Minimum bayes-risk decoding for statistical machine translation, in: Pro- ceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Association for Computational Lin- guistics, Boston, Massachusetts, USA, 2004, pp. 169–176. URL: https://aclanthology.org/N04-1022. [7] M. Müller, R. Sennrich, Understanding the properties of minimum bayes risk decoding in neural machine translation, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th An- nual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Compu- tational Linguistics, Online, 2021, pp. 259–272. URL: https://aclanthology.org/2021.acl-long.22. doi:10.18653/v1/2021.acl-long.22. [8] M. Maddela, Y. Dou, D. Heineman, W. Xu, LENS: A learnable evaluation metric for text sim- plification, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), As- sociation for Computational Linguistics, Toronto, Canada, 2023, pp. 16383–16408. URL: https: //aclanthology.org/2023.acl-long.905. doi:10.18653/v1/2023.acl-long.905. [9] R. Flesch, Marks of readable style; a study in adult education., Teachers College Contributions to Education (1943). [10] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing statistical machine transla- tion for text simplification, Transactions of the Association for Computational Linguistics 4 (2016) 401–415.