UZH_Pandas at SimpleText2024: Multi-Prompt Minimum
                         Bayes Risk with Diverse Prompts
                         Notebook for the SimpleText Lab at CLEF 2024

                         Andrianos Michail1,*,† , Pascal Severin Andermatt1,† and Tobias Fankhauser1
                         1
                             University of Zurich, Zurich, Switzerland


                                         Abstract
                                         This paper serves as a summary of further experiments of the paper "SimpleText Best of Labs in CLEF-2023: Scien-
                                         tific Text Simplification Using Multi-Prompt Minimum Bayes Risk Decoding" [1], adapted to the SimpleText2024
                                         Shared Task 3.1 dataset. We observe how candidate simplifications generated by the off-the-shelf Llama3 perform
                                         differently depending on the prompt, and whether Minimum Bayes Risk (MBR) re-ranking is beneficial with
                                         underperforming candidates. Finally, on a small sample, we investigate the agreement of simplification candidate
                                         re-rankings between MBR and a human annotator.

                                         Keywords
                                         Scientific Text Simplification, Generative Language Models, Minimum Bayes Risk Decoding, Multi Prompt
                                         Ensembling, Prompt Engineering, Large Language Models, SimpleText@CLEF-2024


                         1. Introduction
                         Automatic simplification of complex text and, even more precisely, scientific abstracts, remains chal-
                         lenging. While LLMs have been shown to be adequate for text simplification, there appears to be a
                         large variation in performance across different domains and prompting strategies [2]. We present the
                         extended results of the further evaluations of the paper [1] on the SimpleText2024 shared task [3].
                         Our main contribution in this summary is to report the results of different prompting strategies in the
                         test set and to examine the agreement between the Minimum Bayes Risk re-ranking choices and the
                         candidate selected by a human.


                         2. Methodology
                         We perform the simplifications with off-the-shelf Llama3 [4] 8B model, using the prompts in Table 1.
                         Further to the plain prompts, we also experiment with variations of the prompts where we provide the
                         simplification model with intermediate definitions of complex terms during inference.

                         Table 1
                         The plain prompt templates used to generate the simplifications.
                                  Target                 Prompt
                                  P1: General            Simplify the following scientific sentence to make it more understandable for a
                                                         general audience:
                                  P2: 5Y                 Simplify the following scientific sentence. Explain it as if you were talking to a
                                                         5-year-old, using simple words and concepts:


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ andrianos.michail@cl.uzh.ch (A. Michail); pandermatt@ifi.uzh.ch (P. S. Andermatt); tobias.fankhauser@outlook.de
                         (T. Fankhauser)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                             Prompt     General             Prompt     5Y

          (Task 3)


                                                      Default
          Dataset


                                                                                     Minimum
                             Llama 3                                                 Bayes Risk
                                                                                                         Best
                                                                                      (LENS)
                                                   Intermediate                                                        Result
                                                    Definitions


                                                   Candidates                  Multi-Prompt
                                                   Generation                Candidate Selection


          Figure 1: Complete schematic of the Simplification pipeline. For extended details, refer to [1]


   These definitions are generated by the same LLM in a separate session. We refer to the simplifications
generated with this approach as being generated through Intermediate Definitions (ID).
   We ablate by selecting the best candidate using Minimum Bayes Risk [5, 6, 7] with LENS [8] as the
utility function results in better performance. The complete schematic is illustrated in Figure 1.


3. Results
In Table 2 we show the simplification evaluations of each individual prompt, together with the eval-
uations of simplifications selected by Minimum Bayes Risk. The evaluation metrics generally agree
on the ranking of the systems. The clear exception is that the simplifications receive exceptionally
high FKGL [9] when the model is prompted by Intermediate Definitions (ID), because the definitions are
defined within the sentence. However, due to the extremely low FKGL score of the 5Y prompt, we know
that the model is over-simplifying the text, probably omitting some important details of the source text.
The limitation of these prompts is also reflected in the SARI [10], demonstrating its appropriateness as
an evaluation metric.
   Contrary to previous results [1], simplifications selected by Minimum Bayes Risk received worse
ratings than the two best performing prompts. We hypothesize that this is due to the overshooting of
simplifications generated by the 5Y prompt, which acts as a negative utility to select the best candidate,
demonstrating the dependency of the approach on the source distribution of candidates.


Table 2
Results of the evaluation of the SimpleText2024 Shared Task, Task 3.1, presented in descending order according
to the SARI score. Other participants are omitted for brevity.
 run_id                      Sample Size   FKGL↓      SARI↑     BLEU↑    Comp. ratio    Sent. splits   Lev. sim.   Ex. copies   Lex. comp.
 Reference Texts                 578       08.86      100.00    100.00      0.70            1.06         0.60         0.01         8.51
 Best Run (Elsevier)             578       10.33      43.63     10.68       0.87            1.06         0.59         0.00         8.39
 General                         578       11.24      39.28     05.67       0.88            0.98         0.52         0.00         8.45
 General through ID              578       21.36      38.29     03.13       1.93            0.99         0.46         0.00         8.86
 Minimum Bayes Risk (LENS)       578       07.79      36.72     03.65       0.72            0.98         0.46         0.00         8.25
 5Y through ID                   578       19.30      36.53     02.27       1.76            1.01         0.45         0.00         8.87
 5Y                              578       05.94      34.91     02.29       0.66            0.99         0.43         0.00         8.17
 Source Texts                    578       13.65      12.02     19.76       1.00            1.00         1.00         1.00         8.80
Figure 2: Selection rate for simplification candidates selected through a Human (left) and Minimum Bayes Risk
(right).


3.1. Human Preference Selection
We investigate the selection process of Minimum Bayes Risk (LENS) by comparing it to how a human
would select the best candidate for simplification.
   Out of 50 human annotated selections, we visualize the percentage of examples selected from each
source prompt in Figure 2. We see that the human selected about 38% of the simplification candidates
generated by intermediate definitions, with the qualitative impression that they improve the clarity of
complex terms, making them easier to read. In contrast, Minimum Bayes Risk (LENS) selected mainly
(58%) samples from the 5Y prompt, which was the least selected by the human with a selection rate
of only 10%, due to the qualitative impression that the candidates lacked important details from the
source. In general, the cross-annotator agreement between Minimum Bayes Risk and human selection
is quite low, with a Cohen’s 𝜅 = 0.14.


4. Limitations
In our study, we only examine the behavior of Minimum Bayes Risk within a limited set of different
prompts. In reality, Minimum Bayes Risk using LENS may be limited by the source candidate pipelines
or the utility function itself, LENS. Importantly, our human selection annotation study is subjective and
performed on a small sample of simplifications.


5. Conclusions
This study extended previous work on scientific text simplification using Multi-Prompt Minimum Bayes
Risk re-ranking applied to the SimpleText2024 Shared Task 3 dataset. Our results showed significant
differences in performance between prompts, with one prompt leading to oversimplification, and finally
we measured the agreement between Minimum Bayes Risk and human selection, including qualitative
observations.


Acknowledgments
We express our deepest gratitude and sincere appreciation to Simon Clematide and the Department of
Computational Linguistics for their unwavering support, computational resources and constructive
guidance during the creation of this work. Andrianos Michail acknowledges funding by the SNSF
(213585) under the "impresso 2" project.
References
 [1] A. Michail, P. S. Andermatt, T. Fankhauser, Simpletext best of labs in CLEF-2023: Scientific
     text simplification using multi-prompt minimum bayes risk decoding, in: L. Goeuriot, G. Q.
     Philippe Mulhem, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera,
     G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
     Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture
     Notes in Computer Science, Springer, 2024.
 [2] T. Kew, A. Chi, L. Vásquez-Rodríguez, S. Agrawal, D. Aumiller, F. Alva-Manchego, M. Shardlow,
     BLESS: Benchmarking large language models on sentence simplification, in: H. Bouamor, J. Pino,
     K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language
     Processing, Association for Computational Linguistics, Singapore, 2023, pp. 13291–13309. URL:
     https://aclanthology.org/2023.emnlp-main.821. doi:10.18653/v1/2023.emnlp-main.821.
 [3] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza, J. Kamps,
     Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone, in:
     L. Goeuriot, G. Q. Philippe Mulhem, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S.
     de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
     Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF
     2024), Lecture Notes in Computer Science, Springer, 2024.
 [4] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/
     MODEL_CARD.md.
 [5] S. Kumar, W. Byrne, Minimum bayes-risk word alignments of bilingual texts, in: Proceedings of the
     2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), Association
     for Computational Linguistics, 2002, pp. 140–147. URL: https://aclanthology.org/W02-1019. doi:10.
     3115/1118693.1118712.
 [6] S. Kumar, W. Byrne, Minimum bayes-risk decoding for statistical machine translation, in: Pro-
     ceedings of the Human Language Technology Conference of the North American Chapter of the
     Association for Computational Linguistics: HLT-NAACL 2004, Association for Computational Lin-
     guistics, Boston, Massachusetts, USA, 2004, pp. 169–176. URL: https://aclanthology.org/N04-1022.
 [7] M. Müller, R. Sennrich, Understanding the properties of minimum bayes risk decoding in neural
     machine translation, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th An-
     nual Meeting of the Association for Computational Linguistics and the 11th International Joint
     Conference on Natural Language Processing (Volume 1: Long Papers), Association for Compu-
     tational Linguistics, Online, 2021, pp. 259–272. URL: https://aclanthology.org/2021.acl-long.22.
     doi:10.18653/v1/2021.acl-long.22.
 [8] M. Maddela, Y. Dou, D. Heineman, W. Xu, LENS: A learnable evaluation metric for text sim-
     plification, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st An-
     nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), As-
     sociation for Computational Linguistics, Toronto, Canada, 2023, pp. 16383–16408. URL: https:
     //aclanthology.org/2023.acl-long.905. doi:10.18653/v1/2023.acl-long.905.
 [9] R. Flesch, Marks of readable style; a study in adult education., Teachers College Contributions to
     Education (1943).
[10] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing statistical machine transla-
     tion for text simplification, Transactions of the Association for Computational Linguistics 4 (2016)
     401–415.