1. Introduction

UZH_Pandas at SimpleText2024: Multi-Prompt Minimum Bayes Risk with Diverse Prompts

Andrianos Michail

Pascal Severin Andermatt

Tobias Fankhauser

0 0 University of Zurich , Zurich , Switzerland

This paper serves as a summary of further experiments of the paper "SimpleText Best of Labs in CLEF-2023: Scientific Text Simplification Using Multi-Prompt Minimum Bayes Risk Decoding" [ 1], adapted to the SimpleText2024 Shared Task 3.1 dataset. We observe how candidate simplifications generated by the of-the-shelf Llama3 perform diferently depending on the prompt, and whether Minimum Bayes Risk (MBR) re-ranking is beneficial with underperforming candidates. Finally, on a small sample, we investigate the agreement of simplification candidate re-rankings between MBR and a human annotator.

eol>Scientific Text Simplification Generative Language Models Minimum Bayes Risk Decoding Multi Prompt Ensembling Prompt Engineering Large Language Models SimpleText@CLEF-2024

1. Introduction 2. Methodology

We perform the simplifications with of-the-shelf Llama3 [ 4 ] 8B model, using the prompts in Table 1. Further to the plain prompts, we also experiment with variations of the prompts where we provide the simplification model with intermediate definitions of complex terms during inference.

Target

P2: 5Y

Prompt

Simplify the following scientific sentence to make it more understandable for a general audience: Simplify the following scientific sentence. Explain it as if you were talking to a 5-year-old, using simple words and concepts: t ) e 3 s a sk t a a D (T

Llama 3 Default Intermediate Definitions Candidates Generation Minimum Bayes Risk (LENS)

Best

Multi-Prompt Candidate Selection Result

These definitions are generated by the same LLM in a separate session. We refer to the simplifications generated with this approach as being generated through Intermediate Definitions (ID) .

We ablate by selecting the best candidate using Minimum Bayes Risk [ 5, 6, 7 ] with LENS [ 8 ] as the utility function results in better performance. The complete schematic is illustrated in Figure 1.

3. Results

In Table 2 we show the simplification evaluations of each individual prompt, together with the evaluations of simplifications selected by Minimum Bayes Risk. The evaluation metrics generally agree on the ranking of the systems. The clear exception is that the simplifications receive exceptionally high FKGL [ 9 ] when the model is prompted by Intermediate Definitions (ID) , because the definitions are defined within the sentence. However, due to the extremely low FKGL score of the 5Y prompt, we know that the model is over-simplifying the text, probably omitting some important details of the source text. The limitation of these prompts is also reflected in the SARI [ 10 ], demonstrating its appropriateness as an evaluation metric.

Contrary to previous results [ 1 ], simplifications selected by Minimum Bayes Risk received worse ratings than the two best performing prompts. We hypothesize that this is due to the overshooting of simplifications generated by the 5Y prompt, which acts as a negative utility to select the best candidate, demonstrating the dependency of the approach on the source distribution of candidates. 3.1. Human Preference Selection We investigate the selection process of Minimum Bayes Risk (LENS) by comparing it to how a human would select the best candidate for simplification.

Out of 50 human annotated selections, we visualize the percentage of examples selected from each source prompt in Figure 2. We see that the human selected about 38% of the simplification candidates generated by intermediate definitions, with the qualitative impression that they improve the clarity of complex terms, making them easier to read. In contrast, Minimum Bayes Risk (LENS) selected mainly (58%) samples from the 5Y prompt, which was the least selected by the human with a selection rate of only 10%, due to the qualitative impression that the candidates lacked important details from the source. In general, the cross-annotator agreement between Minimum Bayes Risk and human selection is quite low, with a Cohen’s = 0.14.

4. Limitations 5. Conclusions

In our study, we only examine the behavior of Minimum Bayes Risk within a limited set of diferent prompts. In reality, Minimum Bayes Risk using LENS may be limited by the source candidate pipelines or the utility function itself, LENS. Importantly, our human selection annotation study is subjective and performed on a small sample of simplifications.

This study extended previous work on scientific text simplification using Multi-Prompt Minimum Bayes Risk re-ranking applied to the SimpleText2024 Shared Task 3 dataset. Our results showed significant diferences in performance between prompts, with one prompt leading to oversimplification, and finally we measured the agreement between Minimum Bayes Risk and human selection, including qualitative observations.

Acknowledgments

We express our deepest gratitude and sincere appreciation to Simon Clematide and the Department of Computational Linguistics for their unwavering support, computational resources and constructive guidance during the creation of this work. Andrianos Michail acknowledges funding by the SNSF (213585) under the "impresso 2" project.

[1]

Michail ,

P. S.

Andermatt , T. Fankhauser, Simpletext best of labs in CLEF-2023: Scientific text simplification using multi-prompt minimum bayes risk decoding , in: L. Goeuriot , G. Q.

Philippe Mulhem , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, 2024 .

[2]

Kew ,

Chi ,

Vásquez-Rodríguez ,

Agrawal ,

Aumiller ,

Alva-Manchego ,

Shardlow , BLESS: Benchmarking large language models on sentence simplification , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 13291 - 13309 . URL: https://aclanthology.org/ 2023 .emnlp-main. 821 . doi: 10 .18653/v1/ 2023 .emnlp-main. 821 .

[3]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad ,

G. M.

Di Nunzio ,

Vezzani , J. D'Souza , J. Kamps , Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone , in: L. Goeuriot , G. Q.

Philippe Mulhem , D.

Schwab , L.

[4] AI@Meta, Llama 3 model card (

2024 ). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.

[5]

Kumar , W. Byrne, Minimum bayes-risk word alignments of bilingual texts , in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002 ), Association for Computational Linguistics, 2002 , pp. 140 - 147 . URL: https://aclanthology.org/W02-1019. doi: 10 . 3115/1118693.1118712.

[6]

Kumar , W. Byrne, Minimum bayes-risk decoding for statistical machine translation , in: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004 , Association for Computational Linguistics , Boston, Massachusetts, USA, 2004 , pp. 169 - 176 . URL: https://aclanthology.org/N04-1022.

[7]

Müller ,

Sennrich , Understanding the properties of minimum bayes risk decoding in neural machine translation , in: C. Zong , F.

Xia , W.

Li , R.

Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Online, 2021 , pp. 259 - 272 . URL: https://aclanthology.org/ 2021 . acl-long . 22 . doi: 10 .18653/v1/ 2021 . acl-long . 22 .

[8]

Maddela ,

Dou ,

Heineman , W. Xu, LENS: A learnable evaluation metric for text simplification , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 16383 - 16408 . URL: https: //aclanthology.org/ 2023 . acl-long . 905 . doi: 10 .18653/v1/ 2023 . acl-long . 905 .

[9]

Flesch , Marks of readable style; a study in adult education ., Teachers College Contributions to Education ( 1943 ).

[10]

Xu ,

Napoles ,

Pavlick ,

Chen ,

Callison-Burch , Optimizing statistical machine translation for text simplification , Transactions of the Association for Computational Linguistics 4 ( 2016 ) 401 - 415 .