<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UZH_Pandas at SimpleText2024: Multi-Prompt Minimum Bayes Risk with Diverse Prompts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrianos Michail</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Severin Andermatt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Fankhauser</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper serves as a summary of further experiments of the paper "SimpleText Best of Labs in CLEF-2023: Scientific Text Simplification Using Multi-Prompt Minimum Bayes Risk Decoding" [ 1], adapted to the SimpleText2024 Shared Task 3.1 dataset. We observe how candidate simplifications generated by the of-the-shelf Llama3 perform diferently depending on the prompt, and whether Minimum Bayes Risk (MBR) re-ranking is beneficial with underperforming candidates. Finally, on a small sample, we investigate the agreement of simplification candidate re-rankings between MBR and a human annotator.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scientific Text Simplification</kwd>
        <kwd>Generative Language Models</kwd>
        <kwd>Minimum Bayes Risk Decoding</kwd>
        <kwd>Multi Prompt Ensembling</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>SimpleText@CLEF-2024</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        We perform the simplifications with of-the-shelf Llama3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] 8B model, using the prompts in Table 1.
Further to the plain prompts, we also experiment with variations of the prompts where we provide the
simplification model with intermediate definitions of complex terms during inference.
      </p>
      <sec id="sec-2-1">
        <title>Target</title>
        <p>P2: 5Y</p>
      </sec>
      <sec id="sec-2-2">
        <title>Prompt</title>
        <p>Simplify the following scientific sentence to make it more understandable for a
general audience:
Simplify the following scientific sentence. Explain it as if you were talking to a
5-year-old, using simple words and concepts:
t )
e 3
s
a sk
t
a a
D (T</p>
        <sec id="sec-2-2-1">
          <title>Llama 3</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Default</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Intermediate</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Definitions</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Candidates</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Generation</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>Minimum</title>
        </sec>
        <sec id="sec-2-2-8">
          <title>Bayes Risk (LENS)</title>
          <p>Best</p>
        </sec>
        <sec id="sec-2-2-9">
          <title>Multi-Prompt</title>
        </sec>
        <sec id="sec-2-2-10">
          <title>Candidate Selection</title>
        </sec>
        <sec id="sec-2-2-11">
          <title>Result</title>
          <p>These definitions are generated by the same LLM in a separate session. We refer to the simplifications
generated with this approach as being generated through Intermediate Definitions (ID) .</p>
          <p>
            We ablate by selecting the best candidate using Minimum Bayes Risk [
            <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
            ] with LENS [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] as the
utility function results in better performance. The complete schematic is illustrated in Figure 1.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        In Table 2 we show the simplification evaluations of each individual prompt, together with the
evaluations of simplifications selected by Minimum Bayes Risk. The evaluation metrics generally agree
on the ranking of the systems. The clear exception is that the simplifications receive exceptionally
high FKGL [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] when the model is prompted by Intermediate Definitions (ID) , because the definitions are
defined within the sentence. However, due to the extremely low FKGL score of the 5Y prompt, we know
that the model is over-simplifying the text, probably omitting some important details of the source text.
The limitation of these prompts is also reflected in the SARI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], demonstrating its appropriateness as
an evaluation metric.
      </p>
      <p>
        Contrary to previous results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], simplifications selected by Minimum Bayes Risk received worse
ratings than the two best performing prompts. We hypothesize that this is due to the overshooting of
simplifications generated by the 5Y prompt, which acts as a negative utility to select the best candidate,
demonstrating the dependency of the approach on the source distribution of candidates.
3.1. Human Preference Selection
We investigate the selection process of Minimum Bayes Risk (LENS) by comparing it to how a human
would select the best candidate for simplification.
      </p>
      <p>Out of 50 human annotated selections, we visualize the percentage of examples selected from each
source prompt in Figure 2. We see that the human selected about 38% of the simplification candidates
generated by intermediate definitions, with the qualitative impression that they improve the clarity of
complex terms, making them easier to read. In contrast, Minimum Bayes Risk (LENS) selected mainly
(58%) samples from the 5Y prompt, which was the least selected by the human with a selection rate
of only 10%, due to the qualitative impression that the candidates lacked important details from the
source. In general, the cross-annotator agreement between Minimum Bayes Risk and human selection
is quite low, with a Cohen’s  = 0.14.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Limitations</title>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In our study, we only examine the behavior of Minimum Bayes Risk within a limited set of diferent
prompts. In reality, Minimum Bayes Risk using LENS may be limited by the source candidate pipelines
or the utility function itself, LENS. Importantly, our human selection annotation study is subjective and
performed on a small sample of simplifications.</p>
      <p>This study extended previous work on scientific text simplification using Multi-Prompt Minimum Bayes
Risk re-ranking applied to the SimpleText2024 Shared Task 3 dataset. Our results showed significant
diferences in performance between prompts, with one prompt leading to oversimplification, and finally
we measured the agreement between Minimum Bayes Risk and human selection, including qualitative
observations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We express our deepest gratitude and sincere appreciation to Simon Clematide and the Department of
Computational Linguistics for their unwavering support, computational resources and constructive
guidance during the creation of this work. Andrianos Michail acknowledges funding by the SNSF
(213585) under the "impresso 2" project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Michail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Andermatt</surname>
          </string-name>
          , T. Fankhauser,
          <article-title>Simpletext best of labs in CLEF-2023: Scientific text simplification using multi-prompt minimum bayes risk decoding</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>G. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Philippe Mulhem</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vásquez-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Aumiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          , BLESS:
          <article-title>Benchmarking large language models on sentence simplification</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>13291</fpage>
          -
          <lpage>13309</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>821</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>821</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , E. SanJuan, S. Huet,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>G. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Philippe Mulhem</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] AI@Meta, Llama 3 model card (</article-title>
          <year>2024</year>
          ). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , W. Byrne,
          <article-title>Minimum bayes-risk word alignments of bilingual texts</article-title>
          ,
          <source>in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP</source>
          <year>2002</year>
          ), Association for Computational Linguistics,
          <year>2002</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>147</lpage>
          . URL: https://aclanthology.org/W02-1019. doi:
          <volume>10</volume>
          . 3115/1118693.1118712.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , W. Byrne,
          <article-title>Minimum bayes-risk decoding for statistical machine translation</article-title>
          ,
          <source>in: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL</source>
          <year>2004</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Boston, Massachusetts, USA,
          <year>2004</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>176</lpage>
          . URL: https://aclanthology.org/N04-1022.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <article-title>Understanding the properties of minimum bayes risk decoding in neural machine translation</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>272</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>22</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>22</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Maddela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Heineman</surname>
          </string-name>
          , W. Xu,
          <string-name>
            <surname>LENS:</surname>
          </string-name>
          <article-title>A learnable evaluation metric for text simplification</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>16383</fpage>
          -
          <lpage>16408</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>905</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>905</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Flesch</surname>
          </string-name>
          ,
          <article-title>Marks of readable style; a study in adult education</article-title>
          ., Teachers College Contributions to Education (
          <year>1943</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Optimizing statistical machine translation for text simplification</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <fpage>401</fpage>
          -
          <lpage>415</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>