<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jussi Karlgren</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aarne Talman</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Silo AI</institution>
          ,
          <addr-line>Helsinki</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Helsinki</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment efort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the Topical quiz, in which language models are probed for topical competence. This first year of experimentation has shown - as expected - that using self-assessment with models judging models is feasible, but not entirely straight-forward, and that a judicious comparison with human assessment and application context is necessary to be able to trust self-assessed quality judgments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Generative language models (“LLMs”) as a foundational component in an information system are able
to handle a broad variety of input data robustly and elegantly, and are able to provide appropriately
creative generated output to fit a broad range of application situations and the preferences of a diverse
user population. An information service with a generative language model can be built to provide
a flexible low threshold conversational interface for its users: there is considerable interest to put
generative language models to use in productive practical applications, across domains, sectors of
society, languages, and cultural areas.</p>
      <p>The ELOQUENT lab is intended to probe the quality of a generative language model, and to do this
by addressing specifically such quality issues that are raised at the deployment time when a model is
included in a system for productive downstream tasks. The lab also intends to explore the reliability of
system self-assessment of model quality using other models or even the same model, and to reduce the
dependence of human-assessed gold standard data sets.</p>
      <p>
        A generative language model in practical application will in most envisioned use cases be expected to
stay within given task-appropriate topical boundaries, to generate material restricted to the domain it is
employed to work within, and to have competence in the terminology and conventions of that domain.
Examples of relevant topical domains could be business domains, such as finance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or healthcare [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
or even recreational activities such as sailing or basketball, ranging to diferences in how a topic is
treated diferently across linguistic and cultural areas or in specific demographic groups.
      </p>
      <p>The topical quiz task intends to answer to the need for verifying a model’s understanding of an
application domain of interest. The task is defined for a system to generate a topical quiz for some given
topic; to respond to such quizzes, including the one it has generated itself; and to score responses to
quizzes numerically from 1 to 10. Every participating team was given a list of topics, shared as a JSON
structure, and asked to use their system or systems to generate a set of questions for each topic. The
dataset includes a suggested prompt string, but participants were free to reformulate the string to fit
their model or system. The generated questions were submitted in a prescribed JSON structure by the
participants through a submission form. These question structures were shared back to the participants
for them to use their systems to generated responses to the questions. The generated responses were
then again submitted in a prescribed JSON structure by the participants using a submission form. These
responses were then scored 1-10 by four systems: Reindeer-Poro, Reindeer-Mistral, GPT-SW3, and
GPT-4o. An example topic with responses and scores is shown in Figure 2.</p>
      <sec id="sec-1-1">
        <title>Topic: Wine and terroir</title>
      </sec>
      <sec id="sec-1-2">
        <title>Question (Reindeer-Poro):</title>
      </sec>
      <sec id="sec-1-3">
        <title>What factors contribute to the unique terroir of a vineyard or region?</title>
      </sec>
      <sec id="sec-1-4">
        <title>Response (Reindeer-Mistral): The factors that contribute to the unique terroir of a vineyard or region</title>
        <p>include the soil composition, drainage patterns, exposure to sunlight, temperature and rainfall patterns, and
the presence of natural features such as mountains, rivers, and bodies of water. These factors can vary
greatly from one vineyard to another, even within the same region, and can have a significant impact on the
taste and character of the wine produced.</p>
      </sec>
      <sec id="sec-1-5">
        <title>Response (Reindeer-Poro): Factors that contribute to the unique terroir of a vineyard or region include</title>
        <p>the soil type and composition, the climate, the topography, and the local microclimate. These factors can
interact to influence the growth and development of the vines, as well as the flavors and characteristics of
the wine.</p>
      </sec>
      <sec id="sec-1-6">
        <title>Response (GPT-SW3): A unique terroir is the combination of natural conditions like soil composition,</title>
        <p>elevation, climate, and exposure to sunlight, as well as human factors such as farming methods, viticultural
practices, and winemaking techniques that all contribute to creating a distinct quality in wine. The specific
characteristics of each terroir are what make it distinctive, allowing for wines from one region to taste
diferent than wines from another region with similar conditions.</p>
      </sec>
      <sec id="sec-1-7">
        <title>Reindeer-Mistral</title>
      </sec>
      <sec id="sec-1-8">
        <title>Reindeer-Poro GPT-SW3</title>
      </sec>
      <sec id="sec-1-9">
        <title>Reindeer-Mistral Reindeer Poro 9 7 8 8 9 10</title>
        <p>GPT-SW3
8
9
10</p>
        <p>GPT-4o</p>
        <p>8
7.5
8</p>
        <p>
          The task had 27 registered participant teams. Three teams submitted quizzes, with two teams
submitting responses. The teams used Poro [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], Mistral [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] (for team "Reindeer" [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]), GPT-SW3 [6], and
a modular RAG approach [7] (for team Verbanex). This rather limited set of submitted data does not
provide us with support to use the originally planned fairly elaborate scoring scheme, comparing the
correlation of scoring across systems and the reliability of the self-scoring of a system to scores others
do.
        </p>
        <p>The quizzes generated by the systems are of varying quality, but mostly quite acceptable. The
responses are somewhat more varied, since in some cases, some of the systems failed to generate
any response at all but instead repeated the questions, descended into a generative loop, or gave an
overall summary of the question set instead of responding to them. Scoring questions clearly was the
most demanding task, since even instruction trained models struggle with processing lists of items
and assessing them in turn. A sample question for one of the topics is given in Figure 1 together with
responses from three systems and scores for those responses by the systems themseles and GPT-4o. The
scores are high and vary in fair consensus across the scoring systems for this specific question, with no
clear preference exhibited for own answers. This is not as clearly the case for questions where there is
a larger discrepancy in topical competence. Another example is given in Figure 2 where the responses
to the question "What are the major rules and regulations changes in the NBA this season?" are quite
diferent from each other and in many cases factually erroneous in that they disregard the "this season"
specification and take up rule changes from previous seasons. The self-assessed scoring in this case is
unreliable, limited by the actual topical competence of the system doing the scoring, and appears to be
based more on the appearance of competence and clarity of expression rather than factual correctness.</p>
        <p>A more comprehensive exposition of responses and their scoring is given in the task-specific overview
paper.</p>
      </sec>
      <sec id="sec-1-10">
        <title>Topic: Basketball News</title>
      </sec>
      <sec id="sec-1-11">
        <title>Question (Reindeer-Mistral):</title>
      </sec>
      <sec id="sec-1-12">
        <title>What are the major rules and regulations changes in the NBA this season?</title>
      </sec>
      <sec id="sec-1-13">
        <title>Response (Reindeer-Mistral): The major rules and regulations changes in the NBA this season include the</title>
        <p>implementation of a bubble for the playofs and the elimination of the All-Star Game.</p>
      </sec>
      <sec id="sec-1-14">
        <title>Response (Reindeer-Poro): The major rules and regulations changes in the NBA this season include a new</title>
        <p>emphasis on protecting the basketball, limiting the number of ofensive rebounds, and reducing the number
of timeouts. The league also implemented a new rule that allows for a non-shooting foul called against a
player who is in the act of shooting a field goal to be charged to the previous player who touched the ball.</p>
      </sec>
      <sec id="sec-1-15">
        <title>Response (GPT-SW3): This is a complex question that involves many factors. Here’s a summary of some</title>
        <p>of the most significant changes: lengthy exposition of pandemic-related quarantine rules, salary cap regulations</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Conclusion</title>
      <p>The goal of the Topical Quiz task of the ELOQUENT lab was to evaluate the quality of LLMs by how well
they can generate, respond to, and score in-domain questions. We also find that system performance
varies highly for specific tasks, which does not yet allow for any systematic observations. The
crossmodel evaluation set-up proved to be challenging without the use of human annotations. This we
will be working in coming editions of ELOQUENT, together with exploring new automatic ways of
evaluating LLM-generated outputs.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This lab has been supported by the European Commission through the DeployAI project (grant number
101146490).
A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation
Forum, CEUR-WS.org, 2024.
[6] A. Ekgren, A. C. Gyllensten, E. Gogoulou, A. Heiman, S. Verlinden, J. Öhman, F. Carlsson,
M. Sahlgren, Lessons learned from gpt-sw3: Building the first large-scale generative language model
for swedish, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference
(LREC), 2022.
[7] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, Retrieval-augmented generation
for large language models: A survey, arXiv preprint arXiv:2312.10997 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Irsoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dabravolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kambadur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          , G. Mann,
          <article-title>Bloomberggpt: A large language model for finance</article-title>
          ,
          <source>arXiv preprint: 2303.17564</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tanwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cole-Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pfohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Babiker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schärli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mansfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          , B. Agüera y Arcas,
          <string-name>
            <given-names>D.</given-names>
            <surname>Webster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gottweis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tomasev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajkomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Semturs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karthikesalingam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <article-title>Large language models encode clinical knowledge</article-title>
          ,
          <source>Nature</source>
          <volume>620</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luukkonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Burdge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Komulainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hatanpää</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sarlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          ,
          <article-title>Poro 34b and the blessing of multilinguality</article-title>
          ,
          <source>arXiv preprint: 2404</source>
          .
          <year>01856</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Neralla</surname>
          </string-name>
          , S. Bijl de Vroe,
          <source>Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0</source>
          .
          <article-title>1: LLM System Description for ELOQUENT at CLEF 2024</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          , P. Galuščáková,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>