<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ELOQUENT 2024 -Topical Quiz Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jussi</forename><surname>Karlgren</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Silo AI</orgName>
								<address>
									<settlement>Helsinki</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aarne</forename><surname>Talman</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">University of Helsinki</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">ELOQUENT 2024 -Topical Quiz Task</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">6C8E63AD20675F6DD51DAA2EF4BB604F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the Topical quiz, in which language models are probed for topical competence. This first year of experimentation has shown -as expected -that using self-assessment with models judging models is feasible, but not entirely straight-forward, and that a judicious comparison with human assessment and application context is necessary to be able to trust self-assessed quality judgments.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Generative language models ("LLMs") as a foundational component in an information system are able to handle a broad variety of input data robustly and elegantly, and are able to provide appropriately creative generated output to fit a broad range of application situations and the preferences of a diverse user population. An information service with a generative language model can be built to provide a flexible low threshold conversational interface for its users: there is considerable interest to put generative language models to use in productive practical applications, across domains, sectors of society, languages, and cultural areas.</p><p>The ELOQUENT lab is intended to probe the quality of a generative language model, and to do this by addressing specifically such quality issues that are raised at the deployment time when a model is included in a system for productive downstream tasks. The lab also intends to explore the reliability of system self-assessment of model quality using other models or even the same model, and to reduce the dependence of human-assessed gold standard data sets. A generative language model in practical application will in most envisioned use cases be expected to stay within given task-appropriate topical boundaries, to generate material restricted to the domain it is employed to work within, and to have competence in the terminology and conventions of that domain. Examples of relevant topical domains could be business domains, such as finance <ref type="bibr" target="#b0">[1]</ref> or healthcare <ref type="bibr" target="#b1">[2]</ref>, or even recreational activities such as sailing or basketball, ranging to differences in how a topic is treated differently across linguistic and cultural areas or in specific demographic groups.</p><p>The topical quiz task intends to answer to the need for verifying a model's understanding of an application domain of interest. The task is defined for a system to generate a topical quiz for some given topic; to respond to such quizzes, including the one it has generated itself; and to score responses to quizzes numerically from 1 to 10. Every participating team was given a list of topics, shared as a JSON structure, and asked to use their system or systems to generate a set of questions for each topic. The dataset includes a suggested prompt string, but participants were free to reformulate the string to fit their model or system. The generated questions were submitted in a prescribed JSON structure by the participants through a submission form. These question structures were shared back to the participants for them to use their systems to generated responses to the questions. The generated responses were then again submitted in a prescribed JSON structure by the participants using a submission form. These responses were then scored 1-10 by four systems: Reindeer-Poro, Reindeer-Mistral, GPT-SW3, and GPT-4o. An example topic with responses and scores is shown in Figure <ref type="figure">2</ref>.</p><p>Topic: Wine and terroir Question (Reindeer-Poro): What factors contribute to the unique terroir of a vineyard or region? Response (Reindeer-Mistral): The factors that contribute to the unique terroir of a vineyard or region include the soil composition, drainage patterns, exposure to sunlight, temperature and rainfall patterns, and the presence of natural features such as mountains, rivers, and bodies of water. These factors can vary greatly from one vineyard to another, even within the same region, and can have a significant impact on the taste and character of the wine produced. Response (Reindeer-Poro): Factors that contribute to the unique terroir of a vineyard or region include the soil type and composition, the climate, the topography, and the local microclimate. These factors can interact to influence the growth and development of the vines, as well as the flavors and characteristics of the wine. Response (GPT-SW3): A unique terroir is the combination of natural conditions like soil composition, elevation, climate, and exposure to sunlight, as well as human factors such as farming methods, viticultural practices, and winemaking techniques that all contribute to creating a distinct quality in wine. The specific characteristics of each terroir are what make it distinctive, allowing for wines from one region to taste different than wines from another region with similar conditions. The task had 27 registered participant teams. Three teams submitted quizzes, with two teams submitting responses. The teams used Poro <ref type="bibr" target="#b2">[3]</ref>, Mistral <ref type="bibr" target="#b3">[4]</ref> (for team "Reindeer" <ref type="bibr" target="#b4">[5]</ref>), GPT-SW3 <ref type="bibr" target="#b5">[6]</ref>, and a modular RAG approach <ref type="bibr" target="#b6">[7]</ref> (for team Verbanex). This rather limited set of submitted data does not provide us with support to use the originally planned fairly elaborate scoring scheme, comparing the correlation of scoring across systems and the reliability of the self-scoring of a system to scores others do.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reindeer-Mistral</head><p>The quizzes generated by the systems are of varying quality, but mostly quite acceptable. The responses are somewhat more varied, since in some cases, some of the systems failed to generate any response at all but instead repeated the questions, descended into a generative loop, or gave an overall summary of the question set instead of responding to them. Scoring questions clearly was the most demanding task, since even instruction trained models struggle with processing lists of items and assessing them in turn. A sample question for one of the topics is given in Figure <ref type="figure">1</ref> together with responses from three systems and scores for those responses by the systems themseles and GPT-4o. The scores are high and vary in fair consensus across the scoring systems for this specific question, with no clear preference exhibited for own answers. This is not as clearly the case for questions where there is a larger discrepancy in topical competence. Another example is given in Figure <ref type="figure">2</ref> where the responses to the question "What are the major rules and regulations changes in the NBA this season?" are quite different from each other and in many cases factually erroneous in that they disregard the "this season" specification and take up rule changes from previous seasons. The self-assessed scoring in this case is unreliable, limited by the actual topical competence of the system doing the scoring, and appears to be based more on the appearance of competence and clarity of expression rather than factual correctness.</p><p>A more comprehensive exposition of responses and their scoring is given in the task-specific overview paper.</p><p>Topic: Basketball News Question (Reindeer-Mistral): What are the major rules and regulations changes in the NBA this season? Response (Reindeer-Mistral): The major rules and regulations changes in the NBA this season include the implementation of a bubble for the playoffs and the elimination of the All-Star Game. Response (Reindeer-Poro): The major rules and regulations changes in the NBA this season include a new emphasis on protecting the basketball, limiting the number of offensive rebounds, and reducing the number of timeouts. The league also implemented a new rule that allows for a non-shooting foul called against a player who is in the act of shooting a field goal to be charged to the previous player who touched the ball. Response (GPT-SW3): This is a complex question that involves many factors. Here's a summary of some of the most significant changes: lengthy exposition of pandemic-related quarantine rules, salary cap regulations </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reindeer-Mistral</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Conclusion</head><p>The goal of the Topical Quiz task of the ELOQUENT lab was to evaluate the quality of LLMs by how well they can generate, respond to, and score in-domain questions. We also find that system performance varies highly for specific tasks, which does not yet allow for any systematic observations. The crossmodel evaluation set-up proved to be challenging without the use of human annotations. This we will be working in coming editions of ELOQUENT, together with exploring new automatic ways of evaluating LLM-generated outputs.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>Figure 2: A second more topically demanding sample topic for the Topical Quiz task and a sample quiz question by Reindeer-Mistral; responses by Reindeer-Poro, Reindeer-Mistral, and GPT-SW3; and scores for the responses as given by Reindeer-Poro, Reindeer-Mistral, GPT-SW3, and GPT-4o</figDesc><table><row><cell></cell><cell></cell><cell cols="3">Reindeer Poro GPT-SW3 GPT-4o</cell></row><row><cell>Reindeer-Mistral</cell><cell>7</cell><cell>6</cell><cell>8</cell><cell>4</cell></row><row><cell>Reindeer-Poro</cell><cell>6</cell><cell>6</cell><cell>8</cell><cell>3</cell></row><row><cell>GPT-SW3</cell><cell>9</cell><cell>10</cell><cell>8</cell><cell>6</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This lab has been supported by the European Commission through the DeployAI project (grant number 101146490).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Irsoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dabravolski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dredze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gehrmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kambadur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rosenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mann</surname></persName>
		</author>
		<idno>preprint: 2303.17564</idno>
		<title level="m">Bloomberggpt: A large language model for finance</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Large language models encode clinical knowledge</title>
		<author>
			<persName><forename type="first">K</forename><surname>Singhal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Azizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Mahdavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">W</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Scales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tanwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cole-Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pfohl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Payne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Seneviratne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gamble</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Babiker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schärli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chowdhery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mansfield</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Demner-Fushman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Agüera Y Arcas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Webster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gottweis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tomasev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rajkomar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Barral</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Semturs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karthikesalingam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Natarajan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">620</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Luukkonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Burdge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Talman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Komulainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Hatanpää</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sarlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pyysalo</surname></persName>
		</author>
		<idno>preprint: 2404.01856</idno>
		<title level="m">Poro 34b and the blessing of multilinguality</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bamford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Casas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bressand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lengyel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Saulnier</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.06825</idno>
		<title level="m">Mistral 7b</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1: LLM System Description for ELOQUENT at CLEF</title>
		<author>
			<persName><forename type="first">V</forename><surname>Neralla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bijl De Vroe</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Vlachos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Galuščáková</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>Herrera</surname></persName>
		</editor>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Lessons learned from gpt-sw3: Building the first large-scale generative language model for swedish</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ekgren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Gyllensten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gogoulou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Heiman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Verlinden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Öhman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Carlsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sahlgren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC)</title>
				<meeting>the Thirteenth Language Resources and Evaluation Conference (LREC)</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.10997</idno>
		<title level="m">Retrieval-augmented generation for large language models: A survey</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
