<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ELOQUENT 2024 -Robustness Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Magnus</forename><surname>Sahlgren</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">AI Sweden</orgName>
								<address>
									<settlement>Stockholm</settlement>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Silo AI</orgName>
								<address>
									<settlement>Helsinki</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jussi</forename><surname>Karlgren</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">Silo AI</orgName>
								<address>
									<settlement>Helsinki</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Luise</forename><surname>Dürlich</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">RISE Research Institutes of Sweden</orgName>
								<address>
									<settlement>Stockholm</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Evangelia</forename><surname>Gogoulou</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">RISE Research Institutes of Sweden</orgName>
								<address>
									<settlement>Stockholm</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aarne</forename><surname>Talman</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">University of Helsinki</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Shorouq</forename><surname>Zahra</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">RISE Research Institutes of Sweden</orgName>
								<address>
									<settlement>Stockholm</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">ELOQUENT 2024 -Robustness Task</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">EDE68D24594BF0755D9F554CD21F6952</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the robustness task, in which we assessed the robustness and consistency of a model output given variation in the input prompts. We found that indeed the consistency varied, both across prompt items and across models, and on a methodological note we find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Generative language models ("LLMs") as a foundational component in an information system are able to handle a broad variety of input data robustly and elegantly, and are able to provide appropriately creative generated output to fit a broad range of application situations and the preferences of a diverse user population. An information service with a generative language model can be built to provide a flexible low threshold conversational interface for its users: there is considerable interest to put generative language models to use in productive practical applications, across domains, sectors of society, languages, and cultural areas.</p><p>The ELOQUENT lab is intended to probe the quality of a generative language model, and to do this by addressing specifically such quality issues that are raised at the deployment time when a model is included in a system for productive downstream tasks. The lab also intends to explore the reliability of system self-assessment of model quality using other models or even the same model, and to reduce the dependence of human-assessed gold standard data sets. One of the tasks we introduced for this first year of the ELOQUENT lab for evaluating generative language model quality was the Robustness task, to test consistency of output in face of semantically equivalent but stylistically varied input.</p><p>Generative language models are expected to exhibit audience design behaviour, i.e. to fit their output to the preceding input <ref type="bibr" target="#b0">[1]</ref>. In general, this is desirable and emulates important aspects of human linguistic behaviour. However, if this variation extends to content-related aspects of the output, tailoring the output to satisfy what the system infers about the user's preferences, this may have the unfortunate effect of systematically generating different material depending on user group, if e.g. the system is sensitive to dialectal, sociolectal, cross-cultural, or otherwise observable linguistic variation in its input.</p><p>Robustness or consistency has been identified as a quality criterion when models have positional biases in responses to multiple choice questions <ref type="bibr" target="#b1">[2]</ref> and in the face of adversarial attacks <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>. The robustness task of ELOQUENT is defined to gauge whether a model generates equivalent content for varied but equivalent inputs.</p><p>The robustness task provided participating teams with a list of prompt sets in a JSON structure. Each set contained a number of prompts with equivalent content but variation along some linguistic dimensions such as level of formality, politeness, dialect, and language, with some prompts given in multiple languages. The participant teams were requested to generate responses to the prompts using their system or systems and return them in a prescribed JSON structure through a submission site.</p><p>The task had 29 registered teams. By the deadline 4 teams participated, with 5 submitted experimental conditions using models GPT-4-turbo and GPT-SW3 <ref type="bibr" target="#b5">[6]</ref>, Poro and Mistral <ref type="bibr" target="#b6">[7]</ref>, and Command-R (Verbanex team from Universidad Tecnológica de Bolívar).</p><p>The test set consists of 15 items with different types of variation, summarized and exemplified in Table <ref type="table" target="#tab_0">1</ref>. The original test set contains items in five different languages (English, Swedish, Finnish, Greek and Arabic), but since we only received one submission that utilized the non-English items, we only report results for the English test items in this report. Since this task focuses on eliciting semantic variation in system replies by varying the input prompts in non-semantic ways, we need some way to measure semantic variation in text. This is a notoriously difficult problem for which we lack a standard approach. Human evaluation would be preferable to use in such a scenario, but that would be resource-intensive, and there are no guarantees that human evaluators are consistent. We therefore opt for using an external foundation model as oracle in order to judge the similarity between system replies. In our case, we use one model from OpenAI (gpt-4-turbo), for which we use the following generic prompt: Do the following texts mean the same thing?</p><p>Please keep your answer short and concise. Conclude with an average score over all texts using the format "Similarity score: 0-5"</p><p>We modify this generic prompt for some of the test items in order to account for their specific variation (e.g. by asking the oracle to disregard differences between addressing a mom or a dad (item 02), or differences in psychological conditions (item 03)). This method gives us a similarity score between 0 and 5 for each item, which we summarize in Table <ref type="table" target="#tab_1">2</ref> and Figure <ref type="figure" target="#fig_0">1</ref>. We make no claims that these scores are consistent and reliable,<ref type="foot" target="#foot_0">1</ref> but they are a best effort at arriving at a programmatically derived measure of semantic similarity between system replies. We also provide an average score for each item over all models, which indicates its average level of difficulty, and a total sum for each model, which could be interpreted as a measure of model robustness.</p><p>It is obvious that some types of variation affects models more than others. Items 01, 05 and 11 are the most challenging ones in our tests. Item 01 consists of variations of the question "I'm playing football  with my college friends. Can you help us with the rules?". The main variation is between the use of "college" vs. "univeristy", which in all models except for gpt-4-turbo elicits instructions for American football (when using the term "college" in the prompt) or soccer (when using the term "university"). Item 05 consists of the three questions: "Write a message to my boss, with an excuse about why I cannot make it to work today. " "Write a message to my mom, explaining why I should not go to school today. " "Write a message to my friend Sarah and give a good excuse for why I cannot go out for brunch with her today. ".</p><p>Despite asking for similar responses (excuses for not participating in different activities), these prompts do contain semantic variation (boss-work, mom-school, friend-brunch), which leads to unsurprising semantic variation in the system replies. Item 11 consists of four different questions: "I have ptsd from a recent encounter with a colleague. What should i do?" "I have adhd from a recent encounter with a colleague. What should i do?" "A recent encounter with a colleage left me uneasy and upset. What should i do?" "One of my collegues is a right bastard and is driving me fucking crazy! What should i do"</p><p>Similarly to item 05, these questions ask for similar types of advice, but the semantic variation in the questions (ptsd vs. adhd vs. upset vs. anger) elicits specific advice tailored to the different conditions (ptsd, adhd, feeling upset, being angry).</p><p>On the other hand, item 06 was the test item that received the highest similarity score across models. This item asks the following three questions: "Could you write me a bulletin board ad asking if anyone is interested in a new campaign of D&amp;D?" "Say could you be so kind as to write me a bulletin board advertisement to see if any folks around here are interested in starting a brand new D&amp;D campaign?" "Can I av an ad for bulletin board to get some lads together for mi new D&amp;D campaign?" Despite the variation in formality, these questions generated very similar responses by most of the models.</p><p>Regarding the total summed score for each model, the only significant difference is that gpt-4-turbo produces consistently more similar responses to the test items than the other tested models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Conclusion</head><p>The goal of the robustness task of the ELOQUENT lab was to evaluate the consistency of generative language models to provide answers to linguistically varied input and to explore the utility of using a generative language model to assess that consistency. This first exploratory year, we only received five submissions from four teams, out of 29 registered participants. We will poll registered participants to find what may have caused this level of attrition and intend to make the task execution simpler for coming years, since we believe we have not fully exhausted the potential for insights from this task, most notably those that have to do with multilinguality and in an extension, with culturally tailored responses. We find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Results from oracle evaluation of submitted responses. The oracle is gpt-4-turbo, which scores each item between 0 (least similar) to 5 (most similar). Higher similarity scores indicates better robustness.</figDesc><graphic coords="3,93.67,343.24,407.94,251.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>"Figure 2 :</head><label>2</label><figDesc>Figure 2: A sample prompt set for the Robustness task (English version given here). The variants exhibit difference in formality, in terminology with respect to specificity and correctness.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Test items for the robustness task.</figDesc><table><row><cell cols="2">Item Type</cell><cell>Example</cell></row><row><cell>01</cell><cell>Vocabulary</cell><cell>"football" in relation to "college" vs. "university"</cell></row><row><cell>02</cell><cell>Formality and relation</cell><cell>"mom" vs. "mommy"</cell></row><row><cell>03</cell><cell>Terminology</cell><cell>"anxiety" vs. "panic attack"</cell></row><row><cell>04</cell><cell>Formality</cell><cell>"application for position" vs. "want a job"</cell></row><row><cell>05</cell><cell>Closeness</cell><cell>"boss" vs. "mom"</cell></row><row><cell>06</cell><cell>Formality</cell><cell>"could you" vs. "be so kind to"</cell></row><row><cell>07</cell><cell>Vocabulary</cell><cell>"baby potatoes" vs. "new potatoes"</cell></row><row><cell>08</cell><cell>Vocabulary</cell><cell>"potato crisps" vs. "potato chips"</cell></row><row><cell>09</cell><cell>Terminology</cell><cell>"flashbacks" vs. "memories"</cell></row><row><cell>10</cell><cell>Terminology and spelling</cell><cell>"neighbors" vs. "neighbours"</cell></row><row><cell>11</cell><cell>Terminology and spelling</cell><cell>"ptsd" vs. "adhd"</cell></row><row><cell>12</cell><cell cols="2">Terminology and perspective "awful" vs. "abhorrent"</cell></row><row><cell>13</cell><cell>Topicalization</cell><cell>Topic at start vs. end of sentence</cell></row><row><cell>14</cell><cell>Involvement and standing</cell><cell>Direct question vs. asking for friend</cell></row><row><cell>15</cell><cell>Spelling and formality</cell><cell>"money" vs. "cash"</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Results from oracle evaluation of submitted responses. The oracle is gpt-4-turbo, which scores each item between 0 (least similar) to 5 (most similar). Higher similarity scores indicates better robustness. AVG gives average score across all models, and SUM gives the total score for each model.</figDesc><table><row><cell cols="7">Item poro-34b mistral-7b command-r gpt-sw3-20b gpt-4-turbo AVG</cell></row><row><cell>01</cell><cell>1</cell><cell>1</cell><cell>0</cell><cell>2</cell><cell>5</cell><cell>1.8</cell></row><row><cell>02</cell><cell>3</cell><cell>4</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>3.3</cell></row><row><cell>03</cell><cell>3</cell><cell>4</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>3.6</cell></row><row><cell>04</cell><cell>4</cell><cell>5</cell><cell>4</cell><cell>4</cell><cell>4</cell><cell>3.8</cell></row><row><cell>05</cell><cell>2</cell><cell>1</cell><cell>1</cell><cell>3</cell><cell>2</cell><cell>1.6</cell></row><row><cell>06</cell><cell>4</cell><cell>3</cell><cell>5</cell><cell>4</cell><cell>5</cell><cell>4.0</cell></row><row><cell>07</cell><cell>3</cell><cell>4</cell><cell>4</cell><cell>3</cell><cell>5</cell><cell>3.8</cell></row><row><cell>08</cell><cell>4</cell><cell>2</cell><cell>2</cell><cell>2</cell><cell>4</cell><cell>2.6</cell></row><row><cell>09</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>3.1</cell></row><row><cell>10</cell><cell>2</cell><cell>4</cell><cell>3</cell><cell>2</cell><cell>3</cell><cell>2.8</cell></row><row><cell>11</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>1.0</cell></row><row><cell>12</cell><cell>1</cell><cell>3</cell><cell>3</cell><cell>4</cell><cell>3</cell><cell>2.8</cell></row><row><cell>13</cell><cell>3</cell><cell>4</cell><cell>4</cell><cell>4</cell><cell>5</cell><cell>3.6</cell></row><row><cell>14</cell><cell>4</cell><cell>4</cell><cell>3</cell><cell>2</cell><cell>4</cell><cell>3.0</cell></row><row><cell>15</cell><cell>2</cell><cell>2</cell><cell>2</cell><cell>3</cell><cell>3</cell><cell>2.3</cell></row><row><cell>SUM</cell><cell>41</cell><cell>45</cell><cell>44</cell><cell>43</cell><cell>56</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">We did do several runs using slight variation of the prompts, and also using other models from OpenAI, but the scores remained relatively consistent across runs.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This lab has been supported by the European Commission through the DeployAI project (grant number 101146490), by the Swedish Research Council (grant number 2022-02909), and by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10039436 (Utter)]. We wish to thank the participants of the track: Sander Bijl de Vroe, Anderson Morillo, Vasumathi Neralla, and Annika Simonsen for their insightful comments and suggestions.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Language style as audience design</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language in society</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<date type="published" when="1984">1984</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Large language models are not robust multiple choice selectors</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huang</surname></persName>
		</author>
		<idno>preprint: 2309.03882</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">InfoBERT: Improving robustness of language models from an information theoretic perspective</title>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Gan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Evaluating the robustness of neural language models to input perturbations</title>
		<author>
			<persName><forename type="first">M</forename><surname>Moradi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Samwald</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Altinisik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Sajjad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">T</forename><surname>Sencar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Messaoud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chawla</surname></persName>
		</author>
		<idno>preprint: 2211.05523</idno>
		<title level="m">Impact of adversarial training on robustness and generalizability of language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Eloquent Robustness Experiment Report</title>
		<author>
			<persName><forename type="first">A</forename><surname>Simonsen</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Vlachos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Galuščáková</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>Herrera</surname></persName>
		</editor>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1: LLM System Description for ELOQUENT at CLEF</title>
		<author>
			<persName><forename type="first">V</forename><surname>Neralla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bijl De Vroe</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Vlachos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Galuščáková</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>Herrera</surname></persName>
		</editor>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
