ELOQUENT 2024 — Robustness Task Magnus Sahlgren1,3 , Jussi Karlgren3 , Luise Dürlich2 , Evangelia Gogoulou2 , Aarne Talman4 and Shorouq Zahra 2 1 AI Sweden, Stockholm 2 RISE Research Institutes of Sweden, Stockholm 3 Silo AI, Helsinki 4 University of Helsinki Abstract ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the robustness task, in which we assessed the robustness and consistency of a model output given variation in the input prompts. We found that indeed the consistency varied, both across prompt items and across models, and on a methodological note we find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models. 1. Introduction Generative language models (“LLMs”) as a foundational component in an information system are able to handle a broad variety of input data robustly and elegantly, and are able to provide appropriately creative generated output to fit a broad range of application situations and the preferences of a diverse user population. An information service with a generative language model can be built to provide a flexible low threshold conversational interface for its users: there is considerable interest to put generative language models to use in productive practical applications, across domains, sectors of society, languages, and cultural areas. The ELOQUENT lab is intended to probe the quality of a generative language model, and to do this by addressing specifically such quality issues that are raised at the deployment time when a model is included in a system for productive downstream tasks. The lab also intends to explore the reliability of system self-assessment of model quality using other models or even the same model, and to reduce the dependence of human-assessed gold standard data sets. One of the tasks we introduced for this first year of the ELOQUENT lab for evaluating generative language model quality was the Robustness task, to test consistency of output in face of semantically equivalent but stylistically varied input. Generative language models are expected to exhibit audience design behaviour, i.e. to fit their output to the preceding input [1]. In general, this is desirable and emulates important aspects of human linguistic behaviour. However, if this variation extends to content-related aspects of the output, tailoring the output to satisfy what the system infers about the user’s preferences, this may have the unfortunate effect of systematically generating different material depending on user group, if e.g. the system is sensitive to dialectal, sociolectal, cross-cultural, or otherwise observable linguistic variation in its input. Robustness or consistency has been identified as a quality criterion when models have positional biases in responses to multiple choice questions [2] and in the face of adversarial attacks [3, 4, 5]. The robustness task of ELOQUENT is defined to gauge whether a model generates equivalent content for varied but equivalent inputs. The robustness task provided participating teams with a list of prompt sets in a JSON structure. Each set contained a number of prompts with equivalent content but variation along some linguistic CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings dimensions such as level of formality, politeness, dialect, and language, with some prompts given in multiple languages. The participant teams were requested to generate responses to the prompts using their system or systems and return them in a prescribed JSON structure through a submission site. The task had 29 registered teams. By the deadline 4 teams participated, with 5 submitted experimental conditions using models GPT-4-turbo and GPT-SW3 [6], Poro and Mistral [7], and Command-R (Verbanex team from Universidad Tecnológica de Bolívar). The test set consists of 15 items with different types of variation, summarized and exemplified in Table 1. The original test set contains items in five different languages (English, Swedish, Finnish, Greek and Arabic), but since we only received one submission that utilized the non-English items, we only report results for the English test items in this report. Table 1 Test items for the robustness task. Item Type Example 01 Vocabulary “football” in relation to “college” vs. “university” 02 Formality and relation “mom” vs. “mommy” 03 Terminology “anxiety” vs. “panic attack” 04 Formality “application for position” vs. “want a job” 05 Closeness “boss” vs. “mom” 06 Formality “could you” vs. “be so kind to” 07 Vocabulary “baby potatoes” vs. “new potatoes” 08 Vocabulary “potato crisps” vs. “potato chips” 09 Terminology “flashbacks” vs. “memories” 10 Terminology and spelling “neighbors” vs. “neighbours” 11 Terminology and spelling “ptsd” vs. “adhd” 12 Terminology and perspective “awful” vs. “abhorrent” 13 Topicalization Topic at start vs. end of sentence 14 Involvement and standing Direct question vs. asking for friend 15 Spelling and formality “money” vs. “cash” Since this task focuses on eliciting semantic variation in system replies by varying the input prompts in non-semantic ways, we need some way to measure semantic variation in text. This is a notoriously difficult problem for which we lack a standard approach. Human evaluation would be preferable to use in such a scenario, but that would be resource-intensive, and there are no guarantees that human evaluators are consistent. We therefore opt for using an external foundation model as oracle in order to judge the similarity between system replies. In our case, we use one model from OpenAI (gpt-4-turbo), for which we use the following generic prompt: Do the following texts mean the same thing? Please keep your answer short and concise. Conclude with an average score over all texts using the format "Similarity score: 0-5" We modify this generic prompt for some of the test items in order to account for their specific variation (e.g. by asking the oracle to disregard differences between addressing a mom or a dad (item 02), or differences in psychological conditions (item 03)). This method gives us a similarity score between 0 and 5 for each item, which we summarize in Table 2 and Figure 1. We make no claims that these scores are consistent and reliable,1 but they are a best effort at arriving at a programmatically derived measure of semantic similarity between system replies. We also provide an average score for each item over all models, which indicates its average level of difficulty, and a total sum for each model, which could be interpreted as a measure of model robustness. It is obvious that some types of variation affects models more than others. Items 01, 05 and 11 are the most challenging ones in our tests. Item 01 consists of variations of the question “I’m playing football 1 We did do several runs using slight variation of the prompts, and also using other models from OpenAI, but the scores remained relatively consistent across runs. Table 2 Results from oracle evaluation of submitted responses. The oracle is gpt-4-turbo, which scores each item between 0 (least similar) to 5 (most similar). Higher similarity scores indicates better robustness. AVG gives average score across all models, and SUM gives the total score for each model. Item poro-34b mistral-7b command-r gpt-sw3-20b gpt-4-turbo AVG 01 1 1 0 2 5 1.8 02 3 4 4 3 4 3.3 03 3 4 4 3 4 3.6 04 4 5 4 4 4 3.8 05 2 1 1 3 2 1.6 06 4 3 5 4 5 4.0 07 3 4 4 3 5 3.8 08 4 2 2 2 4 2.6 09 4 3 4 3 4 3.1 10 2 4 3 2 3 2.8 11 1 1 1 1 1 1.0 12 1 3 3 4 3 2.8 13 3 4 4 4 5 3.6 14 4 4 3 2 4 3.0 15 2 2 2 3 3 2.3 SUM 41 45 44 43 56 Figure 1: Results from oracle evaluation of submitted responses. The oracle is gpt-4-turbo, which scores each item between 0 (least similar) to 5 (most similar). Higher similarity scores indicates better robustness. with my college friends. Can you help us with the rules?”. The main variation is between the use of “college” vs. “univeristy”, which in all models except for gpt-4-turbo elicits instructions for American football (when using the term “college” in the prompt) or soccer (when using the term “university”). Item 05 consists of the three questions: “Write a message to my boss, with an excuse about why I cannot make it to work today.” “Write a message to my mom, explaining why I should not go to school today.” “Write a message to my friend Sarah and give a good excuse for why I cannot go out for brunch with her today.”. Despite asking for similar responses (excuses for not participating in different activities), these prompts do contain semantic variation (boss-work, mom-school, friend-brunch), which leads to unsurprising semantic variation in the system replies. Item 11 consists of four different questions: “I have ptsd from a recent encounter with a colleague. What should i do?” “I have adhd from a recent encounter with a colleague. What should i do?” “A recent encounter with a colleage left me uneasy and upset. What should i do?” “One of my collegues is a right bastard and is driving me fucking crazy! What should i do” Similarly to item 05, these questions ask for similar types of advice, but the semantic variation in the questions (ptsd vs. adhd vs. upset vs. anger) elicits specific advice tailored to the different conditions (ptsd, adhd, feeling upset, being angry). On the other hand, item 06 was the test item that received the highest similarity score across models. This item asks the following three questions: “Could you write me a bulletin board ad asking if anyone is interested in a new campaign of D&D?” “Say could you be so kind as to write me a bulletin board advertisement to see if any folks around here are interested in starting a brand new D&D campaign?” “Can I av an ad for bulletin board to get some lads together for mi new D&D campaign?” Despite the variation in formality, these questions generated very similar responses by most of the models. Regarding the total summed score for each model, the only significant difference is that gpt-4-turbo produces consistently more similar responses to the test items than the other tested models. "i have ptsd from a recent encounter with a colleague. what should i do?" "i have adhd from a recent encounter with a colleague. what should i do?" "a recent encounter with a colleague left me uneasy and upset. what should i do?" "one of my collegues is a right bastard and is driving me fucking crazy! what should i do?" Figure 2: A sample prompt set for the Robustness task (English version given here). The variants exhibit difference in formality, in terminology with respect to specificity and correctness. 2. Conclusion The goal of the robustness task of the ELOQUENT lab was to evaluate the consistency of generative language models to provide answers to linguistically varied input and to explore the utility of using a generative language model to assess that consistency. This first exploratory year, we only received five submissions from four teams, out of 29 registered participants. We will poll registered participants to find what may have caused this level of attrition and intend to make the task execution simpler for coming years, since we believe we have not fully exhausted the potential for insights from this task, most notably those that have to do with multilinguality and in an extension, with culturally tailored responses. We find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models. Acknowledgments This lab has been supported by the European Commission through the DeployAI project (grant number 101146490), by the Swedish Research Council (grant number 2022-02909), and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10039436 (Utter)]. We wish to thank the participants of the track: Sander Bijl de Vroe, Anderson Morillo, Vasumathi Neralla, and Annika Simonsen for their insightful comments and suggestions. References [1] A. Bell, Language style as audience design, Language in society 13 (1984). [2] C. Zheng, H. Zhou, F. Meng, J. Zhou, M. Huang, Large language models are not robust multiple choice selectors, arXiv preprint: 2309.03882 (2023). [3] B. Wang, S. Wang, Y. Cheng, Z. Gan, R. Jia, B. Li, J. Liu, InfoBERT: Improving robustness of language models from an information theoretic perspective, in: International Conference on Learning Representations, 2021. [4] M. Moradi, M. Samwald, Evaluating the robustness of neural language models to input perturbations, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. [5] E. Altinisik, H. Sajjad, H. T. Sencar, S. Messaoud, S. Chawla, Impact of adversarial training on robustness and generalizability of language models, arXiv preprint: 2211.05523 (2023). [6] A. Simonsen, Eloquent Robustness Experiment Report, in: G. Faggioli, N. Ferro, M. Vlachos, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024. [7] V. Neralla, S. Bijl de Vroe, Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1: LLM System Description for ELOQUENT at CLEF 2024, in: G. Faggioli, N. Ferro, M. Vlachos, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.