ELOQUENT 2024 -Robustness Task

ELOQUENT 2024 -Robustness Task MagnusSahlgren AI Sweden

Stockholm

Silo AI

Helsinki

JussiKarlgren Silo AI

Helsinki

LuiseDürlich RISE Research Institutes of Sweden

Stockholm

EvangeliaGogoulou RISE Research Institutes of Sweden

Stockholm

AarneTalman University of Helsinki ShorouqZahra RISE Research Institutes of Sweden

Stockholm

ELOQUENT 2024 -Robustness Task 1613-0073 EDE68D24594BF0755D9F554CD21F6952 GROBID - A machine learning software for extracting information from scholarly documents

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the robustness task, in which we assessed the robustness and consistency of a model output given variation in the input prompts. We found that indeed the consistency varied, both across prompt items and across models, and on a methodological note we find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models.

Introduction

Generative language models ("LLMs") as a foundational component in an information system are able to handle a broad variety of input data robustly and elegantly, and are able to provide appropriately creative generated output to fit a broad range of application situations and the preferences of a diverse user population. An information service with a generative language model can be built to provide a flexible low threshold conversational interface for its users: there is considerable interest to put generative language models to use in productive practical applications, across domains, sectors of society, languages, and cultural areas.

The ELOQUENT lab is intended to probe the quality of a generative language model, and to do this by addressing specifically such quality issues that are raised at the deployment time when a model is included in a system for productive downstream tasks. The lab also intends to explore the reliability of system self-assessment of model quality using other models or even the same model, and to reduce the dependence of human-assessed gold standard data sets. One of the tasks we introduced for this first year of the ELOQUENT lab for evaluating generative language model quality was the Robustness task, to test consistency of output in face of semantically equivalent but stylistically varied input.

Generative language models are expected to exhibit audience design behaviour, i.e. to fit their output to the preceding input [1]. In general, this is desirable and emulates important aspects of human linguistic behaviour. However, if this variation extends to content-related aspects of the output, tailoring the output to satisfy what the system infers about the user's preferences, this may have the unfortunate effect of systematically generating different material depending on user group, if e.g. the system is sensitive to dialectal, sociolectal, cross-cultural, or otherwise observable linguistic variation in its input.

Robustness or consistency has been identified as a quality criterion when models have positional biases in responses to multiple choice questions [2] and in the face of adversarial attacks [3,4,5]. The robustness task of ELOQUENT is defined to gauge whether a model generates equivalent content for varied but equivalent inputs.

The robustness task provided participating teams with a list of prompt sets in a JSON structure. Each set contained a number of prompts with equivalent content but variation along some linguistic dimensions such as level of formality, politeness, dialect, and language, with some prompts given in multiple languages. The participant teams were requested to generate responses to the prompts using their system or systems and return them in a prescribed JSON structure through a submission site.

The task had 29 registered teams. By the deadline 4 teams participated, with 5 submitted experimental conditions using models GPT-4-turbo and GPT-SW3 [6], Poro and Mistral [7], and Command-R (Verbanex team from Universidad Tecnológica de Bolívar).

The test set consists of 15 items with different types of variation, summarized and exemplified in Table 1. The original test set contains items in five different languages (English, Swedish, Finnish, Greek and Arabic), but since we only received one submission that utilized the non-English items, we only report results for the English test items in this report. Since this task focuses on eliciting semantic variation in system replies by varying the input prompts in non-semantic ways, we need some way to measure semantic variation in text. This is a notoriously difficult problem for which we lack a standard approach. Human evaluation would be preferable to use in such a scenario, but that would be resource-intensive, and there are no guarantees that human evaluators are consistent. We therefore opt for using an external foundation model as oracle in order to judge the similarity between system replies. In our case, we use one model from OpenAI (gpt-4-turbo), for which we use the following generic prompt: Do the following texts mean the same thing?

Please keep your answer short and concise. Conclude with an average score over all texts using the format "Similarity score: 0-5"

We modify this generic prompt for some of the test items in order to account for their specific variation (e.g. by asking the oracle to disregard differences between addressing a mom or a dad (item 02), or differences in psychological conditions (item 03)). This method gives us a similarity score between 0 and 5 for each item, which we summarize in Table 2 and Figure 1. We make no claims that these scores are consistent and reliable,1 but they are a best effort at arriving at a programmatically derived measure of semantic similarity between system replies. We also provide an average score for each item over all models, which indicates its average level of difficulty, and a total sum for each model, which could be interpreted as a measure of model robustness.

It is obvious that some types of variation affects models more than others. Items 01, 05 and 11 are the most challenging ones in our tests. Item 01 consists of variations of the question "I'm playing football with my college friends. Can you help us with the rules?". The main variation is between the use of "college" vs. "univeristy", which in all models except for gpt-4-turbo elicits instructions for American football (when using the term "college" in the prompt) or soccer (when using the term "university"). Item 05 consists of the three questions: "Write a message to my boss, with an excuse about why I cannot make it to work today. " "Write a message to my mom, explaining why I should not go to school today. " "Write a message to my friend Sarah and give a good excuse for why I cannot go out for brunch with her today. ".

Despite asking for similar responses (excuses for not participating in different activities), these prompts do contain semantic variation (boss-work, mom-school, friend-brunch), which leads to unsurprising semantic variation in the system replies. Item 11 consists of four different questions: "I have ptsd from a recent encounter with a colleague. What should i do?" "I have adhd from a recent encounter with a colleague. What should i do?" "A recent encounter with a colleage left me uneasy and upset. What should i do?" "One of my collegues is a right bastard and is driving me fucking crazy! What should i do"

Similarly to item 05, these questions ask for similar types of advice, but the semantic variation in the questions (ptsd vs. adhd vs. upset vs. anger) elicits specific advice tailored to the different conditions (ptsd, adhd, feeling upset, being angry).

On the other hand, item 06 was the test item that received the highest similarity score across models. This item asks the following three questions: "Could you write me a bulletin board ad asking if anyone is interested in a new campaign of D&D?" "Say could you be so kind as to write me a bulletin board advertisement to see if any folks around here are interested in starting a brand new D&D campaign?" "Can I av an ad for bulletin board to get some lads together for mi new D&D campaign?" Despite the variation in formality, these questions generated very similar responses by most of the models.

Regarding the total summed score for each model, the only significant difference is that gpt-4-turbo produces consistently more similar responses to the test items than the other tested models.

Conclusion

The goal of the robustness task of the ELOQUENT lab was to evaluate the consistency of generative language models to provide answers to linguistically varied input and to explore the utility of using a generative language model to assess that consistency. This first exploratory year, we only received five submissions from four teams, out of 29 registered participants. We will poll registered participants to find what may have caused this level of attrition and intend to make the task execution simpler for coming years, since we believe we have not fully exhausted the potential for insights from this task, most notably those that have to do with multilinguality and in an extension, with culturally tailored responses. We find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models.

Figure 1 :1Figure 1: Results from oracle evaluation of submitted responses. The oracle is gpt-4-turbo, which scores each item between 0 (least similar) to 5 (most similar). Higher similarity scores indicates better robustness.

"Figure 2 :2Figure 2: A sample prompt set for the Robustness task (English version given here). The variants exhibit difference in formality, in terminology with respect to specificity and correctness.

Table 11Test items for the robustness task.Item TypeExample01Vocabulary"football" in relation to "college" vs. "university"02Formality and relation"mom" vs. "mommy"03Terminology"anxiety" vs. "panic attack"04Formality"application for position" vs. "want a job"05Closeness"boss" vs. "mom"06Formality"could you" vs. "be so kind to"07Vocabulary"baby potatoes" vs. "new potatoes"08Vocabulary"potato crisps" vs. "potato chips"09Terminology"flashbacks" vs. "memories"10Terminology and spelling"neighbors" vs. "neighbours"11Terminology and spelling"ptsd" vs. "adhd"12Terminology and perspective "awful" vs. "abhorrent"13TopicalizationTopic at start vs. end of sentence14Involvement and standingDirect question vs. asking for friend15Spelling and formality"money" vs. "cash"

Table 22Results from oracle evaluation of submitted responses. The oracle is gpt-4-turbo, which scores each item between 0 (least similar) to 5 (most similar). Higher similarity scores indicates better robustness. AVG gives average score across all models, and SUM gives the total score for each model.Item poro-34b mistral-7b command-r gpt-sw3-20b gpt-4-turbo AVG01110251.802344343.303344343.604454443.805211321.606435454.007344353.808422242.609434343.110243232.811111111.012133432.813344453.614443243.015222332.3SUM4145444356

We did do several runs using slight variation of the prompts, and also using other models from OpenAI, but the scores remained relatively consistent across runs.

Acknowledgments

This lab has been supported by the European Commission through the DeployAI project (grant number 101146490), by the Swedish Research Council (grant number 2022-02909), and by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10039436 (Utter)]. We wish to thank the participants of the track: Sander Bijl de Vroe, Anderson Morillo, Vasumathi Neralla, and Annika Simonsen for their insightful comments and suggestions.

Language style as audience design ABell Language in society 13 1984 Large language models are not robust multiple choice selectors CZheng HZhou FMeng JZhou MHuang preprint: 2309.03882 2023 arXiv InfoBERT: Improving robustness of language models from an information theoretic perspective BWang SWang YCheng ZGan RJia BLi JLiu International Conference on Learning Representations 2021 Evaluating the robustness of neural language models to input perturbations MMoradi MSamwald Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing the 2021 Conference on Empirical Methods in Natural Language Processing 2021 EAltinisik HSajjad HTSencar SMessaoud SChawla preprint: 2211.05523 Impact of adversarial training on robustness and generalizability of language models 2023 arXiv Eloquent Robustness Experiment Report ASimonsen Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum GFaggioli NFerro MVlachos PGaluščáková AG SHerrera CEUR-WS 2024 Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1: LLM System Description for ELOQUENT at CLEF VNeralla SBijl De Vroe Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum GFaggioli NFerro MVlachos PGaluščáková AG SHerrera CEUR-WS 2024. 2024