=Paper= {{Paper |id=Vol-3740/paper-70 |storemode=property |title=Experimental Report on Robustness Task - ELOQUENT Lab @ CLEF 2024 |pdfUrl=https://ceur-ws.org/Vol-3740/paper-70.pdf |volume=Vol-3740 |authors=Annika Simonsen |dblpUrl=https://dblp.org/rec/conf/clef/Simonsen24 }} ==Experimental Report on Robustness Task - ELOQUENT Lab @ CLEF 2024== https://ceur-ws.org/Vol-3740/paper-70.pdf
                         Experimental Report on Robustness Task - ELOQUENT
                         Lab @ CLEF 2024
                         Notebook for the ELOQUENT Lab at CLEF 2024

                         Annika Simonsen
                         The University of Iceland, Sæmundargata 2, 102 Reykjavík, Iceland


                                      Abstract
                                      The ELOQUENT Lab’s Robustness Task for CLEF 2024 aimed to evaluate the robustness of language models
                                      against various input variations, such as dialectal, sociolectal, and cross-cultural differences. The task aimed
                                      to assess how well language models could maintain consistent, functionally equivalent output despite surface
                                      variations in input prompts.

                                      Keywords
                                      Language models, Robustness, CLEF 2024, Dialects, Cross-cultural variations




                         1. Introduction
                         The ELOQUENT Lab’s Robustness Task for CLEF 2024 aimed to evaluate the robustness of language
                         models against various input variations, such as dialectal, sociolectal, and cross-cultural differences.
                         The task aimed to assess how well language models could maintain consistent, functionally equivalent
                         output despite surface variations in input prompts.


                         2. Objective
                         The primary goal was to measure the robustness of language models to input variations by analyzing
                         the consistency of their responses across different dialectal and cultural prompts. This would involve
                         evaluating the output using measures of language variation, such as n-gram overlap and embedding
                         similarity, to ensure that the responses maintained their topical and semantic content.


                         3. Method
                         3.1. Models used
                         I employed the following models for this task:

                                • gpt-4-turbo-2024-04-09 [1]
                                • gpt-sw3-20b-instruct [2]
                                • gpt-sw3-20b [2]
                                • gpt-sw3-40b [2]

                            The models were chosen for their advanced capabilities and availability. The GPT-4-turbo model was
                         the state-of-the-art model at the time of the experiment, while the GPT-SW3 series from AI Sweden
                         is lesser known and is therefore, studied less. The GPT-SW3 models are a series of Nordic models,
                         primarily trained on Swedish text data, but also Norwegian, Danish, Icelandic and English text data.


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ ans72@hi.is (A. Simonsen)
                          € https://github.com/AnnikaSimonsen (A. Simonsen)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
3.2. Prompts and variations
The prompts were provided by the organizers and were written in in multiple languages, including
Swedish, English, Arabic, and Finnish, to test the models’ ability to handle linguistic diversity. The
variations within these prompts included differences in dialect, sociolect, idiolect, and levels of formality.
I kept the provided prompts unchanged and I did not add any system prompt.

3.3. Submission details
Two submissions were made:

   1. Initial Submission: Included results from gpt-4-turbo-2024-04-09, gpt-sw3-20b-instruct, and
      gpt-sw3-20b. This submission faced issues where the GPT-SW3 models tended to repeat the
      prompts themselves.
   2. Second Submission: Included results from gpt-sw3-40b base. This submission also encountered
      similar issues, and prompts were split into three batches due to computational constraints.

3.4. Technical Setup
The scripts are included in the appendix.

    • Temperature and Sampling Settings: For both gpt-4-turbo and the GPT-SW3 models, the
      temperature was set to 0.7 and for the GPT-SW3 models the top-p sampling was set to 1 to ensure
      a balance between creativity and determinism.
    • Hardware: The models were run on a single GPU (NVIDIA A100) on the university cluster. Due
      to computational limitations, prompts were processed in batches.


4. Results and Discussion
At the moment of writing, the evaluation of the output has not been released, so the following is
the observations made during the experiment itself. The GPT-SW3 models struggled with prompt
repetition, which could potentially be due to a coding error or model-specific behavior. If I were to
run the experiments again, I would re-formulate the prompts in order to get GPT-SW3 to follow the
instructions better.
  I encountered that the output was not in UTF-8 encoding, even though the dataset was loaded in with
encoding=’utf-8’. Therefore I had to first process the generated responses and then map the responses
back to items, which were saved into the appropriate structure with encoding=’utf-8’. See the full
scripts in the appendix.
  I did not add any prompts to the prompt collection, although it would have been interesting to add
prompts written in Faroese, a low-resource Nordic language that is my native language. Ultimately, this
was not done due to time constraints.

4.1. Conclusion
The Robustness task was run on four different models, one GPT-4 model and three GPT-SW3 mod-
els. Future work could focus on improving prompt handling in models like GPT-SW3 and exploring
additional languages and dialects to further test model robustness.


References
[1] OpenAI, GPT-4 Model, 2023. URL: https://www.openai.com/research/gpt-4.
[2] AI Sweden, GPT-SW3 Models, 2024. URL: https://www.ai.se/en/gpt-sw3.
A. Scripts
A.1. Script for gpt-4-turbo
import os
import json
from openai import OpenAI

# Replace ’MY-API-KEY-HERE’ with your actual OpenAI API key
os.environ["OPENAI_API_KEY"] = "MY-API-KEY-HERE"

client = OpenAI()



def generate_response(prompt):
    try {
        # Generate a response using the chat completion method
        completion = client.chat.completions.create(
            model="gpt-4-turbo-2024-04-09",
            temperature=0.7,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        # Correct extraction of the response text
        response_text = completion.choices[0].message.content
        return response_text.strip()
    } except Exception as e {
        print(f"Error generating response: {e}")
        return ""
    }
}

# Load the dataset from a JSON file with UTF-8 encoding
with open(’task3.test1.json’, ’r’, encoding=’utf-8’) as file {
    data = json.load(file)
    prompts = data[’eloquent-robustness-test’][’items’]

# Prepare the results dictionary
results = {"eloquent-robustness-results": {"source": "Annika-UOI", "year": "2024", "items": []}}

# Process each prompt in the dataset
for item in prompts {
    item_id = item[’id’]
    responses = []
    variants = item[’variants’]
    for variant in variants {
        prompt = variant[’prompt’]
        response_text = generate_response(prompt)
        responses.append({"response": response_text})
    }

    # Add the responses for each item to the results
    results[’eloquent-robustness-results’][’items’].append({"id": item_id, "responses": responses})
}

# Saving the data
with open(’submission_fo.json’, ’w’, encoding=’utf-8’) as out_file {
    json.dump(results, out_file, ensure_ascii=False, indent=2)
}

# Reading back the saved data for verification
with open(’submission_fo.json’, ’r’, encoding=’utf-8’) as in_file {
    loaded_data = json.load(in_file)
    print(loaded_data) # Check how the data reads back
}



A.2. Script for gpt-sw3 models
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

os.environ[’PYTORCH_CUDA_ALLOC_CONF’] = "max_split_size_mb:128"

gpu_id =    os.environ.get("CUDA_VISIBLE_DEVICES")

if gpu_id is not None {
    print(f"running gpu {gpu_id}")
} else {
    print(f"No gpu found")
}

# Setup for HuggingFace and Torch
model_name = "AI-Sweden-Models/gpt-sw3-40b"
# model_name = "AI-Sweden-Models/gpt-sw3-20b"
# model_name = "AI-Sweden-Models/gpt-sw3-20b-instruct"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Initialize Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)

# Load dataset in batches
with open(’task3.test1-5.json’, ’r’, encoding=’utf-8’) as file {
    data = json.load(file)
    prompts = data[’eloquent-robustness-test’][’items’]
}

# Prepare the results dictionary and batch inputs
results_sw3 = {"eloquent-robustness-results-sw3": {"source": "Annika-UOI", "year": "2024", "items": []}}
batch_prompts = []
batch_info = []

# Accumulate prompts for batching
for item in prompts {
    item_id = item[’id’]
    variants = item[’variants’]
    for variant in variants {
        prompt = variant[’prompt’]
        batch_prompts.append(prompt)
        batch_info.append({"item_id": item_id, "prompt": prompt})
    }
}

# Tokenize the batch
input_ids = tokenizer(batch_prompts, return_tensors="pt", padding=True, truncation=True,
max_length=512)["input_ids"].to(device)

# Generate responses in a batch
generated_token_ids = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=1
)

# Process the generated responses
responses = [tokenizer.decode(ids, skip_special_tokens=True) for ids in generated_token_ids]

# Map responses back to items and prepare the result structure
for response, info in zip(responses, batch_info) {
    item_id = info["item_id"]
   if not any(item["id"] == item_id for item in results_sw3["eloquent-robustness-results-sw3"]["items"]) {
        results_sw3["eloquent-robustness-results-sw3"]["items"].append({"id": item_id, "responses": []})
    }
    item = next(item for item in
    results_sw3["eloquent-robustness-results-sw3"]["items"] if item["id"] == item_id)
    item["responses"].append({"prompt": info["prompt"], "response": response})
}
# Save the results to a JSON file
with open(’submission_sw3_base40-1-5.json’, ’w’, encoding=’utf-8’) as out_file {
    json.dump(results_sw3, out_file, ensure_ascii=False, indent=2)
}