Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen Detection and generation of hallucinations Luise Dürlich1 , Evangelia Gogoulou1 , Liane Guillou2 , Joakim Nivre1 and Shorouq Zahra1 1 RISE Research Institutes of Sweden, Stockholm 2 School of Informatics, University of Edinburgh Abstract In the HalluciGen task we aim to discover whether LLMs have an internal representation of hallucination. Specifically, we investigate whether LLMs can be used to both generate and detect hallucinated content. In the cross-model evaluation setting we take this a step further and explore the viability of using an LLM to evaluate output produced by another LLM. We include generation, detection, and cross-model evaluation steps for two scenarios: paraphrase and machine translation. Overall we find that performance of the baselines and submitted systems is highly variable, however initial results are promising and lessons learned from this year’s task will provide a solid foundation for future iterations of the task. In particular, we highlight that human validation of generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen. Keywords Generative language models, Evaluation, Hallucinations 1. Introduction Detecting hallucinations in LLM output may be difficult for humans in certain settings. For example, in the question answering scenario, an individual who asks an LLM a question about a domain with which they are unfamiliar might not be able to detect the presence of hallucinated content in the answer output by the model. In the cross-lingual setting the problem may become even more severe. For example, if the LLM is used to translate from or into a language that the human user does not comprehend well, they may be completely unable to identify hallucinations in the translation output. Models that humans will interact with should therefore be rigorously tested with respect to hallucination, prior to deployment. In the HalluciGen task we aim to discover whether LLMs have an internal representation of halluci- nation – that is can they be used to both generate and detect hallucinated content? Taking this a step further, we also explore the viability of using LLMs in a cross-evaluation setting, where one LLM is used to evaluate the output of another [1, 2]. The first year of HalluciGen focused on developing models that are able to evaluate hallucination. Our task investigates the hallucination phenomenon in two downstream scenarios: (i) Paraphrase Generation (PG): given a source sentence, the model is instructed to produce an accurate paraphrase. For this scenario we include two languages: English and Swedish (en/sv); and (ii) Machine Translation (MT): given a sentence in a source language, the model is instructed to translate it into the target language. For this scenario we include two language pairs: English-German (en⇔de) and English-French (en⇔fr), for both translation directions. For each of the scenarios there are two steps: • Generation: Given a source sentence, the model should generate two hypotheses, one that is a correct paraphrase/translation of the source (ℎ𝑦𝑝+) and one that is a hallucinated paraphrase/- translation of the source (ℎ𝑦𝑝−). CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France ⋆ The authors are listed in alphabetic order. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings • Detection: Given a source sentence and two paraphrase/translation hypotheses (ℎ𝑦𝑝1 and ℎ𝑦𝑝2), the model should detect which of the two contains a hallucination. As an additional challenge, we also perform the detection step in a cross-model setting, where the participant models perform the detection step on the model outputs from the generation step. 2. Datasets For each of the two scenarios, i.e. paraphrase generation or machine translation, we construct a dataset with the following fields: a source sentence, a correct hypothesis of the source, a hallucinated hypothesis of the source, and the type of hallucination demonstrated in the hallucinated hypothesis. Our datasets include hallucinations of the following categories: addition, named-entity, number, conversion, date, gender, pronoun, antonym, tense, negation, and natural hallucinations. With the exception of tense and negation, the remainder of the hallucination types are identical to the type of translation errors identified in ACES [3]. All our datasets are available on Huggingface.1 The process of dataset creation for each scenario is described below. 2.1. Machine Translation For the translation scenario we leveraged ACES [3], a challenge set for evaluating the performance of Machine Translation (MT) metrics on a range of translation accuracy errors. Each example in ACES already follows the structure that we use in the HalluciGen task and ACES already contains errors for the en⇔fr and en⇔de language pairs for all but two of the phenomena we are interested in. For the tense and negation categories, which do not exist in ACES, we constructed examples from the PAWS-X dataset [4] of adversarial paraphrases. For tense examples we filter the PAWS-X dataset to select English examples labelled as paraphrases, then we select each instance of sentence1 and use spaCy2 to tokenise and part-of-speech tag the sentence. We then identify a verb and its tense using the Penn-treebank style tags output by the spaCy pipeline, and inflect it for a different tense using the pyinflect3 python library. We change tense between past, present, and future (by injecting the token “will”). The original sentence1 forms the good translation, the perturbed version is the incorrect translation, and we pair the English sentence with the corresponding French/German translation in PAWS-X (which forms the source sentence). Negation examples are created by automatically extracting English paraphrase examples in PAWS-X that contain a negation and manually editing sentence1 to construct an incorrect translation e.g. by inserting an (extra) negation, or modifying the polarity of a sentence that already contains a negation. We consider lexical negation (e.g. the affixes “un” and “dis”) and negation tokens (e.g. not, n’t, never). Again, we pair the English target sentences with a corresponding French/German source sentence from PAWS-X. From the combined set of ACES and the negation and tense examples, we selected 100 examples for each language direction for the test set and 10 examples for the trial set. Examples for the test set were selected in order to provide as close to a uniform selection across categories as possible. Note that due to the unbalanced coverage of examples in ACES, some categories are underrepresented or absent for some language directions. 2.2. Paraphrase For the English paraphrase scenario, we sampled 138 examples from the SHROOM training data for the paraphrase generation subtask [5]. Each example consists of a source sentence accompanied with a machine-generated paraphrase hypothesis. The latter were generated by the SHROOM organisers 1 https://huggingface.co/datasets/Eloquent/HalluciGen-PG https://huggingface.co/datasets/Eloquent/HalluciGen-Translation 2 https://spacy.io 3 https://pypi.org/project/pyinflect/ using the PEGASUS model4 . In order to increase the chance for hallucination, we prioritised examples with long contexts (minimum 140 tokens) that also include numbers. In the Swedish paraphrase scenario, we used a subset of the SweParaphrase test data [6] and the Swedish part of the Finnish paraphrase corpus [7]. Each example consists of two sentences, together with a label reflecting the degree of their semantic similarity. After filtering only the sentence pairs with the highest degree of semantic similarity (that is label 5 in the Swedish Paraphrase dataset and label 4 in the Finnish paraphrase dataset), we sampled 139 examples and used two LLMs, Mixtral 7B [8] and GPT-SW3 6.7B [9], to generate a paraphrase hypothesis for the first sentence of each example. For these Swedish paraphrases, we observed cases where the generated paraphrase was in the wrong language, typically English, or a mix of languages when using Mixtral 7B. To obtain a large enough sample of reasonably good quality for annotation, we therefore chose to (1) translate output in English to Swedish using GPT3.5 and (2) generate multiple hypotheses – one by each of the LLMs – for some of these examples. In total, 46 of the sources had multiple annotations, while 56 sources only occur once in the entire dataset. All datasets used for the paraphrase scenario are annotated in two steps. The first step is to decide if the generated hypothesis is a hallucination of the source, given the definition of the hallucination phenomenon in our task. If yes, then we mark the hypothesis as hallucination (H) and then choose a suitable hallucination type from the list of eleven hallucination categories in the HalluciGen dataset (addition, named-entity, etc.). If the hypothesis is marked as not hallucination (NH) then we construct a hallucination manually, based on one of the hallucination categories above. In the Swedish data, the cases with hypotheses in the wrong language or a mix of languages were considered too high effort to correct manually and were discarded from the final dataset. Each of the resulting datasets per scenario and language was split into a trial and a test set. For English, 119 examples were selected for the test set and 16 examples for the trial set. The Swedish test set amounted to 119 examples in total (117 from SweParaphrase and 2 from the Finnish paraphrase corpus) and the trial set to 20 examples (19 from SweParaphrase and 1 from the Finnish corpus). The distribution of each dataset over the different hallucination types is presented in Table 1. Table 1 Frequency of hallucination categories in the data Named Entity Conversion Data Split Language Antonym Negation Addition Scenario Pronoun Number Natural Gender Tense Date en 11 16 5 3 9 14 9 11 4 3 33 PG test sv 42 11 – 3 15 12 9 1 5 1 20 en 2 2 1 1 2 1 1 2 – 1 3 PG trial sv 5 1 1 1 3 2 3 1 1 – 2 en-fr 10 – 24 – 33 – 33 – – – – fr-en 9 13 4 12 12 12 13 – 12 13 – MT test en-de 10 16 14 – 15 – 13 16 – – 16 de-en 10 10 7 11 10 10 10 – 10 11 11 en-fr 1 – 3 – 3 – 3 – – – – fr-en 1 1 1 1 1 2 1 – 1 1 – MT trial en-de 1 1 2 – 1 – 3 1 – – 1 de-en 1 1 1 1 1 1 1 – 1 1 1 4 https://huggingface.co/tuner007/pegasus_paraphrase 3. Baseline Models 3.1. Paraphrase scenario For the paraphrase scenario we use different models for the generation and detection steps. For generation, we use Mixtral-8x7B-Instruct-v0.1, the instructed variant of the Mixtral LLM [8]5 to generate ℎ𝑦𝑝+/ℎ𝑦𝑝− hypotheses pairs for the English and Swedish test sets, and gpt-sw3-6.7b-v2 [9]6 as an additional baseline for Swedish. For the detection step we use several models. The first is the Llama-2-7b-chat-hf[10] model from HuggingFace7 . This model, although English-centric has been trained on smaller amounts of data for other languages, including Swedish. We use three prompts aimed at detecting which hypothesis a) is an incorrect paraphrase of the source, b) has a different meaning to the source, or c) is not supported by the source (see Table 9 in Appendix A). The second and third models are multilingual zero-shot Natural Language Inference (NLI) models, bge-m3-zeroshot-v2.0 [11] for both English and Swedish hallucination detection, and scandi-nli-large [12] 8 as an additional baseline for Swedish. These models classify a text into a number of custom defined classes; in our case, we choose the default classes “not_entailment” and “entailment” and infer the output label from the predicted scores for both classes. To determine which of the the two hypotheses (ℎ𝑦𝑝1/ℎ𝑦𝑝2) contains a hallucination, we predicted “entailment” and “not_entailment” class scores between the source sentence and each one of the hypotheses. We follow these conditions to infer the final label: 1. If one hypothesis has higher entailment whereas the other hypothesis has higher non-entailment, we choose the one with the higher non-entailment score. 2. If both ℎ𝑦𝑝1 and ℎ𝑦𝑝2 had a higher entailment score, we choose the one with the lowest entail- ment. 3. If both ℎ𝑦𝑝1 and ℎ𝑦𝑝2 had a higher non-entailment score, we choose the one with the highest non-entailment. For both NLI models, the default configurations were used and each pair (𝑠𝑜𝑢𝑟𝑐𝑒+ℎ𝑦𝑝1 / 𝑠𝑜𝑢𝑟𝑐𝑒+ℎ𝑦𝑝2). The models were used out of the box, as available on HuggingFace, and without any additional fine-tuning. 3.2. Machine translation scenario For the translation scenario we again use the Llama2 [10] 7B-chat model from HuggingFace. We use this model as the baseline for the generation and detection steps. As stated in the previous section, while Llama2 is an English-centric model, it has been trained on (relatively) small amounts of data from other languages (including French and German) and is therefore able to perform cross-lingual tasks such as translation. Crucially, in addition to producing accurate translations it can also be prompted to produce incorrect translations in a zero-shot setting – something that we could not get MT-specific LLMs such as Tower [13] to do, perhaps because they have been optimised to output accurate translations. We note that there are many stronger LLMs, and our aim is not to provide an unbeatable baseline in the first year of HalluciGen. For the generation step we split the problem into two parts, using separate prompts to produce the good and incorrect translations. For the good translation we simply prompt the model to translate from the source language to the target language. For the incorrect translation we use two different strategies; we prompt the model to a) produce an incorrect translation, and b) produce an incorrect translation and provide a list of possible phenomena that the incorrect translation could target. For 5 https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF 6 https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2 7 https://huggingface.co/meta-llama/Llama-2-7b-chat-hf 8 https://huggingface.co/alexandrainst/scandi-nli-large the detection step, similar to the paraphrase detection step, we have three different prompts aimed at detecting which hypothesis a) is an incorrect translation of the source, b) has a different meaning to the source, or c) is not supported by the source. We use the same prompts from the detection step in the cross-model evaluation step. For detailed of the exact prompts used, see Table 10 in Appendix A. Note that we experimented with explicitly including the term “hallucination” as part of the prompt instructions, but this was unsuccessful. We used the default Llama2-7B-chat model parameters, unless otherwise stated. For generation (translation only) we want to encourage creativity for translation so we set temperature to 0.9; top_k=10, num_return_sequences=1, and max_length=200. For detection (for both translation and paraphrase) we want to encourage deterministic behaviour so we set temperature to 0.1 and top_k=1; as our prompts are longer than for the generation step we set max_length=400 (to allow for longer inputs). In addition to Llama2, we again employed bge-m3-zeroshot-v2.0 [11] to create detection step baselines for all language pairs and directions. This uses the same model and process detailed in Paraphrase scenario section. Although the model performs better on English input, it is still suitable for multilingual tasks. While it was recommended to translate input sentences into English rather than having them in multiple languages (as a way to improve performance), no additional translation was performed on either the source sentences nor the hypotheses pair (ℎ𝑦𝑝1/ℎ𝑦𝑝2); this means that the NLI model receives two sentence in two different languages as input (one in English, and one in either French or German) in both directions. 4. Participant Submissions In total, we received outputs from 10 systems submitted by 3 different groups which included varying numbers of participants. Table 2 provides an overview of the submitted systems. Participant group 1 (Bui et al.) [14] submitted systems for all steps and all languages for both the paraphrase and translation scenarios. They applied zero-shot prompting for a range of pre-trained LLMs, and ensembled combinations of these models to produce majority voting systems. Participant group 3 (Siino & Tinnirello) [15] submitted systems for the detection step of the paraphrase scenario only. They used Mistral-7B-Instruct-v0.2 with few-shot prompting, providing the complete set of examples (either English or Swedish depending on the language in focus) from the trial data set as part of the prompt. Participant group 2 (Abburi) submitted systems for the detection step for both the paraphrase and translation scenarios. Unfortunately, as they did not submit a paper to CLEF 2024, we know little about their system other than it uses majority voting across multiple fine-tuned LLMs. 5. Evaluation Methodology 5.1. Detection Step For the detection step, the submitted systems are evaluated with respect to the human-annotated labels, using the following metrics: accuracy, precision, recall and F1 score. We use F1 as the primary metric for comparison between different systems. Examples were classified as incorrect in cases when the evaluated system produced no label or a label outside the allowed categories (ℎ𝑦𝑝1/ℎ𝑦𝑝2). 5.1.1. Generation Step We use the NLI task as a proxy for evaluating the quality of the correct and hallucinated hypothesis ℎ𝑦𝑝+,ℎ𝑦𝑝− generated by the participant models. More specifically, the NLI model bge-m3-zeroshot- v2.0 [11], that also serves as a baseline for the detection step, is now used to predict “entailment” vs “not_entailment” scores. The rationale behind this is as follows: one way to determine whether or not a system is able to create appropriate pairs of hypotheses is to measure the textual entailment between each pair and the source sentence. We assume that a successful paraphrase of a sentence textually entails the source sentence; whereas a hallucination does not. If ℎ𝑦𝑝+ is predicted as having Table 2 Participant systems by task and scenario (PG and MT), including the languages or language pairs for which output was submitted. The double-direction arrows “⇔” indicates participant submissions for a language pair in both directions Cross-Model LLM System Detection Generation evaluation Participant Group 1 (Bui et al.) PG (en/sv) PG (en/sv) PG (en/sv) google/gemma-7b-it MT (en⇔de) MT (en⇔de) MT (en⇔de) MT (en⇔fr) MT (en⇔fr) MT (en⇔fr) PG (en/sv) PG (en/sv) PG (en/sv) gpt-3.5-turbo MT (en⇔de) MT (en⇔de) MT (en⇔de) MT (en⇔fr) MT (en⇔fr) MT (en⇔fr) PG (en/sv) MT (en⇔de) gpt-4 MT (en⇔de) - MT (en⇔fr) MT (en⇔fr) gpt-4-turbo PG (en/sv) - PG (en/sv) PG (en/sv) PG (en) PG (en/sv) meta-llama/Meta-Llama-3-8B-Instruct MT (en⇔de) MT (en⇔de) MT (en⇔de) MT (en⇔fr) MT (en⇔fr) MT (en⇔fr) meta-llama/Meta-Llama-3-8B PG (en/sv) - - Majority vote (A) on: google/gemma-7b-it PG (en/sv) meta-llama/Meta-Llama-3-8B-Instruct MT (en⇔de) - PG (en/sv) gpt-3.5-turbo MT (en⇔fr) gpt-4-turbo Majority vote (B) on: google/gemma-7b-it PG (en/sv) meta-llama/Meta-Llama-3-8B-Instruct - - MT (en⇔de) gpt-3.5-turbo MT (en⇔fr) gpt-4 Participant Group 2 (Abburi) PG (en/sv) Majority voting of finetuned LLMs MT (en⇔de) - - MT (en⇔fr) Participant Group 3 (Siino & Tinnirello) TheBloke/Mistral-7B-Instruct-v0.2-GGUF PG (en/sv) - - higher “entailment”, it is assigned a score of 1, otherwise 0, and if a ℎ𝑦𝑝− is is predicted as having higher “not_entailment”, it is assigned a score of 1, otherwise 0. To validate the use of the NLI model for evaluating the model outputs for the generation step, we test the NLI model bge-m3-zeroshot-v2.0 as a baseline for the detection step in both scenarios. These are the scores highlighted in grey in Tables 3 and 7. We observe that the NLI model competes with (or even surpasses) the participant models on the detection task. This allows us to use it for evaluating the model outputs for the generation step. 5.1.2. Cross-model evaluation For the cross-model evaluation, the system performance is measured with respect to the output of the generator model, using the same metrics as in the detection step. In addition, Matthew’s correlation coefficient (mcc) and Cohen’s kappa are used to measure the agreement between the different evaluators. 6. Results 6.1. Paraphrase Scenario Tables 3 - 5 present the results of the participant models and the baselines for the three steps of the paraphrase scenario. Starting from the detection step, we observe that the NLI baseline baseline-bge- m3-zeroshot-v2.0 exhibits very strong performance. The difference with the participant models is even Table 3 Detection step results for the paraphrase scenario. Results for the NLI model bge-m3-zeroshot-v2.0 (highlighted in grey) are included for the purpose of validating the NLI model as an evaluation method for the generation step. Detection: Paraphrase LLM system F1 LLM system F1 English Swedish gemma-7b-it 0.49 gemma-7b-it 0.11 gemma-7b-it v1 0.71 gemma-7b-it v1 0.52 gpt-3.5-turbo 0.68 gpt-3.5-turbo 0.60 gpt-3.5-turbo v1 0.73 gpt-3.5-turbo v1 0.70 gpt-4-turbo 0.91 gpt-4-turbo 0.81 Meta-Llama-3-8B-Instruct 0.80 Meta-Llama-3-8B-Instruct 0.59 Meta-Llama-3-8B 0.69 Meta-Llama-3-8B 0.48 Majority vote A (Bui et al.) 0.85 Majority vote (Abburi) 0.79 Majority vote (Abburi) 0.90 Majority vote A (Bui et al.) 0.66 Mistral-7B-Instruct-v0.2 0.72 Mistral-7B-Instruct-v0.2 0.75 Baselines baseline-bge-m3-zeroshot-v2.0 0.90 baseline-bge-m3-zeroshot-v2.0 0.92 baseline-llama2-meaning-detection 0.44 baseline-llama2-meaning-detection 0.60 baseline-llama2-not-supported-detection 0.35 baseline-llama2-not-supported-detection 0.56 baseline-llama2-paraphrase-detection 0.35 baseline-llama2-paraphrase-detection 0.59 baseline-sv_scandi-nli-large 0.92 Table 4 Generation results for the paraphrase scenario. ℎ𝑦𝑝+, ℎ𝑦𝑝− refer to the accuracy of the NLI model on predicting that ℎ𝑦𝑝+ is entailed and ℎ𝑦𝑝− is not entailed correspondingly. Generation: Paraphrase LLM system ℎ𝑦𝑝+ ℎ𝑦𝑝− LLM system ℎ𝑦𝑝+ ℎ𝑦𝑝− English Swedish gemma-7b-it v1 0.82 0.89 gemma-7b-it v1 0.35 0.93 gemma-7b-it v2 0.85 0.90 gemma-7b-it v2 0.61 0.69 gpt-3.5-turbo 0.98 0.80 gpt-3.5-turbo 0.90 0.93 Meta-Llama-3-8B-Instruct 0.88 0.98 Baselines baseline-mixtral-8x7b-instruct 0.92 0.74 baseline-gpt-sw3-6.7b-v2 0.64 0.50 baseline-mixtral-8x7b-instruct 0.84 0.35 Table 5 Cross-model step results for the paraphrase scenario. Cross-model evaluation: Paraphrase LLM system F1 Avg Kappa LLM system F1 Avg Kappa English Swedish gemma-7b-it v1 0.77 0.61 gemma-7b-it v1 0.48 0.19 gpt-3.54-turbo v2 0.88 0.77 gpt-3.54-turbo v2 0.68 0.48 Meta-Llama-3-8B-Instruct 0.92 0.74 Meta-Llama-3-8B-Instruct 0.70 0.50 Majority vote A (Bui et al.) 0.92 0.81 Majority vote A (Bui et al.) 0.76 0.59 gpt-4-turbo v2 0.93 0.75 gpt-4-turbo v2 0.74 0.41 more noticable for the Swedish dataset, where the best performing participant model, gpt-4-turbo lies over 10 points behind the NLI baseline in terms of F1 score. This is almost expected since none of the participant models has been (intentionally) trained on Swedish data. For the English paraphrase, gpt-4-turbo and the Majority vote (Abburi) models perform on the same level as the baseline on the task of hallucination detection. For the generation step, gpt-3-5-turbo produces overall the best quality positive and negative hypotheses in both English and Swedish, according to the NLI model. Notably larger difference between the ℎ𝑦𝑝+ and ℎ𝑦𝑝− scores of that model is observed in English, in comparison with Swedish. In addition, gemma-7b-it v1 stands out for generating ℎ𝑦𝑝− hypotheses with considerably better quality than ℎ𝑦𝑝+ hypotheses, according to the NLI model. From the results of the cross-model evaluation in Table 5 we observe that the Majority vote A (Bui et al.) exhibits the best overall performance in detecting hallucinations in machine-generated hypotheses in English and Swedish, with respect to both the generator output and the other evaluator models. Table 6 Generation step results for the translation scenario Generation: Translation en-fr fr-en en-de de-en LLM system ℎ𝑦𝑝+ ℎ𝑦𝑝− ℎ𝑦𝑝+ ℎ𝑦𝑝− ℎ𝑦𝑝+ ℎ𝑦𝑝− ℎ𝑦𝑝+ ℎ𝑦𝑝− Meta-Llama-3-8B-Instruct prompt1 (Bui et al.) 0.77 0.81 0.82 0.88 0.84 0.84 0.84 0.85 Meta-Llama-3-8B-Instruct prompt2 (Bui et al.) 0.81 0.86 0.81 0.96 0.84 0.68 0.85 0.62 gemma-7b-it (Bui et al.) 0.80 0.49 0.73 0.57 0.85 0.42 0.70 0.54 gpt-3.5-turbo (Bui et al.) 0.88 0.91 0.86 0.90 0.81 0.84 0.86 0.95 Llama-2-7b-chat-hf general-prompt 0.93 0.08 1.00 0.03 0.85 0.19 0.98 0.03 Llama-2-7b-chat-hf phenomena-mentions-prompt 0.92 0.23 0.97 0.08 0.85 0.33 0.98 0.06 Table 7 Detection step results for the translation scenario. Results for the NLI model bge-m3-zeroshot-v2.0 (highlighted in grey) are included for the purpose of validating the NLI model as an evaluation method for the generation step. Detection: Translation F1 LLM system en-fr fr-en en-de de-en Meta-Llama-3-8B-Instruct final (Bui et al.) 0.51 0.63 0.47 0.67 Meta-Llama-3-8B-Instruct new-prompt-final (Bui et al.) 0.65 0.60 0.69 0.70 gemma-7b-it (Bui et al.) 0.66 0.61 0.59 0.58 gemma-7b-it final (Bui et al.) 0.60 0.46 0.54 0.53 gpt-3.5-turbo prompt1 (Bui et al.) 0.74 0.75 0.67 0.80 gpt-3.5-turbo prompt2 (Bui et al.) 0.76 0.82 0.80 0.83 gpt-4 prompt1 (Bui et al.) 0.90 0.87 0.86 0.93 gpt-4 prompt2 (Bui et al.) 0.79 0.89 0.79 0.83 Majority vote A (Bui et al.) 0.83 0.84 0.81 0.85 Majority vote (Abburi) 0.85 0.87 0.85 0.89 bge-m3-zeroshot-v2.0 0.82 0.88 0.73 0.78 Llama-2-7b-chat-hf general-prompt 0.47 0.50 0.48 0.50 Llama-2-7b-chat-hf meaning-prompt 0.50 0.44 0.36 0.36 Llama-2-7b-chat-hf supported-prompt 0.24 0.35 0.41 0.50 Table 8 Cross-model evaluation step results for the translation scenario Cross-model Evaluation: Translation wrt. generator output (F1) wrt. other evaluators (K) LLM system en-fr fr-en en-de de-en en-fr fr-en en-de de-en Meta-Llama-3-8B-Instruct final (Bui et al.) 0.65 0.68 0.52 0.51 0.43 0.45 0.27 0.33 gemma-7b-it final (Bui et al.) 0.57 0.53 0.53 0.55 0.23 0.13 0.15 0.17 gpt-3.5-turbo (Bui et al.) 0.77 0.75 0.75 0.71 0.57 0.55 0.50 0.54 gpt-4 (Bui et al.) 0.76 0.75 0.73 0.71 0.59 0.58 0.52 0.53 Majority vote B (Bui et al.) 0.78 0.79 0.74 0.73 0.65 0.62 0.58 0.59 6.2. Machine translation Scenario Tables 6 - 8 contain the results for the translation scenario. For the generation step (Table 6) we observe that performance of Llama-3-8B-Instruct and gpt-3.5-turbo participant systems is generally good: the average “entailment” scores for ℎ𝑦𝑝+ and “not_entailment” scores for ℎ𝑦𝑝− suggest that the models are generally consistent in their ability to generate hypotheses that are entailed by the reference (ℎ𝑦𝑝+) and that contradict the reference (ℎ𝑦𝑝−). The two Llama-2-7b-chat baselines and, to a lesser degree, the gemma-7b-it participant system exhibit stronger performance for the generation of ℎ𝑦𝑝+ examples than ℎ𝑦𝑝− examples. In particular, the Llama-2-7b-chat baselines outperform the participant systems for the task of generating ℎ𝑦𝑝+ examples. We conjecture that this may be a result of using separate prompts to generate ℎ𝑦𝑝+ and ℎ𝑦𝑝−; by focusing the prompt for generating ℎ𝑦𝑝+ examples on generating a “good” translation of the source we may focus the model on the translation task, for which it was likely fine-tuned. Conversely, the baseline performance for generating ℎ𝑦𝑝− examples is very low, but confidence in the ability of LLMs to perform this task is buoyed by the performance of the participant systems. Note that these results are based on automatic metrics; for a complete evaluation we propose that the generated output be verified by human annotators, which we leave to future work. For the detection step, all participant systems outperformed the Llama-2-7b-chat baselines (one model; three different prompts). The stronger bge-m3-zeroshot-v2.0 baseline, is outperformed by a number of participant systems for all language pairs. Overall, gpt-4 prompt1 is the strongest-performing participant system, with the highest F1 score for three out of four language pairs. The majority voting strategies of Bui et al. [14] and Abburi also perform strongly. For the cross-model evaluation step, from which we exclude the baselines, we find that the majority voting strategy of Bui et al. [14] works well, with strong F1 performance on detection based on the examples generated by the models in the generation step, and also has the highest agreement (measured using Cohen’s Kappa) with the other evaluator models. 7. Conclusion and Future Work In the HalluciGen task we explored the use of LLMs in generating and detecting hallucinations in paraphrase and translation tasks. We find that performance of the participant and baseline systems is highly variable, but results from this year’s lab are promising and will provide a solid foundation for future iterations of the task. We highlight that all three steps (generation, detection, and cross-model evaluation) have been evaluated automatically, and therefore caution the reader against drawing any conclusions regarding which models, prompts, or methods may be “best” based solely on the results in this paper. In the case of the generation step in particular, human validation of the generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen. Acknowledgments This lab has been partially supported by the Swedish Research Council (grant number 2022-02909) and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10039436 (Utter)]. References [1] P. Manakul, A. Liusie, M. J. F. Gales, Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, arXiv preprint: 2303.08896 (2023). [2] W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, J. Leike, Self-critiquing models for assisting human evaluators, arXiv preprint: 2206.05802 (2022). [3] C. Amrhein, N. Moghe, L. Guillou, ACES: Translation accuracy challenge sets for evaluating machine translation metrics, in: Proceedings of the Seventh Conference on Machine Translation (WMT), Association for Computational Linguistics, ????, pp. 479–513. URL: https://aclanthology. org/2022.wmt-1.44. [4] Y. Yang, Y. Zhang, C. Tar, J. Baldridge, PAWS-X: A cross-lingual adversarial dataset for paraphrase identification, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3687–3692. URL: https://aclanthology.org/D19-1382. doi:10.18653/v1/ D19-1382. [5] T. Mickus, E. Zosa, R. Vázquez, T. Vahtola, J. Tiedemann, V. Segonne, A. Raganato, M. Apidianaki, Semeval-2024 shared task 6: Shroom, a shared-task on hallucinations and related observable overgeneration mistakes, arXiv preprint arXiv:2403.07726 (2024). [6] A. Berdicevskis, G. Bouma, R. Kurtz, F. Morger, J. Öhman, Y. Adesam, L. Borin, D. Dannélls, M. Forsberg, T. Isbister, A. Lindahl, M. Malmsten, F. Rekathati, M. Sahlgren, E. Volodina, L. Börjeson, S. Hengchen, N. Tahmasebi, Superlim: A Swedish language understanding evaluation benchmark, Association for Computational Linguistics, Singapore, 2023, pp. 8137–8153. [7] J. Kanerva, F. Ginter, L.-H. Chang, I. Rastas, V. Skantsi, J. Kilpeläinen, H.-M. Kupari, J. Saarni, M. Sevón, O. Tarkka, Finnish paraphrase corpus, in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Linköping University Electronic Press, Sweden, Reykjavik, Iceland (Online), 2021, pp. 288–298. [8] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint: 2401.04088 (2024). [9] A. Ekgren, A. Cuba Gyllensten, F. Stollenwerk, J. Öhman, T. Isbister, E. Gogoulou, F. Carlsson, J. Casademont, M. Sahlgren, GPT-SW3: An autoregressive language model for the Scandinavian languages, in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 2024. [10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv:2307.09288. [11] M. Laurer, W. van Atteveldt, A. Casas, K. Welbers, Building Efficient Universal Classifiers with Natural Language Inference, 2023. URL: http://arxiv.org/abs/2312.17543. doi:10.48550/arXiv. 2312.17543, arXiv:2312.17543 [cs]. [12] D. S. Nielsen, Scandinli: Natural language inference for the scandinavian languages, https://github. com/alexandrainst/ScandiNLI, 2022. URL: https://aclanthology.org/D19-1382. [13] D. M. Alves, J. Pombal, N. M. Guerreiro, P. H. Martins, J. Alves, A. Farajian, B. Peters, R. Rei, P. Fernandes, S. Agrawal, P. Colombo, J. G. C. de Souza, A. F. T. Martins, Tower: An open multilingual large language model for translation-related tasks, 2024. arXiv:2402.17733. [14] A. T. Bui, S. F. Brech, N. Hußfeldt, T. Jennert, M. Ullrich, T. Breuer, N. Nikzad, P. Schaer, The two sides of the coin: Hallucination generation and detection with evaluators for llms, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [15] M. Siino, I. Tinnirello, Gpt hallucination detection through prompt engineering, in: G. Faggi- oli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. A. Task 2 - Baseline System Prompts The prompts used for the paraphrase and translation baseline LLM systems are provided in Tables 9 and 10 respectively. Table 9 Prompts for Paraphrase baseline models.In the generation step, the model is instructed to generate a pair of hypotheses (sometimes explicitly named "hyp+" or "hyp-") where one is supported by the source sentence and the other is not. In the detection step, the model is instructed to identify which of the two hypotheses, hypothesis1 (hyp1) or hypothesis2 (hyp2) contains the hallucinated content, given the source sentence. Model Prompt Paraphrase: Generation Step gpt-sw3-6.7b-v2 Generera en parafras hyp+ som stöds av src och en andra parafras hyp- som inte stöds av src. mixtral-8x7b-instruct Prompt for English: Given the src below, generate a paraphrase hypothesis hyp+ that is supported by src and a paraphrase hypothesis hyp- that is not supported by src. Prompt for Swedish: Generera en parafras hyp+ som stöds av src och en andra parafras hyp- som inte stöds av src. Paraphrase: Detection Step Llama2-7B-general-prompt Which hypothesis is an incorrect paraphrase of the source: hypothesis1 or hypothesis2? source: hypothesis1: hypothesis2: Acceptable answers: ’hypothesis1’, ’hypothesis2’ Answer: Llama2-7B-meaning-prompt Given the source which hypothesis contains content which is not present in the source, or has a different meaning to the source: hypothesis1 or hypothesis2? source: hypothesis1: hypothesis2: Acceptable answers: ’hypothesis1’, ’hypothesis2’ Answer: Llama2-7B-support-prompt Which hypothesis is not supported by the source: hypothesis1 or hypothesis2? source: hypothesis1: hypothesis2: Acceptable answers: ’hypothesis1’, ’hypothesis2’ Answer: Table 10 Prompts for Translation baseline models. In the generation step the model is instructed to produce translations of src_sentence, a source language (src_lang) text into the target language (tgt_lang). In the detection step the model is instructed to identify which of the two hypotheses, hypothesis1 (ℎ𝑦𝑝1) or hypothesis2 (ℎ𝑦𝑝2) contains the hallucinated content, given the source sentence. Model Prompt Translation: Generation Step Llama2-7B Translate the following text into (good translation) Text: : Llama2-7B-general-prompt Translate the following text incorrectly into (incorrect translation) Text: : Llama2-7B-mentions-prompt Translate the following text incorrectly into (incorrect translation) and change its meaning, for example by inserting a word, changing the tense of the text, negating the text, or replacing a date, number, named entity, or pronoun. Text: : Translation: Detection Step Llama2-7B-general-prompt Which hypothesis is an incorrect translation of the source: hypothesis1 or hypothesis2? source: hypothesis1: hypothesis2: Acceptable answers: ‘hypothesis1’, ‘hypothesis2’ Answer: Llama2-7B-meaning-prompt Given the source which hypothesis contains content which is not present in the source, or has a different meaning to the source: hypothesis1 or hypothesis2? source: hypothesis1: hypothesis2: Acceptable answers: ‘hypothesis1’, ‘hypothesis2’ Answer: Llama2-7B-support-prompt Which hypothesis is not supported by the source: hypothesis1 or hypothesis2? source: hypothesis1: hypothesis2: Acceptable answers: ‘hypothesis1’, ‘hypothesis2’ Answer: