1. Introduction

Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen

Luise Dürlich

Evangelia Gogoulou

Liane Guillou

Joakim Nivre

Shorouq Zahra

0 0 RISE Research Institutes of Sweden , Stockholm 1 School of Informatics, University of Edinburgh

In the HalluciGen task we aim to discover whether LLMs have an internal representation of hallucination. Specifically, we investigate whether LLMs can be used to both generate and detect hallucinated content. In the cross-model evaluation setting we take this a step further and explore the viability of using an LLM to evaluate output produced by another LLM. We include generation, detection, and cross-model evaluation steps for two scenarios: paraphrase and machine translation. Overall we find that performance of the baselines and submitted systems is highly variable, however initial results are promising and lessons learned from this year's task will provide a solid foundation for future iterations of the task. In particular, we highlight that human validation of generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen.

eol>Generative language models Evaluation Hallucinations

1. Introduction

• Detection: Given a source sentence and two paraphrase/translation hypotheses (ℎ1 and ℎ2), the model should detect which of the two contains a hallucination.

As an additional challenge, we also perform the detection step in a cross-model setting, where the participant models perform the detection step on the model outputs from the generation step.

2. Datasets

For each of the two scenarios, i.e. paraphrase generation or machine translation, we construct a dataset with the following fields: a source sentence, a correct hypothesis of the source, a hallucinated hypothesis of the source, and the type of hallucination demonstrated in the hallucinated hypothesis. Our datasets include hallucinations of the following categories: addition, named-entity, number, conversion, date, gender, pronoun, antonym, tense, negation, and natural hallucinations. With the exception of tense and negation, the remainder of the hallucination types are identical to the type of translation errors identified in ACES [ 3 ]. All our datasets are available on Huggingface.1 The process of dataset creation for each scenario is described below.

2.1. Machine Translation

For the translation scenario we leveraged ACES [ 3 ], a challenge set for evaluating the performance of Machine Translation (MT) metrics on a range of translation accuracy errors. Each example in ACES already follows the structure that we use in the HalluciGen task and ACES already contains errors for the en⇔fr and en⇔de language pairs for all but two of the phenomena we are interested in. For the tense and negation categories, which do not exist in ACES, we constructed examples from the PAWS-X dataset [ 4 ] of adversarial paraphrases.

For tense examples we filter the PAWS-X dataset to select English examples labelled as paraphrases, then we select each instance of sentence1 and use spaCy2 to tokenise and part-of-speech tag the sentence. We then identify a verb and its tense using the Penn-treebank style tags output by the spaCy pipeline, and inflect it for a diferent tense using the pyinflect 3 python library. We change tense between past, present, and future (by injecting the token “will”). The original sentence1 forms the good translation, the perturbed version is the incorrect translation, and we pair the English sentence with the corresponding French/German translation in PAWS-X (which forms the source sentence). Negation examples are created by automatically extracting English paraphrase examples in PAWS-X that contain a negation and manually editing sentence1 to construct an incorrect translation e.g. by inserting an (extra) negation, or modifying the polarity of a sentence that already contains a negation. We consider lexical negation (e.g. the afixes “un” and “dis”) and negation tokens (e.g. not, n’t, never). Again, we pair the English target sentences with a corresponding French/German source sentence from PAWS-X.

From the combined set of ACES and the negation and tense examples, we selected 100 examples for each language direction for the test set and 10 examples for the trial set. Examples for the test set were selected in order to provide as close to a uniform selection across categories as possible. Note that due to the unbalanced coverage of examples in ACES, some categories are underrepresented or absent for some language directions.

2.2. Paraphrase

For the English paraphrase scenario, we sampled 138 examples from the SHROOM training data for the paraphrase generation subtask [ 5 ]. Each example consists of a source sentence accompanied with a machine-generated paraphrase hypothesis. The latter were generated by the SHROOM organisers 1https://huggingface.co/datasets/Eloquent/HalluciGen-PG https://huggingface.co/datasets/Eloquent/HalluciGen-Translation 2https://spacy.io 3https://pypi.org/project/pyinflect/ using the PEGASUS model4. In order to increase the chance for hallucination, we prioritised examples with long contexts (minimum 140 tokens) that also include numbers.

In the Swedish paraphrase scenario, we used a subset of the SweParaphrase test data [ 6 ] and the Swedish part of the Finnish paraphrase corpus [ 7 ]. Each example consists of two sentences, together with a label reflecting the degree of their semantic similarity. After filtering only the sentence pairs with the highest degree of semantic similarity (that is label 5 in the Swedish Paraphrase dataset and label 4 in the Finnish paraphrase dataset), we sampled 139 examples and used two LLMs, Mixtral 7B [ 8 ] and GPT-SW3 6.7B [ 9 ], to generate a paraphrase hypothesis for the first sentence of each example. For these Swedish paraphrases, we observed cases where the generated paraphrase was in the wrong language, typically English, or a mix of languages when using Mixtral 7B. To obtain a large enough sample of reasonably good quality for annotation, we therefore chose to (1) translate output in English to Swedish using GPT3.5 and (2) generate multiple hypotheses – one by each of the LLMs – for some of these examples. In total, 46 of the sources had multiple annotations, while 56 sources only occur once in the entire dataset.

All datasets used for the paraphrase scenario are annotated in two steps. The first step is to decide if the generated hypothesis is a hallucination of the source, given the definition of the hallucination phenomenon in our task. If yes, then we mark the hypothesis as hallucination (H) and then choose a suitable hallucination type from the list of eleven hallucination categories in the HalluciGen dataset (addition, named-entity, etc.). If the hypothesis is marked as not hallucination (NH) then we construct a hallucination manually, based on one of the hallucination categories above. In the Swedish data, the cases with hypotheses in the wrong language or a mix of languages were considered too high efort to correct manually and were discarded from the final dataset.

Each of the resulting datasets per scenario and language was split into a trial and a test set. For English, 119 examples were selected for the test set and 16 examples for the trial set. The Swedish test set amounted to 119 examples in total (117 from SweParaphrase and 2 from the Finnish paraphrase corpus) and the trial set to 20 examples (19 from SweParaphrase and 1 from the Finnish corpus). The distribution of each dataset over the diferent hallucination types is presented in Table 1.

3. Baseline Models 3.1. Paraphrase scenario

For the paraphrase scenario we use diferent models for the generation and detection steps. For generation, we use Mixtral-8x7B-Instruct-v0.1, the instructed variant of the Mixtral LLM [ 8 ]5 to generate ℎ+/ℎ− hypotheses pairs for the English and Swedish test sets, and gpt-sw3-6.7b-v2 [ 9 ]6 as an additional baseline for Swedish.

For the detection step we use several models. The first is the Llama-2-7b-chat-hf[ 10 ] model from HuggingFace7. This model, although English-centric has been trained on smaller amounts of data for other languages, including Swedish. We use three prompts aimed at detecting which hypothesis a) is an incorrect paraphrase of the source, b) has a diferent meaning to the source, or c) is not supported by the source (see Table 9 in Appendix A). The second and third models are multilingual zero-shot Natural Language Inference (NLI) models, bge-m3-zeroshot-v2.0 [ 11 ] for both English and Swedish hallucination detection, and scandi-nli-large [ 12 ] 8 as an additional baseline for Swedish. These models classify a text into a number of custom defined classes; in our case, we choose the default classes “not_entailment” and “entailment” and infer the output label from the predicted scores for both classes. To determine which of the the two hypotheses (ℎ1/ℎ2) contains a hallucination, we predicted “entailment” and “not_entailment” class scores between the source sentence and each one of the hypotheses. We follow these conditions to infer the final label: 1. If one hypothesis has higher entailment whereas the other hypothesis has higher non-entailment, we choose the one with the higher non-entailment score. 2. If both ℎ1 and ℎ2 had a higher entailment score, we choose the one with the lowest entailment. 3. If both ℎ1 and ℎ2 had a higher non-entailment score, we choose the one with the highest non-entailment.

For both NLI models, the default configurations were used and each pair ( +ℎ1 / +ℎ2). The models were used out of the box, as available on HuggingFace, and without any additional fine-tuning.

3.2. Machine translation scenario

For the translation scenario we again use the Llama2 [ 10 ] 7B-chat model from HuggingFace. We use this model as the baseline for the generation and detection steps. As stated in the previous section, while Llama2 is an English-centric model, it has been trained on (relatively) small amounts of data from other languages (including French and German) and is therefore able to perform cross-lingual tasks such as translation. Crucially, in addition to producing accurate translations it can also be prompted to produce incorrect translations in a zero-shot setting – something that we could not get MT-specific LLMs such as Tower [ 13 ] to do, perhaps because they have been optimised to output accurate translations. We note that there are many stronger LLMs, and our aim is not to provide an unbeatable baseline in the ifrst year of HalluciGen.

For the generation step we split the problem into two parts, using separate prompts to produce the good and incorrect translations. For the good translation we simply prompt the model to translate from the source language to the target language. For the incorrect translation we use two diferent strategies; we prompt the model to a) produce an incorrect translation, and b) produce an incorrect translation and provide a list of possible phenomena that the incorrect translation could target. For 5https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF 6https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2 7https://huggingface.co/meta-llama/Llama-2-7b-chat-hf 8https://huggingface.co/alexandrainst/scandi-nli-large the detection step, similar to the paraphrase detection step, we have three diferent prompts aimed at detecting which hypothesis a) is an incorrect translation of the source, b) has a diferent meaning to the source, or c) is not supported by the source. We use the same prompts from the detection step in the cross-model evaluation step. For detailed of the exact prompts used, see Table 10 in Appendix A. Note that we experimented with explicitly including the term “hallucination” as part of the prompt instructions, but this was unsuccessful.

We used the default Llama2-7B-chat model parameters, unless otherwise stated. For generation (translation only) we want to encourage creativity for translation so we set temperature to 0.9; top_k=10, num_return_sequences=1, and max_length=200. For detection (for both translation and paraphrase) we want to encourage deterministic behaviour so we set temperature to 0.1 and top_k=1; as our prompts are longer than for the generation step we set max_length=400 (to allow for longer inputs).

In addition to Llama2, we again employed bge-m3-zeroshot-v2.0 [ 11 ] to create detection step baselines for all language pairs and directions. This uses the same model and process detailed in Paraphrase scenario section. Although the model performs better on English input, it is still suitable for multilingual tasks. While it was recommended to translate input sentences into English rather than having them in multiple languages (as a way to improve performance), no additional translation was performed on either the source sentences nor the hypotheses pair (ℎ1/ℎ2); this means that the NLI model receives two sentence in two diferent languages as input (one in English, and one in either French or German) in both directions.

4. Participant Submissions

In total, we received outputs from 10 systems submitted by 3 diferent groups which included varying numbers of participants. Table 2 provides an overview of the submitted systems. Participant group 1 (Bui et al.) [ 14 ] submitted systems for all steps and all languages for both the paraphrase and translation scenarios. They applied zero-shot prompting for a range of pre-trained LLMs, and ensembled combinations of these models to produce majority voting systems. Participant group 3 (Siino & Tinnirello) [ 15 ] submitted systems for the detection step of the paraphrase scenario only. They used Mistral-7B-Instruct-v0.2 with few-shot prompting, providing the complete set of examples (either English or Swedish depending on the language in focus) from the trial data set as part of the prompt. Participant group 2 (Abburi) submitted systems for the detection step for both the paraphrase and translation scenarios. Unfortunately, as they did not submit a paper to CLEF 2024, we know little about their system other than it uses majority voting across multiple fine-tuned LLMs.

5. Evaluation Methodology 5.1. Detection Step

For the detection step, the submitted systems are evaluated with respect to the human-annotated labels, using the following metrics: accuracy, precision, recall and F1 score. We use F1 as the primary metric for comparison between diferent systems. Examples were classified as incorrect in cases when the evaluated system produced no label or a label outside the allowed categories (ℎ1/ℎ2).

5.1.1. Generation Step

We use the NLI task as a proxy for evaluating the quality of the correct and hallucinated hypothesis ℎ+,ℎ− generated by the participant models. More specifically, the NLI model bge-m3-zeroshotv2.0 [ 11 ], that also serves as a baseline for the detection step, is now used to predict “entailment” vs “not_entailment” scores. The rationale behind this is as follows: one way to determine whether or not a system is able to create appropriate pairs of hypotheses is to measure the textual entailment between each pair and the source sentence. We assume that a successful paraphrase of a sentence textually entails the source sentence; whereas a hallucination does not. If ℎ+ is predicted as having higher “entailment”, it is assigned a score of 1, otherwise 0, and if a ℎ− is is predicted as having higher “not_entailment”, it is assigned a score of 1, otherwise 0. To validate the use of the NLI model for evaluating the model outputs for the generation step, we test the NLI model bge-m3-zeroshot-v2.0 as a baseline for the detection step in both scenarios. These are the scores highlighted in grey in Tables 3 and 7. We observe that the NLI model competes with (or even surpasses) the participant models on the detection task. This allows us to use it for evaluating the model outputs for the generation step.

5.1.2. Cross-model evaluation

For the cross-model evaluation, the system performance is measured with respect to the output of the generator model, using the same metrics as in the detection step. In addition, Matthew’s correlation coeficient (mcc) and Cohen’s kappa are used to measure the agreement between the diferent evaluators.

6. Results 6.1. Paraphrase Scenario

Tables 3 - 5 present the results of the participant models and the baselines for the three steps of the paraphrase scenario. Starting from the detection step, we observe that the NLI baseline baseline-bgem3-zeroshot-v2.0 exhibits very strong performance. The diference with the participant models is even

LLM system gemma-7b-it gemma-7b-it v1 gpt-3.5-turbo gpt-3.5-turbo v1 gpt-4-turbo Meta-Llama-3-8B-Instruct Meta-Llama-3-8B Majority vote A (Bui et al.) Majority vote (Abburi) Mistral-7B-Instruct-v0.2

English Detection step results for the paraphrase scenario. Results for the NLI model bge-m3-zeroshot-v2.0 (highlighted that ℎ+ is entailed and ℎ− is not entailed correspondingly.

Generation results for the paraphrase scenario. ℎ+, ℎ− refer to the accuracy of the NLI model on predicting LLM system gemma-7b-it v1 gemma-7b-it v2 gpt-3.5-turbo Meta-Llama-3-8B-Instruct

English baseline-mixtral-8x7b-instruct

Generation: Paraphrase ℎ+ ℎ− LLM system addition, gemma-7b-it v1 stands out for generating ℎ− more noticable for the Swedish dataset, where the best performing participant model, gpt-4-turbo lies over 10 points behind the NLI baseline in terms of F1 score. This is almost expected since none of the participant models has been (intentionally) trained on Swedish data. For the English paraphrase, gpt-4-turbo and the Majority vote (Abburi) models perform on the same level as the baseline on the task of hallucination detection.

For the generation step, gpt-3-5-turbo produces overall the best quality positive and negative hypotheses in both English and Swedish, according to the NLI model. Notably larger diference between scores of that model is observed in English, in comparison with Swedish. In hypotheses with considerably better quality than ℎ+ hypotheses, according to the NLI model.

From the results of the cross-model evaluation in Table 5 we observe that the Majority vote A (Bui et al.) exhibits the best overall performance in detecting hallucinations in machine-generated hypotheses in English and Swedish, with respect to both the generator output and the other evaluator models. in grey) are included for the purpose of validating the NLI model as an evaluation method for the generation

Detection: Translation

6.2. Machine translation Scenario

observe that performance of Llama-3-8B-Instruct and gpt-3.5-turbo participant systems is generally good: the average “entailment” scores for ℎ+ and “not_entailment” scores for ℎ− suggest that the models are generally consistent in their ability to generate hypotheses that are entailed by the reference (ℎ+) and that contradict the reference (ℎ− ). The two Llama-2-7b-chat baselines and, to a lesser degree, the gemma-7b-it participant system exhibit stronger performance for the generation of ℎ+ examples than ℎ− examples. In particular, the Llama-2-7b-chat baselines outperform the participant systems for the task of generating ℎ+ examples. We conjecture that this may be a result of using separate prompts to generate ℎ+ and ℎ− ; by focusing the prompt for generating ℎ+ examples on generating a “good” translation of the source we may focus the model on the translation task, for which it was likely fine-tuned. Conversely, the baseline performance for generating ℎ− examples is very low, but confidence in the ability of LLMs to perform this task is buoyed by the performance of the participant systems. Note that these results are based on automatic metrics; for a complete evaluation we propose that the generated output be verified by human annotators, which we leave to future work.

For the detection step, all participant systems outperformed the Llama-2-7b-chat baselines (one model; three diferent prompts). The stronger bge-m3-zeroshot-v2.0 baseline, is outperformed by a number of participant systems for all language pairs. Overall, gpt-4 prompt1 is the strongest-performing participant system, with the highest F1 score for three out of four language pairs. The majority voting strategies of Bui et al. [ 14 ] and Abburi also perform strongly.

For the cross-model evaluation step, from which we exclude the baselines, we find that the majority voting strategy of Bui et al. [ 14 ] works well, with strong F1 performance on detection based on the examples generated by the models in the generation step, and also has the highest agreement (measured using Cohen’s Kappa) with the other evaluator models.

7. Conclusion and Future Work

In the HalluciGen task we explored the use of LLMs in generating and detecting hallucinations in paraphrase and translation tasks. We find that performance of the participant and baseline systems is highly variable, but results from this year’s lab are promising and will provide a solid foundation for future iterations of the task. We highlight that all three steps (generation, detection, and cross-model evaluation) have been evaluated automatically, and therefore caution the reader against drawing any conclusions regarding which models, prompts, or methods may be “best” based solely on the results in this paper. In the case of the generation step in particular, human validation of the generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen.

Acknowledgments References

This lab has been partially supported by the Swedish Research Council (grant number 2022-02909) and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10039436 (Utter)].

A. Task 2 - Baseline System Prompts

The prompts used for the paraphrase and translation baseline LLM systems are provided in Tables 9 and 10 respectively.

[1]

Manakul ,

Liusie ,

M. J. F.

Gales , Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , arXiv preprint: 2303.08896 ( 2023 ).

[2]

Saunders ,

Yeh ,

Wu ,

Bills ,

Ouyang ,

Ward ,

Leike , Self-critiquing models for assisting human evaluators , arXiv preprint: 2206.05802 ( 2022 ).

[3]

Amrhein ,

Moghe , L. Guillou, ACES: Translation accuracy challenge sets for evaluating machine translation metrics , in: Proceedings of the Seventh Conference on Machine Translation (WMT) , Association for Computational Linguistics , ????, pp. 479 - 513 . URL: https://aclanthology. org/ 2022 .wmt- 1 . 44 .

[4]

Yang ,

Zhang ,

Tar , J. Baldridge, PAWS-X: A cross-lingual adversarial dataset for paraphrase identification , in: K. Inui,

Jiang ,

Ng , X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3687 - 3692 . URL: https://aclanthology.org/D19-1382. doi: 10 .18653/v1/ D19 -1382.

[5]

Mickus , E. Zosa,

Vázquez ,

Vahtola ,

Tiedemann ,

Segonne ,

Raganato ,

Apidianaki , Semeval -2024 shared task 6: Shroom, a shared-task on hallucinations and related observable overgeneration mistakes , arXiv preprint arXiv:2403.07726 ( 2024 ).

[6]

Berdicevskis , G. Bouma,

Kurtz ,

Morger ,

Öhman ,

Adesam ,

Borin ,

Dannélls ,

Forsberg ,

Isbister ,

Lindahl ,

Malmsten ,

Rekathati ,

Sahlgren ,

Volodina ,

Börjeson ,

Hengchen ,

Tahmasebi , Superlim: A Swedish language understanding evaluation benchmark, Association for Computational Linguistics , Singapore, 2023 , pp. 8137 - 8153 .

[7]

Kanerva ,

Ginter ,

L.-H.

Chang ,

Rastas ,

Skantsi ,

Kilpeläinen , H. -M. Kupari , J.

Saarni , M.

Sevón , O.

Tarkka , Finnish paraphrase corpus , in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , Linköping University Electronic Press, Sweden, Reykjavik, Iceland (Online), 2021 , pp. 288 - 298 .

[8]

A. Q.

Jiang ,

Sablayrolles ,

Roux ,

Mensch ,

Savary ,

Bamford ,

D. S.

Chaplot , D. d. l. Casas,

E. B.

Hanna ,

Bressand , et al., Mixtral of experts, arXiv preprint: 2401.04088 ( 2024 ).

[9]

Ekgren ,

A. Cuba

Gyllensten ,

Stollenwerk ,

Öhman ,

Isbister ,

Gogoulou ,

Carlsson ,

Casademont ,

Sahlgren , GPT-SW3: An autoregressive language model for the Scandinavian languages , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024 ), Torino, Italia, 2024 .

[10]

Touvron ,

Martin ,

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale ,

Bikel ,

Blecher ,

C. C.

Ferrer ,

Chen ,

Cucurull ,

Esiobu ,

Fernandes ,

Fu ,

Fuller ,

Gao ,

Goswami ,

Goyal ,

Hartshorn ,

Hosseini ,

Hou ,

Inan ,

Kardas ,

Kerkez ,

Khabsa , I. Kloumann ,

Korenev ,

P. S.

Koura , M. -

A. Lachaux , T.

Lavril , J.

Lee , D.

Liskovich , Y.

Lu , Y.

Mao , X.

Martinet , T.

Mihaylov , P.

Mishra , I. Molybog, Y.

Nie , A.

Poulton , J.

Reizenstein , R.

Rungta , K.

Saladi , A.

Schelten , R.

Silva , E. M.

Smith , R.

Subramanian , X. E.

Tan , B.

Tang , R.

Taylor , A.

Williams , J. X.

Kuan , P.

Xu , Z.

Yan , I. Zarov, Y.

Zhang , A.

Fan , M.

Kambadur , S.

Narang , A.

Rodriguez , R.

Stojnic , S.

Edunov , T. Scialom, Llama 2 : Open foundation and fine-tuned chat models , 2023 . arXiv: 2307 . 09288 .

[11]

Laurer , W. van Atteveldt,

Casas ,

Welbers , Building Eficient Universal Classifiers with Natural Language Inference , 2023 . URL: http://arxiv.org/abs/2312.17543. doi: 10 .48550/arXiv. 2312.17543, arXiv: 2312 .17543 [cs].

[12]

D. S.

Nielsen , Scandinli: Natural language inference for the scandinavian languages , https://github. com/alexandrainst/ScandiNLI, 2022 . URL: https://aclanthology.org/D19-1382.

[13] D. M. Alves , J.

Pombal , N. M.

Guerreiro , P. H.

Martins , J.

Alves , A.

Farajian , B.

Peters , R.

Rei , P.

Fernandes , S.

Agrawal , P.

Colombo , J. G. C. de Souza , A. F. T.

Martins , Tower:

An open multilingual large language model for translation-related tasks , 2024 . arXiv: 2402 . 17733 .

[14]

A. T.

Bui ,

S. F.

Brech ,

Hußfeldt ,

Jennert ,

Ullrich ,

Breuer ,

Nikzad ,

Schaer , The two sides of the coin: Hallucination generation and detection with evaluators for llms , in: G. Faggioli,

Ferro ,

Galuščáková , A . García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , 2024 .

[15]

Siino , I. Tinnirello , Gpt hallucination detection through prompt engineering , in: G. Faggioli,

Ferro ,

Galuščáková , A . García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , 2024 .