1. Introduction

The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Anh Thu Maria Bui

Saskia Felizitas Brech

Natalie Hußfeldt

Tobias Jennert

Melanie Ullrich

Timo Breuer

Narjes Nikzad Khasmakhi

Philipp Schaer

0 0 TH Köln - University of Applied Sciences , Cologne , Germany

Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3, Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of these LLMs in handling hallucination generation and detection tasks.

eol>Hallucination Generation Hallucination Detection LLMs as Evaluators Llama 3 Gemma GPT-4 GPT-3 5 Turbo Ensemble majority voting

1. Introduction 2. Methodology

This section is divided into two parts: generation and hallucination detection tasks. Before delving into the details of our methodology, it is important to note that prior to receiving the dataset from the organizers, we began familiarizing ourselves with the overall task by applying the three models Falcon [ 3 ], MPT [ 4 ], and Llama 2 [ 5 ] to the hallucination detection task on the SHROOM dataset [ 6 ]. Since the results from these three techniques were unsatisfactory, we excluded them from our implementation for Eloquent Lab.

The models we applied to the Eloquent’s dataset in generation and detection tasks were: • Meta-Llama/Meta-Llama-3-8B-Instruct [ 7, 8 ] • GPT-3.5 Turbo [ 9 ] • GPT-4 [ 10 ] • Google/GEMMA-7B-IT [ 11, 12 ]

We leveraged a combination of open-source and closed-source models. This allows us to evaluate the quality of outputs across diferent models. Additionally, utilizing open-source models helped us optimize costs. Therefore, we initially experimented with various prompts for the tasks using open-source LLMs to identify the most efective ones. Then, we applied these optimized prompts to closed-source GPT models. Additionally, we did our best to enhance our prompting efectiveness using the guidance framework [ 13 ].

2.1. Hallucination generation task

The task of hallucination generation is divided into two scenarios: machine translation and paraphrasing. The goal of the generation step is to take a source sentence and generate two LLM hypotheses: one that is a correct translation/paraphrase of the source and one that is a hallucinated translation/paraphrase of the source.

Figure 2 indicates the overview of our approach for the generation task. To conduct this task, we took advantage of ‘GPT-3.5 Turbo’, ‘GEMMA-7B-IT’, and ‘Llama 3’.

2.2. Hallucination detection task

The hallucination detection task is to present the LLM with a source sentence and two hypotheses (hyp1 and hyp2) and to determine which hypothesis is a hallucination and which is factually accurate. Our approach involved using four diferent LLMs, ‘GPT-3.5 Turbo’, ‘Google/GEMMA-7B-IT’, ‘Llama 3’, and ‘GPT-4’, as classifiers. Additionally, we employed a voting approach as a simple technique of ensemble learning [ 14 ] to combine the outputs of these four models.

Furthermore, we experimented with four distinct prompting techniques to provide better guidance to the LLMs and enhance their ability to discriminate between factual and hallucinated information. • Type1: Simple Prompt: Using labels ‘hallucinated’ or ‘not hallucinated’. • Type2: Complex Prompt with 0/1 Labels: Specifying the task with labels 0 or 1. • Type3: Prompt with Definition and Examples : Including a definition of hallucination alongside examples labeled 0 and 1. • Type4: Prompt with Full Task Description: Describing the entire translation task (for instance) and hallucination detection goal. Combined Prompt: Combining all the above elements.

3. Implementation

This part primarily focuses on how we prompted LLMs, along with the challenges and observations we encountered during the task. We divide this section into three parts: generation, detection, and cross-evaluation tasks.

3.1. Generation task

The generation task includes test sets for both paraphrasing and translation tasks.

3.1.1. Paraphrasing Generation Task

The paraphrasing generation task involved datasets in English and Swedish, comprising 118 samples for English and 76 samples for Swedish.

The performance of diferent models, including ‘Gemma’, ‘GPT-3.5 Turbo’, and ‘Llama 3’, was evaluated based on their ability to generate paraphrases for English and Swedish datasets. In the appendix in Section A, Figures 4 to 7 show comprehensive lists of all prompts used for the diferent models. The following demonstrates some of our observations regarding the implementation of the generation task: • The performance of the ‘Gemma’ model varied significantly based on the complexity of the prompts used. Simpler prompts yielded better results that highlight the importance of prompt design. Despite this, the model struggled with understanding specific instructions, such as ‘generate hallucination’. Additionally, the generation speed was notably slow. • For ‘GPT-3.5 Turbo’, one prompt for English and one prompt for Swedish were employed. The generation speed of ‘GPT-3.5 Turbo’ was significantly faster compared to other models. • For ‘Llama 3’, a single prompt was used for both English and Swedish datasets. The speed of the model in generating Swedish responses was exceedingly slow. After seven hours, it only produced five outputs.

3.1.2. Translation Generation Tasks

In the appendix in Section A, in Figures 8 to 10, you will find a comprehensive list of all prompts used for the diferent models. The details of our implementation and observation of the translation generation task are as follows:

In our experimentation with ‘Llama 3’, we opted not to use the ‘guidance’ framework because of its inefective performance. ‘Llama 3’ showed promising results for each language pair. We experimented with two diferent prompts, as shown in Figure 10 and observed instances where ‘Llama 3’ successfully generated hypotheses in the desired target language but struggled with the source language. Examples illustrating this phenomenon can be found in the Table 19.

Various prompts were tested, and the one that was chosen, as shown in Figure 9, showed efectiveness in generating the most automatic translations. However, ‘GPT-3.5 Turbo’ still struggled to instantly create translations (hyp- and hyp+) for all sources. The main issue was the variations in quotation marks which caused problems during the extraction process. As a result, we had to prompt some sentences individually (instead of being able to loop them as a group) so that the structure was recognized by GPT again. List 3.1.2 shows the number of samples had been done individually.

• German to English: with 12 sources where 3 sources needed to be translated manually. • English to German: with 10 sources where 2 sources needed to be translated manually. • French to English: with 19 sources where 3 sources needed to be translated manually. • English to French: with 64 sources where 0 sources needed to be translated manually.

Translating from English was a smoother process for the Gemma model compared to translating to English.

3.2. Detection task

The detection task involves trial and test sets for both scenarios.

3.2.1. Paraphrasing Detection Task

Table 1 shows the number of samples for each trial and test set for the paraphrasing detection task. The trial dataset for the paraphrasing detection task is structured as follows: • id: Unique identifier of the example. • source: Original model input for paraphrase generation. • hyp1: First alternative paraphrase of the source. • hyp2: Second alternative paraphrase of the source. • label: hyp1 or hyp2, based on which of those has been annotated as hallucination. • type: Hallucination category assigned. Possible values include: – addition – named-entity – number – conversion – date – tense – negation – gender – pronoun – antonym – natural

We compared the performance of diferent models on the trial dataset using distinct prompts. Some prompts used for the paraphrasing detection task on the trial dataset are presented in Figure 11. Additionally, Tables 20 to 32 illustrate the performance of various prompts on the trial dataset for both English and Swedish.

A challenge with Gemma was its tendency to generate code within responses. We implemented a specific ‘JSON’ format to ensure retrievable output. Figure 12 indicates the example of generated output from Gemma. Figures 13 to 17 display the prompts employed in the paraphrasing detection task across various models for the test set.

3.2.2. Translation Detection Task

The following details are provided about the translation detection dataset.

• Both trial and test datasets include data for four language pairs as follows: – de-en: Source language: German, Target language: English – en-de: Source language: English, Target language: German – fr-en: Source language: French, Target language: English – en-fr: Source language: English, Target language: French • The trial dataset included 10 data entries, with 5 entries featuring hallucination as hyp1 and the other 5 as hyp2. The structure of the trial dataset is illustrated below: – id: Unique identifier of the example. – langpair: Language of source and hypotheses pair – source: Source Text – hyp1: First alternative translation of the source. – hyp2 Second alternative translation of the source. – type: Hallucination category assigned. Possible values include: ∗ addition ∗ named-entity ∗ number ∗ conversion ∗ date ∗ tense ∗ negation ∗ gender ∗ pronoun ∗ antonym ∗ natural – label hyp1 or hyp2, based on which of those has been annotated as hallucination • In the test collection, there are 100 data samples for each language pair.

The structure of the test dataset is presented as follows: – id: Unique identifier of the example. – langpair: Language of source and hypotheses pair – source: Source Text – hyp1: First alternative translation of the source.

– hyp2 Second alternative translation of the source.

Our implementation and observations of the translation detection task are delineated below, categorized according to each model.

Observations for Llama 3 Ultimately, we experimented with 15 diferent prompts for the ‘Llama 3’ model. Among these, the prompt, as shown in Figure 18(a) yielded the most favorable results. Table 33 demonstrates the achieved results by using this prompt on the trial dataset. So, we opted for it for the ifnal detection task.

The main observations for Llama 3 are: • ‘Llama 3’ is not able to detect a label for every data entry (support is only 4 for each, hyp1 and hyp2). Figure 18 demonstrates the prompts used by ‘Llama 3’ on the test set. • When detecting the hallucination, ‘Llama 3’ gives explanations, such as: ‘I chose hyp1 as the hallucination because it contains a date (December 5) that is not present in the source text. The source text only mentions the date August 5, but hyp1 provides a diferent date.’ The first row in Table 34 shows this issue. • As both examples in Table 34 indicate ‘Llama 3’ exhibits gender bias. In the first example, it failed to recognize the feminine noun ‘Wirtschaftsprüferin’ shows a female auditor and labeled it as gender-neutral. It made a gender assumption in hyp1 which assumes a male auditor. Similarly, in the second example, ‘Llama 3’ struggled to understand the clear indication of a female secretary with the word ‘Sekretärin.’ • ‘Llama 3’ struggles with understanding and converting measurements and it could not recognize when diferent units are essentially the same. For instance, it sees ‘kilometers’ and thinks it is diferent from ‘metres’ which leads to mistakenly identifying that text as a hallucination. Additionally, ‘Llama 3’ makes the assumption that hyp2 is the hallucination because it contains ‘kilometers’ instead of ‘km’ and it fails to consider the fact that hyp1 also uses ‘metres’ instead of ‘km.’ Table 35 highlights this issue. • ‘Llama 3’ struggles to recognize the diferent ways dates can be written. As shown in Table 36, it could not understand that ‘21. Januar’ and ‘Jan. 21st’ refer to the same date. • In the end, we noticed that the prompt immensely influences the outcome of ‘Llama 3’. When using diferent prompts, ‘Llama 3’ was either able to detect the gender, conversion, or the correct date, or it was not. For example, 1 in Table 37 shows that using the prompt shown in Figure 18(b), ‘Llama 3’ correctly explains that ‘Wirtschaftsprüferin’ refers to a female auditor in the ifrst example, but then it mistakenly swaps hyp1 and hyp2. Additionally, as shown in the second row of this table, the new prompt allows ‘Llama 3’ to detect the correct gender indicated in the source text. However, ‘Llama 3’ still fails to assign the correct label. In the third row, we can see that ‘Llama 3’ correctly converts 65 km to 65,000 meters and identifies the hallucination in hyp2. Additionally, ‘Llama 3’ correctly identifies the wrong date in the last example. The primary issue with this prompt is that ‘Llama 3’ frequently fails to identify any hallucinations in certain data samples.

Observations for GPT-3.5 Turbo and GPT-4 The approach used for ‘GPT-3.5 Turbo’ was replicated for ‘GPT-4’ to directly assess comprehension. Various prompts were tested, and two were selected based on the best results from previous trials.

The main observations for GPT-3.5 Turbo and GPT-4 are: • There were some samples where no hallucinations were detected. Table 38 displays the count of failed examples for ‘GPT-4’ and ‘GPT-3.5 Turbo’ in the translation detection task. Additionally, Table 39 lists some samples for which ‘GPT-4’ failed to assign labels with our explanation for each one. • Regarding the prompts for GPT models, both seem to encounter issues with misinterpretations or slightly inaccurate translations. Additionally, both struggle to identify incorrect pronouns. • Initially, during the phase with incorrect trial datasets, it was observed that ‘GPT-3.5 Turbo’ had dificulty recognizing hallucinations when names were slightly misspelled or had an extra letter appended.

Observations for Gemma We tried various prompts, but Gemma showed better (80% Accuracy) in detecting the correct label when it was first asked to translate the hypothesis into the language of the source and then detect hallucinations. Figure 19 indicates the prompts used by Gemma on the test set.

The main observations for Gemma are: • The performance was significantly worse when the prompts were too scientific or contained too many technical terms. • Tricky samples for Gemma in the detection task include detecting the gender in comparison to the source (female/male), and identifying when numbers are incorrect, such as missing zeros. Observations for ensemble voting approach We opted for a straightforward voting approach to ensemble model predictions due to the limitations imposed by the small sample size of the trial set. This method ensured all models contributed equally.

Since we compared an even number of models, there were instances where two models voted for hyp1 and the other two voted for hyp2. In these cases, we randomly selected the label.

3.3. Cross-evaluation task

The following provides detailed information regarding the cross-evaluation task. Table 2 presents information regarding the samples included in the paraphrasing task. The prompt showns in Figure 20 has been used for the english paraphrasing task.

In the translation task, sometimes none of the models detected any hallucinations in either hypothesis, which resulted in some blank spaces in the CSV file due to the lack of predictions. There were instances where no hallucinations were present because both hypotheses, hyp1 and hyp2, were the same.

4. Results

This part presents the results in detail for each task and scenario. It is worth noting that prior to showing the results from LLMs, Logistic Regression and Random Forest classifiers were used for an initial evaluation to establish a baseline performance for comparison with LLMs. Both LR and RF classifiers achieved similar performance with an F1-score of 0.5.

For the evaluation of the generation task, the lab employed a zero-shot text classification Natural language inference (NLI) model (‘ / − 3 − ℎ − 2.0’) to predict whether ‘hyp+’ is entailed within the source sentence and whether ‘hyp-’ contradicts the source sentence. They used only two labels: ‘entailment’ and ‘not_entailment’. This approach helps us assess whether the systems can produce coherent hyp+/hyp- pairs. It is important to note that the performance of the classification model is not perfect, but it demonstrated reasonable performance on the detection test set across various languages and language pairs [ 2 ].

For evaluating both detection and cross-model tasks, the lab reported key metrics such as Accuracy, F1-score, Precision, and Recall for each model. Additionally, several baseline models were evaluated by the lab. For cross-model assessment, the lab also employed two metrics: Matthews Correlation Coeficient (MCC) and Cohen’s Kappa.

The Average MCC (MCC) measures the quality of binary classifications by considering true and false positives and negatives, while the Standard Deviation of MCC ( ) provides insight into the consistency of the model’s performance. Similarly, the Average Kappa (¯) measures inter-rater reliability for categorical items, and the Standard Deviation of Kappa ( ) indicates the variability or consistency of the Kappa metric [ 15 ].

Tables 3 to 5 demonstrate the evaluation of detection, generation, and cross-model evaluation for English paraphrasing tasks.

The performance of detection across various models on the English paraphrasing task is presented in Table 3. The model ‘GPT-4’ with prompt ‘En_Se_Para_Det_GPT3.5_GPT4_v2’ achieved the highest performance with Accuracy, F1-score, Precision, and Recall scores of 0.91.

Table 4 presents the results for the generation step. The model ‘GPT-3.5 Turbo’ with prompt ‘En_Para_Gen_GPT3.5’ achieved the highest performance in hyp+ entailment mean (0.964) and hyp+ correct label mean (0.983). Furthermore, The model ‘Llama 3’ with prompt ‘En_Para_Gen_Llama3’ showed strong performance in hyp- not entailment mean(0.978) and hyp- correct label mean (0.983).

Table 5 presents the results for the cross model. The model ‘GPT-4’ with prompt ‘final_gpt4_en_v2_cross_model_detection’ showed Accuracy, F1-score, Precision, and Recall scores of 0.93. In the next stage, the majority model with prompt ‘majority_vote_cross_model_result_en’ demonstrated impressive performance.

Table 6 shows that the model with prompt ‘majority_vote_cross_model_result_en’ achieved the highest performance with an average MCC of 0.83 and average Kappa of 0.81.

Table 7 presents the performance metrics of various models in the detection step for the Swedish paraphrasing task. The model ‘GPT-4’ with prompt ‘En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT4)’ achieved an Accuracy score of 0.81 and with consistent scores across all metrics (F1 = 0.81, Precision = 0.81, Recall = 0.81). Additionally, the baseline ‘baseline-bge-m3-zeroshot-v2.0/sv_bge-m3-zeroshot-v2.0’ shows the highest Accuracy of 0.92 across all models.

Table 8 summarizes the results of models in the generation step for Swedish paraphrasing where the focus is on metrics related to hypothesis entailment and not_entailment. The model ‘GPT-3.5 Turbo’ with prompt ‘Se_Para_Gen_GPT3.5’ demonstrated strong performance with high scores in hyp+ entailment mean of 0.88, hyp+ correct label mean of 0.90, hyp- contradiction mean of 0.91, and hypcorrect label mean of 0.93.

Tables 9 and 10 present cross-model evaluation results for the Swedish paraphrasing task that highlights the model performance across diferent evaluation criteria. The model majority voting with prompt ‘majority_vote_cross_model_result_se’ showed competitive performance. Table 10 provides statistical measures for models excluding baselines that indicate the noted majority voting technique has consistency and reliability.

Tables 11 to 14 report the performance of English-French translation detection, generation, and cross-model evaluation.

Table 11 highlights several key points regarding the performance of the detection task. Model ‘GPT4’ with prompt ‘results_gpt4_en_fr’ achieved the highest performance with Accuracy, F1-score, and Recall of 0.90, and Precision of 0.91. Additionally, we can observe that the majority voting model with prompt ‘majority_vote_result_en_fr’ also performed well with Accuracy, F1-score, and Recall of 0.83 and Precision of 0.86.

One of the conclusions can be drawn from Table 12 is that the baseline model ‘baseline-generalprompt/en-fr.gen’ showed a better performance with hyp+ entailment mean 0.90 and hyp+ correct label mean of 0.93, while it has a lower performance in hyp- contradiction mean of 0.10 and hyp- correct label mean of 0.08. Also, it is clear that model ‘GPT-3.5 Turbo’ with prompt ‘results_gpt_en_fr’ demonstrated a high performance in hyp- contradiction mean of 0.88, and hyp- correct label mean of 0.91.

From tables 13 and 14 we have the finding that the majority voting approach with prompt ‘majority_vote_result_en_fr’ reached Accuracy 0.79, F1 score 0.78, Precision 0.80, and Recall 0.79. This combination exhibited the highest average MCC 0.66 and average Kappa 0.65.

Tables 15 to 18 report the results for the evaluation of the English-German translation detection, generation, and cross-model.

The important observation from the Table 15 is that the model ‘GPT-4’ along with prompt ‘results_gpt4_en_de’ showed the highest performance with an Accuracy, F1 score, and Recall all at 0.86 and Precision 0.89.

From Table 16 we can see that the model ‘GPT-3.5 Turbo’ with the mixture of prompt ‘results_gpt_en_de’ exhibited better performance in hyp- contradiction mean of 0.83, and hyp- correct label mean of 0.84. ‘Gemma’ with prompt ‘En_De_Trans_Gen_gamma’ showed the best hyp+ correct label mean of 0.85. Additionally, ‘baseline-phenomena-mentions-prompt/en-de.gen’ provides a better hyp+ entailment mean of 0.84.

Tables 17 and 18 provide the insight that the model ‘GPT-3.5 Turbo’ with ‘results_gpt_en_de’ had the highest Accuracy of 0.76, F1 score of 0.75, Precision of 0.77, and Recall of 0.76. The prompt ‘majority_vote_result_en_de’ for majority voting had the highest average MCC of 0.60 and average Kappa of 0.58 which indicates strong inter-model agreement and consistency.

5. Conclusion

In conclusion, this study leveraged several LLMs to investigate both the generation and detection of hallucinations by LLMs themselves. The four distinct models employed presented their own unique evaluation challenges. We explored various prompt techniques including few-shot learning and chain of thought by using the guidance framework. Additionally, for the detection task, we tested an ensemble voting approach to combine the results from diferent LLMs. Although in this study we could achieve better results in comparison to the baseline models, our findings indicate that while some issues can be addressed through efective prompting, others remain dificult to mitigate solely by prompt engineering. Moreover, identifying the optimal prompt itself poses a significant challenge. final_gemma_en_v1_cross_model final_gpt35_en_v2_cross_model_detection final_gpt4_en_v2_cross_model_detection final_lama3_cross_model_en_v1 majority_vote_cross_model_result_en final_gemma_se_v1_cross_model final_gpt35_se_v2_cross_model_detection final_gpt4_se_v2_cross_model_detection final_lama3_cross_model_se_v1 majority_vote_cross_model_result_se majority_vote_result_en_fr results_gemma_en_fr_final results_gpt4_en_fr results_gpt_en_fr results_llama3_en_fr_final

MCC majority_vote_result_en_de results_gemma_en_de_final results_gpt4_en_de results_gpt_en_de results_llama3_en_de_final

MCC

A. Appendix All nouns, alongside the All nouns, alongside the word word Sie for you, always be- Sie for you, always begin with a gin with a capital letter, even capital letter, even in the middle in the middle of a sentence. of a sentence, except for those that are part of a title or a proper noun.

The final line of the third The final line of the third verse verse was changed during was modified during the reign the reign of Alexander I of of Alexander I of Yugoslavia in Yugoslavia in “Kralja Alek- ’Kralja Aleksandra, Bože hrani’. sandra, Bože hrani ”. hypAll nouns, alongside the word sie for you, always begin with a capital letter, even in the middle of a sentence.

The final line of the third verse was rewritten during the reign of Alexander the Great in ’Kralja Aleksandra, Bože hrani’.

label prediction gemma

Listing 7: Se_Para_Gen_Gemma_v2 1 user_prompt = f’’’ 2 You are a text generator and your task is to generate two translation hypothesis given the ’ src’ below. 3 The first translation labelled as ’hyp+’ should be supported by ’src’ and the second translation labelled as ’hyp-’ should not be supported by ’src’. 4 Provide the result in the following format: "hyp+": "", "hyp-": "". Target language: "

English"

Macro avg Weighted avg hyp1 hyp2 Accuracy

Listing 11: En_De_Trans_Gen_llama3_v2, other language pairs contain an example in the associated language Listing 12: Swedish Prompt 1.

answer_format = {"label": ""} user_prompt = f’’’ <start_of_turn>user

You are a researcher investigating a new phenomenon. You have gathered data ({source}) and formulated two competing hypotheses (hyp1: {hyp1}, and hyp2 : {hyp2}) to explain it.

Identify the hypothesis that contradicts the information provided in the given source.

Provide the result in the following format: {answer_format}. Listing 13: English Prompt 2.

answer_format = {"label": ""} user_prompt = f’’’ <start_of_turn>user

Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to detect which of the two hypotheses ("label") is not supported by the source.

Provide the result in the following format: {answer_format}.

Src: {source} hyp1: {hyp1} hyp2: {hyp2}

Listing 14: English Prompt 1 and Swedish Prompt 2.

type antonym negation antonym named entity natural addition gender natural number pronoun pronoun addition conversion

natural named entity date label prediction gemma

Macro avg Weighted avg

0.88 0.75 label prediction gemma prediction gpt 3.5

Macro avg Weighted avg Macro avg Weighted avg

type label prediction gemma prediction gpt 3.5

Macro avg Weighted avg hyp1 hyp2 Accuracy Macro avg Weighted avg

type 0.33 1.00 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ." ’’’ ’’’ result = {’label’: ’hyp1’} src = "The population has declined in some 210 of the 280 municipalities in Sweden, mainly in inland central and northern Sweden." "In the majority of Sweden’s 280 municipalities, the population has gone up "In the majority of Sweden’s 280 municipalities, the population has gone down."

Listing 15: En_Para_Gen _Gemma_v1 6 1 answer_format = {"label": ""} 2 3 4 5 user_prompt = f’’’ <start_of_turn>user Givet en ”src” och två hypoteser ”hyp1 ” och ”hyp2” är din uppgift att upptä cka vilken av de två hypoteserna (” label”) som inte stöds av källan.

Ge resultatet i följande format: {

answer_format}. 7 8 9 10 11 12 13 14 15 16 17 Listing 16: En_Para_Det_Gemma_v1 Listing 17: Se_Para_Det_Gemma_v1 7 8 9 10 11

Listing 18: En_Se_Para_Det_Gemma_v2 user_prompt = f’’’ Given a "src" and two hypotheses " hyp1" and "hyp2" your task is to detect which of the two hypotheses (" label") is not supported by the source.

Provide the result in the following format: {answer_format}. Listing 19: En_Se_Para_Det_GPT3.5_GPT4_v1 Listing 20: En_Se_Para_Det_GPT3.5_GPT4_v2

’’’Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to detect which of the two hypotheses ("label") is not supported by the source.

Provide the result in the following

format: {answer_format}.

’’’Givet en ”src” och två hypoteser ”hyp1” och ”hyp2” är din uppgift att upptäcka vilken av de två hypoteserna (”label”) som inte stöds av källan.

Ge resultatet i följande format: { answer_format}. Resultat: ’’’ Listing 21: En_Para_Det _Llama3_v1 Listing 22: Se_Para_Det_Llama3_v1 1 ’’’ You are a researcher investigating a new phenomenon. 2 You have gathered data (source) and formulated two competing hypotheses (hyp1 and hyp2) to explain it. 3 4 Identify the hypothesis that contradicts the information provided in the given source. 5 Provide the result in the following format: {answer_format}. 6 7 source: {source} 8 hyp1: {hyp1} 9 hyp2: {hyp2} 10 11 Result:’’’ 12

Listing 23: En_Se_Para_Det _Llama3_v2

Listing 24: En_De_Trans_Det_llama3_v1, other language pairs contain an example in the associated language 1 user_prompt = f’’’ 2 You are a researcher investigating a new phenomenon. You have gathered data (source) and formulated two competing hypotheses (Hyp1 and Hyp2) to explain it. Identify the hypothesis that contradicts the information provided in the given source. 3 Provide the result in the following format: {"label": ""}. 4 5 Src: {source} 6 hyp1 : {hyp1} 7 hyp2 : {hyp2} 8 9 Result:’’’ 10

Listing 25: En_De_Trans_Det_llama3_v2, En_Fr_Trans_Det_llama3_v2 The lawyer told the male auditor to provide documentation.

hyp2 The center of the earthquake was about 20 km northeast of Dillon and about 65000 metres south of Butte.

hyp2

Llama 3’s explanation The center of the earthquake I chose hyp2 as the hallucinawas about 20 km northeast of tion because it adds the unit Dillon and about 65000 kilome- "kilometres" to the distance, tres south of Butte. which is not present in the original text. The original text uses "km" and "Meilen", which are equivalent units, but hyp2 introduces a new unit "kilometres". This is a type of hallucination where a new information is added to the original text, specifically a unit conversion.’ hyp2

Llama3’s explanation "Perry erklärte, er """Perry stated """Perry stated werde „nach Texas that he will that he will zurückkehren, um return to Texas return to Texas das Ergebnis der to evaluate to evaluate the Vorwahl von heute the result of result of tonight’s Abend auszuwerten tonight’s pre- pre-election and und zu entscheiden, election and decide whether ob es in diesem decide whether there is a path Rennen einen weit- there is a path forward for me erführenden Weg forward for me in this race,"" für mich gibt"", aber in this race,"" but later said he sagte später, er but later said would stay in the werde im Rennen he would stay race and run for bleiben und bei den in the race and the pre-election Vorwahlen in South run for the in South Carolina Carolina am 21. pre-election in on Jan. 21." Januar antreten." South Carolina on March 21."

I chose hyp2 as the hallucination because it contains a specific date ("Jan. 21"), which is not present in the original source text. The original text only mentions the date "21. Januar", which is a more general and less specific date.

This suggests that hyp2 contains a hallucinated detail. The type of hallucination is time, as it involves the introduction of a specific and precise time element.

Listing 26: En_De_Trans_Det_gemma_v1, En_Fr_Trans_Det_gemma_v1 Result: <start_of_turn>model ’’’ Listing 27: De_En_Trans_Det_gemma_v2, En_De_Trans_Det_gemma_v2, En_Fr_Trans_Det_gemma_v2, Fr_En_Trans_Det_gemma_v2 Der Anwalt forderte The lawyer told the die Wirtschaft- male auditor to prosprüferin auf, vide documentation.

Unterlagen vorzulegen.

hyp2 The lawyer told the female auditor to provide documentation.

Llama3’s explanation The source states "Der Anwalt forderte die Wirtschaftsprüferin auf, Unterlagen vorzulegen.", which translates to "The lawyer demanded the female auditor to provide documentation." This implies that the auditor is a woman, contradicting Hyp2 which states that the auditor is a man. result_text ’label’: ’hyp2’ Der Arbeiter grub ein Loch für die Sekretärin.

The laborer dug a hole for the female secretary.

Das Zentrum des Erdbebens befand sich etwa 20 km (15 Meilen) nordnordöstlich von Dillon und etwa 65 km (40 Meilen) südlich von Butte.

The center of the earthquake was about 20 km northeast of Dillon and about 65000 metres south of Butte.

The center of the earthquake was about 20 km northeast of Dillon and about 65000 kilometres south of Butte. "Perry erklärte, er """Perry stated that werde „nach Texas he will return to zurückkehren, um Texas to evaluate the das Ergebnis der result of tonight’s Vorwahl von heute pre-election and Abend auszuwerten decide whether there und zu entscheiden, is a path forward ob es in diesem for me in this race,"" Rennen einen weit- but later said he erführenden Weg would stay in the für mich gibt"", aber race and run for the sagte später, er pre-election in South werde im Rennen Carolina on March bleiben und bei den 21." Vorwahlen in South Carolina am 21.

Januar antreten." """Perry stated that he will return to Texas to evaluate the result of tonight’s pre-election and decide whether there is a path forward for me in this race,"" but later said he would stay in the race and run for the pre-election in South Carolina on Jan. 21." The laborer dug a The source states "Der Arbeiter hole for the male sec- grub ein Loch für die Sekretärin", retary. which means "The laborer dug a hole for the secretary". This clearly indicates that the secretary is a female, as "Sekretärin" is the German word for "female secretary". Therefore, Hyp2, which suggests that the laborer dug a hole for the male secretary, contradicts the information provided in the source. result_text ’label’: ’hyp2’ Explanation: The source states that the distance from the earthquake’s center to Butte is approximately 40 miles (65 km), not 65,000 kilometres.

Hyp2 contradicts this information by stating a much larger distance, while Hyp1 correctly translates the original distance from miles to meters. result_text ’label’: ’hyp2’ Explanation: The source states that the event in question is on "heute Abend" (tonight), which corresponds to January 21. However, Hyp1 states that the event is on "March 21", which contradicts the information provided in the source. result_text ’label’: ’hyp1’ de-en en-de en-fr fr-en 2 8 2 7

GPT-4 prompt1 prompt2

GPT-3.5 prompt1 prompt2 1 15 7 3 0 1 0 1 0 1 1 2 ’’’Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to detect which of the two hypotheses ("label") is not supported by the source.

Provide the result in the following format: {answer_format}. 1 2 3 4 5 6 7 8 9 10 hyp1: {hyp1} hyp2: {hyp2}’’’

Listing 28: majority_vote_cross_model_result_en

Die Mittel könnte man für hochwassersichere Häuser, eine bessere Wasserverwaltung und Nutzpflanzendiversifizierung verwenden.

Es zeigt 362 ver- It shows 362 diferent schiedene alte old species of wood, Holzarten, Büsche bushes and 236 difund 236 verschiedene ferent species of fruit

Obstbaumarten. trees. deen deen ende ende ende ende ende ende

The world has over 5,000 diferent languages, more than twenty with 50 million or more speakers. 1i Productions is an American board game publisher. It was founded in 2004 by Colin Byrne, William and Jenna.

Mats Wilander defeats Anders Järryd, 6 – 4, 3 – 6, 7 - 5.

They have feet with scales and claws, they lay eggs, and they walk on their two back legs like a T-Rex.

The NSA has its own internal data format that tracks both ends of a communication, and if it says, this communication came from America, they can tell Congress how many of those communications they have today, right now. Diferent interpretations of flood-proof wrong tense

filler

[1]

Minaee ,

Mikolov ,

Nikzad ,

Chenaghlu ,

Socher ,

Amatriain ,

Gao , Large language models: A survey , CoRR abs/2402 .06196 ( 2024 ). URL: https://doi.org/10.48550/arXiv.2402.06196. doi: 10 .48550/ARXIV.2402.06196. arXiv: 2402 . 06196 .

[2]

Karlgren ,

Dürlich ,

Gogoulou ,

Guillou ,

Nivre ,

Sahlgren ,

Talman , Eloquent clef shared tasks for evaluation of generative language model quality , in: Advances in Information Retrieval: 46th European Conference on Information Retrieval , ECIR 2024 , Glasgow, UK, March 24 -28, 2024 , Proceedings, Part

, Springer-Verlag, Berlin, Heidelberg, 2024 , p. 459 - 465 . URL: https://doi.org/10.1007/978-3- 031 -56069-9_ 63 . doi: 10 .1007/978-3- 031 -56069-9_ 63 .

[3] tiiuae, Falcon-11B, https://huggingface.co/tiiuae/falcon-11B, Accessed on 2024- 05 -23.

[4]

M. N.

Team , Introducing mpt-7b: A new standard for open-source, commercially usable llms , 2023 . URL: www.mosaicml.com/blog/mpt-7b, accessed: 2023 -05-05.

[5]

Touvron ,

Martin ,

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale , et al., Llama 2 : Open foundation and fine-tuned chat models , arXiv preprint arXiv:2307.09288 ( 2023 ).

[6]

Mickus , E. Zosa,

Vázquez ,

Vahtola ,

Tiedemann ,

Segonne ,

Raganato ,

Apidianaki , Semeval -2024 shared task 6: Shroom, a shared-task on hallucinations and related observable overgeneration mistakes , arXiv preprint arXiv:2403.07726 ( 2024 ).

[7] AI@Meta, Llama 3 model card (

2024 ). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.

[8]

Hugging

Face , Meta-Llama- 3 - 8B-Instruct, https://huggingface.co/meta-llama/ Meta-Llama-3 - 8B-Instruct, Accessed on 2024- 05 -23.

[9] OpenAI, Gpt- 3 .5 turbo, n.d.. URL: https://platform.openai.com/docs/models/gpt-3-5-turbo.

[10] OpenAI , Model endpoint compatibility, n.d.. URL: https://platform.openai.com/docs/models/ model-endpoint-compatibility.

[11] T. M. Gemma Team , C.

Hardin , R.

Dadashi , S.

Bhupatiraju , L.

Sifre , M.

Rivière , M. S.

Kale , J.

Love , P.

Tafti , L.

Hussenot , et al., Gemma ( 2024 ). URL: https://www.kaggle.com/m/3301. doi: 10 .34740/ KAGGLE/M/3301.

[12] Google , gemma -7b, https://huggingface.co/google/gemma-7b, Accessed on 2024- 05 -23.

[13] Microsoft, guidance-ai/guidance: A guidance language for controlling generative models , https: //github.com/guidance-ai/guidance, 2023 .

[14]

T. G.

Dietterich , et al., Ensemble learning, The handbook of brain theory and neural networks 2 ( 2002 ) 110 - 125 .

[15]

Chicco ,

M. J.

Warrens , G. Jurman,

The matthews correlation coeficient (mcc) is more informative than cohen's kappa and brier score in binary classification assessment, Ieee Access 9 (

2021 ) 78368 - 78381 .

Table 39 Samples for which GPT-4 failed to assign labels in the translation detection task .