=Paper= {{Paper |id=Vol-3740/paper-71 |storemode=property |title=The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs |pdfUrl=https://ceur-ws.org/Vol-3740/paper-71.pdf |volume=Vol-3740 |authors=Anh Thu Bui,Saskia Felizitas Brech,Natalie Hußfeldt,Tobias Jennert,Melanie Ullrich,Timo Breuer,Narjes Nikzad Khasmakhi,Philipp Schaer |dblpUrl=https://dblp.org/rec/conf/clef/BuiBHJU0NS24 }} ==The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs== https://ceur-ws.org/Vol-3740/paper-71.pdf
                         The Two Sides of the Coin: Hallucination Generation and
                         Detection with LLMs as Evaluators for LLMs
                         Notebook for the ELOQUENT Lab at CLEF 2024

                         Anh Thu Maria Bui1,† , Saskia Felizitas Brech1,† , Natalie Hußfeldt1,† , Tobias Jennert1,† ,
                         Melanie Ullrich1,† , Timo Breuer1,† , Narjes Nikzad Khasmakhi1,*,† and Philipp Schaer1,†
                         1
                             TH Köln – University of Applied Sciences, Cologne, Germany


                                        Abstract
                                        Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work
                                        presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators
                                        for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3,
                                        Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate
                                        all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of
                                        these LLMs in handling hallucination generation and detection tasks.

                                        Keywords
                                        Hallucination Generation, Hallucination Detection, LLMs as Evaluators, Llama 3, Gemma, GPT-4, GPT-3.5 Turbo,
                                        Ensemble majority voting




                         1. Introduction
                         The prevalence of large language models advancements and groundbreaking results for many NLP
                         research problems [1], tremendously changed how we generally approach everyday tasks but also how
                         we approach larger more complex problems. The LLMs’ ability to combine vast amounts of knowledge
                         from different sources is unparalleled and for some tasks exceeds what humans are able to achieve.
                            However, blind faith in the generated outputs of these models is critical as they may produce incorrect
                         facts, also known as hallucinations. These false facts can be misleading and are one of the main barriers
                         to using LLMs reliably and in a trustworthy way.
                            To this end, this work is part of our participation in the ELOQUENT Lab 2024 at CLEF. More
                         specifically, we participate in the HalluciGen task that evaluates if the LLMs themselves are able to
                         correctly detect hallucinations in both human- and machine-generated contexts [2]. The HalluciGen
                         task is divided into two phases over two years. The first phase is dedicated to the builder task, while
                         the second phase will focus on the breaker task. This study specifically targets the first phase where the
                         goal is to create multilingual and monolingual hallucination-aware models. These models are designed
                         to generate and detect ‘hallucinated content’ in two scenarios: machine translation and paraphrase
                         generation. Figure 1 illustrates an overview of hallucination generation and detection tasks as described
                         by the lab.
                            The remainder of this work is structured as follows. Section 2 describes our methodology in more
                         detail. Section 3 details the implementation. Section 4 describes our results. Finally, Section 5 concludes
                         our contributions.



                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ anh_thu_maria.bui@smail.th-koeln.de (A. T. M. Bui); saskia_felizitas.brech@smail.th-koeln.de (S. F. Brech);
                         natalie.hussfeldt@smail.th-koeln.de (N. Hußfeldt); tobias.jennert1@smail.th-koeln.de (T. Jennert);
                         melanie.ullrich@smail.th-koeln.de (M. Ullrich); timo.breuer@th-koeln.de (T. Breuer); narjes.nikzad_khasmakhi@th-koeln.de
                         (N. Nikzad Khasmakhi); philipp.schaer@th-koeln.de (P. Schaer)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: An overview of hallucination generation and detection tasks.


2. Methodology
This section is divided into two parts: generation and hallucination detection tasks. Before delving
into the details of our methodology, it is important to note that prior to receiving the dataset from
the organizers, we began familiarizing ourselves with the overall task by applying the three models
Falcon [3], MPT [4], and Llama 2 [5] to the hallucination detection task on the SHROOM dataset [6]. Since
the results from these three techniques were unsatisfactory, we excluded them from our implementation
for Eloquent Lab.
   The models we applied to the Eloquent’s dataset in generation and detection tasks were:

    • Meta-Llama/Meta-Llama-3-8B-Instruct [7, 8]
    • GPT-3.5 Turbo [9]
    • GPT-4 [10]
    • Google/GEMMA-7B-IT [11, 12]

   We leveraged a combination of open-source and closed-source models. This allows us to evaluate the
quality of outputs across different models. Additionally, utilizing open-source models helped us optimize
costs. Therefore, we initially experimented with various prompts for the tasks using open-source LLMs
to identify the most effective ones. Then, we applied these optimized prompts to closed-source GPT
models. Additionally, we did our best to enhance our prompting effectiveness using the guidance
framework [13].

2.1. Hallucination generation task
The task of hallucination generation is divided into two scenarios: machine translation and paraphrasing.
The goal of the generation step is to take a source sentence and generate two LLM hypotheses: one that
is a correct translation/paraphrase of the source and one that is a hallucinated translation/paraphrase
of the source.
   Figure 2 indicates the overview of our approach for the generation task. To conduct this task, we
took advantage of ‘GPT-3.5 Turbo’, ‘GEMMA-7B-IT’, and ‘Llama 3’.

2.2. Hallucination detection task
The hallucination detection task is to present the LLM with a source sentence and two hypotheses
(hyp1 and hyp2) and to determine which hypothesis is a hallucination and which is factually accurate.
Our approach involved using four different LLMs, ‘GPT-3.5 Turbo’, ‘Google/GEMMA-7B-IT’, ‘Llama
3’, and ‘GPT-4’, as classifiers. Additionally, we employed a voting approach as a simple technique of
ensemble learning [14] to combine the outputs of these four models.
Figure 2: An overview of our approach for the generation task.




             (a) LLMs as a classifier.                           (b) Using simple voting approach.

Figure 3: An overview of our approach for the detection task.


   Furthermore, we experimented with four distinct prompting techniques to provide better guidance
to the LLMs and enhance their ability to discriminate between factual and hallucinated information.

    • Type1: Simple Prompt: Using labels ‘hallucinated’ or ‘not hallucinated’.
    • Type2: Complex Prompt with 0/1 Labels: Specifying the task with labels 0 or 1.
    • Type3: Prompt with Definition and Examples: Including a definition of hallucination alongside
      examples labeled 0 and 1.
    • Type4: Prompt with Full Task Description: Describing the entire translation task (for instance) and
      hallucination detection goal. Combined Prompt: Combining all the above elements.



3. Implementation
This part primarily focuses on how we prompted LLMs, along with the challenges and observations
we encountered during the task. We divide this section into three parts: generation, detection, and
cross-evaluation tasks.

3.1. Generation task
The generation task includes test sets for both paraphrasing and translation tasks.
3.1.1. Paraphrasing Generation Task
The paraphrasing generation task involved datasets in English and Swedish, comprising 118 samples
for English and 76 samples for Swedish.
  The performance of different models, including ‘Gemma’, ‘GPT-3.5 Turbo’, and ‘Llama 3’, was
evaluated based on their ability to generate paraphrases for English and Swedish datasets. In the
appendix in Section A, Figures 4 to 7 show comprehensive lists of all prompts used for the different
models. The following demonstrates some of our observations regarding the implementation of the
generation task:

    • The performance of the ‘Gemma’ model varied significantly based on the complexity of the
      prompts used. Simpler prompts yielded better results that highlight the importance of prompt
      design. Despite this, the model struggled with understanding specific instructions, such as
      ‘generate hallucination’. Additionally, the generation speed was notably slow.
    • For ‘GPT-3.5 Turbo’, one prompt for English and one prompt for Swedish were employed. The
      generation speed of ‘GPT-3.5 Turbo’ was significantly faster compared to other models.
    • For ‘Llama 3’, a single prompt was used for both English and Swedish datasets. The speed of
      the model in generating Swedish responses was exceedingly slow. After seven hours, it only
      produced five outputs.

3.1.2. Translation Generation Tasks
In the appendix in Section A, in Figures 8 to 10, you will find a comprehensive list of all prompts used for
the different models. The details of our implementation and observation of the translation generation
task are as follows:
   In our experimentation with ‘Llama 3’, we opted not to use the ‘guidance’ framework because of its
ineffective performance. ‘Llama 3’ showed promising results for each language pair. We experimented
with two different prompts, as shown in Figure 10 and observed instances where ‘Llama 3’ successfully
generated hypotheses in the desired target language but struggled with the source language. Examples
illustrating this phenomenon can be found in the Table 19.
   Various prompts were tested, and the one that was chosen, as shown in Figure 9, showed effectiveness
in generating the most automatic translations. However, ‘GPT-3.5 Turbo’ still struggled to instantly
create translations (hyp- and hyp+) for all sources. The main issue was the variations in quotation marks
which caused problems during the extraction process. As a result, we had to prompt some sentences
individually (instead of being able to loop them as a group) so that the structure was recognized by
GPT again. List 3.1.2 shows the number of samples had been done individually.

    • German to English: with 12 sources where 3 sources needed to be translated manually.
    • English to German: with 10 sources where 2 sources needed to be translated manually.
    • French to English: with 19 sources where 3 sources needed to be translated manually.
    • English to French: with 64 sources where 0 sources needed to be translated manually.

  Translating from English was a smoother process for the Gemma model compared to translating to
English.

3.2. Detection task
The detection task involves trial and test sets for both scenarios.

3.2.1. Paraphrasing Detection Task
Table 1 shows the number of samples for each trial and test set for the paraphrasing detection task. The
trial dataset for the paraphrasing detection task is structured as follows:
Table 1
Paraphrasing detection dataset details.
                        Dataset                  Type                      Count
                        Eloquent/HalluciGen-PG   trial_detection_english   15
                        Eloquent/HalluciGen-PG   trial_detection_swedish   19
                        Eloquent/HalluciGen-PG   test_detection_english    118
                        Eloquent/HalluciGen-PG   test_detection_swedish    118


    • id: Unique identifier of the example.
    • source: Original model input for paraphrase generation.
    • hyp1: First alternative paraphrase of the source.
    • hyp2: Second alternative paraphrase of the source.
    • label: hyp1 or hyp2, based on which of those has been annotated as hallucination.
    • type: Hallucination category assigned. Possible values include:
         – addition
         – named-entity
         – number
         – conversion
         – date
         – tense
         – negation
         – gender
         – pronoun
         – antonym
         – natural

   We compared the performance of different models on the trial dataset using distinct prompts. Some
prompts used for the paraphrasing detection task on the trial dataset are presented in Figure 11.
Additionally, Tables 20 to 32 illustrate the performance of various prompts on the trial dataset for both
English and Swedish.
   A challenge with Gemma was its tendency to generate code within responses. We implemented a
specific ‘JSON’ format to ensure retrievable output. Figure 12 indicates the example of generated output
from Gemma. Figures 13 to 17 display the prompts employed in the paraphrasing detection task across
various models for the test set.

3.2.2. Translation Detection Task
The following details are provided about the translation detection dataset.
    • Both trial and test datasets include data for four language pairs as follows:
         – de-en: Source language: German, Target language: English
         – en-de: Source language: English, Target language: German
         – fr-en: Source language: French, Target language: English
         – en-fr: Source language: English, Target language: French
    • The trial dataset included 10 data entries, with 5 entries featuring hallucination as hyp1 and the
      other 5 as hyp2. The structure of the trial dataset is illustrated below:
         – id: Unique identifier of the example.
         – langpair: Language of source and hypotheses pair
         – source: Source Text
         – hyp1: First alternative translation of the source.
         – hyp2 Second alternative translation of the source.
         – type: Hallucination category assigned. Possible values include:
             ∗ addition
             ∗ named-entity
             ∗ number
             ∗ conversion
             ∗ date
             ∗ tense
             ∗ negation
             ∗ gender
             ∗ pronoun
             ∗ antonym
             ∗ natural
         – label hyp1 or hyp2, based on which of those has been annotated as hallucination
    • In the test collection, there are 100 data samples for each language pair.
      The structure of the test dataset is presented as follows:
         – id: Unique identifier of the example.
         – langpair: Language of source and hypotheses pair
         – source: Source Text
         – hyp1: First alternative translation of the source.
         – hyp2 Second alternative translation of the source.

   Our implementation and observations of the translation detection task are delineated below, catego-
rized according to each model.

Observations for Llama 3 Ultimately, we experimented with 15 different prompts for the ‘Llama 3’
model. Among these, the prompt, as shown in Figure 18(a) yielded the most favorable results. Table 33
demonstrates the achieved results by using this prompt on the trial dataset. So, we opted for it for the
final detection task.
   The main observations for Llama 3 are:

    • ‘Llama 3’ is not able to detect a label for every data entry (support is only 4 for each, hyp1 and
      hyp2). Figure 18 demonstrates the prompts used by ‘Llama 3’ on the test set.
    • When detecting the hallucination, ‘Llama 3’ gives explanations, such as: ‘I chose hyp1 as the
      hallucination because it contains a date (December 5) that is not present in the source text. The
      source text only mentions the date August 5, but hyp1 provides a different date.’ The first row in
      Table 34 shows this issue.
    • As both examples in Table 34 indicate ‘Llama 3’ exhibits gender bias. In the first example, it failed
      to recognize the feminine noun ‘Wirtschaftsprüferin’ shows a female auditor and labeled it as
      gender-neutral. It made a gender assumption in hyp1 which assumes a male auditor. Similarly, in
      the second example, ‘Llama 3’ struggled to understand the clear indication of a female secretary
      with the word ‘Sekretärin.’
    • ‘Llama 3’ struggles with understanding and converting measurements and it could not recognize
      when different units are essentially the same. For instance, it sees ‘kilometers’ and thinks it
      is different from ‘metres’ which leads to mistakenly identifying that text as a hallucination.
      Additionally, ‘Llama 3’ makes the assumption that hyp2 is the hallucination because it contains
      ‘kilometers’ instead of ‘km’ and it fails to consider the fact that hyp1 also uses ‘metres’ instead of
      ‘km.’ Table 35 highlights this issue.
    • ‘Llama 3’ struggles to recognize the different ways dates can be written. As shown in Table 36, it
      could not understand that ‘21. Januar’ and ‘Jan. 21st’ refer to the same date.
    • In the end, we noticed that the prompt immensely influences the outcome of ‘Llama 3’. When
      using different prompts, ‘Llama 3’ was either able to detect the gender, conversion, or the correct
      date, or it was not. For example, 𝑟𝑜𝑤1 in Table 37 shows that using the prompt shown in Figure
      18(b), ‘Llama 3’ correctly explains that ‘Wirtschaftsprüferin’ refers to a female auditor in the
      first example, but then it mistakenly swaps hyp1 and hyp2. Additionally, as shown in the second
      row of this table, the new prompt allows ‘Llama 3’ to detect the correct gender indicated in the
      source text. However, ‘Llama 3’ still fails to assign the correct label. In the third row, we can see
      that ‘Llama 3’ correctly converts 65 km to 65,000 meters and identifies the hallucination in hyp2.
      Additionally, ‘Llama 3’ correctly identifies the wrong date in the last example. The primary issue
      with this prompt is that ‘Llama 3’ frequently fails to identify any hallucinations in certain data
      samples.

Observations for GPT-3.5 Turbo and GPT-4 The approach used for ‘GPT-3.5 Turbo’ was replicated
for ‘GPT-4’ to directly assess comprehension. Various prompts were tested, and two were selected based
on the best results from previous trials.
   The main observations for GPT-3.5 Turbo and GPT-4 are:
    • There were some samples where no hallucinations were detected. Table 38 displays the count of
      failed examples for ‘GPT-4’ and ‘GPT-3.5 Turbo’ in the translation detection task. Additionally,
      Table 39 lists some samples for which ‘GPT-4’ failed to assign labels with our explanation for
      each one.
    • Regarding the prompts for GPT models, both seem to encounter issues with misinterpretations
      or slightly inaccurate translations. Additionally, both struggle to identify incorrect pronouns.
    • Initially, during the phase with incorrect trial datasets, it was observed that ‘GPT-3.5 Turbo’ had
      difficulty recognizing hallucinations when names were slightly misspelled or had an extra letter
      appended.

Observations for Gemma We tried various prompts, but Gemma showed better (80% Accuracy) in
detecting the correct label when it was first asked to translate the hypothesis into the language of the
source and then detect hallucinations. Figure 19 indicates the prompts used by Gemma on the test set.
  The main observations for Gemma are:
    • The performance was significantly worse when the prompts were too scientific or contained too
      many technical terms.
    • Tricky samples for Gemma in the detection task include detecting the gender in comparison to
      the source (female/male), and identifying when numbers are incorrect, such as missing zeros.

Observations for ensemble voting approach We opted for a straightforward voting approach
to ensemble model predictions due to the limitations imposed by the small sample size of the trial set.
This method ensured all models contributed equally.
   Since we compared an even number of models, there were instances where two models voted for
hyp1 and the other two voted for hyp2. In these cases, we randomly selected the label.

3.3. Cross-evaluation task
The following provides detailed information regarding the cross-evaluation task. Table 2 presents
information regarding the samples included in the paraphrasing task. The prompt showns in Figure 20
has been used for the english paraphrasing task.
   In the translation task, sometimes none of the models detected any hallucinations in either hypothesis,
which resulted in some blank spaces in the CSV file due to the lack of predictions. There were instances
where no hallucinations were present because both hypotheses, hyp1 and hyp2, were the same.
Table 2
Test Dataset for Paraphrasing Detection Task Cross Model Evaluation.
                   Dataset                    Type                                Count
                   Eloquent/HalluciGen-PG     cross_model_evaluation_english      594
                   Eloquent/HalluciGen-PG     cross_model_evaluation_swedish      380


4. Results
This part presents the results in detail for each task and scenario. It is worth noting that prior to
showing the results from LLMs, Logistic Regression and Random Forest classifiers were used for an
initial evaluation to establish a baseline performance for comparison with LLMs. Both LR and RF
classifiers achieved similar performance with an F1-score of 0.5.
   For the evaluation of the generation task, the lab employed a zero-shot text classification Natural
language inference (NLI) model (‘𝑀 𝑜𝑟𝑖𝑡𝑧𝐿𝑎𝑢𝑟𝑒𝑟/𝑏𝑔𝑒 − 𝑚3 − 𝑧𝑒𝑟𝑜𝑠ℎ𝑜𝑡 − 𝑣2.0’) to predict whether
‘hyp+’ is entailed within the source sentence and whether ‘hyp-’ contradicts the source sentence. They
used only two labels: ‘entailment’ and ‘not_entailment’. This approach helps us assess whether the
systems can produce coherent hyp+/hyp- pairs. It is important to note that the performance of the
classification model is not perfect, but it demonstrated reasonable performance on the detection test set
across various languages and language pairs [2].
   For evaluating both detection and cross-model tasks, the lab reported key metrics such as Accuracy,
F1-score, Precision, and Recall for each model. Additionally, several baseline models were evaluated
by the lab. For cross-model assessment, the lab also employed two metrics: Matthews Correlation
Coefficient (MCC) and Cohen’s Kappa.
   The Average MCC (MCC) measures the quality of binary classifications by considering true and
false positives and negatives, while the Standard Deviation of MCC (𝜎𝑀 𝐶𝐶 ) provides insight into the
consistency of the model’s performance. Similarly, the Average Kappa (𝜅  ¯ ) measures inter-rater reliability
for categorical items, and the Standard Deviation of Kappa (𝜎𝜅 ) indicates the variability or consistency
of the Kappa metric [15].
   Tables 3 to 5 demonstrate the evaluation of detection, generation, and cross-model evaluation for
English paraphrasing tasks.
   The performance of detection across various models on the English paraphrasing task is presented in
Table 3. The model ‘GPT-4’ with prompt ‘En_Se_Para_Det_GPT3.5_GPT4_v2’ achieved the highest
performance with Accuracy, F1-score, Precision, and Recall scores of 0.91.
   Table 4 presents the results for the generation step. The model ‘GPT-3.5 Turbo’ with prompt
‘En_Para_Gen_GPT3.5’ achieved the highest performance in hyp+ entailment mean (0.964) and hyp+
correct label mean (0.983). Furthermore, The model ‘Llama 3’ with prompt ‘En_Para_Gen_Llama3’
showed strong performance in hyp- not entailment mean(0.978) and hyp- correct label mean (0.983).
   Table 5 presents the results for the cross model. The model ‘GPT-4’ with prompt ‘fi-
nal_gpt4_en_v2_cross_model_detection’ showed Accuracy, F1-score, Precision, and Recall scores of 0.93.
In the next stage, the majority model with prompt ‘majority_vote_cross_model_result_en’ demonstrated
impressive performance.
   Table 6 shows that the model with prompt ‘majority_vote_cross_model_result_en’ achieved the
highest performance with an average MCC of 0.83 and average Kappa of 0.81.
   Table 7 presents the performance metrics of various models in the detection step for the Swedish para-
phrasing task. The model ‘GPT-4’ with prompt ‘En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT4)’ achieved
an Accuracy score of 0.81 and with consistent scores across all metrics (F1 = 0.81, Precision = 0.81, Recall
= 0.81). Additionally, the baseline ‘baseline-bge-m3-zeroshot-v2.0/sv_bge-m3-zeroshot-v2.0’ shows the
highest Accuracy of 0.92 across all models.
   Table 8 summarizes the results of models in the generation step for Swedish paraphrasing where
the focus is on metrics related to hypothesis entailment and not_entailment. The model ‘GPT-3.5
Turbo’ with prompt ‘Se_Para_Gen_GPT3.5’ demonstrated strong performance with high scores in hyp+
entailment mean of 0.88, hyp+ correct label mean of 0.90, hyp- contradiction mean of 0.91, and hyp-
correct label mean of 0.93.
   Tables 9 and 10 present cross-model evaluation results for the Swedish paraphrasing task that
highlights the model performance across different evaluation criteria. The model majority voting with
prompt ‘majority_vote_cross_model_result_se’ showed competitive performance. Table 10 provides
statistical measures for models excluding baselines that indicate the noted majority voting technique
has consistency and reliability.
   Tables 11 to 14 report the performance of English-French translation detection, generation, and
cross-model evaluation.
   Table 11 highlights several key points regarding the performance of the detection task. Model ‘GPT-
4’ with prompt ‘results_gpt4_en_fr’ achieved the highest performance with Accuracy, F1-score, and
Recall of 0.90, and Precision of 0.91. Additionally, we can observe that the majority voting model with
prompt ‘majority_vote_result_en_fr’ also performed well with Accuracy, F1-score, and Recall of 0.83
and Precision of 0.86.
   One of the conclusions can be drawn from Table 12 is that the baseline model ‘baseline-general-
prompt/en-fr.gen’ showed a better performance with hyp+ entailment mean 0.90 and hyp+ correct label
mean of 0.93, while it has a lower performance in hyp- contradiction mean of 0.10 and hyp- correct label
mean of 0.08. Also, it is clear that model ‘GPT-3.5 Turbo’ with prompt ‘results_gpt_en_fr’ demonstrated
a high performance in hyp- contradiction mean of 0.88, and hyp- correct label mean of 0.91.
   From tables 13 and 14 we have the finding that the majority voting approach with prompt ‘ma-
jority_vote_result_en_fr’ reached Accuracy 0.79, F1 score 0.78, Precision 0.80, and Recall 0.79. This
combination exhibited the highest average MCC 0.66 and average Kappa 0.65.
   Tables 15 to 18 report the results for the evaluation of the English-German translation detection,
generation, and cross-model.
   The important observation from the Table 15 is that the model ‘GPT-4’ along with prompt ‘re-
sults_gpt4_en_de’ showed the highest performance with an Accuracy, F1 score, and Recall all at 0.86
and Precision 0.89.
   From Table 16 we can see that the model ‘GPT-3.5 Turbo’ with the mixture of prompt ‘re-
sults_gpt_en_de’ exhibited better performance in hyp- contradiction mean of 0.83, and hyp- correct
label mean of 0.84. ‘Gemma’ with prompt ‘En_De_Trans_Gen_gamma’ showed the best hyp+ correct
label mean of 0.85. Additionally, ‘baseline-phenomena-mentions-prompt/en-de.gen’ provides a better
hyp+ entailment mean of 0.84.
   Tables 17 and 18 provide the insight that the model ‘GPT-3.5 Turbo’ with ‘results_gpt_en_de’ had
the highest Accuracy of 0.76, F1 score of 0.75, Precision of 0.77, and Recall of 0.76. The prompt
‘majority_vote_result_en_de’ for majority voting had the highest average MCC of 0.60 and average
Kappa of 0.58 which indicates strong inter-model agreement and consistency.


5. Conclusion
In conclusion, this study leveraged several LLMs to investigate both the generation and detection of
hallucinations by LLMs themselves. The four distinct models employed presented their own unique
evaluation challenges. We explored various prompt techniques including few-shot learning and chain of
thought by using the guidance framework. Additionally, for the detection task, we tested an ensemble
voting approach to combine the results from different LLMs. Although in this study we could achieve
better results in comparison to the baseline models, our findings indicate that while some issues can be
addressed through effective prompting, others remain difficult to mitigate solely by prompt engineering.
Moreover, identifying the optimal prompt itself poses a significant challenge.
Table 3
Results of the English Detection Step Paraphrasing Task
 Model                                                       Accuracy     F1            Precision      Recall
 En_Se_Para_Det_Llama3_v2                                    0.69         0.69          0.81           0.69
 En_Se_Para_Det_GPT3.5_GPT4_v2                               0.91         0.91          0.91           0.91
 En_Para_Det_Llama3_v1                                       0.80         0.80          0.81           0.80
 En_Para_Det_Gemma_v1                                        0.71         0.71          0.77           0.71
 En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT4)                        0.73         0.73          0.83           0.73
 En_Se_Para_Det_Gemma_v2                                     0.54         0.49          0.73           0.54
 majority_vote_result_en_prompt (Prompts of Version 2)       0.85         0.85          0.86           0.85
 En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT3.5)                      0.68         0.68          0.75           0.68
 Baseline Models
 baseline-bge-m3-zeroshot-v2.0/en_bge-m3-zeroshot-v2.0       0.90         0.90          0.90           0.90
 baseline-llama2-meaning-detection/en.det                    0.45         0.44          0.44           0.45
 baseline-llama2-not-supported-detection/en.det              0.34         0.35          0.39           0.34
 baseline-llama2-paraphrase-detection/en.det                 0.34         0.35          0.37           0.34


Table 4
Results of the English Generation Step for Paraphrasing Task
 Model                               hyp+ entail-         hyp+ correct     hyp- contra-        hyp- correct
                                     ment mean            label mean       diction mean        label mean
 En_Para_Gen_Gemma_v2                0.828                0.857            0.894               0.908
 En_Para_Gen_Gemma_v1                0.782                0.824            0.894               0.899
 En_Para_Gen_GPT3.5                  0.964                0.983            0.797               0.807
 En_Para_Gen_Llama3                  0.843                0.882            0.978               0.983
 Baseline Models
 baseline-mixtral-8x7b-instruct-     0.920                0.924            0.738               0.748
 hallucination-detection/en.gen


Table 5
Results of the English Cross-Model Evaluation for Paraphrasing Task
 Model                                       Accuracy             F1             Precision      Recall
 final_gpt35_en_v2_cross_model_detection     0.88                 0.88           0.89           0.88
 majority_vote_cross_model_result_en         0.92                 0.92           0.93           0.92
 final_lama3_cross_model_en_v1               0.87                 0.87           0.88           0.87
 final_gpt4_en_v2_cross_model_detection      0.93                 0.93           0.93           0.93
 final_gemma_en_v1_cross_model               0.78                 0.77           0.83           0.78
 Baseline Models
 baseline-bge-m3-zeroshot-v2.0/en.det.csv    0.95                 0.95           0.95           0.95


Table 6
Results of the English Cross-Model Evaluation (excluding the baseline models) for Paraphrasing Task

 Model                                       MCC                  𝜎𝑀 𝐶𝐶          𝜅
                                                                                 ¯              𝜎𝜅
 final_gemma_en_v1_cross_model               0.65                 0.03           0.61           0.04
 final_gpt35_en_v2_cross_model_detection     0.77                 0.08           0.77           0.09
 final_gpt4_en_v2_cross_model_detection      0.76                 0.10           0.75           0.12
 final_lama3_cross_model_en_v1               0.75                 0.09           0.74           0.10
 majority_vote_cross_model_result_en         0.83                 0.09           0.81           0.10
Table 7
Results of the Swedish Detection Step for Paraphrasing Task
 Model                                                        Accuracy     F1            Precision      Recall
 En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT4)                         0.81         0.81          0.81           0.81
 Se_Para_Det_Gemma_v1                                         0.59         0.52          0.71           0.59
 majority_vote_result_se (Prompts from Version 1)             0.67         0.66          0.72           0.67
 En_Se_Para_Det_Gemma_v2                                      0.07         0.11          0.47           0.07
 En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT 3.5)                      0.71         0.70          0.76           0.71
 En_Se_Para_Det_GPT3.5_GPT4_v2 (GPT 3.5)                      0.61         0.60          0.65           0.61
 En_Se_Para_Det_Llama3_v2                                     0.57         0.48          0.77           0.57
 Se_Para_Det_Llama3_v1                                        0.60         0.59          0.60           0.60
 Baseline Models
 baseline-bge-m3-zeroshot-v2.0/sv_bge-m3-zeroshot-v2.0        0.92         0.92          0.92           0.92
 baseline-llama2-meaning-detection/sv.det                     0.60         0.60          0.62           0.60
 baseline-llama2-not-supported-detection/sv.det               0.57         0.56          0.70           0.57
 baseline-llama2-paraphrase-detection/sv.det                  0.61         0.59          0.68           0.61
 sv_scandi-nli-large                                          0.92         0.92          0.92           0.92


Table 8
Results of the Swedish Generation Step for Paraphrasing Task
 Model                                  hyp+ entail-       hyp+ correct      hyp- contra-       hyp- correct
                                        ment mean          label mean        diction mean       label mean
 Se_Para_Gen_Gemma_v1                   0.346              0.355             0.931              0.934
 Se_Para_Gen_Gemma_v2                   0.588              0.618             0.710              0.697
 Se_Para_Gen_GPT3.5                     0.881              0.908             0.918              0.934
 Baseline Models
 baseline-gpt-sw3-6.7b-v2-              0.637              0.645             0.502              0.500
 hallucination-detection/sv.gen
 baseline-mixtral-8x7b-instruct-        0.809              0.842             0.386              0.355
 hallucination-detection/sv.gen


Table 9
Results of the Swedish Cross-Model Evaluation for Paraphrasing Task
 Model                                          Accuracy           F1             Precision      Recall
 final_lama3_cross_model_se_v1                  0.71               0.70           0.71           0.71
 final_gpt35_se_v2_cross_model_detection        0.68               0.68           0.69           0.68
 majority_vote_cross_model_result_se            0.76               0.76           0.77           0.76
 final_gpt4_se_v2_cross_model_detection         0.72               0.74           0.77           0.72
 final_gemma_se_v1_cross_model                  0.56               0.48           0.63           0.56
 Baseline Models
 baseline-bge-m3-zeroshot-v2.0/sv.det           0.75               0.75           0.75           0.75


Table 10
Results of the Swedish Cross-Model Evaluation (excluding the baseline models)

 Model                                          MCC                𝜎𝑀 𝐶𝐶          𝜅
                                                                                  ¯              𝜎𝜅
 final_gemma_se_v1_cross_model                  0.26               0.04           0.19           0.04
 final_gpt35_se_v2_cross_model_detection        0.51               0.18           0.48           0.20
 final_gpt4_se_v2_cross_model_detection         0.46               0.16           0.41           0.19
 final_lama3_cross_model_se_v1                  0.52               0.22           0.50           0.24
 majority_vote_cross_model_result_se            0.62               0.19           0.59           0.23
Table 11
Results of the English-French Translation Detection Step
 Model                                           Accuracy      F1              Precision        Recall
 results_gpt4_en_fr_prompt2                      0.79          0.79            0.79             0.79
 En_Fr_Trans_Det_llama3_v2                       0.66          0.65            0.74             0.66
 En_Fr_Trans_Det_gemma_v1                        0.66          0.66            0.67             0.66
 results_gpt_en_fr                               0.74          0.74            0.81             0.74
 En_Fr_Trans_Det_gemma_v2                        0.63          0.60            0.81             0.63
 En_Fr_Trans_Det_llama3_v1                       0.56          0.51            0.79             0.56
 results_gpt_en_fr_prompt2                       0.76          0.76            0.83             0.76
 majority_vote_result_en_fr                      0.83          0.83            0.86             0.83
 results_gpt4_en_fr                              0.90          0.90            0.91             0.90
 Baseline Models
 en_fr_bge-m3-zeroshot-v2.0                      0.82          0.82            0.82             0.82
 baseline-general-detection-prompt_en-fr.det     0.47          0.47            0.52             0.47
 baseline-meaning-detection-prompt_en-fr.det     0.49          0.50            0.50             0.49
 baseline-supported-detection-prompt_en-fr.det   0.40          0.24            0.17             0.40


Table 12
Results of the English-French Translation Generation Step
 Model                               hyp+ entail-       hyp+ correct      hyp- contra-      hyp- correct
                                     ment mean          label mean        diction mean      label mean
 En_Fr_Trans_Gen_llama3_v2           0.80277            0.81              0.82427           0.86
 En_Fr_Trans_Gen_gemma               0.80958            0.8               0.50219           0.49
 results_gpt_en_fr                   0.85611            0.88              0.88556           0.91
 En_Fr_Trans_Gen_llama3_v1           0.75679            0.77              0.78892           0.81
 Baseline Models
 baseline-phenomena-mentions-        0.89575            0.92              0.26324           0.23
 prompt/en-fr.gen
 baseline-general-prompt/en-fr.gen   0.90503            0.93              0.10912           0.08


Table 13
Results of the English-French Translation Cross-Model Evaluation
 Model                                           Accuracy      F1               Precision       Recall
 results_gpt_en_fr                               0.77          0.77             0.79            0.77
 results_gemma_en_fr_final                       0.57          0.57             0.57            0.57
 results_llama3_en_fr_final                      0.68          0.65             0.78            0.68
 majority_vote_result_en_fr                      0.79          0.78             0.80            0.79
 results_gpt4_en_fr                              0.77          0.76             0.79            0.77
 Baseline Models
 baseline-general-detection-prompt/en-fr.cme     0.44          0.45             0.46            0.44
 baseline-meaning-detection-prompt/en-fr.cme     0.47          0.47             0.48            0.47
 baseline-supported-detection-prompt/en-fr.cme   0.48          0.32             0.24            0.48


Table 14
Results of the English-French Translation Cross-Model Evaluation (excluding the baseline models)

 Model                        MCC                𝜎𝑀 𝐶𝐶                𝜅
                                                                      ¯                  𝜎𝜅
 majority_vote_result_en_fr   0.66               0.23                 0.65               0.24
 results_gemma_en_fr_final    0.25               0.04                 0.23               0.05
 results_gpt4_en_fr           0.60               0.26                 0.59               0.27
 results_gpt_en_fr            0.59               0.26                 0.57               0.27
 results_llama3_en_fr_final   0.48               0.16                 0.43               0.16
Table 15
Results of the English-German Translation Detection Step
 Model                                           Accuracy      F1                Precision       Recall
 majority_vote_result_en_de                      0.81          0.81              0.87            0.81
 En_De_Trans_Det_gemma_v1                        0.59          0.59              0.60            0.59
 results_gpt4_en_de_prompt2                      0.79          0.79              0.81            0.79
 En_De_Trans_Det_llama3_v1                       0.54          0.47              0.78            0.54
 En_De_Trans_Det_llama3_v2                       0.70          0.69              0.79            0.70
 En_De_Trans_Det_gemma_v2                        0.58          0.54              0.73            0.58
 results_gpt_en_de_prompt2                       0.80          0.80              0.85            0.80
 results_gpt4_en_de                              0.86          0.86              0.89            0.86
 results_gpt_en_de_prompt2 (duplicate?)          0.68          0.67              0.77            0.68
 Baseline Models
 en_de_bge-m3-zeroshot-v2.0                      0.73          0.73              0.74            0.73
 baseline-general-detection-prompt/en-de.det     0.48          0.48              0.51            0.48
 baseline-meaning-detection-prompt/en-de.det     0.36          0.36              0.36            0.36
 baseline-supported-detection-prompt/en-de.det   0.41          0.25              0.18            0.41


Table 16
Results of the English-German Translation Generation Step
 Model                               hyp+ entail-       hyp+ correct       hyp- contra-      hyp- correct
                                     ment mean          label mean         diction mean      label mean
 results_gpt_en_de                   0.75706            0.81               0.83454           0.84
 En_De_Trans_Gen_gamma               0.82415            0.85               0.44845           0.42
 En_De_Trans_Gen_llama3_v2           0.81966            0.84               0.64852           0.68
 En_De_Trans_Gen_llama3_v1           0.78468            0.84               0.81466           0.84
 Baseline Models
 baseline-phenomena-mentions-        0.8455             0.85               0.35101           0.33
 prompt/en-de.gen
 baseline-general-prompt/en-         0.83701            0.85               0.20566           0.19
 de.gen


Table 17
Results of the English-German Translation Cross-Model Evaluation
 Model                                           Accuracy       F1               Precision       Recall
 majority_vote_result_en_de                      0.75           0.74             0.76            0.75
 results_llama3_en_de_final                      0.58           0.52             0.69            0.58
 results_gemma_en_de_final                       0.53           0.53             0.53            0.53
 results_gpt4_en_de                              0.73           0.73             0.76            0.73
 results_gpt_en_de                               0.76           0.75             0.77            0.76
 Baseline Models
 baseline-general-detection-prompt/en-de.cme     0.42           0.43             0.45            0.42
 baseline-meaning-detection-prompt/en-de.cme     0.41           0.42             0.43            0.41
 baseline-supported-detection-prompt/en-de.cme   0.49           0.33             0.58            0.49


Table 18
Results of the English-German Translation Cross-Model Evaluation (excluding the baseline models)

 Model                        MCC                𝜎𝑀 𝐶𝐶                 𝜅
                                                                       ¯                  𝜎𝜅
 majority_vote_result_en_de   0.60               0.28                  0.58               0.30
 results_gemma_en_de_final    0.16               0.05                  0.15               0.05
 results_gpt4_en_de           0.53               0.31                  0.52               0.33
 results_gpt_en_de            0.52               0.32                  0.50               0.33
 results_llama3_en_de_final   0.34               0.11                  0.27               0.10
References
 [1] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, J. Gao, Large language
     models: A survey, CoRR abs/2402.06196 (2024). URL: https://doi.org/10.48550/arXiv.2402.06196.
     doi:10.48550/ARXIV.2402.06196. arXiv:2402.06196.
 [2] J. Karlgren, L. Dürlich, E. Gogoulou, L. Guillou, J. Nivre, M. Sahlgren, A. Talman, Eloquent clef
     shared tasks for evaluation of generative language model quality, in: Advances in
     Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow,
     UK, March 24–28, 2024, Proceedings, Part V, Springer-Verlag, Berlin, Heidelberg, 2024, p. 459–465.
     URL: https://doi.org/10.1007/978-3-031-56069-9_63. doi:10.1007/978-3-031-56069-9_63.
 [3] tiiuae, Falcon-11B, https://huggingface.co/tiiuae/falcon-11B, Accessed on 2024-05-23.
 [4] M. N. Team, Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023.
     URL: www.mosaicml.com/blog/mpt-7b, accessed: 2023-05-05.
 [5] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     gava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
     arXiv:2307.09288 (2023).
 [6] T. Mickus, E. Zosa, R. Vázquez, T. Vahtola, J. Tiedemann, V. Segonne, A. Raganato, M. Apidianaki,
     Semeval-2024 shared task 6: Shroom, a shared-task on hallucinations and related observable
     overgeneration mistakes, arXiv preprint arXiv:2403.07726 (2024).
 [7] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/
     MODEL_CARD.md.
 [8] Hugging         Face,         Meta-Llama-3-8B-Instruct,        https://huggingface.co/meta-llama/
     Meta-Llama-3-8B-Instruct, Accessed on 2024-05-23.
 [9] OpenAI, Gpt-3.5 turbo, n.d.. URL: https://platform.openai.com/docs/models/gpt-3-5-turbo.
[10] OpenAI, Model endpoint compatibility, n.d.. URL: https://platform.openai.com/docs/models/
     model-endpoint-compatibility.
[11] T. M. Gemma Team, C. Hardin, R. Dadashi, S. Bhupatiraju, L. Sifre, M. Rivière, M. S. Kale, J. Love,
     P. Tafti, L. Hussenot, et al., Gemma (2024). URL: https://www.kaggle.com/m/3301. doi:10.34740/
     KAGGLE/M/3301.
[12] Google, gemma-7b, https://huggingface.co/google/gemma-7b, Accessed on 2024-05-23.
[13] Microsoft, guidance-ai/guidance: A guidance language for controlling generative models, https:
     //github.com/guidance-ai/guidance, 2023.
[14] T. G. Dietterich, et al., Ensemble learning, The handbook of brain theory and neural networks 2
     (2002) 110–125.
[15] D. Chicco, M. J. Warrens, G. Jurman, The matthews correlation coefficient (mcc) is more informative
     than cohen’s kappa and brier score in binary classification assessment, Ieee Access 9 (2021) 78368–
     78381.
     A. Appendix

1 answer_format = {"hyp+": "", "hyp-": ""}
                                                             1 answer_format = {"hyp+": "", "hyp-": ""}
2 user_prompt = f’’’
                                                             2 user_prompt = f’’’
3 user
                                                             3 
4 Given the src below, generate a
                                                             4 Med tanke på källan nedan, generera en
      paraphrase hypothesis hyp+ that is
                                                                   parafras-hypotes hyp+ som stöds av kä
      supported by src and a second
                                                                   llan och en andra parafras hyp- som
      paraphrase hyp- that is not supported
                                                                   inte stöds av källan.
       by src.
                                                             5 Ge resultatet i följande format: {
5 Provide the result in the following
                                                                   answer_format}
      format: {answer_format}
                                                             6 Källa: {source}
6 Src: {source}
                                                             7 
7 
                                                             8 Resultat:
8 Result:
                                                             9 modell’’’
9 model’’’
                                                            10
10

                                                                 Listing 2: Se_Para_Gen_Gemma_v1
     Listing 1: En_Para_Gen _Gemma_v1

     Figure 4: First Prompts used by ‘Gemma’ for Paraphrasing Generation Task.



                                                             1           answer_format = {"hyp+": " ", "
                                                                     hyp-": " "}
                                                             2       user_prompt = f’’’
                                                             3       Med tanke på källan nedan, generera
                                                                     en parafras-hypotes hyp+ som stöds av
                                                                      källan och en andra parafras hyp-
1            ’’’Given the src below, generate
                                                                     som inte stöds av källan.
         a paraphrase hypothesis hyp+ that is
                                                             4       Ge resultatet i följande format: {
          supported by src and a second
                                                                     answer_format}
         paraphrase hyp- that is not supported
                                                             5       Källa: {source}
          by src.
                                                             6       Resultat:
2            Provide the result in the
                                                             7       ’’’
         following format: {"hyp+": "", "hyp-":
                                                             8       with system():
          ""}
                                                             9           lm = gpt + "Du är en
3            Src: {source}
                                                                     textgenerator. Du är specialiserad på
4            Result:’’’
                                                                      att parafrasera texter"
5
                                                            10       with user():
     Listing 3: En_Para_Gen _GPT3.5                         11           lm += user_prompt
                                                            12       with assistant():
                                                            13           lm += gen("answer")
                                                            14       result = lm["answer"]
                                                            15


                                                                 Listing 4: Se_Para_Gen_GPT3.5

     Figure 5: Prompts used by ‘GPT-3.5 Turbo’ for Paraphrasing Generation Task.
   1                 ’’’Given the src below, generate a paraphrase hypothesis hyp+ that is
              supported by src and a second paraphrase hyp- that is not supported by src.
   2
   3          Provide the result in the following format: {"hyp+": "", "hyp-": ""}
   4          Src: {source}
   5          Result: }
   6              Result:’’’
   7


       Listing 5: En_Para_Gen _Llama3

Figure 6: Prompt used by ‘Llama 3’ for Paraphrasing Generation Task.




Table 19
LLama3 generates hyp+ and hyp- in the source language instead of the desired target language.
 Lang-pair        Source                            hyp+                                   hyp-
 en-de            All nouns, alongside the          All nouns, alongside the word          All nouns, alongside the word
                  word Sie for you, always be-      Sie for you, always begin with a       sie for you, always begin with a
                  gin with a capital letter, even   capital letter, even in the middle     capital letter, even in the middle
                  in the middle of a sentence.      of a sentence, except for those        of a sentence.
                                                    that are part of a title or a proper
                                                    noun.
 en-fr            The final line of the third The final line of the third verse            The final line of the third verse
                  verse was changed during was modified during the reign                   was rewritten during the reign
                  the reign of Alexander I of of Alexander I of Yugoslavia in              of Alexander the Great in ’Kralja
                  Yugoslavia in “Kralja Alek- ’Kralja Aleksandra, Bože hrani’.             Aleksandra, Bože hrani’.
                  sandra, Bože hrani ”.




Table 20
Comparison of English Prompt 1 Results on Trial Dataset from Gemma, Llama3, and GPT-3.5 for the paraphrasing
detection task.
         id         type         label    prediction gemma          prediction gpt 3.5        prediction llama3
         0        antonym        hyp1               hyp1                     hyp1                     hyp1
         1        negation       hyp1               hyp1                     hyp1                     hyp1
         2        antonym        hyp1               hyp1                     hyp1                     hyp1
         3      named entity     hyp1               hyp1                     hyp1                     hyp1
         4         natural       hyp2               hyp2                     hyp2                     hyp2
         5        addition       hyp2               hyp2                     hyp2                     hyp2
         6         gender        hyp2               hyp2                     hyp2                     hyp2
         7         natural       hyp1               hyp2                     hyp1                     hyp2
         8         number        hyp2               hyp2                     hyp2                     hyp2
         9        pronoun        hyp1               hyp2                     hyp2                     hyp2
         10       pronoun        hyp1               hyp1                     hyp2                     hyp2
         11       addition       hyp2               hyp2                     hyp2                     hyp2
         12      conversion      hyp1               hyp2                     hyp1                     hyp1
         13        natural       hyp2               hyp2                     hyp1                     hyp2
         14     named entity     hyp2               hyp2                     hyp2                     hyp2
         15         date         hyp1               hyp1                     hyp1                     hyp1
1  answer_format = {"hyp+": "", "hyp-": ""}
2  user_prompt = f’’’
 3 user
 4 As an AI model, your task is to generate two paraphrases based on the given source text.
       The first paraphrase, labeled as ’hyp+’, should be supported by the source text. The
       second paraphrase, labeled as ’hyp-’, should not be supported by the source text.
 5 Here’s an example to illustrate this:
 6 Source: The population has declined in some 210 of the 280 municipalities in Sweden,
       mainly in inland central and northern Sweden.
 7 hyp+: In the majority of Sweden’s 280 municipalities, the population has gone down.
 8 This is a paraphrase that is supported by the source. It’s saying essentially the same
       thing as the source: the population has decreased in most municipalities. The wording
       is different, but the meaning is the same. Hence, it’s labeled as ’hyp+’.
 9 hyp-: In the majority of Sweden’s 280 municipalities, the population has gone up.
10 This is a paraphrase that is not supported by the source. It’s saying the opposite of what
       the source says: the population has increased in most municipalities. This contradicts
       the information in the source. Hence, it’s labeled as ’hyp-’.
11 Now, given the source text below, generate ’hyp+’ and ’hyp-’ paraphrases and provide the
       result in the following format: {answer_format}
12 Source: {source}
13 
14 Result:
15 model’’’
16


      Listing 6: En_Para_Gen _Gemma_v2

1  answer_format = {"hyp+": "", "hyp-": ""}
2  user_prompt = f’’’
 3 användare
 4 Som en AI-modell är din uppgift att generera två parafraser baserade på den angivna
       källtexten. Den första parafrasen, märkt som ’hyp+’, ska stödjas av källtexten. Den
       andra parafrasen, märkt som ’hyp-’, ska inte stödjas av källtexten.
 5 Här är ett exempel för att illustrera detta:
 6 Källa: Intäkterna från mjukvarulicenser, ett mått som finansanalytiker följer noga,
       minskade med 21 procent till 107,6 miljoner dollar.
 7 hyp+: Intäkter från programvarulicenser, en metrik som noggrant övervakas av finansiella
       analytiker, minskade med 21 procent till ett belopp av 107,6 miljoner dollar.
 8 Detta är en parafras som stöds av källan. Den säger i princip samma sak som källan:
       intäkterna har minskat i de flesta kommunerna. Formuleringen är annorlunda, men
       betydelsen är densamma. Därför märks den som ’hyp+’.
 9 Detta är en parafras som stöds av källan. Den säger i princip samma sak som källan:
       intäkterna har minskat i de flesta kommunerna. Formuleringen är annorlunda, men
       betydelsen är densamma. Därför märks den som ’hyp+’.
10 hyp-: Intäkter från programvarulicenser, en metrik som noggrant övervakas av finansiella
       analytiker, minskade med 42 procent till ett belopp av 107,6 miljoner dollar.
11 Detta är en parafras som inte stöds av källan. Den säger motsatsen till vad källan säger:
       intäkterna har ökat i de flesta kommunerna. Detta motsäger informationen i källan.
       Därför märks den som ’hyp-’.
12 Nu, med den angivna källtexten nedan, generera ’hyp+’ och ’hyp-’ parafraser och ge
       resultatet i följande format: {answer_format}
13 Källa: {source}
14 
15 Resultat:
16 modell’’’
17


      Listing 7: Se_Para_Gen_Gemma_v2

     Figure 7: Second Prompts used by ‘Gemma’ for Paraphrasing Generation Task.
1 user_prompt = f’’’
2 You are a text generator and your task is to generate two translation hypothesis given the ’
      src’ below.
3 The first translation labelled as ’hyp+’ should be supported by ’src’ and the second
      translation labelled as ’hyp-’ should not be supported by ’src’.
4 Provide the result in the following format: "hyp+": "", "hyp-": "". Target language: "
      English"
5
6    Src: {source}
7
8    Result:’’’
9


     Listing 8: En_De_Trans_Gen_gamma, En_Fr_Trans_Gen_gemma


     Figure 8: Prompts used by ‘Gemma’ for Translation Generation Task.


1            answer_format = {"hyp+": "", "hyp-": ""}
2            system_msg = "You are a text generator for translation"
3            user_prompt = f’’’
4            Your task is to generate two translation hypothesis given the ’src’ below. The first
          translation labelled as ’hyp’ should be supported by ’src’ and the second translation
         labelled as ’hyp-’ should not be supported by "src". Provide the result in the following
          format: {answer_format}. Target language: {target_language}
5
6             Src: {source}
7
8             Result:
9             ’’’
10


     Listing 9: En_De_Trans_Gen_gamma, En_Fr_Trans_Gen_gemma


     Figure 9: Prompts used by ‘GPT-3.5 Turbo’ and ‘GPT-4’ models for Translation Generation Task.


     Table 21
     Classification Report Gemma of English Prompt 1 Results on Trial Dataset
                                                Precision   Recall   F1-Score      Support
                              hyp1                 1.00      0.78       0.88          9
                              hyp2                 0.78      1.00       0.88          7
                              Accuracy                                  0.88         16
                              Macro avg            0.89      0.89       0.88         16
                              Weighted avg         0.90      0.88       0.88         16


     Table 22
     Classification Report GPT-3.5 of English Prompt 1 Results on Trial Dataset.
                                                Precision   Recall   F1-Score      Support
                              hyp1                 1.00      0.67       0.80          9
                              hyp2                 0.70      1.00       0.82          7
                              Accuracy                                  0.81         16
                              Macro avg            0.85      0.83       0.81         16
                              Weighted avg         0.87      0.81       0.81         16
1              answer_format = {"hyp+": "", "hyp-": ""}
2              user_prompt = f’’’
3              Given the ’src’ below, generate a translation hypothesis ’hyp+’ that is supported
           by ’src’ and a second translation ’hyp-’ that is not supported by ’src’.
4              Provide the result in the following format: {answer_format}.
5              Target language: {target_language}
6              Src: {source}
7              Result:’’’
8

      Listing 10: De_En_Trans_Gen_llama3_v1, En_De_Trans_Gen_llama3_v1, Fr_En_Trans_Gen_llama3_v1,
      En_Fr_Trans_Gen_llama3_v1

1              answer_format = {"hyp+": "", "hyp-": ""}
2              user_prompt = f’’’
3              As an AI model, your task is to generate two translation hypothesis given the ’src’
            below. The first translation labelled as ’hyp+’ should be supported by ’src’ and the
           second translation labelled as ’hyp-’ should not be supported by ’src’.
4              Provide the result in the following format: {answer_format}.
5              Target language: {target_language}
6              Here is an example to illustrate this:
7              src: The days in the summer can lead to problems getting sufficient sleep and
           associated health issues.
8              hyp+: Die Tage im Sommer können zu Problemen führen, genügend Schlaf zu bekommen
           und damit verbundene Gesundheitsprobleme.
9              This is a translation that is supported by the source. It is the exact translation
           of ’src’.
10             hyp-: Die sehr langen Tage im Sommer können zu Problemen führen, genügend Schlaf zu
            bekommen und damit verbundene Gesundheitsprobleme.
11             This is a translation, which is not supported by the source. It includes an
           addition, which is "sehr langen" in this case. In ’src’, there is no mention about ’the
            very long’ days in the summer. Hence, it is labelled as ’hyp-’.
12             Src: {source}
13             Result: ’’’
14


      Listing 11: En_De_Trans_Gen_llama3_v2, other language pairs contain an example in the associated language

     Figure 10: Prompt used by ‘Llama 3’ for Translation Generation Task.




     Table 23
     Classification Report Llama3 of English Prompt 1 Results on Trial Dataset.
                                                Precision   Recall   F1-Score     Support
                              hyp1                 1.00      0.78      0.88          9
                              hyp2                 0.78      1.00      0.88          7
                              Accuracy                                 0.88         16
                              Macro avg            0.89      0.89      0.88         16
                              Weighted avg         0.90      0.88      0.88         16
        1                answer_format = {"label": ""}
        2                user_prompt = f’’’
        3                user
        4                Givet en "src" och två hypoteser "hyp1" och "hyp2" är din uppgift att
                 upptäcka vilken av de två hypoteserna ("label") som inte stöds av källan.
        5                Ge resultatet i följande format: {answer_format}.
        6
        7                 Src: {source}
        8                 hyp1 : {hyp1}
        9                 hyp2 : {hyp2}
        10
        11                
        12
        13                Resultat:
        14                model
        15                ’’’
        16
        17


             Listing 12: Swedish Prompt 1.

        1                answer_format = {"label": ""}
        2                user_prompt = f’’’
        3                user
        4                You are a researcher investigating a new phenomenon. You have gathered
                 data ({source}) and formulated two competing hypotheses (hyp1: {hyp1}, and hyp2
                 : {hyp2}) to explain it.
        5                Identify the hypothesis that contradicts the information provided in
                 the given source.
        6                Provide the result in the following format: {answer_format}.
        7
        8                 
        9
        10                Result:
        11                model
        12                ’’’
        13


             Listing 13: English Prompt 2.

        1                answer_format = {"label": ""}
        2                user_prompt = f’’’
        3                user
        4                Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to
                 detect which of the two hypotheses ("label") is not supported by the source.
        5                Provide the result in the following format:
        6                {answer_format}.
        7
        8                 Src: {source}
        9                 hyp1: {hyp1}
        10                hyp2: {hyp2}
        11
        12                
        13
        14                Result:
        15                model
        16                ’’’
        17


             Listing 14: English Prompt 1 and Swedish Prompt 2.

Figure 11: Prompts used for English and Swedish Paraphrasing Detection Task on Trial Dataset.
Table 24
Comparison of English Prompt 2 Results on Trial Dataset from Gemma, GPT-3.5, Llama3 and GPT-4 for the
paraphrasing detection task. Cases highlighted in yellow indicate discrepancies between the model’s predictions
and the correct labels. Analysis of the table shows consistent challenges for the model in accurately predicting
labels, particularly for hallucinations involving pronouns, named entities, natural transformations and conversions.
Gemma had the most difficulty across different types of hallucinations. GPT-4 showed the best performance
specifically in detecting false pronouns, as the other models failed to recognize this type of hallucination.
 id        type         label    prediction gemma        prediction gpt 3.5      prediction llama3      prediction gpt4
 0       antonym        hyp1            hyp1                       hyp1                 hyp1                  hyp1
 1       negation       hyp1            hyp1                       hyp1                 hyp1                  hyp1
 2       antonym        hyp1            hyp1                       hyp1                 hyp1                  hyp1
 3     named entity     hyp1            hyp2                       hyp1                 hyp1                  hyp1
 4        natural       hyp2            hyp2                       hyp2                 hyp2                  hyp2
 5       addition       hyp2            hyp2                       hyp2                 hyp2                  hyp2
 6        gender        hyp2            hyp2                       hyp2                 hyp2                  hyp2
 7        natural       hyp1            hyp2                       hyp1                 hyp2                  hyp1
 8        number        hyp2            hyp2                       hyp2                 hyp2                  hyp2
 9       pronoun        hyp1            hyp2                       hyp2                 hyp2                  hyp1
 10      pronoun        hyp1            hyp2                       hyp2                 hyp2                  hyp1
 11      addition       hyp2            hyp2                       hyp2                 hyp2                  hyp2
 12     conversion      hyp1            hyp2                       hyp1                 hyp1                  hyp1
 13       natural       hyp2            hyp2                       hyp1                 hyp2                  hyp2
 14    named entity     hyp2            hyp2                       hyp2                 hyp2                  hyp2
 15        date         hyp1            hyp1                       hyp1                 hyp1                  hyp1




Table 25
Classification Report Gemma of English Prompt 2 Results on Trial Dataset.
                                             Precision    Recall     F1-Score    Support
                          hyp1                  1.00       0.44           0.62      9
                          hyp2                  0.58       1.00           0.74      7
                          Accuracy                                        0.69     16
                          Macro avg             0.79       0.72           0.68     16
                          Weighted avg          0.82       0.69           0.67     16




Table 26
Classification Report GPT-3.5 of English Prompt 2 Results on Trial Dataset.
                                             Precision    Recall     F1-Score    Support
                          hyp1                  0.88       0.78           0.82      9
                          hyp2                  0.75       0.86           0.80      7
                          Accuracy                                        0.81     16
                          Macro avg             0.81       0.82           0.81     16
                          Weighted avg          0.82       0.81           0.81     16
Table 27
Classification Report Llama3 of English Prompt 2 Results on Trial Dataset.
                                           Precision   Recall   F1-Score     Support
                         hyp1                   1.00    0.67      0.80          9
                         hyp2                   0.70    1.00      0.82          7
                         Accuracy                                 0.81         16
                         Macro avg              0.85    0.83      0.81         16
                         Weighted avg           0.87    0.81      0.81         16




Table 28
Classification Report GPT-4 of English Prompt 2 Results on Trial Dataset.
                                           Precision   Recall   F1-Score     Support
                         hyp1                   1.00    1.00      1.00          9
                         hyp2                   1.00    1.00      1.00          7
                         Accuracy                                 1.00         16
                         Macro avg              1.00    1.00      1.00         16
                         Weighted avg           1.00    1.00      1.00         16




Table 29
Comparison of Swedish Prompt 1 Results on Trial Dataset from Gemma and GPT-3.5 for the paraphrasing
detection task.
                  id        type        label     prediction gemma       prediction gpt 3.5
                  0        number       hyp2            hyp2                   hyp2
                  1        natural      hyp1            hyp2                   hyp2
                  2     named entity    hyp2            hyp2                   hyp2
                  3       addition      hyp2            hyp2                   hyp1
                  4       negation      hyp1            hyp1                   hyp1
                  5        gender       hyp2            hyp2                   hyp2
                  6       antonym       hyp2            hyp2                   hyp2
                  7       negation      hyp1            hyp2                   hyp2
                  8       addition      hyp2            hyp2                   hyp2
                  9        number       hyp2            hyp2                   hyp2
                  10       natural      hyp1            hyp2                   hyp1
                  11      addition      hyp2            hyp2                   hyp2
                  12      addition      hyp2            hyp2                   hyp2
                  13    named entity    hyp1            hyp2                   hyp2
                  14    named entity    hyp1            hyp1                   hyp2
                  15       number       hyp2            hyp2                   hyp2
                  16      addition      hyp1            hyp2                   hyp1
                  17        tense       hyp2            hyp2                   hyp2
                  18      pronoun       hyp1            hyp2                   hyp2
                  19         date       hyp1            hyp1                   hyp1
Table 30
Classification Report Gemma for Swedish Prompt 1 Results on Trial Dataset.
                                           Precision   Recall     F1-Score        Support
                         hyp1                  1.00     0.33           0.50         9
                         hyp2                  0.65     1.00           0.79         11
                         Accuracy                               0.70
                         Macro avg             0.82     0.67           0.64         20
                         Weighted avg          0.81     0.70           0.66         20




Table 31
Classification Report GPT-3.5 for Swedish Prompt 1 Results on Trial Dataset.
                                           Precision   Recall     F1-Score        Support
                         hyp1                  0.80     0.44           0.57         9
                         hyp2                  0.67     0.91           0.77         11
                         Accuracy                               0.70
                         Macro avg             0.73     0.68           0.67         20
                         Weighted avg          0.73     0.70           0.68         20




Table 32
Comparison of Swedish Prompt 2 Results on Trial Dataset from Gemma and GPT-3.5 for the paraphrasing
detection task.
                  id       type        label     prediction gemma             prediction gpt 3.5
                  0       number       hyp2            hyp2                         hyp2
                  1       natural      hyp1            hyp2                         hyp2
                  2    named entity    hyp2            hyp2                         hyp2
                  3      addition      hyp2            hyp2                         hyp1
                  4      negation      hyp1            hyp1                         hyp1
                  5       gender       hyp2            hyp2                         hyp2
                  6      antonym       hyp2            hyp2                         hyp2
                  7      negation      hyp1            hyp1                         hyp1
                  8      addition      hyp2            hyp2                         hyp1
                  9       number       hyp2            hyp2                         hyp2
                  10      natural      hyp1            hyp2                         hyp1
                  11     addition      hyp2            hyp2                         hyp1
                  12     addition      hyp2            hyp2                         hyp2
                  13   named entity    hyp1            hyp2                         hyp2
                  14   named entity    hyp1            hyp2                         hyp2
                  15      number       hyp2            hyp2                         hyp2
                  16     addition      hyp1            hyp2                         hyp2
                  17       tense       hyp2            hyp2                         hyp2
                  18     pronoun       hyp1            hyp2                         hyp2
                  19        date       hyp1            hyp2                         hyp2
        1             ’’’
        2             result = {’label’: ’hyp1’}
        3
        4           src = "The population has declined in some 210 of the 280 municipalities in
                 Sweden, mainly in inland central and northern Sweden."
        5
        6             "In the majority of Sweden’s 280 municipalities, the population has gone up
                ."
        7           "In the majority of Sweden’s 280 municipalities, the population has gone
                down."
        8
        9             hyp1
       10             hyp2
       11
       12             if ’declined’ in src or ’down’ in src:
       13                 result[’label’] = hyp2
       14
       15             elif ’up’ in src:
       16                 result[’label’] = hyp1
       17       ’’’
       18


            Listing 15: En_Para_Gen _Gemma_v1

     Figure 12: Example of generated output from Gemma.




                                                              1   answer_format = {"label": ""}
1 answer_format = {"label": ""}
                                                              2
2    user_prompt = f’’’
                                                              3      user_prompt = f’’’
3    user
                                                              4      user
4    Given a "src" and two hypotheses "hyp1
                                                              5      Givet en ”src” och två hypoteser ”hyp1
      " and "hyp2" your task is to detect
                                                                      ” och ”hyp2” är din uppgift att upptä
      which of the two hypotheses ("label")
                                                                      cka vilken av de två hypoteserna (”
       is not supported by the source.
                                                                      label”) som inte stöds av källan.
5    Provide the result in the following
                                                              6      Ge resultatet i följande format: {
      format: {answer_format}.
                                                                      answer_format}.
6
                                                              7
7       Src: {source}
                                                              8      Src: {source}
8       hyp1 : {hyp1}
                                                              9      hyp1 : {hyp1}
9       hyp2 : {hyp2}
                                                             10      hyp2 : {hyp2}
10
                                                             11
11      
                                                             12      
12
                                                             13
13      Result:
                                                             14      Resultat:
14      model
                                                             15      model
15      ’’’
                                                             16      ’’’
16
                                                             17

     Listing 16: En_Para_Det_Gemma_v1
                                                                  Listing 17: Se_Para_Det_Gemma_v1

     Figure 13: First Prompts used by ‘Gemma’ for Paraphrasing Detection Task on the test set.
        1   ’’’
        2   answer_format = {"hyp+": "", "hyp-": ""}
        3
        4 user_prompt = user
        5    You are a researcher investigating a new phenomenon. You have gathered data
              ({source}) and formulated two competing hypotheses (hyp1: {hyp1}, and hyp2:
              {hyp2}) to explain it.
        6    Identify the hypothesis that contradicts the information provided in the given
              source.
        7    Provide the result in the following format: {answer_format}.
        8    
        9    Result:
       10    model’’’
       11


            Listing 18: En_Se_Para_Det_Gemma_v2

     Figure 14: Second Prompts used by ‘Gemma’ for Paraphrasing Detection Task on the test set.




1                answer_format = {"label": ""}
                                                                1               answer_format = {"label": ""}
2
                                                                2
3           user_prompt = f’’’
                                                                3        user_prompt = f’’’
4           Given a "src" and two hypotheses "
                                                                4        You have gathered data ({source}) and
            hyp1" and "hyp2" your task is to
                                                                          formulated two competing hypotheses
            detect which of the two hypotheses ("
                                                                         to explain it.
            label") is not supported by the
                                                                5        hyp1: {hyp1}
            source.
                                                                6        hyp2: {hyp2})
5           Provide the result in the following
                                                                7        Identify the hypothesis that
            format: {answer_format}.
                                                                         contradicts the information provided
6
                                                                         in the given source.
7           Src: {source}
                                                                8
8           hyp1 : {hyp1}
                                                                9        Provide the result in the following
9           hyp2 : {hyp2}
                                                                         format: {answer_format}.
10
                                                               10
11          Result:
                                                               11        Result:
12          ’’’
                                                               12        ’’’
13
                                                               13
14

                                                                    Listing 20: En_Se_Para_Det_GPT3.5_GPT4_v2
     Listing 19: En_Se_Para_Det_GPT3.5_GPT4_v1

     Figure 15: Prompt used by ‘GPT-3.5 Turbo’ and ‘GPT-4’ for Paraphrasing Detection Task on the test set.




     Table 33
     Classification report for Llama 3 on Trial Dataset.
                                                 Precision   Recall     F1-Score     Support
                               hyp1                 1.00      0.50           0.67       4
                               hyp2                 0.67      1.00           0.80       4
                               Accuracy                               0.70
                               Macro avg            0.83      0.75           0.73       8
                               Weighted avg         0.83      0.75           0.73       8
1              ’’’Given a "src" and two                        1           ’’’Givet en ”src” och två
         hypotheses "hyp1" and "hyp2" your                             hypoteser ”hyp1” och ”hyp2” är din
         task is to detect which of the two                            uppgift att upptäcka vilken av de två
         hypotheses ("label") is not supported                          hypoteserna (”label”) som inte stöds
          by the source.                                                av källan.
2      Provide the result in the following                     2       Ge resultatet i följande format: {
         format: {answer_format}.                                      answer_format}.
3                                                              3
4      Src: {source}                                           4       Src: {source}
5      hyp1 : {hyp1}                                           5       hyp1 : {hyp1}
6      hyp2 : {hyp2}                                           6       hyp2 : {hyp2}
7                                                              7
8      Result:                                                 8       Resultat:
9         ’’’                                                  9       ’’’
10                                                            10


     Listing 21: En_Para_Det _Llama3_v1                            Listing 22: Se_Para_Det_Llama3_v1

     Figure 16: First Prompts used by ‘Llama 3’ for Paraphrasing Detection Task on the test set.




        1   ’’’ You are a researcher investigating a new phenomenon.
        2   You have gathered data (source) and formulated two competing hypotheses (hyp1 and
                hyp2) to explain it.
        3
        4 Identify the hypothesis that contradicts the information provided in the given
              source.
        5 Provide the result in the following format: {answer_format}.
        6
        7       source: {source}
        8       hyp1: {hyp1}
        9       hyp2: {hyp2}
       10
       11       Result:’’’
       12


            Listing 23: En_Se_Para_Det _Llama3_v2

     Figure 17: Second Prompt used by ‘Llama 3’ for Paraphrasing Detection Task on the test set.
1 user_prompt = f’’’
2 Given a "src" and two hypotheses "hyp1" and "hyp2", which are translated into German. Your
      task is to detect which of the two hypotheses is hallucinated ("label") and determine
      what type of hallucination ("type") it is.
3 Please explain your choice.
4 Provide the result in the following format: {"type": "", "label": ""}.
5 Here is an example to illustrate this:
6
7         Src: The days in the summer can lead to problems getting sufficient sleep and
          associated health issues.
8         hyp1: Die Tage im Sommer können zu Problemen führen, genügend Schlaf zu bekommen und
          damit verbundene Gesundheitsprobleme.
9         hyp2: Die sehr langen Tage im Sommer können zu Problemen führen, genügend Schlaf zu
          bekommen und damit verbundene Gesundheitsprobleme.
10        label: hyp2
11        type: addition
12
13    Detect which of the following hypotheses is the hallucination.
14
15     Src: {source}
16     hyp1 : {hyp1}
17     hyp2 : {hyp2}
18 Result:’’’
19


      Listing 24: En_De_Trans_Det_llama3_v1, other language pairs contain an example in the associated language

1 user_prompt = f’’’
2 You are a researcher investigating a new phenomenon. You have gathered data (source) and
      formulated two competing hypotheses (Hyp1 and Hyp2) to explain it. Identify the
      hypothesis that contradicts the information provided in the given source.
3 Provide the result in the following format: {"label": ""}.
4
5 Src: {source}
6 hyp1 : {hyp1}
7 hyp2 : {hyp2}
8
9     Result:’’’
10


      Listing 25: En_De_Trans_Det_llama3_v2, En_Fr_Trans_Det_llama3_v2

     Figure 18: Prompts Llama3 Translation Detection Task.
Table 34
Llama3 exhibits gender bias in the translation detection task.
 Source        hyp1                           hyp2                              Llama3’s explanation
 Der An- The lawyer told the male au-         The lawyer told the female au-    I chose hyp2 as the hallucina-
 walt        ditor to provide documenta-      ditor to provide documentation.   tion because it introduces a new
 forderte    tion.                                                              detail that is not present in the
 die                                                                            source text. The source text only
 Wirtschaft-                                                                    mentions "Wirtschaftsprüferin",
 sprüferin                                                                      which is a gender-neutral term
 auf, Un-                                                                       for an auditor. The addition of
 terlagen                                                                       "female" is an assumption made
 vorzule-                                                                       by the translator, which is not
 gen.                                                                           supported by the original text.
 Der    Ar-    The laborer dug a hole for     The laborer dug a hole for the    My choice is based on the fact
 beiter        the female secretary.          male secretary.                   that the original sentence does
 grub ein                                                                       not mention the gender of the
 Loch                                                                           secretary. The source sentence
 für    die                                                                     only mentions "die Sekretärin"
 Sekretärin.                                                                    which is a feminine noun in Ger-
                                                                                man, but it does not imply that
                                                                                the laborer dug a hole specifi-
                                                                                cally for a female secretary. The
                                                                                addition of "male" in hyp2 is an
                                                                                unnecessary detail that is not
                                                                                supported by the original sen-
                                                                                tence. Therefore, I label hyp2 as
                                                                                the hallucination and identify
                                                                                the type of hallucination as ad-
                                                                                dition.




Table 35
Llama3 struggles with converting measurements in the translation detection task.
 Source        hyp1                           hyp2                              Llama 3’s explanation
 Das Zen-      The center of the earth-       The center of the earthquake      I chose hyp2 as the hallucina-
 trum          quake was about 20 km          was about 20 km northeast of      tion because it adds the unit
 des Erd-      northeast of Dillon and        Dillon and about 65000 kilome-    "kilometres" to the distance,
 bebens        about 65000 metres south of    tres south of Butte.              which is not present in the orig-
 befand        Butte.                                                           inal text. The original text uses
 sich etwa                                                                      "km" and "Meilen", which are
 20 km (15                                                                      equivalent units, but hyp2 in-
 Meilen)                                                                        troduces a new unit "kilome-
 nord-                                                                          tres". This is a type of hallu-
 nordöstlich                                                                    cination where a new informa-
 von Dillon                                                                     tion is added to the original text,
 und etwa                                                                       specifically a unit conversion.’
 65 km (40
 Meilen)
 südlich
 von Butte.
     Table 36
     Llama3 struggles with recognizing the different ways dates in the translation detection task.
      Source                  hyp1               hyp2                   Llama3’s explanation
      "Perry erklärte, er     """Perry stated    """Perry stated        I chose hyp2 as the hallucination be-
      werde „nach Texas       that he will       that he will           cause it contains a specific date ("Jan.
      zurückkehren, um        return to Texas    return to Texas        21"), which is not present in the origi-
      das Ergebnis der        to     evaluate    to evaluate the        nal source text. The original text only
      Vorwahl von heute       the result of      result of tonight’s    mentions the date "21. Januar", which
      Abend auszuwerten       tonight’s pre-     pre-election and       is a more general and less specific date.
      und zu entscheiden,     election    and    decide whether         This suggests that hyp2 contains a hal-
      ob es in diesem         decide whether     there is a path        lucinated detail. The type of halluci-
      Rennen einen weit-      there is a path    forward for me         nation is time, as it involves the intro-
      erführenden      Weg    forward for me     in this race,""        duction of a specific and precise time
      für mich gibt"", aber   in this race,""    but later said he      element.
      sagte später,      er   but later said     would stay in the
      werde im Rennen         he would stay      race and run for
      bleiben und bei den     in the race and    the pre-election
      Vorwahlen in South      run for the        in South Carolina
      Carolina am 21.         pre-election in    on Jan. 21."
      Januar antreten."       South Carolina
                              on March 21."




                                                                 1 user_prompt = f’’’
                                                                 2 user
1 user_prompt = f’’’                                             3 You are a researcher investigating a new

2 Given a "src" and two hypotheses "hyp1"                              phenomenon. You have gathered data (
      and "hyp2",                                                      source) and formulated two competing
3 your task is to first translate "hyp1"                               hypotheses (hyp1 and hyp2) to explain
      and "hyp2"                                                        it. Identify the hypothesis that
4 back to the language of "src". Then,                                 contradicts the information provided
      compare these                                                    in the given source. Provide the
5 translations with "src" to detect which                              result in the following format: {
      of the two                                                       answer_format}. ONLY ANSWER IN THAT
6 hypotheses is hallucinated ("label") and                             FORMAT!
      what type of                                               4

7 hallucination ("type") it is. Provide the                      5       source: {source}
       result in the                                             6       hyp1 : {hyp1}
8 following format: {answer_format}.                             7       hyp2 : {hyp2}
9                                                                8
10 Src: {source}                                                 9       
11 hyp1 : {hyp1}                                                10
12 hyp2 : {hyp2}                                                11       Result:
13 Result:                                                      12       model ’’’
14 ’’’                                                          13
15                                                              14

     Listing 26: En_De_Trans_Det_gemma_v1,                           Listing 27: De_En_Trans_Det_gemma_v2,
     En_Fr_Trans_Det_gemma_v1                                        En_De_Trans_Det_gemma_v2,
                                                                     En_Fr_Trans_Det_gemma_v2,
                                                                     Fr_En_Trans_Det_gemma_v2

     Figure 19: Prompts Gemma Translation Detection Task.
Table 37
The prompt immensely influences the outcome of Llama3 for the translation detection task.
 Source                  hyp1                     hyp2                     Llama3’s explanation
 Der Anwalt forderte     The lawyer told the      The lawyer told the      The source states "Der Anwalt
 die       Wirtschaft-   male auditor to pro-     female auditor to pro-   forderte die Wirtschaftsprüferin
 sprüferin        auf,   vide documentation.      vide documentation.      auf,    Unterlagen vorzulegen.",
 Unterlagen vorzule-                                                       which translates to "The lawyer
 gen.                                                                      demanded the female auditor
                                                                           to provide documentation." This
                                                                           implies that the auditor is a
                                                                           woman, contradicting Hyp2 which
                                                                           states that the auditor is a man.
                                                                           result_text ’label’: ’hyp2’
 Der Arbeiter grub       The laborer dug a        The laborer dug a        The source states "Der Arbeiter
 ein Loch für die        hole for the female      hole for the male sec-   grub ein Loch für die Sekretärin",
 Sekretärin.             secretary.               retary.                  which means "The laborer dug a
                                                                           hole for the secretary". This clearly
                                                                           indicates that the secretary is a fe-
                                                                           male, as "Sekretärin" is the German
                                                                           word for "female secretary". There-
                                                                           fore, Hyp2, which suggests that the
                                                                           laborer dug a hole for the male sec-
                                                                           retary, contradicts the information
                                                                           provided in the source. result_text
                                                                           ’label’: ’hyp2’
 Das Zentrum des         The center of the        The center of the        Explanation: The source states that
 Erdbebens befand        earthquake      was      earthquake       was     the distance from the earthquake’s
 sich etwa 20 km         about 20 km north-       about 20 km north-       center to Butte is approximately 40
 (15 Meilen) nord-       east of Dillon and       east of Dillon and       miles (65 km), not 65,000 kilometres.
 nordöstlich      von    about 65000 metres       about 65000 kilome-      Hyp2 contradicts this information
 Dillon und etwa         south of Butte.          tres south of Butte.     by stating a much larger distance,
 65 km (40 Meilen)                                                         while Hyp1 correctly translates the
 südlich von Butte.                                                        original distance from miles to me-
                                                                           ters. result_text ’label’: ’hyp2’
 "Perry erklärte, er     """Perry stated that     """Perry stated that     Explanation: The source states
 werde „nach Texas       he will return to        he will return to        that the event in question is on
 zurückkehren, um        Texas to evaluate the    Texas to evaluate the    "heute Abend" (tonight), which cor-
 das Ergebnis der        result of tonight’s      result of tonight’s      responds to January 21. However,
 Vorwahl von heute       pre-election      and    pre-election      and    Hyp1 states that the event is on
 Abend auszuwerten       decide whether there     decide whether there     "March 21", which contradicts the
 und zu entscheiden,     is a path forward        is a path forward        information provided in the source.
 ob es in diesem         for me in this race,""   for me in this race,""   result_text ’label’: ’hyp1’
 Rennen einen weit-      but later said he        but later said he
 erführenden      Weg    would stay in the        would stay in the
 für mich gibt"", aber   race and run for the     race and run for the
 sagte später,      er   pre-election in South    pre-election in South
 werde im Rennen         Carolina on March        Carolina on Jan. 21."
 bleiben und bei den     21."
 Vorwahlen in South
 Carolina am 21.
 Januar antreten."
Table 38
The count of failed examples for GPT-3.5 Turb and GPT-4 in the translation detection task.


                           Language         GPT-4                  GPT-3.5
                                       prompt1 prompt2        prompt1 prompt2
                           de-en       2           1          0           0
                           en-de       8           15         1           1
                           en-fr       2           7          0           1
                           fr-en       7           3          1           2




   1              ’’’Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to detect
            which of the two hypotheses ("label") is not supported by the source.
   2               Provide the result in the following format:
   3               {answer_format}.
   4
   5                Src: {source}
   6
   7                hyp1: {hyp1}
   8
   9                hyp2: {hyp2}’’’
  10


       Listing 28: majority_vote_cross_model_result_en

Figure 20: The prompts used for paraphrasing task in cross-model.
Table 39
Samples for which GPT-4 failed to assign labels in the translation detection task.
 Lang Source                        hyp1                       hyp2                       Our explanation
 de-    Die Mittel könnte man       The funds could be         The funds could be         Different inter-
 en     für hochwassersichere       used for more water-       used for more flood-       pretations   of
        Häuser, eine bessere        proof houses, better       proof houses, better       flood-proof
        Wasserverwaltung und        water management           water management
        Nutzpflanzendiversi-        and crop diversifica-      and crop diversifica-
        fizierung verwenden.        tion.                      tion.
 de-    Es zeigt 362 ver-           It shows 362 different     It showed 362 differ-      wrong tense
 en     schiedene        alte       old species of wood,       ent old species of wood,
        Holzarten,   Büsche         bushes and 236 dif-        bushes and 236 dif-
        und 236 verschiedene        ferent species of fruit    ferent species of fruit
        Obstbaumarten.              trees.                     trees.
 en-    The world has over          Die Welt hat mehr          Die Welt hat mehr          missing       filler
 de     5,000 different lan-        als 5.000 verschiedene     als 5.000 verschiedene     word
        guages, more than           Sprachen, darunter         Sprachen, mehr als
        twenty with 50 million      mehr als zwanzig mit       zwanzig mit 50 Mil-
        or more speakers.           50 Millionen oder          lionen oder mehr
                                    mehr Sprechern.            Sprechern.
 en-    1i Productions is an        Es wurde 2004 von          1i Productions ist         Name of the pub-
 de     American board game         Colin Byrne, William       ein    amerikanischer      lisher missing
        publisher.   It was         und Jenna gegründet        Brettspieleverlag. Er
        founded in 2004 by          und ist ein amerikanis-    wurde von Colin Byrne,
        Colin Byrne, William        cher Brettspielverlag.     Wiliam and Jenna im
        and Jenna.                                             Jahr 2004 gegründet.
 en-    Mats Wilander defeats       Mat Wilander besiegt       Mats Wilander schlägt      different    (but
 de     Anders Järryd, 6 – 4, 3     Anders Järryd, 6:4, 3:6,   Anders Järryd 6:4, 3:6,    still     correct)
        – 6, 7 - 5.                 7:5.                       7:5.                       translations of
                                                                                          defeat
 en-    They have feet with         Sie haben Füße mit         Sie haben Füße mit         Hind legs was
 de     scales and claws, they      Schalen und Nägeln,        Schalen und Nägeln,        translated incor-
        lay eggs, and they walk     sie legen Eier und sie     sie legen Eier und sie     rectly from back
        on their two back legs      gehen auf ihren beiden     gehen auf ihren beiden     (body part)
        like a T-Rex.               Rückenbeinen wie ein       Hinterbeinen wie ein T-
                                    T-Rex.                     Rex.
 en-    The NSA has its own         Die NSA hat ein            Die NSA hat ein            Wrong pronoun
 de     internal data format        eigenes       internes     eigenes       internes     "it" instead of
        that tracks both ends       Datenformat,        das    Datenformat,        das    "they"
        of a communication,         beide Enden einer          beide Enden einer
        and if it says, this com-   Kommunikation ver-         Kommunikation ver-
        munication came from        folgt, und wenn sie        folgt, und wenn sie
        America, they can tell      sagt, diese Mitteilung     sagt, diese Mitteilung
        Congress how many of        kam aus Amerika, kön-      kam aus Amerika, kön-
        those communications        nen sie dem Kongress       nen sie dem Kongress
        they have today, right      sagen, wie viele dieser    sagen, wie viele dieser
        now.                        Mitteilungen es heute      Mitteilungen sie heute
                                    haben, gerade jetzt.       haben, gerade jetzt.
 en-    In 2014 the site            20124        brachten      Im Jahr 2014 startete      The   year       is
 de     launched iOS and            iOS und Android            die Website iOS und        messed up
        Android applications        Applikationen      zur     Android - Anwendun-
        for product search;         Produktsuche      her-     gen für die Produkt-
        product        features     aus; Produktfeatures       suche. Zu den Produkt-
        include     interactive     beinhalten interaktive     funktionen gehören in-
        video product reviews       Video-Produktreviews       teraktive Videoproduk-
        with live question-and-     mit live Frage- und        tbewertungen mit live
        answer sessions.            Antwort-Sessions.          Fragen und Antworten.