<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anh Thu Maria Bui</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saskia Felizitas Brech</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natalie Hußfeldt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Jennert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Melanie Ullrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timo Breuer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Narjes Nikzad Khasmakhi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Schaer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TH Köln - University of Applied Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3, Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of these LLMs in handling hallucination generation and detection tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hallucination Generation</kwd>
        <kwd>Hallucination Detection</kwd>
        <kwd>LLMs as Evaluators</kwd>
        <kwd>Llama 3</kwd>
        <kwd>Gemma</kwd>
        <kwd>GPT-4</kwd>
        <kwd>GPT-3</kwd>
        <kwd>5 Turbo</kwd>
        <kwd>Ensemble majority voting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        This section is divided into two parts: generation and hallucination detection tasks. Before delving
into the details of our methodology, it is important to note that prior to receiving the dataset from
the organizers, we began familiarizing ourselves with the overall task by applying the three models
Falcon [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], MPT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Llama 2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to the hallucination detection task on the SHROOM dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Since
the results from these three techniques were unsatisfactory, we excluded them from our implementation
for Eloquent Lab.
      </p>
      <p>
        The models we applied to the Eloquent’s dataset in generation and detection tasks were:
• Meta-Llama/Meta-Llama-3-8B-Instruct [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]
• GPT-3.5 Turbo [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
• GPT-4 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
• Google/GEMMA-7B-IT [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]
      </p>
      <p>
        We leveraged a combination of open-source and closed-source models. This allows us to evaluate the
quality of outputs across diferent models. Additionally, utilizing open-source models helped us optimize
costs. Therefore, we initially experimented with various prompts for the tasks using open-source LLMs
to identify the most efective ones. Then, we applied these optimized prompts to closed-source GPT
models. Additionally, we did our best to enhance our prompting efectiveness using the guidance
framework [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Hallucination generation task</title>
        <p>The task of hallucination generation is divided into two scenarios: machine translation and paraphrasing.
The goal of the generation step is to take a source sentence and generate two LLM hypotheses: one that
is a correct translation/paraphrase of the source and one that is a hallucinated translation/paraphrase
of the source.</p>
        <p>Figure 2 indicates the overview of our approach for the generation task. To conduct this task, we
took advantage of ‘GPT-3.5 Turbo’, ‘GEMMA-7B-IT’, and ‘Llama 3’.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Hallucination detection task</title>
        <p>
          The hallucination detection task is to present the LLM with a source sentence and two hypotheses
(hyp1 and hyp2) and to determine which hypothesis is a hallucination and which is factually accurate.
Our approach involved using four diferent LLMs, ‘GPT-3.5 Turbo’, ‘Google/GEMMA-7B-IT’, ‘Llama
3’, and ‘GPT-4’, as classifiers. Additionally, we employed a voting approach as a simple technique of
ensemble learning [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to combine the outputs of these four models.
        </p>
        <p>Furthermore, we experimented with four distinct prompting techniques to provide better guidance
to the LLMs and enhance their ability to discriminate between factual and hallucinated information.
• Type1: Simple Prompt: Using labels ‘hallucinated’ or ‘not hallucinated’.
• Type2: Complex Prompt with 0/1 Labels: Specifying the task with labels 0 or 1.
• Type3: Prompt with Definition and Examples : Including a definition of hallucination alongside
examples labeled 0 and 1.
• Type4: Prompt with Full Task Description: Describing the entire translation task (for instance) and
hallucination detection goal. Combined Prompt: Combining all the above elements.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation</title>
      <p>This part primarily focuses on how we prompted LLMs, along with the challenges and observations
we encountered during the task. We divide this section into three parts: generation, detection, and
cross-evaluation tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Generation task</title>
        <p>The generation task includes test sets for both paraphrasing and translation tasks.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Paraphrasing Generation Task</title>
          <p>The paraphrasing generation task involved datasets in English and Swedish, comprising 118 samples
for English and 76 samples for Swedish.</p>
          <p>The performance of diferent models, including ‘Gemma’, ‘GPT-3.5 Turbo’, and ‘Llama 3’, was
evaluated based on their ability to generate paraphrases for English and Swedish datasets. In the
appendix in Section A, Figures 4 to 7 show comprehensive lists of all prompts used for the diferent
models. The following demonstrates some of our observations regarding the implementation of the
generation task:
• The performance of the ‘Gemma’ model varied significantly based on the complexity of the
prompts used. Simpler prompts yielded better results that highlight the importance of prompt
design. Despite this, the model struggled with understanding specific instructions, such as
‘generate hallucination’. Additionally, the generation speed was notably slow.
• For ‘GPT-3.5 Turbo’, one prompt for English and one prompt for Swedish were employed. The
generation speed of ‘GPT-3.5 Turbo’ was significantly faster compared to other models.
• For ‘Llama 3’, a single prompt was used for both English and Swedish datasets. The speed of
the model in generating Swedish responses was exceedingly slow. After seven hours, it only
produced five outputs.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Translation Generation Tasks</title>
          <p>In the appendix in Section A, in Figures 8 to 10, you will find a comprehensive list of all prompts used for
the diferent models. The details of our implementation and observation of the translation generation
task are as follows:</p>
          <p>In our experimentation with ‘Llama 3’, we opted not to use the ‘guidance’ framework because of its
inefective performance. ‘Llama 3’ showed promising results for each language pair. We experimented
with two diferent prompts, as shown in Figure 10 and observed instances where ‘Llama 3’ successfully
generated hypotheses in the desired target language but struggled with the source language. Examples
illustrating this phenomenon can be found in the Table 19.</p>
          <p>Various prompts were tested, and the one that was chosen, as shown in Figure 9, showed efectiveness
in generating the most automatic translations. However, ‘GPT-3.5 Turbo’ still struggled to instantly
create translations (hyp- and hyp+) for all sources. The main issue was the variations in quotation marks
which caused problems during the extraction process. As a result, we had to prompt some sentences
individually (instead of being able to loop them as a group) so that the structure was recognized by
GPT again. List 3.1.2 shows the number of samples had been done individually.</p>
          <p>• German to English: with 12 sources where 3 sources needed to be translated manually.
• English to German: with 10 sources where 2 sources needed to be translated manually.
• French to English: with 19 sources where 3 sources needed to be translated manually.
• English to French: with 64 sources where 0 sources needed to be translated manually.</p>
          <p>Translating from English was a smoother process for the Gemma model compared to translating to
English.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Detection task</title>
        <p>The detection task involves trial and test sets for both scenarios.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Paraphrasing Detection Task</title>
          <p>Table 1 shows the number of samples for each trial and test set for the paraphrasing detection task. The
trial dataset for the paraphrasing detection task is structured as follows:
• id: Unique identifier of the example.
• source: Original model input for paraphrase generation.
• hyp1: First alternative paraphrase of the source.
• hyp2: Second alternative paraphrase of the source.
• label: hyp1 or hyp2, based on which of those has been annotated as hallucination.
• type: Hallucination category assigned. Possible values include:
– addition
– named-entity
– number
– conversion
– date
– tense
– negation
– gender
– pronoun
– antonym
– natural</p>
          <p>We compared the performance of diferent models on the trial dataset using distinct prompts. Some
prompts used for the paraphrasing detection task on the trial dataset are presented in Figure 11.
Additionally, Tables 20 to 32 illustrate the performance of various prompts on the trial dataset for both
English and Swedish.</p>
          <p>A challenge with Gemma was its tendency to generate code within responses. We implemented a
specific ‘JSON’ format to ensure retrievable output. Figure 12 indicates the example of generated output
from Gemma. Figures 13 to 17 display the prompts employed in the paraphrasing detection task across
various models for the test set.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Translation Detection Task</title>
          <p>The following details are provided about the translation detection dataset.</p>
          <p>• Both trial and test datasets include data for four language pairs as follows:
– de-en: Source language: German, Target language: English
– en-de: Source language: English, Target language: German
– fr-en: Source language: French, Target language: English
– en-fr: Source language: English, Target language: French
• The trial dataset included 10 data entries, with 5 entries featuring hallucination as hyp1 and the
other 5 as hyp2. The structure of the trial dataset is illustrated below:
– id: Unique identifier of the example.
– langpair: Language of source and hypotheses pair
– source: Source Text
– hyp1: First alternative translation of the source.
– hyp2 Second alternative translation of the source.
– type: Hallucination category assigned. Possible values include:
∗ addition
∗ named-entity
∗ number
∗ conversion
∗ date
∗ tense
∗ negation
∗ gender
∗ pronoun
∗ antonym
∗ natural
– label hyp1 or hyp2, based on which of those has been annotated as hallucination
• In the test collection, there are 100 data samples for each language pair.</p>
          <p>The structure of the test dataset is presented as follows:
– id: Unique identifier of the example.
– langpair: Language of source and hypotheses pair
– source: Source Text
– hyp1: First alternative translation of the source.</p>
          <p>– hyp2 Second alternative translation of the source.</p>
          <p>Our implementation and observations of the translation detection task are delineated below,
categorized according to each model.</p>
          <p>Observations for Llama 3 Ultimately, we experimented with 15 diferent prompts for the ‘Llama 3’
model. Among these, the prompt, as shown in Figure 18(a) yielded the most favorable results. Table 33
demonstrates the achieved results by using this prompt on the trial dataset. So, we opted for it for the
ifnal detection task.</p>
          <p>The main observations for Llama 3 are:
• ‘Llama 3’ is not able to detect a label for every data entry (support is only 4 for each, hyp1 and
hyp2). Figure 18 demonstrates the prompts used by ‘Llama 3’ on the test set.
• When detecting the hallucination, ‘Llama 3’ gives explanations, such as: ‘I chose hyp1 as the
hallucination because it contains a date (December 5) that is not present in the source text. The
source text only mentions the date August 5, but hyp1 provides a diferent date.’ The first row in
Table 34 shows this issue.
• As both examples in Table 34 indicate ‘Llama 3’ exhibits gender bias. In the first example, it failed
to recognize the feminine noun ‘Wirtschaftsprüferin’ shows a female auditor and labeled it as
gender-neutral. It made a gender assumption in hyp1 which assumes a male auditor. Similarly, in
the second example, ‘Llama 3’ struggled to understand the clear indication of a female secretary
with the word ‘Sekretärin.’
• ‘Llama 3’ struggles with understanding and converting measurements and it could not recognize
when diferent units are essentially the same. For instance, it sees ‘kilometers’ and thinks it
is diferent from ‘metres’ which leads to mistakenly identifying that text as a hallucination.
Additionally, ‘Llama 3’ makes the assumption that hyp2 is the hallucination because it contains
‘kilometers’ instead of ‘km’ and it fails to consider the fact that hyp1 also uses ‘metres’ instead of
‘km.’ Table 35 highlights this issue.
• ‘Llama 3’ struggles to recognize the diferent ways dates can be written. As shown in Table 36, it
could not understand that ‘21. Januar’ and ‘Jan. 21st’ refer to the same date.
• In the end, we noticed that the prompt immensely influences the outcome of ‘Llama 3’. When
using diferent prompts, ‘Llama 3’ was either able to detect the gender, conversion, or the correct
date, or it was not. For example, 1 in Table 37 shows that using the prompt shown in Figure
18(b), ‘Llama 3’ correctly explains that ‘Wirtschaftsprüferin’ refers to a female auditor in the
ifrst example, but then it mistakenly swaps hyp1 and hyp2. Additionally, as shown in the second
row of this table, the new prompt allows ‘Llama 3’ to detect the correct gender indicated in the
source text. However, ‘Llama 3’ still fails to assign the correct label. In the third row, we can see
that ‘Llama 3’ correctly converts 65 km to 65,000 meters and identifies the hallucination in hyp2.
Additionally, ‘Llama 3’ correctly identifies the wrong date in the last example. The primary issue
with this prompt is that ‘Llama 3’ frequently fails to identify any hallucinations in certain data
samples.</p>
          <p>Observations for GPT-3.5 Turbo and GPT-4 The approach used for ‘GPT-3.5 Turbo’ was replicated
for ‘GPT-4’ to directly assess comprehension. Various prompts were tested, and two were selected based
on the best results from previous trials.</p>
          <p>The main observations for GPT-3.5 Turbo and GPT-4 are:
• There were some samples where no hallucinations were detected. Table 38 displays the count of
failed examples for ‘GPT-4’ and ‘GPT-3.5 Turbo’ in the translation detection task. Additionally,
Table 39 lists some samples for which ‘GPT-4’ failed to assign labels with our explanation for
each one.
• Regarding the prompts for GPT models, both seem to encounter issues with misinterpretations
or slightly inaccurate translations. Additionally, both struggle to identify incorrect pronouns.
• Initially, during the phase with incorrect trial datasets, it was observed that ‘GPT-3.5 Turbo’ had
dificulty recognizing hallucinations when names were slightly misspelled or had an extra letter
appended.</p>
          <p>Observations for Gemma We tried various prompts, but Gemma showed better (80% Accuracy) in
detecting the correct label when it was first asked to translate the hypothesis into the language of the
source and then detect hallucinations. Figure 19 indicates the prompts used by Gemma on the test set.</p>
          <p>The main observations for Gemma are:
• The performance was significantly worse when the prompts were too scientific or contained too
many technical terms.
• Tricky samples for Gemma in the detection task include detecting the gender in comparison to
the source (female/male), and identifying when numbers are incorrect, such as missing zeros.
Observations for ensemble voting approach We opted for a straightforward voting approach
to ensemble model predictions due to the limitations imposed by the small sample size of the trial set.
This method ensured all models contributed equally.</p>
          <p>Since we compared an even number of models, there were instances where two models voted for
hyp1 and the other two voted for hyp2. In these cases, we randomly selected the label.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Cross-evaluation task</title>
        <p>The following provides detailed information regarding the cross-evaluation task. Table 2 presents
information regarding the samples included in the paraphrasing task. The prompt showns in Figure 20
has been used for the english paraphrasing task.</p>
        <p>In the translation task, sometimes none of the models detected any hallucinations in either hypothesis,
which resulted in some blank spaces in the CSV file due to the lack of predictions. There were instances
where no hallucinations were present because both hypotheses, hyp1 and hyp2, were the same.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This part presents the results in detail for each task and scenario. It is worth noting that prior to
showing the results from LLMs, Logistic Regression and Random Forest classifiers were used for an
initial evaluation to establish a baseline performance for comparison with LLMs. Both LR and RF
classifiers achieved similar performance with an F1-score of 0.5.</p>
      <p>
        For the evaluation of the generation task, the lab employed a zero-shot text classification Natural
language inference (NLI) model (‘ / − 3 − ℎ − 2.0’) to predict whether
‘hyp+’ is entailed within the source sentence and whether ‘hyp-’ contradicts the source sentence. They
used only two labels: ‘entailment’ and ‘not_entailment’. This approach helps us assess whether the
systems can produce coherent hyp+/hyp- pairs. It is important to note that the performance of the
classification model is not perfect, but it demonstrated reasonable performance on the detection test set
across various languages and language pairs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>For evaluating both detection and cross-model tasks, the lab reported key metrics such as Accuracy,
F1-score, Precision, and Recall for each model. Additionally, several baseline models were evaluated
by the lab. For cross-model assessment, the lab also employed two metrics: Matthews Correlation
Coeficient (MCC) and Cohen’s Kappa.</p>
      <p>
        The Average MCC (MCC) measures the quality of binary classifications by considering true and
false positives and negatives, while the Standard Deviation of MCC (  ) provides insight into the
consistency of the model’s performance. Similarly, the Average Kappa (¯) measures inter-rater reliability
for categorical items, and the Standard Deviation of Kappa (  ) indicates the variability or consistency
of the Kappa metric [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Tables 3 to 5 demonstrate the evaluation of detection, generation, and cross-model evaluation for
English paraphrasing tasks.</p>
      <p>The performance of detection across various models on the English paraphrasing task is presented in
Table 3. The model ‘GPT-4’ with prompt ‘En_Se_Para_Det_GPT3.5_GPT4_v2’ achieved the highest
performance with Accuracy, F1-score, Precision, and Recall scores of 0.91.</p>
      <p>Table 4 presents the results for the generation step. The model ‘GPT-3.5 Turbo’ with prompt
‘En_Para_Gen_GPT3.5’ achieved the highest performance in hyp+ entailment mean (0.964) and hyp+
correct label mean (0.983). Furthermore, The model ‘Llama 3’ with prompt ‘En_Para_Gen_Llama3’
showed strong performance in hyp- not entailment mean(0.978) and hyp- correct label mean (0.983).</p>
      <p>Table 5 presents the results for the cross model. The model ‘GPT-4’ with prompt
‘final_gpt4_en_v2_cross_model_detection’ showed Accuracy, F1-score, Precision, and Recall scores of 0.93.
In the next stage, the majority model with prompt ‘majority_vote_cross_model_result_en’ demonstrated
impressive performance.</p>
      <p>Table 6 shows that the model with prompt ‘majority_vote_cross_model_result_en’ achieved the
highest performance with an average MCC of 0.83 and average Kappa of 0.81.</p>
      <p>Table 7 presents the performance metrics of various models in the detection step for the Swedish
paraphrasing task. The model ‘GPT-4’ with prompt ‘En_Se_Para_Det_GPT3.5_GPT4_v1 (GPT4)’ achieved
an Accuracy score of 0.81 and with consistent scores across all metrics (F1 = 0.81, Precision = 0.81, Recall
= 0.81). Additionally, the baseline ‘baseline-bge-m3-zeroshot-v2.0/sv_bge-m3-zeroshot-v2.0’ shows the
highest Accuracy of 0.92 across all models.</p>
      <p>Table 8 summarizes the results of models in the generation step for Swedish paraphrasing where
the focus is on metrics related to hypothesis entailment and not_entailment. The model ‘GPT-3.5
Turbo’ with prompt ‘Se_Para_Gen_GPT3.5’ demonstrated strong performance with high scores in hyp+
entailment mean of 0.88, hyp+ correct label mean of 0.90, hyp- contradiction mean of 0.91, and
hypcorrect label mean of 0.93.</p>
      <p>Tables 9 and 10 present cross-model evaluation results for the Swedish paraphrasing task that
highlights the model performance across diferent evaluation criteria. The model majority voting with
prompt ‘majority_vote_cross_model_result_se’ showed competitive performance. Table 10 provides
statistical measures for models excluding baselines that indicate the noted majority voting technique
has consistency and reliability.</p>
      <p>Tables 11 to 14 report the performance of English-French translation detection, generation, and
cross-model evaluation.</p>
      <p>Table 11 highlights several key points regarding the performance of the detection task. Model
‘GPT4’ with prompt ‘results_gpt4_en_fr’ achieved the highest performance with Accuracy, F1-score, and
Recall of 0.90, and Precision of 0.91. Additionally, we can observe that the majority voting model with
prompt ‘majority_vote_result_en_fr’ also performed well with Accuracy, F1-score, and Recall of 0.83
and Precision of 0.86.</p>
      <p>One of the conclusions can be drawn from Table 12 is that the baseline model
‘baseline-generalprompt/en-fr.gen’ showed a better performance with hyp+ entailment mean 0.90 and hyp+ correct label
mean of 0.93, while it has a lower performance in hyp- contradiction mean of 0.10 and hyp- correct label
mean of 0.08. Also, it is clear that model ‘GPT-3.5 Turbo’ with prompt ‘results_gpt_en_fr’ demonstrated
a high performance in hyp- contradiction mean of 0.88, and hyp- correct label mean of 0.91.</p>
      <p>From tables 13 and 14 we have the finding that the majority voting approach with prompt
‘majority_vote_result_en_fr’ reached Accuracy 0.79, F1 score 0.78, Precision 0.80, and Recall 0.79. This
combination exhibited the highest average MCC 0.66 and average Kappa 0.65.</p>
      <p>Tables 15 to 18 report the results for the evaluation of the English-German translation detection,
generation, and cross-model.</p>
      <p>The important observation from the Table 15 is that the model ‘GPT-4’ along with prompt
‘results_gpt4_en_de’ showed the highest performance with an Accuracy, F1 score, and Recall all at 0.86
and Precision 0.89.</p>
      <p>From Table 16 we can see that the model ‘GPT-3.5 Turbo’ with the mixture of prompt
‘results_gpt_en_de’ exhibited better performance in hyp- contradiction mean of 0.83, and hyp- correct
label mean of 0.84. ‘Gemma’ with prompt ‘En_De_Trans_Gen_gamma’ showed the best hyp+ correct
label mean of 0.85. Additionally, ‘baseline-phenomena-mentions-prompt/en-de.gen’ provides a better
hyp+ entailment mean of 0.84.</p>
      <p>Tables 17 and 18 provide the insight that the model ‘GPT-3.5 Turbo’ with ‘results_gpt_en_de’ had
the highest Accuracy of 0.76, F1 score of 0.75, Precision of 0.77, and Recall of 0.76. The prompt
‘majority_vote_result_en_de’ for majority voting had the highest average MCC of 0.60 and average
Kappa of 0.58 which indicates strong inter-model agreement and consistency.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, this study leveraged several LLMs to investigate both the generation and detection of
hallucinations by LLMs themselves. The four distinct models employed presented their own unique
evaluation challenges. We explored various prompt techniques including few-shot learning and chain of
thought by using the guidance framework. Additionally, for the detection task, we tested an ensemble
voting approach to combine the results from diferent LLMs. Although in this study we could achieve
better results in comparison to the baseline models, our findings indicate that while some issues can be
addressed through efective prompting, others remain dificult to mitigate solely by prompt engineering.
Moreover, identifying the optimal prompt itself poses a significant challenge.
final_gemma_en_v1_cross_model
final_gpt35_en_v2_cross_model_detection
final_gpt4_en_v2_cross_model_detection
final_lama3_cross_model_en_v1
majority_vote_cross_model_result_en
final_gemma_se_v1_cross_model
final_gpt35_se_v2_cross_model_detection
final_gpt4_se_v2_cross_model_detection
final_lama3_cross_model_se_v1
majority_vote_cross_model_result_se
majority_vote_result_en_fr
results_gemma_en_fr_final
results_gpt4_en_fr
results_gpt_en_fr
results_llama3_en_fr_final</p>
      <p>MCC
majority_vote_result_en_de
results_gemma_en_de_final
results_gpt4_en_de
results_gpt_en_de
results_llama3_en_de_final</p>
      <p>MCC</p>
      <p>A. Appendix
All nouns, alongside the All nouns, alongside the word
word Sie for you, always be- Sie for you, always begin with a
gin with a capital letter, even capital letter, even in the middle
in the middle of a sentence. of a sentence, except for those
that are part of a title or a proper
noun.</p>
      <p>The final line of the third The final line of the third verse
verse was changed during was modified during the reign
the reign of Alexander I of of Alexander I of Yugoslavia in
Yugoslavia in “Kralja Alek- ’Kralja Aleksandra, Bože hrani’.
sandra, Bože hrani ”.
hypAll nouns, alongside the word
sie for you, always begin with a
capital letter, even in the middle
of a sentence.</p>
      <p>The final line of the third verse
was rewritten during the reign
of Alexander the Great in ’Kralja
Aleksandra, Bože hrani’.</p>
      <p>label prediction gemma</p>
      <p>Listing 7: Se_Para_Gen_Gemma_v2
1 user_prompt = f’’’
2 You are a text generator and your task is to generate two translation hypothesis given the ’
src’ below.
3 The first translation labelled as ’hyp+’ should be supported by ’src’ and the second
translation labelled as ’hyp-’ should not be supported by ’src’.
4 Provide the result in the following format: "hyp+": "", "hyp-": "". Target language: "</p>
      <p>English"</p>
      <sec id="sec-5-1">
        <title>Macro avg</title>
      </sec>
      <sec id="sec-5-2">
        <title>Weighted avg hyp1 hyp2 Accuracy</title>
        <p>Listing 11: En_De_Trans_Gen_llama3_v2, other language pairs contain an example in the associated language
Listing 12: Swedish Prompt 1.</p>
        <p>answer_format = {"label": ""}
user_prompt = f’’’
&lt;start_of_turn&gt;user</p>
        <p>You are a researcher investigating a new phenomenon. You have gathered
data ({source}) and formulated two competing hypotheses (hyp1: {hyp1}, and hyp2
: {hyp2}) to explain it.</p>
        <p>Identify the hypothesis that contradicts the information provided in
the given source.</p>
        <p>Provide the result in the following format: {answer_format}.
Listing 13: English Prompt 2.</p>
        <p>answer_format = {"label": ""}
user_prompt = f’’’
&lt;start_of_turn&gt;user</p>
        <p>Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to
detect which of the two hypotheses ("label") is not supported by the source.</p>
        <p>Provide the result in the following format:
{answer_format}.</p>
        <p>Src: {source}
hyp1: {hyp1}
hyp2: {hyp2}</p>
        <p>Listing 14: English Prompt 1 and Swedish Prompt 2.</p>
        <p>type
antonym
negation
antonym
named entity
natural
addition
gender
natural
number
pronoun
pronoun
addition
conversion</p>
        <p>natural
named entity
date
label prediction gemma</p>
      </sec>
      <sec id="sec-5-3">
        <title>Macro avg Weighted avg</title>
        <p>0.88
0.75
label prediction gemma
prediction gpt 3.5</p>
      </sec>
      <sec id="sec-5-4">
        <title>Macro avg Weighted avg</title>
      </sec>
      <sec id="sec-5-5">
        <title>Macro avg Weighted avg</title>
        <p>type
label prediction gemma
prediction gpt 3.5</p>
      </sec>
      <sec id="sec-5-6">
        <title>Macro avg</title>
      </sec>
      <sec id="sec-5-7">
        <title>Weighted avg hyp1 hyp2 Accuracy</title>
      </sec>
      <sec id="sec-5-8">
        <title>Macro avg Weighted avg</title>
        <p>type
0.33
1.00
5
6
7
8
9
10
11
12
13
14
15
16
17
18
."
’’’
’’’
result = {’label’: ’hyp1’}
src = "The population has declined in some 210 of the 280 municipalities in
Sweden, mainly in inland central and northern Sweden."
"In the majority of Sweden’s 280 municipalities, the population has gone up
"In the majority of Sweden’s 280 municipalities, the population has gone
down."</p>
        <p>Listing 15: En_Para_Gen _Gemma_v1
6
1 answer_format = {"label": ""}
2
3
4
5
user_prompt = f’’’
&lt;start_of_turn&gt;user
Givet en ”src” och två hypoteser ”hyp1
” och ”hyp2” är din uppgift att upptä
cka vilken av de två hypoteserna (”
label”) som inte stöds av källan.</p>
        <p>Ge resultatet i följande format: {</p>
        <p>answer_format}.
7
8
9
10
11
12
13
14
15
16
17
Listing 16: En_Para_Det_Gemma_v1
Listing 17: Se_Para_Det_Gemma_v1
7
8
9
10
11</p>
        <p>Listing 18: En_Se_Para_Det_Gemma_v2
user_prompt = f’’’
Given a "src" and two hypotheses "
hyp1" and "hyp2" your task is to
detect which of the two hypotheses ("
label") is not supported by the
source.</p>
        <p>Provide the result in the following
format: {answer_format}.
Listing 19: En_Se_Para_Det_GPT3.5_GPT4_v1
Listing 20: En_Se_Para_Det_GPT3.5_GPT4_v2</p>
        <p>’’’Given a "src" and two
hypotheses "hyp1" and "hyp2" your
task is to detect which of the two
hypotheses ("label") is not supported
by the source.</p>
        <p>Provide the result in the following</p>
        <p>format: {answer_format}.</p>
        <p>’’’Givet en ”src” och två
hypoteser ”hyp1” och ”hyp2” är din
uppgift att upptäcka vilken av de två
hypoteserna (”label”) som inte stöds
av källan.</p>
        <p>Ge resultatet i följande format: {
answer_format}.
Resultat:
’’’
Listing 21: En_Para_Det _Llama3_v1
Listing 22: Se_Para_Det_Llama3_v1
1 ’’’ You are a researcher investigating a new phenomenon.
2 You have gathered data (source) and formulated two competing hypotheses (hyp1 and
hyp2) to explain it.
3
4 Identify the hypothesis that contradicts the information provided in the given
source.
5 Provide the result in the following format: {answer_format}.
6
7 source: {source}
8 hyp1: {hyp1}
9 hyp2: {hyp2}
10
11 Result:’’’
12</p>
        <p>Listing 23: En_Se_Para_Det _Llama3_v2</p>
        <p>Listing 24: En_De_Trans_Det_llama3_v1, other language pairs contain an example in the associated language
1 user_prompt = f’’’
2 You are a researcher investigating a new phenomenon. You have gathered data (source) and
formulated two competing hypotheses (Hyp1 and Hyp2) to explain it. Identify the
hypothesis that contradicts the information provided in the given source.
3 Provide the result in the following format: {"label": ""}.
4
5 Src: {source}
6 hyp1 : {hyp1}
7 hyp2 : {hyp2}
8
9 Result:’’’
10</p>
        <p>Listing 25: En_De_Trans_Det_llama3_v2, En_Fr_Trans_Det_llama3_v2
The lawyer told the male
auditor to provide
documentation.</p>
        <p>hyp2
The center of the
earthquake was about 20 km
northeast of Dillon and
about 65000 metres south of
Butte.</p>
        <p>hyp2</p>
        <p>Llama 3’s explanation
The center of the earthquake I chose hyp2 as the
hallucinawas about 20 km northeast of tion because it adds the unit
Dillon and about 65000 kilome- "kilometres" to the distance,
tres south of Butte. which is not present in the
original text. The original text uses
"km" and "Meilen", which are
equivalent units, but hyp2
introduces a new unit
"kilometres". This is a type of
hallucination where a new
information is added to the original text,
specifically a unit conversion.’
hyp2</p>
        <p>Llama3’s explanation
"Perry erklärte, er """Perry stated """Perry stated
werde „nach Texas that he will that he will
zurückkehren, um return to Texas return to Texas
das Ergebnis der to evaluate to evaluate the
Vorwahl von heute the result of result of tonight’s
Abend auszuwerten tonight’s pre- pre-election and
und zu entscheiden, election and decide whether
ob es in diesem decide whether there is a path
Rennen einen weit- there is a path forward for me
erführenden Weg forward for me in this race,""
für mich gibt"", aber in this race,"" but later said he
sagte später, er but later said would stay in the
werde im Rennen he would stay race and run for
bleiben und bei den in the race and the pre-election
Vorwahlen in South run for the in South Carolina
Carolina am 21. pre-election in on Jan. 21."
Januar antreten." South Carolina
on March 21."</p>
        <p>I chose hyp2 as the hallucination
because it contains a specific date ("Jan.
21"), which is not present in the
original source text. The original text only
mentions the date "21. Januar", which
is a more general and less specific date.</p>
        <p>This suggests that hyp2 contains a
hallucinated detail. The type of
hallucination is time, as it involves the
introduction of a specific and precise time
element.</p>
        <p>Listing 26: En_De_Trans_Det_gemma_v1,
En_Fr_Trans_Det_gemma_v1
Result:
&lt;start_of_turn&gt;model ’’’
Listing 27: De_En_Trans_Det_gemma_v2,
En_De_Trans_Det_gemma_v2,
En_Fr_Trans_Det_gemma_v2,
Fr_En_Trans_Det_gemma_v2
Der Anwalt forderte The lawyer told the
die Wirtschaft- male auditor to
prosprüferin auf, vide documentation.</p>
        <p>Unterlagen
vorzulegen.</p>
        <p>hyp2
The lawyer told the
female auditor to
provide documentation.</p>
        <p>Llama3’s explanation
The source states "Der Anwalt
forderte die Wirtschaftsprüferin
auf, Unterlagen vorzulegen.",
which translates to "The lawyer
demanded the female auditor
to provide documentation." This
implies that the auditor is a
woman, contradicting Hyp2 which
states that the auditor is a man.
result_text ’label’: ’hyp2’
Der Arbeiter grub
ein Loch für die
Sekretärin.</p>
        <p>The laborer dug a
hole for the female
secretary.</p>
        <p>Das Zentrum des
Erdbebens befand
sich etwa 20 km
(15 Meilen)
nordnordöstlich von
Dillon und etwa
65 km (40 Meilen)
südlich von Butte.</p>
        <p>The center of the
earthquake was
about 20 km
northeast of Dillon and
about 65000 metres
south of Butte.</p>
        <p>The center of the
earthquake was
about 20 km
northeast of Dillon and
about 65000
kilometres south of Butte.
"Perry erklärte, er """Perry stated that
werde „nach Texas he will return to
zurückkehren, um Texas to evaluate the
das Ergebnis der result of tonight’s
Vorwahl von heute pre-election and
Abend auszuwerten decide whether there
und zu entscheiden, is a path forward
ob es in diesem for me in this race,""
Rennen einen weit- but later said he
erführenden Weg would stay in the
für mich gibt"", aber race and run for the
sagte später, er pre-election in South
werde im Rennen Carolina on March
bleiben und bei den 21."
Vorwahlen in South
Carolina am 21.</p>
        <p>Januar antreten."
"""Perry stated that
he will return to
Texas to evaluate the
result of tonight’s
pre-election and
decide whether there
is a path forward
for me in this race,""
but later said he
would stay in the
race and run for the
pre-election in South
Carolina on Jan. 21."
The laborer dug a The source states "Der Arbeiter
hole for the male sec- grub ein Loch für die Sekretärin",
retary. which means "The laborer dug a
hole for the secretary". This clearly
indicates that the secretary is a
female, as "Sekretärin" is the German
word for "female secretary".
Therefore, Hyp2, which suggests that the
laborer dug a hole for the male
secretary, contradicts the information
provided in the source. result_text
’label’: ’hyp2’
Explanation: The source states that
the distance from the earthquake’s
center to Butte is approximately 40
miles (65 km), not 65,000 kilometres.</p>
        <p>Hyp2 contradicts this information
by stating a much larger distance,
while Hyp1 correctly translates the
original distance from miles to
meters. result_text ’label’: ’hyp2’
Explanation: The source states
that the event in question is on
"heute Abend" (tonight), which
corresponds to January 21. However,
Hyp1 states that the event is on
"March 21", which contradicts the
information provided in the source.
result_text ’label’: ’hyp1’
de-en
en-de
en-fr
fr-en
2
8
2
7</p>
        <p>GPT-4
prompt1 prompt2</p>
        <p>GPT-3.5
prompt1 prompt2
1
15
7
3
0
1
0
1
0
1
1
2
’’’Given a "src" and two hypotheses "hyp1" and "hyp2" your task is to detect
which of the two hypotheses ("label") is not supported by the source.</p>
        <p>Provide the result in the following format:
{answer_format}.
1
2
3
4
5
6
7
8
9
10
hyp1: {hyp1}
hyp2: {hyp2}’’’</p>
        <p>Listing 28: majority_vote_cross_model_result_en</p>
        <p>Die Mittel könnte man
für hochwassersichere
Häuser, eine bessere
Wasserverwaltung und
Nutzpflanzendiversifizierung verwenden.</p>
        <p>Es zeigt 362 ver- It shows 362 diferent
schiedene alte old species of wood,
Holzarten, Büsche bushes and 236
difund 236 verschiedene ferent species of fruit</p>
        <p>Obstbaumarten. trees.
deen
deen
ende
ende
ende
ende
ende
ende</p>
        <p>The world has over
5,000 diferent
languages, more than
twenty with 50 million
or more speakers.
1i Productions is an
American board game
publisher. It was
founded in 2004 by
Colin Byrne, William
and Jenna.</p>
        <p>Mats Wilander defeats
Anders Järryd, 6 – 4, 3
– 6, 7 - 5.</p>
        <p>They have feet with
scales and claws, they
lay eggs, and they walk
on their two back legs
like a T-Rex.</p>
        <p>The NSA has its own
internal data format
that tracks both ends
of a communication,
and if it says, this
communication came from
America, they can tell
Congress how many of
those communications
they have today, right
now.
Diferent
interpretations of
flood-proof
wrong tense</p>
        <p>filler</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chenaghlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Amatriain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Large language models: A survey</article-title>
          ,
          <source>CoRR abs/2402</source>
          .06196 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2402.06196. doi:
          <volume>10</volume>
          .48550/ARXIV.2402.06196. arXiv:
          <volume>2402</volume>
          .
          <fpage>06196</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <article-title>Eloquent clef shared tasks for&amp;nbsp;evaluation of&amp;nbsp;generative language model quality</article-title>
          ,
          <source>in: Advances in Information Retrieval: 46th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2024</year>
          , Glasgow, UK, March
          <volume>24</volume>
          -28,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>V</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2024</year>
          , p.
          <fpage>459</fpage>
          -
          <lpage>465</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>63</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>63</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] tiiuae, Falcon-11B, https://huggingface.co/tiiuae/falcon-11B, Accessed on 2024-
          <volume>05</volume>
          -23.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Introducing mpt-7b: A new standard for open-source, commercially usable llms</article-title>
          ,
          <year>2023</year>
          . URL: www.mosaicml.com/blog/mpt-7b, accessed:
          <fpage>2023</fpage>
          -05-05.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mickus</surname>
          </string-name>
          , E. Zosa,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vázquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vahtola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Segonne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Apidianaki</surname>
          </string-name>
          , Semeval
          <article-title>-2024 shared task 6: Shroom, a shared-task on hallucinations and related observable overgeneration mistakes</article-title>
          ,
          <source>arXiv preprint arXiv:2403.07726</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] AI@Meta, Llama 3 model card (</article-title>
          <year>2024</year>
          ). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Hugging</given-names>
            <surname>Face</surname>
          </string-name>
          ,
          <string-name>
            <surname>Meta-Llama-</surname>
          </string-name>
          3
          <string-name>
            <surname>-</surname>
          </string-name>
          8B-Instruct, https://huggingface.co/meta-llama/ Meta-Llama-3
          <string-name>
            <surname>-</surname>
          </string-name>
          8B-Instruct, Accessed on 2024-
          <volume>05</volume>
          -23.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] OpenAI, Gpt-
          <volume>3</volume>
          .5 turbo, n.d.. URL: https://platform.openai.com/docs/models/gpt-3-5-turbo.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Model endpoint compatibility, n.d.. URL: https://platform.openai.com/docs/models/ model-endpoint-compatibility.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>T. M. Gemma Team</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hardin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Dadashi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bhupatiraju</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sifre</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rivière</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Kale</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Love</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Tafti</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hussenot</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Gemma</surname>
          </string-name>
          (
          <year>2024</year>
          ). URL: https://www.kaggle.com/m/3301. doi:
          <volume>10</volume>
          .34740/ KAGGLE/M/3301.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Google</surname>
          </string-name>
          ,
          <fpage>gemma</fpage>
          -7b, https://huggingface.co/google/gemma-7b, Accessed on 2024-
          <volume>05</volume>
          -23.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>Microsoft, guidance-ai/guidance: A guidance language for controlling generative models</article-title>
          , https: //github.com/guidance-ai/guidance,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          , et al.,
          <article-title>Ensemble learning, The handbook of brain theory and neural networks 2 (</article-title>
          <year>2002</year>
          )
          <fpage>110</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chicco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Warrens</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Jurman,</surname>
          </string-name>
          <article-title>The matthews correlation coeficient (mcc) is more informative than cohen's kappa and brier score in binary classification assessment, Ieee Access 9 (</article-title>
          <year>2021</year>
          )
          <fpage>78368</fpage>
          -
          <lpage>78381</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>Table 39 Samples for which GPT-4 failed to assign labels in the translation detection task</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>