<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luise Dürlich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evangelia Gogoulou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liane Guillou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joakim Nivre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shorouq Zahra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RISE Research Institutes of Sweden</institution>
          ,
          <addr-line>Stockholm</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the HalluciGen task we aim to discover whether LLMs have an internal representation of hallucination. Specifically, we investigate whether LLMs can be used to both generate and detect hallucinated content. In the cross-model evaluation setting we take this a step further and explore the viability of using an LLM to evaluate output produced by another LLM. We include generation, detection, and cross-model evaluation steps for two scenarios: paraphrase and machine translation. Overall we find that performance of the baselines and submitted systems is highly variable, however initial results are promising and lessons learned from this year's task will provide a solid foundation for future iterations of the task. In particular, we highlight that human validation of generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative language models</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Hallucinations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Detection: Given a source sentence and two paraphrase/translation hypotheses (ℎ1 and ℎ2),
the model should detect which of the two contains a hallucination.</p>
      <p>As an additional challenge, we also perform the detection step in a cross-model setting, where the
participant models perform the detection step on the model outputs from the generation step.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets</title>
      <p>
        For each of the two scenarios, i.e. paraphrase generation or machine translation, we construct a dataset
with the following fields: a source sentence, a correct hypothesis of the source, a hallucinated hypothesis
of the source, and the type of hallucination demonstrated in the hallucinated hypothesis. Our datasets
include hallucinations of the following categories: addition, named-entity, number, conversion, date,
gender, pronoun, antonym, tense, negation, and natural hallucinations. With the exception of tense
and negation, the remainder of the hallucination types are identical to the type of translation errors
identified in ACES [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. All our datasets are available on Huggingface.1 The process of dataset creation
for each scenario is described below.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Machine Translation</title>
        <p>
          For the translation scenario we leveraged ACES [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], a challenge set for evaluating the performance of
Machine Translation (MT) metrics on a range of translation accuracy errors. Each example in ACES
already follows the structure that we use in the HalluciGen task and ACES already contains errors for
the en⇔fr and en⇔de language pairs for all but two of the phenomena we are interested in. For the
tense and negation categories, which do not exist in ACES, we constructed examples from the PAWS-X
dataset [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] of adversarial paraphrases.
        </p>
        <p>For tense examples we filter the PAWS-X dataset to select English examples labelled as paraphrases,
then we select each instance of sentence1 and use spaCy2 to tokenise and part-of-speech tag the
sentence. We then identify a verb and its tense using the Penn-treebank style tags output by the
spaCy pipeline, and inflect it for a diferent tense using the pyinflect 3 python library. We change tense
between past, present, and future (by injecting the token “will”). The original sentence1 forms the good
translation, the perturbed version is the incorrect translation, and we pair the English sentence with
the corresponding French/German translation in PAWS-X (which forms the source sentence). Negation
examples are created by automatically extracting English paraphrase examples in PAWS-X that contain
a negation and manually editing sentence1 to construct an incorrect translation e.g. by inserting an
(extra) negation, or modifying the polarity of a sentence that already contains a negation. We consider
lexical negation (e.g. the afixes “un” and “dis”) and negation tokens (e.g. not, n’t, never). Again, we pair
the English target sentences with a corresponding French/German source sentence from PAWS-X.</p>
        <p>From the combined set of ACES and the negation and tense examples, we selected 100 examples for
each language direction for the test set and 10 examples for the trial set. Examples for the test set were
selected in order to provide as close to a uniform selection across categories as possible. Note that due
to the unbalanced coverage of examples in ACES, some categories are underrepresented or absent for
some language directions.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Paraphrase</title>
        <p>
          For the English paraphrase scenario, we sampled 138 examples from the SHROOM training data for
the paraphrase generation subtask [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Each example consists of a source sentence accompanied with
a machine-generated paraphrase hypothesis. The latter were generated by the SHROOM organisers
1https://huggingface.co/datasets/Eloquent/HalluciGen-PG
https://huggingface.co/datasets/Eloquent/HalluciGen-Translation
2https://spacy.io
3https://pypi.org/project/pyinflect/
using the PEGASUS model4. In order to increase the chance for hallucination, we prioritised examples
with long contexts (minimum 140 tokens) that also include numbers.
        </p>
        <p>
          In the Swedish paraphrase scenario, we used a subset of the SweParaphrase test data [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and the
Swedish part of the Finnish paraphrase corpus [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Each example consists of two sentences, together
with a label reflecting the degree of their semantic similarity. After filtering only the sentence pairs
with the highest degree of semantic similarity (that is label 5 in the Swedish Paraphrase dataset and
label 4 in the Finnish paraphrase dataset), we sampled 139 examples and used two LLMs, Mixtral 7B
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and GPT-SW3 6.7B [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], to generate a paraphrase hypothesis for the first sentence of each example.
For these Swedish paraphrases, we observed cases where the generated paraphrase was in the wrong
language, typically English, or a mix of languages when using Mixtral 7B. To obtain a large enough
sample of reasonably good quality for annotation, we therefore chose to (1) translate output in English
to Swedish using GPT3.5 and (2) generate multiple hypotheses – one by each of the LLMs – for some of
these examples. In total, 46 of the sources had multiple annotations, while 56 sources only occur once
in the entire dataset.
        </p>
        <p>All datasets used for the paraphrase scenario are annotated in two steps. The first step is to decide
if the generated hypothesis is a hallucination of the source, given the definition of the hallucination
phenomenon in our task. If yes, then we mark the hypothesis as hallucination (H) and then choose a
suitable hallucination type from the list of eleven hallucination categories in the HalluciGen dataset
(addition, named-entity, etc.). If the hypothesis is marked as not hallucination (NH) then we construct a
hallucination manually, based on one of the hallucination categories above. In the Swedish data, the
cases with hypotheses in the wrong language or a mix of languages were considered too high efort to
correct manually and were discarded from the final dataset.</p>
        <p>Each of the resulting datasets per scenario and language was split into a trial and a test set. For
English, 119 examples were selected for the test set and 16 examples for the trial set. The Swedish test
set amounted to 119 examples in total (117 from SweParaphrase and 2 from the Finnish paraphrase
corpus) and the trial set to 20 examples (19 from SweParaphrase and 1 from the Finnish corpus). The
distribution of each dataset over the diferent hallucination types is presented in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Baseline Models</title>
      <sec id="sec-3-1">
        <title>3.1. Paraphrase scenario</title>
        <p>
          For the paraphrase scenario we use diferent models for the generation and detection steps. For
generation, we use Mixtral-8x7B-Instruct-v0.1, the instructed variant of the Mixtral LLM [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]5 to
generate ℎ+/ℎ− hypotheses pairs for the English and Swedish test sets, and gpt-sw3-6.7b-v2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]6
as an additional baseline for Swedish.
        </p>
        <p>
          For the detection step we use several models. The first is the Llama-2-7b-chat-hf[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] model from
HuggingFace7. This model, although English-centric has been trained on smaller amounts of data for
other languages, including Swedish. We use three prompts aimed at detecting which hypothesis a) is
an incorrect paraphrase of the source, b) has a diferent meaning to the source, or c) is not supported
by the source (see Table 9 in Appendix A). The second and third models are multilingual zero-shot
Natural Language Inference (NLI) models, bge-m3-zeroshot-v2.0 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for both English and Swedish
hallucination detection, and scandi-nli-large [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] 8 as an additional baseline for Swedish. These
models classify a text into a number of custom defined classes; in our case, we choose the default
classes “not_entailment” and “entailment” and infer the output label from the predicted scores for
both classes. To determine which of the the two hypotheses (ℎ1/ℎ2) contains a hallucination, we
predicted “entailment” and “not_entailment” class scores between the source sentence and each one of
the hypotheses. We follow these conditions to infer the final label:
1. If one hypothesis has higher entailment whereas the other hypothesis has higher non-entailment,
we choose the one with the higher non-entailment score.
2. If both ℎ1 and ℎ2 had a higher entailment score, we choose the one with the lowest
entailment.
3. If both ℎ1 and ℎ2 had a higher non-entailment score, we choose the one with the highest
non-entailment.
        </p>
        <p>For both NLI models, the default configurations were used and each pair ( +ℎ1 /
+ℎ2). The models were used out of the box, as available on HuggingFace, and without
any additional fine-tuning.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Machine translation scenario</title>
        <p>
          For the translation scenario we again use the Llama2 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] 7B-chat model from HuggingFace. We use this
model as the baseline for the generation and detection steps. As stated in the previous section, while
Llama2 is an English-centric model, it has been trained on (relatively) small amounts of data from other
languages (including French and German) and is therefore able to perform cross-lingual tasks such as
translation. Crucially, in addition to producing accurate translations it can also be prompted to produce
incorrect translations in a zero-shot setting – something that we could not get MT-specific LLMs such
as Tower [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to do, perhaps because they have been optimised to output accurate translations. We
note that there are many stronger LLMs, and our aim is not to provide an unbeatable baseline in the
ifrst year of HalluciGen.
        </p>
        <p>For the generation step we split the problem into two parts, using separate prompts to produce the
good and incorrect translations. For the good translation we simply prompt the model to translate
from the source language to the target language. For the incorrect translation we use two diferent
strategies; we prompt the model to a) produce an incorrect translation, and b) produce an incorrect
translation and provide a list of possible phenomena that the incorrect translation could target. For
5https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
6https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2
7https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
8https://huggingface.co/alexandrainst/scandi-nli-large
the detection step, similar to the paraphrase detection step, we have three diferent prompts aimed at
detecting which hypothesis a) is an incorrect translation of the source, b) has a diferent meaning to
the source, or c) is not supported by the source. We use the same prompts from the detection step in
the cross-model evaluation step. For detailed of the exact prompts used, see Table 10 in Appendix A.
Note that we experimented with explicitly including the term “hallucination” as part of the prompt
instructions, but this was unsuccessful.</p>
        <p>We used the default Llama2-7B-chat model parameters, unless otherwise stated. For generation
(translation only) we want to encourage creativity for translation so we set temperature to 0.9; top_k=10,
num_return_sequences=1, and max_length=200. For detection (for both translation and paraphrase) we
want to encourage deterministic behaviour so we set temperature to 0.1 and top_k=1; as our prompts
are longer than for the generation step we set max_length=400 (to allow for longer inputs).</p>
        <p>
          In addition to Llama2, we again employed bge-m3-zeroshot-v2.0 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to create detection step
baselines for all language pairs and directions. This uses the same model and process detailed in
Paraphrase scenario section. Although the model performs better on English input, it is still suitable for
multilingual tasks. While it was recommended to translate input sentences into English rather than
having them in multiple languages (as a way to improve performance), no additional translation was
performed on either the source sentences nor the hypotheses pair (ℎ1/ℎ2); this means that the
NLI model receives two sentence in two diferent languages as input (one in English, and one in either
French or German) in both directions.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Participant Submissions</title>
      <p>
        In total, we received outputs from 10 systems submitted by 3 diferent groups which included varying
numbers of participants. Table 2 provides an overview of the submitted systems. Participant group
1 (Bui et al.) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] submitted systems for all steps and all languages for both the paraphrase and
translation scenarios. They applied zero-shot prompting for a range of pre-trained LLMs, and ensembled
combinations of these models to produce majority voting systems. Participant group 3 (Siino &amp;
Tinnirello) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] submitted systems for the detection step of the paraphrase scenario only. They used
Mistral-7B-Instruct-v0.2 with few-shot prompting, providing the complete set of examples (either
English or Swedish depending on the language in focus) from the trial data set as part of the prompt.
Participant group 2 (Abburi) submitted systems for the detection step for both the paraphrase and
translation scenarios. Unfortunately, as they did not submit a paper to CLEF 2024, we know little about
their system other than it uses majority voting across multiple fine-tuned LLMs.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation Methodology</title>
      <sec id="sec-5-1">
        <title>5.1. Detection Step</title>
        <p>For the detection step, the submitted systems are evaluated with respect to the human-annotated labels,
using the following metrics: accuracy, precision, recall and F1 score. We use F1 as the primary metric
for comparison between diferent systems. Examples were classified as incorrect in cases when the
evaluated system produced no label or a label outside the allowed categories (ℎ1/ℎ2).</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Generation Step</title>
          <p>
            We use the NLI task as a proxy for evaluating the quality of the correct and hallucinated hypothesis
ℎ+,ℎ− generated by the participant models. More specifically, the NLI model
bge-m3-zeroshotv2.0 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], that also serves as a baseline for the detection step, is now used to predict “entailment” vs
“not_entailment” scores. The rationale behind this is as follows: one way to determine whether or
not a system is able to create appropriate pairs of hypotheses is to measure the textual entailment
between each pair and the source sentence. We assume that a successful paraphrase of a sentence
textually entails the source sentence; whereas a hallucination does not. If ℎ+ is predicted as having
higher “entailment”, it is assigned a score of 1, otherwise 0, and if a ℎ− is is predicted as having
higher “not_entailment”, it is assigned a score of 1, otherwise 0. To validate the use of the NLI model
for evaluating the model outputs for the generation step, we test the NLI model bge-m3-zeroshot-v2.0
as a baseline for the detection step in both scenarios. These are the scores highlighted in grey in Tables
3 and 7. We observe that the NLI model competes with (or even surpasses) the participant models on
the detection task. This allows us to use it for evaluating the model outputs for the generation step.
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Cross-model evaluation</title>
          <p>For the cross-model evaluation, the system performance is measured with respect to the output of the
generator model, using the same metrics as in the detection step. In addition, Matthew’s correlation
coeficient (mcc) and Cohen’s kappa are used to measure the agreement between the diferent evaluators.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Paraphrase Scenario</title>
        <p>Tables 3 - 5 present the results of the participant models and the baselines for the three steps of the
paraphrase scenario. Starting from the detection step, we observe that the NLI baseline
baseline-bgem3-zeroshot-v2.0 exhibits very strong performance. The diference with the participant models is even</p>
        <p>LLM system
gemma-7b-it
gemma-7b-it v1
gpt-3.5-turbo
gpt-3.5-turbo v1
gpt-4-turbo
Meta-Llama-3-8B-Instruct
Meta-Llama-3-8B
Majority vote A (Bui et al.)
Majority vote (Abburi)
Mistral-7B-Instruct-v0.2</p>
        <p>English
Detection step results for the paraphrase scenario. Results for the NLI model bge-m3-zeroshot-v2.0 (highlighted
that ℎ+ is entailed and ℎ− is not entailed correspondingly.</p>
        <p>Generation results for the paraphrase scenario. ℎ+, ℎ− refer to the accuracy of the NLI model on predicting
LLM system
gemma-7b-it v1
gemma-7b-it v2
gpt-3.5-turbo
Meta-Llama-3-8B-Instruct</p>
        <p>English
baseline-mixtral-8x7b-instruct</p>
        <p>Generation: Paraphrase
ℎ+ ℎ− LLM system
addition, gemma-7b-it v1 stands out for generating ℎ−
more noticable for the Swedish dataset, where the best performing participant model, gpt-4-turbo
lies over 10 points behind the NLI baseline in terms of F1 score. This is almost expected since none of
the participant models has been (intentionally) trained on Swedish data. For the English paraphrase,
gpt-4-turbo and the Majority vote (Abburi) models perform on the same level as the baseline on the
task of hallucination detection.</p>
        <p>For the generation step, gpt-3-5-turbo produces overall the best quality positive and negative
hypotheses in both English and Swedish, according to the NLI model. Notably larger diference between
scores of that model is observed in English, in comparison with Swedish. In
hypotheses with considerably better quality
than ℎ+ hypotheses, according to the NLI model.</p>
        <p>From the results of the cross-model evaluation in Table 5 we observe that the Majority vote A
(Bui et al.) exhibits the best overall performance in detecting hallucinations in machine-generated
hypotheses in English and Swedish, with respect to both the generator output and the other evaluator
models.
in grey) are included for the purpose of validating the NLI model as an evaluation method for the generation</p>
        <p>Detection: Translation</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Machine translation Scenario</title>
        <p>observe that performance of Llama-3-8B-Instruct and gpt-3.5-turbo participant systems is generally
good: the average “entailment” scores for ℎ+ and “not_entailment” scores for ℎ− suggest that the
models are generally consistent in their ability to generate hypotheses that are entailed by the reference
(ℎ+) and that contradict the reference (ℎ− ). The two Llama-2-7b-chat baselines and, to a lesser
degree, the gemma-7b-it participant system exhibit stronger performance for the generation of ℎ+
examples than ℎ− examples. In particular, the Llama-2-7b-chat baselines outperform the participant
systems for the task of generating ℎ+ examples. We conjecture that this may be a result of using
separate prompts to generate ℎ+ and ℎ− ; by focusing the prompt for generating ℎ+ examples
on generating a “good” translation of the source we may focus the model on the translation task, for
which it was likely fine-tuned. Conversely, the baseline performance for generating ℎ− examples is
very low, but confidence in the ability of LLMs to perform this task is buoyed by the performance of the
participant systems. Note that these results are based on automatic metrics; for a complete evaluation
we propose that the generated output be verified by human annotators, which we leave to future work.</p>
        <p>
          For the detection step, all participant systems outperformed the Llama-2-7b-chat baselines (one
model; three diferent prompts). The stronger bge-m3-zeroshot-v2.0 baseline, is outperformed by a
number of participant systems for all language pairs. Overall, gpt-4 prompt1 is the strongest-performing
participant system, with the highest F1 score for three out of four language pairs. The majority voting
strategies of Bui et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and Abburi also perform strongly.
        </p>
        <p>
          For the cross-model evaluation step, from which we exclude the baselines, we find that the majority
voting strategy of Bui et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] works well, with strong F1 performance on detection based on the
examples generated by the models in the generation step, and also has the highest agreement (measured
using Cohen’s Kappa) with the other evaluator models.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In the HalluciGen task we explored the use of LLMs in generating and detecting hallucinations in
paraphrase and translation tasks. We find that performance of the participant and baseline systems is
highly variable, but results from this year’s lab are promising and will provide a solid foundation for
future iterations of the task. We highlight that all three steps (generation, detection, and cross-model
evaluation) have been evaluated automatically, and therefore caution the reader against drawing any
conclusions regarding which models, prompts, or methods may be “best” based solely on the results in
this paper. In the case of the generation step in particular, human validation of the generated output is
ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this
challenge in future iterations of HalluciGen.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments References</title>
      <p>This lab has been partially supported by the Swedish Research Council (grant number 2022-02909) and
by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee
[grant number 10039436 (Utter)].</p>
    </sec>
    <sec id="sec-9">
      <title>A. Task 2 - Baseline System Prompts</title>
      <p>The prompts used for the paraphrase and translation baseline LLM systems are provided in Tables 9
and 10 respectively.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          , Selfcheckgpt:
          <article-title>Zero-resource black-box hallucination detection for generative large language models</article-title>
          ,
          <source>arXiv preprint: 2303.08896</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bills</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <article-title>Self-critiquing models for assisting human evaluators</article-title>
          ,
          <source>arXiv preprint: 2206.05802</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Amrhein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Moghe</surname>
          </string-name>
          , L. Guillou, ACES:
          <article-title>Translation accuracy challenge sets for evaluating machine translation metrics</article-title>
          ,
          <source>in: Proceedings of the Seventh Conference on Machine Translation (WMT)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , ????, pp.
          <fpage>479</fpage>
          -
          <lpage>513</lpage>
          . URL: https://aclanthology. org/
          <year>2022</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>44</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          , J. Baldridge,
          <string-name>
            <surname>PAWS-X:</surname>
          </string-name>
          <article-title>A cross-lingual adversarial dataset for paraphrase identification</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3687</fpage>
          -
          <lpage>3692</lpage>
          . URL: https://aclanthology.org/D19-1382. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1382.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mickus</surname>
          </string-name>
          , E. Zosa,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vázquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vahtola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Segonne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Apidianaki</surname>
          </string-name>
          , Semeval
          <article-title>-2024 shared task 6: Shroom, a shared-task on hallucinations and related observable overgeneration mistakes</article-title>
          ,
          <source>arXiv preprint arXiv:2403.07726</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berdicevskis</surname>
          </string-name>
          , G. Bouma,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kurtz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Öhman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Adesam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Borin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dannélls</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Forsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Isbister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lindahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Malmsten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rekathati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Volodina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Börjeson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hengchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tahmasebi</surname>
          </string-name>
          ,
          <article-title>Superlim: A Swedish language understanding evaluation benchmark, Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>8137</fpage>
          -
          <lpage>8153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ginter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Rastas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Skantsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kilpeläinen</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-M. Kupari</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Saarni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sevón</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Tarkka</surname>
          </string-name>
          ,
          <article-title>Finnish paraphrase corpus</article-title>
          ,
          <source>in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)</source>
          , Linköping University Electronic Press, Sweden, Reykjavik, Iceland (Online),
          <year>2021</year>
          , pp.
          <fpage>288</fpage>
          -
          <lpage>298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Hanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , et al.,
          <source>Mixtral of experts, arXiv preprint: 2401.04088</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Cuba</given-names>
            <surname>Gyllensten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stollenwerk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Öhman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Isbister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Casademont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          , GPT-SW3:
          <article-title>An autoregressive language model for the Scandinavian languages</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING</article-title>
          <year>2024</year>
          ), Torino, Italia,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liskovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Martinet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mihaylov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
            , I. Molybog,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poulton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reizenstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rungta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saladi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schelten</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>X. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Kuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , I. Zarov,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kambadur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Laurer</surname>
          </string-name>
          , W. van Atteveldt,
          <string-name>
            <given-names>A.</given-names>
            <surname>Casas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Welbers</surname>
          </string-name>
          ,
          <source>Building Eficient Universal Classifiers with Natural Language Inference</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2312.17543. doi:
          <volume>10</volume>
          .48550/arXiv. 2312.17543, arXiv:
          <fpage>2312</fpage>
          .17543 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <article-title>Scandinli: Natural language inference for the scandinavian languages</article-title>
          , https://github. com/alexandrainst/ScandiNLI,
          <year>2022</year>
          . URL: https://aclanthology.org/D19-1382.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>D. M. Alves</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pombal</surname>
            ,
            <given-names>N. M.</given-names>
          </string-name>
          <string-name>
            <surname>Guerreiro</surname>
            ,
            <given-names>P. H.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Farajian</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rei</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Fernandes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Colombo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>Tower:</given-names>
          </string-name>
          <article-title>An open multilingual large language model for translation-related tasks</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>17733</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Brech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hußfeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jennert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ullrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Breuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <article-title>The two sides of the coin: Hallucination generation and detection with evaluators for llms</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Siino</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tinnirello</surname>
          </string-name>
          ,
          <article-title>Gpt hallucination detection through prompt engineering</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>