<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Efects of Hallucinations in Synthetic Training Data for Relation Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Steven Rogulsky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicholas Popovic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Färber</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Karlsruhe Institute of Technology (KIT)</institution>
          ,
          <addr-line>Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Dresden &amp; ScaDS.AI</institution>
          ,
          <addr-line>Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact on relation extraction remains underexplored. In this paper, we examine the efects of hallucinations on the performance of relation extraction on the document and sentence levels. Our empirical study reveals that hallucinations considerably compromise the ability of models to extract relations from text, with recall reductions between 19.1% and 39.2%. We identify that relevant hallucinations impair the model's performance, while irrelevant hallucinations have a minimal impact. Additionally, we develop methods for the detection of hallucinations to improve data quality and model performance. Our approaches successfully classify texts as either 'hallucinated' or 'clean,' achieving high F1-scores of 83.8% and 92.2%. These methods not only assist in removing hallucinations but also help in estimating their prevalence within datasets, which is crucial for selecting high-quality data. Overall, our work confirms the profound impact of relevant hallucinations on the efectiveness of relation extraction models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Relation extraction is an important step in extracting structured information from text
documents, such as news articles, publications, patents, and websites, building the basis for
knowledge graph construction. High-quality datasets play a crucial role in this process [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], as
they form the basis for training, fine-tuning, and evaluating relation extraction models.
Additionally, the amount of data they contain has a significant impact on the achieved results [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
However, creating large datasets with high quality typically requires human annotation, which
is expensive and slow. Although heuristic methods such as distant supervision can produce
larger datasets, they often lack quality [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. An alternative is Generative Data Augmentation
(GDA), a technique for synthetically expanding datasets by generating new data samples (here:
texts and extracted triples). It can generate datasets that are much larger, more diverse, and
less expensive than traditional human annotations without directly collecting new data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In
the context of relation extraction, GDA has been widely used in combination with pre-trained
language models such as BERT and GPT [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
        ].
      </p>
      <p>Triple
(Ted, ’lives in’, ’New York’)</p>
      <p>GDA-Model</p>
      <p>Text without Hallucinations
Ted lives in the city of New York</p>
      <p>Text with Hallucinations
Ted lives in the city of New York, which has
a population of 8.4 million inhabitants.</p>
      <p>
        Despite its advantages, GDA often leads to hallucinations in the text, where the content
deviates from the information in the input, as, for instance, additional facts are generated (see
Figure 1). This issue commonly occurs in generative language models [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]. If a language
model is trained on a dataset with incorrect annotations due to hallucinations, the efectiveness
of the relation extraction method may be compromised – to which degree is unknown –,
potentially reducing the accuracy of extracted triples as the model might not learn to capture
all necessary information. Although the phenomenon of hallucinations is well-recognized, and
the use of LLMs to generate training datasets is increasing, the specific efects of hallucinations
on relation extraction have not been thoroughly investigated.
      </p>
      <p>In this paper, we examine the impact of hallucinations in synthetic training data on relation
extraction, considering several hallucination types on the document and sentence level. Our
research focuses on two primary questions: RQ1: Can we detect a significant influence
of hallucinations on the relation extraction model’s performance? We address this
question by evaluating the performance of models trained on datasets with varying levels of
hallucinations, aiming to understand the presence and impact of diferent hallucination types.
RQ2: Can hallucinations be reliably detected? To address this question, we develop and
evaluate approaches for hallucination detection.</p>
      <p>Our findings reveal substantial declines in dataset quality and model performance due to
hallucinations, with recall decreases ranging from 19.1% to 39.2%. This indicates that hallucinations
notably compromise the ability of models to extract relations from texts. In this context, it is
crucial to diferentiate between relevant and irrelevant hallucinations. The former significantly
afects performance, while the latter has a minimal impact. Furthermore, we develop two
methods for identifying and eliminating hallucinations, achieving F1-scores of 83.8% and 92.2%.
These methods not only remove hallucinations but also assist in estimating their prevalence.</p>
      <p>Overall, our contributions in this paper are as follows:1
• Analyzing the Impact of Hallucinations on Model Performance: We determine the
efect of hallucinations on relation extraction models by training them on datasets with
diferent levels of hallucinations and analyzing the performance discrepancies observed.
• Classifying Hallucinations: We categorize hallucinations into relevant and irrelevant
types, and examine their impacts on datasets.
• Detecting Hallucinations: We evaluate language model-based methods for
automatically detecting hallucinations.</p>
      <p>1Our source code is available at https://github.com/BigPanda042/Relation-Extraction-Hallucination-Study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, we first look at related work on creating synthetic training data. In the second
part, we look at noisy data and hallucinations and how to recognize them.</p>
      <p>
        Generating Synthetic Data. Several data augmentation approaches have been proposed.
Feng et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] diferentiate between three main types: (1) Rule-based approaches use
algorithms to modify existing real-world datasets. Techniques such as synonym replacement,
random insertion, swapping and deletion are used to significantly increase the volume of
training data [
        <xref ref-type="bibr" rid="ref16 ref17 ref3">16, 3, 17</xref>
        ]. (2) Sample interpolation, also known as Mixed Sample Data
Augmentation, [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] interpolates data points to create more diverse and robust datasets for
training language models [
        <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22 ref23">19, 20, 21, 22, 23</xref>
        ]. Both approaches are limited by the fact that they
are based on existing datasets. As a result, they are not able to introduce completely new
features or vary the data types significantly, such as the relation types for relation extraction
tasks. This can lead to the persistence of existing biases in the original datasets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. (3)
Modelbased approaches, referred to as Generative Data Augmentation (GDA), overcome these
limitations. They are able to generate completely new and specific data points, independent
of existing datasets. For example, the Control Prefixes model [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] is characterized by the
generation of text data from structured knowledge graphs using the WebNLG dataset [
        <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
        ].
Other notable implementations include the use of pretrained language models (PLMs), such
as GPT-3.5, which have been successfully used to improve performance on relation extraction
tasks [
        <xref ref-type="bibr" rid="ref2 ref27 ref28 ref6">27, 2, 28, 6</xref>
        ].
      </p>
      <p>
        Josifoski et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] developed a large synthetic dataset named Wiki-cIE for closed information
extraction, utilizing GPT-3.5 with prompt engineering. This dataset, containing 1.8 million data
points, serves as a robust alternative to both distantly supervised and directly supervised datasets
in terms of size and quality. It is positioned closely in scale to the largest distantly supervised
dataset, REBEL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Importantly, the Wiki-cIE dataset ofers enhanced quality, especially in the
distribution of relation types and the accuracy of text annotations. Josifoski et al. demonstrate
that relation extraction models trained on Wiki-cIE significantly outperform those trained on
REBEL, attributing this advantage to the superior quality of their synthetic dataset. However,
they do not specify which particular attributes of the datasets contribute to these performance
diferences. A notable quality diference is in the accuracy of the text annotations, suggesting
that this aspect may be a critical factor in the observed improvements in model performance.
      </p>
      <p>
        Detecting and Assessing Noisy Data. Corrupted or noisy data, characterized by issues
such as incorrect labels, afects language model training [
        <xref ref-type="bibr" rid="ref29 ref30 ref31 ref32 ref33 ref34">29, 30, 31, 32, 33, 34</xref>
        ]. Several strategies
have been developed to address noisy data in datasets. Techniques include resampling [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], loss
reweighting [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], and label correction [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. Additionally, some approaches advocate training
models using noise-robust loss functions [
        <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
        ], with a notable recent development being a
noise-robust re-weighting framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. While these methods efectively mitigate the impact of
noisy data or reduce its presence, they do not specifically explore the influence of hallucinations
within synthetic training data on the performance of relation extraction models.
      </p>
      <p>
        Analyzing Hallucinations. Ji et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] provide an overview on hallucinations, including
relevant data-to-text use cases. The authors distinguish between two types of hallucinations:
intrinsic and extrinsic. Intrinsic hallucinations are false information in texts that contradict
the annotations, while extrinsic hallucinations consist of additional information in the texts
that is not supported by the annotations. There exist several approaches to detect both types
of errors. Typical textual similarity metrics such as BLEU or ROUGE are unsuitable for the
detection of hallucinations [
        <xref ref-type="bibr" rid="ref37 ref38">37, 38</xref>
        ]. Other approaches can be divided into statistical and
modelbased methods. Statistical approaches [
        <xref ref-type="bibr" rid="ref39 ref40 ref41">39, 40, 41</xref>
        ] focus primarily on lexical information, i.e.,
the specific words used, and therefore cannot adequately take syntactic or semantic variations
into account. Thus, the more relevant alternatives are model-based approaches. Liu et al. [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ]
use named entity recognition to extract the entities from a text and compare them with those
in the annotated table. The number of hallucinations is then based on the diference between
annotated and found entities. Dušek and Kasner [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ] have developed an approach that uses a
natural language based inference method. It compares the input data and the output text in both
directions and can thus detect omissions or hallucinations. The last methods to be mentioned
are the language model-based approaches by Filipova [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and Tian et al. [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. However, these
methods provide results that either focus on table-to-text generation or are not precise enough
for our needs. While the presented methods contribute to the task of detecting hallucinations,
none of them examines the exact influence of hallucinations on training performance or attempts
to diferentiate between diferent types of hallucinations.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        The concept of hallucinations lacks a universally accepted definition [
        <xref ref-type="bibr" rid="ref12 ref13 ref32 ref40 ref45">13, 32, 45, 12, 40</xref>
        ]. Figure
1 provides an example of a hallucination. In this scenario, a GDA model, tasked with generating
text from the input triple (’Ted’, ’lives in’, ’New York’), should ideally produce ’Ted lives in the
city of New York.’ Instead, the model might extend this to ’Ted lives in the city of New York,
which has a population of 8.4 million inhabitants.’ This addition introduces an unsupported
triple (’New York’, ’has’, ’8.4 million inhabitants’), which is a hallucination.
      </p>
      <p>Formally, we define hallucinations  as the set diference  =   ∖  between the set of
triples  and the triples   that are actually generated in the text .</p>
      <p>
        We diferentiate between relevant and irrelevant hallucinations [
        <xref ref-type="bibr" rid="ref24 ref46">24, 46</xref>
        ] in relation
extraction models, as illustrated in Figure 2. Relevant hallucinations occur when the text
expresses triples with relation types that are relevant (i.e., included in the schema) but absent
from the annotations. For example, if a model is trained exclusively to detect birth dates in texts,
only triples related to birth dates are considered relevant. Conversely, irrelevant hallucinations
involve relations that the model is designed to ignore, as they do not pertain to its trained focus.
      </p>
      <p>In the following, we first analyze the influence of hallucinations on synthetic training datasets
for relation extraction. We then consider the automatic detection of hallucinations in relation
extraction datasets.</p>
      <sec id="sec-3-1">
        <title>3.1. Evaluating the Efects of Hallucinations</title>
        <sec id="sec-3-1-1">
          <title>3.1.1. Influence of Relevant Hallucinations on Document Level</title>
          <p>In this subsection, we focus on evaluating the impact of relevant hallucinations on document-level
relation extraction.</p>
          <p>Relevant Relations
{’birthDate’}</p>
          <p>Triple
(’Alan Bean’,
’birthDate’, ’March 15, 1932’)</p>
          <p>Correct Text
Alan Bean was born
on March 15, 1932.</p>
          <p>Text with Hallucinations
Alan Bean was born on March
15, 1932 and was an Astronaut</p>
          <p>Text with Hallucinations
Alan Bean was born on March 15, 1932,
and Nikola Tesla was born on Juli 10, 1856</p>
          <p>Irrelevant Hallucination:
(’Alan Bean’, ’occupation’, ’Astronaut’)</p>
          <p>Relevant Hallucination:
(’Nikola Tesla’, ’birthDate’, ’Juli 10, 1856’)
Datasets. As presented in Figure 3, we select two datasets:
1. Dataset A is characterized by fewer hallucinations. Specifically, we employ the
Re</p>
          <p>
            DocRED dataset [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ], as is known for its extensive use and minimal irrelevant content.
2. Dataset B contains a significant presence of relevant hallucinations. We use the DocRED
dataset [
            <xref ref-type="bibr" rid="ref47">47</xref>
            ], an earlier version of the Re-DocRED dataset known for its incomplete
annotations and the consequent prevalence of relevant hallucinations.
          </p>
          <p>The diferences between dataset , Re-DocRED, and , DocRED, are outlined in Table 1.</p>
          <p>
            Relation Extraction Model. We select the DREEAM model [
            <xref ref-type="bibr" rid="ref48">48</xref>
            ], which has achieved top
performance on the DocRED and Re-DocRED datasets [
            <xref ref-type="bibr" rid="ref49 ref50">49, 50</xref>
            ]. This model is optimized for
compatibility with both datasets, thereby obviating the need for further modifications.
          </p>
          <p>
            The model is initially trained on Dataset  and  to produce two tailored versions: DREEAM
and DREEAM. These models are then evaluated on the respective test portions of the datasets
and benchmarked against the findings of Ma et al. [
            <xref ref-type="bibr" rid="ref48">48</xref>
            ]. Although the standard practice for
DocRED involves using a development dataset for parameter tuning and testing, we adopt it
as our test dataset. The ultimate comparison of DREEAM and DREEAM’s performance is
conducted using the same test dataset A, which is free of hallucinations [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ].
          </p>
          <p>Evaluation Results and Discussion: Table 2 shows the evaluation results, revealing
a significant discrepancy between the two model configurations. Notably, the recall difers
strongly, which is also reflected in the F1-score. In the case of relation extraction, the recall
measures the ratio of correctly extracted triples compared to all relevant triples that should
have been extracted. Since DREEAM was trained on data where the triples in the annotation
do not accurately reflect the text’s triples, it learned that not all triples must be extracted to
obtain a correct solution. This results in a lower recall, as expected.</p>
          <p>The precision, however, surprisingly increases for DREEAM when evaluated on A,
compared to the evaluation on B, to an even higher value than for DREEAM. This diference
is most likely due to the wrong test dataset. The model most likely extracted true positives, but
since the test dataset is incorrect, those correct triples were not present in the test annotation
and counted as false positives. Those false positives became true positives through the correct
A, and the precision increased. Nevertheless, this cannot explain why the precision increased
further than the precision of DREEAM. One potential reason is that DREEAM tends to
extract fewer triples than DREEAM. DREEAM was trained on a dataset with generally fewer
triples in T but the same texts and thus learned to extract fewer triples. Another possibility
presents the relation type distribution. In , the number of underrepresented relation types
may have increased, or new, more dificult ones to extract correctly may have been introduced.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Influence of Relevant Hallucinations on Sentence Level</title>
          <p>
            Datasets. We now require datasets on the sentence-level. We use the WebNLG [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ] dataset for
, a widely used knowledge graph-to-text dataset [
            <xref ref-type="bibr" rid="ref24 ref25 ref51 ref8">25, 24, 8, 51</xref>
            ]. Based on an own analysis and
to the best of our knowledge, the dataset is free of relevant hallucinations [
            <xref ref-type="bibr" rid="ref52">52</xref>
            ].
          </p>
          <p>The first variant for ,  , is created to ensure direct comparability to the document-level
datasets used above and to prevent biases by controlling the creation process. To accomplish
this, we delete one triple from each of ’s data points, given that at least two triples are present.
We randomly select which triple to delete to avoid any bias regarding the position of missing
information. This ensures that the same texts are kept in both A and B  but with diferent
annotations. In total, we delete 28.1% of all triples, corresponding to a 39.0% hallucination rate
(calculated by dividing the total number of triples in all the texts by the total number of triples
in all the annotations).</p>
          <p>Additionally, we create B to ensure that measured diferences cannot solely be attributed
to B  just having fewer triples in the annotation. We add the text (of an unrelated data point)
that contains no identical triples in the annotation to each data point of B . This way, we can
include relevant information in the text without altering the annotations.</p>
          <p>
            Relation Extraction Model. We use the state-of-the-art PFN model [
            <xref ref-type="bibr" rid="ref46 ref53">46, 53</xref>
            ]. Our initial step
involved a preliminary experiment similar to the one described in Section 3.1.1. We adapted PFN
to dataset A, resulting in PFN, and assessed its performance on the A dataset. The F1-score
diferences were minor, averaging less than 0.6% in variation, which we deemed acceptable
given the unknown variance in Yan et al.’s results. Subsequently, we fine-tuned PFN on datasets
B and B , producing PFN and PFN , respectively. Both models were then evaluated
against the original test dataset.
          </p>
          <p>Evaluation Results and Discussion. Table 3 reveals performance diferences between
PFN and PFN . Specifically, recall diminishes by an average of 19.1%, while precision
Akron, Ohio is 306 m above sea level,
has a total area of 161.54 sq km and
a population density of 1239.3 people
per sq km.</p>
          <p>Llama 2</p>
          <p>Akron, Ohio lies 306.0 m above sea level and
has the area codes of 234 and 330. It has
a total area of 161.54 sq km of which 0.88
sq km is water, and a population density of
1239.3 inhabitants per sq km.
declines by 2.29%, a decrease deemed statistically significant through a paired t-test at a 95%
confidence level. Regarding PFN  , recall is similarly reduced by 19.98%. Conversely, there is
a marginal increase in precision of 0.06%. Consequently, the F1-scores for PFN and PFN
decrease by 11.32% and 11.62%, respectively.</p>
          <p>These findings underline that the persistence of triples in longer texts within B  does
not counterbalance the reduced training data volume. As discussed in Section 3.1.1, only the
variation in hallucination rates across datasets explains the altered recall rates.</p>
          <p>Diferences remain substantial in recall between document-level and sentence-level extraction,
as summarized in Table 4. Document-level recall decreases nearly twice as much in absolute
terms and three times in relative terms compared to sentence-level, primarily due to difering
hallucination rates. Table 1 shows that Re-DocRED contains three times more triples per
annotation than DocRED, a stronger contrast than observed between A and B . Yet, without
control over document-level dataset creation, a definitive causality cannot be verified here.</p>
          <p>Contrary to expectations, precision varies significantly across the experiments. Notably, a
5% diference at the document-level, as indicated in Table 2, diverges from the sentence-level
ifndings between PFN  and PFN . This discrepancy suggests potential document-specific
efects or dataset variances not previously accounted for. Given the controlled modifications in
B , these results are considered more reliable, highlighting a distinct decrease in precision
between PFN and PFN .</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Diferences Between Relevant and Irrelevant Hallucinations</title>
          <p>In this subsection, we evaluate the influence of relevant and irrelevant hallucinations on the
sentence level.</p>
          <p>
            Dataset. We keep  the same as it can serve as the dataset with fewer hallucinations. On
the other hand,  needs to contain irrelevant hallucinations instead of relevant ones. We also
create another test dataset for testing whether the newly trained models only extract the first
part of a text and ignore the rest. To that end, we use the chat version of LLAMA2 [
            <xref ref-type="bibr" rid="ref54">54</xref>
            ] to add
irrelevant hallucinations to each of ’s data points. The LLM takes text as input and returns
the same text but with additional information. To create additional dataset variants, we adjust
the prompt by adding or removing specific instructions. This allows us to partly control the
amount and type of information added. In total, we produce five modified WebNLG datasets,
with the only diference being the prompt we use for the creation process.
          </p>
          <p>New Test Dataset. We modify the test dataset to assess if text length afects model
performance. Each data point in A is altered by fusing two data points (i.e., concatenating S
and merging T), creating a test set with longer texts and more triples per data point without
adding new hallucinations.</p>
          <p>Language Model. We fine-tune PFN on each of the five modified WebNLG datasets.</p>
          <p>Evaluation Results on Original Test Dataset. The results are presented in Table 5. Since
all the results of the modified WebNLG datasets are similar to each other, we present with
B an average of all five (all results are in our repository). The evaluation on the original
A dataset shows that the recall drops for an average of 1.98% and the F1-score for 1.26%
(statistically significant diferences using the paired t-test and a confidence interval of 0.95).</p>
          <p>Evaluation Results on New Test Dataset. The results are presented in Table 6. They
indicate a similar diference in the performance of PFN  and PFN between the evaluation on
the altered and the original test dataset. The recall diference is not statistically significant while
the f1-score and precision are significant (using a paired t-test and confidence interval of 0.95).</p>
          <p>Discussion. The presented findings indicate that whether we extensively increase the
information content, keep it a bit shorter, or create more similar information, the diferent
prompts and irrelevant hallucinations have only a minor impact on the trained relation extraction
models. Through the evaluation on the new test dataset, we observed that the small diferences
between PFN and PFN cannot be attributed to the fact that PFN learned to ignore
the back part of the natural language texts (which contains the newly added hallucinations) and
only extracts triples from the front part. This is evident in the statistically insignificant diference
between the recall of PFN and PFN evaluated on the new test dataset. The significant results
in precision and F1-score are not further relevant for us.</p>
          <p>Overall, there is a minor impact of irrelevant hallucinations in relation extraction models
because these models are trained to prioritize and extract only those relations classified under
relevant relation types, efectively disregarding all others categorized as irrelevant relation
types. Thus, irrelevant hallucinations are systematically ignored during the training process.</p>
          <p>The observed diferences in model performance, despite expectations, may stem from two
potential factors that require further investigation. The first possibility is that Llama 2
occasionally introduces relevant relations in what is mostly uncontrolled information. The second
possibility is an increase in errors by the relation extraction model due to processing a larger
volume of text, regardless of its relevance. Further experiments are needed in this regard.</p>
          <p>Based on these results and explanations, we can confirm the assumption that relevant
hallucinations in (synthetic) training data have a much stronger impact on the performance of
relation extraction models trained on them than irrelevant hallucinations. Therefore, irrelevant
hallucinations can mostly be neglected regarding the influence on training performance of
relation extraction models. This also means that when creating datasets or improving annotation
quality, removing relevant hallucinations should be the priority.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluating Hallucination Detection</title>
        <p>We consider two approaches of hallucination detection, as outlined in the following.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Named Entity Recognition-Based Hallucination Detection</title>
          <p>
            A first approach for hallucination detection was suggested by Liu et al. [
            <xref ref-type="bibr" rid="ref42">42</xref>
            ] and involves named
entity recognition (NER). This approach extracts entities from a text and compares them to the
entities in the corresponding triples. Entities found in the text but absent from the triples are
identified as hallucinations.
          </p>
          <p>
            Dataset: The sentence-level WebNLG dataset version v3.0 [
            <xref ref-type="bibr" rid="ref55">55</xref>
            ] serves as the basis for this
work. The dataset includes annotations with one to seven triples.
          </p>
          <p>
            Model: We decided to use the widely used SpaCy model given its wide usage and solid
performance. Through preliminary tests, we can confirm the theory that the entities extracted
by the model from S are often correct but not equivalent to those in the annotation T. This can
result in cases where, for example, ’Alan Bean’ is in T but only ’A. Bean’ is extracted from S,
which essentially means the same thing. To solve this problem, we use the sentence similarity
model all-mpnet-base-v2 [
            <xref ref-type="bibr" rid="ref56">56</xref>
            ] to compare the extracted and annotated entities.
          </p>
          <p>Evaluation Setup: We use the precision, recall, and F1-score for the evaluation. We classify
’hallucination-free’ texts that are correctly accepted as true positives.</p>
          <p>For our experiments, we utilize 3,000 data points sampled from D. For each data point, we
randomly select one correct text and one hallucination to maintain a balanced ratio between the
two. The hallucination text can contain one to six hallucinated triples. Additionally, we conduct
a hyperparameter sweep across all acceptance thresholds ranging from 0.05 to 0.95 (inclusive)
in 0.05 increments to find the best-performing threshold for the sentence similarity model.</p>
          <p>Evaluation Results: Figure 5 shows a climbing precision and falling recall with increasing
threshold. Given those trends, the F1-score increases with a higher threshold up to 0.55. After
this, it falls until the end. At the peak of 0.55, the precision, recall, and F1-score are 85.34,
82.25, and 83.76%, respectively. Given this, the overall results obtained from the tests seem
satisfactory. Out of all the texts classified as ’clean,’ around 85.34% were correctly identified as
clean. Similarly, among all the tested clean texts, 82.25% were accurately classified as ’clean.’
With that performance, the approach can be used to detect hallucinations and provide an
approximate understanding of the amount of hallucinations in datasets.</p>
          <p>Precision
Recall
F1</p>
          <p>0.5</p>
          <p>Threshold</p>
          <p>The presented approach has several limitations. One limitation is that the equivalence between
extracted and annotated entities depends on the sentence similarity model, making it unclear
how many entities were incorrectly accepted or rejected. A fixed threshold is also needed
to define equivalence, with the best results found at 0.55, indicating significant diferences
between extracted entities and annotations. This complicates model evaluation, and the issue
can vary across datasets due to difering annotation formats. A potential solution is to use
textual entailment instead of sentence similarity to assess entity matches.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Textual Entailment Approach for Hallucination Detection</title>
          <p>
            Another approach, inspired by Dušek and Kasner [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ], uses an entailment model  to check
if a sentence  contains the same information as a set of triples  . The triples  ∈  are
combined into a single sentence using conjunctions. If  classifies  as not entailed, it indicates
hallucinations; if classified as entailed,  is considered hallucination-free.
          </p>
          <p>Dataset: Since we evaluate a new approach on the same task as in the previous Section
3.2.1, we do not need to adjust the dataset and can continue to use .</p>
          <p>
            Model: For this task, we focus on the roberta-large-mnli [
            <xref ref-type="bibr" rid="ref57">57</xref>
            ] and deberta-v2-xlarge-mnli
[
            <xref ref-type="bibr" rid="ref58">58</xref>
            ] models. Both models perform well on SQuAD 1.1/2.0 and various GLUE benchmarks.
          </p>
          <p>Classifier: An entailment model can be used to test whether sentence S2 is part of sentence
S1 or if the content of S1 implies the content of S2. We design the model to test for any
hallucinations in S compared to T, from each data point of a dataset.</p>
          <p>The initial step involves pre-processing the triples, typically formatted as ’Entity_1 | relation
| Entity_2’ or as a three-element list. Here, we receive the input in the former format and
replace all occurrences of ’_’ and ’ | ’ with spaces. Next, the goal is to create a sentence, ST, that
encompasses all triples from T and accurately conveys T’s informative value. This is done by
combining the pre-processed triples  ∈  into a single sentence, linked by the conjunction
’and.’ Finally, M verifies whether S is entailed in ST. If the result is ’entailed,’ S is deemed correct;
otherwise, ’neutral’ or ’contradictory’ results signal the presence of hallucinations.</p>
          <p>Evaluation Setup: We test  on 4,000 sampled data points from . Each sample comprises
an annotation  and two texts, 1 ∈ ℎ and 2 ∈ , while ℎ,  ⊂ . That means that two tests
have to be conducted for each data point, one for each text. This results in 8,000 classifications.
The evaluation is otherwise similar to the NER approach from the previous section.</p>
          <p>
            Evaluation Results: The results are presented in Table 7. Both tested models outperformed
the SpaCy model by 7.03% to 8.39% in F1-score using a straightforward sentence creation
procedure for ST. The deberta-v2-xlarge-mnli model outperforms the roberta-large-mnli model
by 1.36% in F1-score, with the most significant diference in recall, while precision increases
only slightly. When comparing our best hallucination detection approach to the method used by
Dušek and Kasner [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ], our approach performs significantly better, achieving a 13.75% higher
F1-score. However, this comparison should be considered with caution, as their study used an
older version of the WebNLG dataset with many incorrect annotations.
          </p>
          <p>An F1-score of 92.15% demonstrates that it is possible to reliably classify each data point in a
dataset as either containing hallucinations or being hallucination-free. With high recall and
precision, most clean texts are correctly identified as such, and the error rate for texts wrongly
identified as hallucination-free is under 10%. This performance allows for the efective detection
(and removal) of hallucinations in datasets, thereby significantly improving annotation quality.</p>
          <p>
            Despite the superior results compared to Dušek and Kasner [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ], who also relied on natural
language inference, the comparison must be approached with caution due to their use of an
outdated WebNLG version with incorrect annotations afecting their outcomes. Furthermore,
this approach is limited to the detection of hallucinations and to sentence-level datasets, similar
to the constraints discussed for the NER metric.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we analyzed the impact of hallucinations in synthetic training data on relation
extraction tasks. Our evaluation revealed significant performance declines with recall reductions
between 19.1% and 39.2%. This indicates that hallucinations notably compromise the ability of
models to accurately extract relations from texts. We identified a distinction between relevant
and irrelevant hallucinations, noting that the former significantly impairs performance, while
the latter has a minimal impact. Additionally, we developed methods for the detection (and
thus mitigation) of hallucinations to improve data quality and, thus, model performance. Our
approaches, successfully classified texts as either ’hallucinated’ or ’clean,’ with notable F1-scores
of 83.8% and 92.2%. In the future, we will analyze the impact of hallucinations in datasets for
other NLP tasks, such as entity and event extraction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hernández-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>König</surname>
          </string-name>
          ,
          <article-title>Data augmentation instead of explicit regularization</article-title>
          , CoRR abs/
          <year>1806</year>
          .03852 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1806</year>
          .03852. arXiv:
          <year>1806</year>
          .03852.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <article-title>Self-guided noise-free data generation for eficient zero-shot learning</article-title>
          ,
          <source>in: Proceedings of the Eleventh International Conference on Learning Representations, ICLR'23</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zou</surname>
          </string-name>
          , EDA:
          <article-title>Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing</source>
          , EMNLP-IJCNLP'
          <fpage>19</fpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China,
          <year>2019</year>
          , pp.
          <fpage>6382</fpage>
          -
          <lpage>6388</lpage>
          . URL: https://aclanthology.org/D19-1670. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1670.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Anaby-Tavor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carmeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Goldbraich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kantor</surname>
          </string-name>
          , G. Kour,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shlomov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zwerdling</surname>
          </string-name>
          ,
          <article-title>Do not have enough data? deep learning to the rescue!</article-title>
          ,
          <source>in: Proceedings of the 34th AAAI Conference on Artificial Intelligence</source>
          , AAAI'
          <fpage>20</fpage>
          , AAAI Press,
          <year>2020</year>
          , pp.
          <fpage>7383</fpage>
          -
          <lpage>7390</lpage>
          . URL: https://doi.org/10.1609/aaai.v34i05.6233. doi:
          <volume>10</volume>
          .1609/AAAI.V34I05.6233.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>REBEL: relation extraction by end-to-end language generation</article-title>
          ,
          <source>in: Findings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP'21</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2370</fpage>
          -
          <lpage>2381</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Josifoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sakota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <article-title>Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP'23,
          <string-name>
            <surname>Singapore</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>1555</fpage>
          -
          <lpage>1574</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>96</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>96</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L. F. R.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Investigating Pretrained Language Models for Graph-to-Text Generation</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI</source>
          , NLP4ConvAI@ACL'
          <fpage>21</fpage>
          ,
          <string-name>
            <surname>Online</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>227</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .nlp4convai-
          <fpage>1</fpage>
          .20. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . nlp4convai-
          <fpage>1</fpage>
          .
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yavuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <article-title>Stage-wise fine-tuning for graph-to-text generation</article-title>
          ,
          <source>in: Proceedings of the ACL-IJCNLP 2021 Student Research Workshop</source>
          , ACL 2021, Online, JUli
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>22</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .acl-srw.2. doi:
          <volume>10</volume>
          .18653/V1/
          <year>2021</year>
          .ACL-SRW.
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL'20,
          <string-name>
            <surname>Online</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>2147</fpage>
          -
          <lpage>2157</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>194</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>194</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Daxenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Gurevych</given-names>
            ,
            <surname>Augmented</surname>
          </string-name>
          <string-name>
            <surname>SBERT</surname>
          </string-name>
          :
          <article-title>Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>296</fpage>
          -
          <lpage>310</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>28</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>28</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>K. M. Yoo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>S.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , W. Park, GPT3Mix:
          <article-title>Leveraging Large-scale Language Models for Text Augmentation</article-title>
          ,
          <source>in: Findings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP'21,
          <string-name>
            <surname>Punta</surname>
            <given-names>Cana</given-names>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2225</fpage>
          -
          <lpage>2239</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>192</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>192</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Hua, W. Jia,
          <source>Cognitive Mirage: A Review of Hallucinations in Large Language Models</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2309.06794. doi:
          <volume>10</volume>
          .48550/ arXiv.2309.06794.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of Hallucination in Natural Language Generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . URL: http://arxiv.org/abs/2202.03629. doi:
          <volume>10</volume>
          .1145/ 3571730, arXiv:
          <fpage>2202</fpage>
          .
          <fpage>03629</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Varshney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation</article-title>
          ,
          <source>CoRR abs/2307</source>
          .03987 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.03987. doi:
          <volume>10</volume>
          .48550/ ARXIV.2307.03987.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gangal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>A Survey of Data Augmentation Approaches for NLP, in: Findings of the Association for Computational Linguistics</article-title>
          , ACL-IJCNLP'
          <fpage>21</fpage>
          ,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>968</fpage>
          -
          <lpage>988</lpage>
          . URL: https://aclanthology. org/
          <year>2021</year>
          .findings-acl.
          <volume>84</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.
          <volume>84</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cohn</surname>
          </string-name>
          , T. Baldwin,
          <article-title>Robust Training under Linguistic Adversity, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          , EACL'17,
          <string-name>
            <surname>Valencia</surname>
          </string-name>
          , Spain,
          <year>2017</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>27</lpage>
          . URL: https://aclanthology.org/ E17-2004.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <article-title>Text Augmentation in a Multi-Task View, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          , EACL'21,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>2888</fpage>
          -
          <lpage>2894</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .eacl-main.
          <volume>252</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl-main.
          <volume>252</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cisse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Lopez-Paz,
          <source>mixup: Beyond Empirical Risk Minimization</source>
          ,
          <year>2018</year>
          . URL: http://arxiv.org/abs/1710.09412. doi:
          <volume>10</volume>
          .48550/arXiv.1710.09412, arXiv:
          <fpage>1710</fpage>
          .
          <fpage>09412</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          , D. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Chun,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choe</surname>
          </string-name>
          , Cutmix:
          <article-title>Regularization strategy to train strong classifiers with localizable features</article-title>
          ,
          <source>in: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision</source>
          , ICCV'19, IEEE,
          <year>2019</year>
          , pp.
          <fpage>6022</fpage>
          -
          <lpage>6031</lpage>
          . URL: https://doi.org/10.1109/ICCV.
          <year>2019</year>
          .
          <volume>00612</volume>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2019</year>
          .
          <volume>00612</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>V.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beckham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Najafi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Mitliagkas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lopez-Paz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Manifold Mixup:
          <article-title>Better Representations by Interpolating Hidden States</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Machine Learning, ICML'19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6438</fpage>
          -
          <lpage>6447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          , Nonlinear Mixup:
          <article-title>Out-Of-Manifold Data Augmentation for Text Classification</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>4044</fpage>
          -
          <lpage>4051</lpage>
          . doi:
          <volume>10</volume>
          . 1609/aaai.v34i04.
          <fpage>5822</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Beckham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Honari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Lamb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ghadiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Hjelm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          , On Adversarial Mixup Resynthesis,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>32</volume>
          of NeurIPS'19,
          <year>2019</year>
          . URL: https://papers.nips.cc/paper/2019/hash/ f708f064faaf32a43e4d3c784e6af9ea-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Sequence-Level Mixed Sample Data Augmentation</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP'20,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>5547</fpage>
          -
          <lpage>5552</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          . emnlp-main.
          <volume>447</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>447</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Clive</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <article-title>Control Prefixes for Parameter-Eficient Text Generation (</article-title>
          <year>2022</year>
          ). URL: http://arxiv.org/abs/2110.08329. doi:
          <volume>10</volume>
          .48550/arXiv.2110.08329, arXiv:
          <fpage>2110</fpage>
          .
          <fpage>08329</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>WebNLG</surname>
          </string-name>
          ,
          <string-name>
            <surname>Papers with Code - WebNLG Dataset</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://paperswithcode.com/ dataset/webnlg.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gardent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shimorina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Perez-Beltrachini</surname>
          </string-name>
          ,
          <article-title>The WebNLG Challenge: Generating Text from RDF Data</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Natural Language Generation</source>
          , INLG'17, Santiago de Compostela, Spain,
          <year>2017</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>133</lpage>
          . URL: https://aclanthology.org/W17-3518. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W17</fpage>
          -3518.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>How to Unleash the Power of Large Language Models for Few-shot Relation Extraction?</article-title>
          ,
          <source>in: Proceedings of The Fourth Workshop on Simple and Eficient Natural Language Processing</source>
          ,
          <source>SustaiNLP'23</source>
          , Toronto, Canada (Hybrid),
          <year>2023</year>
          , pp.
          <fpage>190</fpage>
          -
          <lpage>200</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .sustainlp-
          <volume>1</volume>
          .13. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . sustainlp-
          <volume>1</volume>
          .
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          , Zerogen:
          <article-title>Eficient zero-shot learning via dataset generation</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP'22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>11653</fpage>
          -
          <lpage>11669</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>801</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2022</year>
          .EMNLP-MAIN.
          <year>801</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thulasidasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Bilmes</surname>
          </string-name>
          , G. Chennupati,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mohd-Yusof</surname>
          </string-name>
          ,
          <article-title>Combating label noise in deep learning using abstention</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of ICML'19</source>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>6234</fpage>
          -
          <lpage>6243</lpage>
          . URL: http://proceedings.mlr.press/v97/thulasidasan19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. Huang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R. S.</given-names>
            <surname>Erfani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <article-title>Normalized loss functions for deep learning with noisy labels</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Machine Learning</source>
          , volume
          <volume>119</volume>
          <source>of ICML'20</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6543</fpage>
          -
          <lpage>6553</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Peer loss functions: Learning from noisy labels without knowing noise rates</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Machine Learning</source>
          , volume
          <volume>119</volume>
          <source>of ICML'20</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6226</fpage>
          -
          <lpage>6236</lpage>
          . URL: http://proceedings.mlr.press/v119/liu20e.html.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>K.</given-names>
            <surname>Filippova</surname>
          </string-name>
          , Controlled Hallucinations:
          <article-title>Learning to Generate Faithfully from Noisy Data, in: Findings of the 20210Conference on Empirical Methods in Natural Language Processing</article-title>
          , EMNLP'20,
          <string-name>
            <surname>Online</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>864</fpage>
          -
          <lpage>870</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .findings-emnlp.
          <volume>76</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>76</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Aljunied</surname>
          </string-name>
          ,
          <article-title>Revisiting DocRED - Addressing the False Negative Problem in Relation Extraction</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP'22,
          <string-name>
            <surname>Abu</surname>
            <given-names>Dhabi</given-names>
          </string-name>
          , United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>8472</fpage>
          -
          <lpage>8487</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>580</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>580</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>G.</given-names>
            <surname>Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Platanios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poczos</surname>
          </string-name>
          , Re-TACRED:
          <article-title>Addressing Shortcomings of the TACRED Dataset</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>13843</fpage>
          -
          <lpage>13850</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/17631. doi:
          <volume>10</volume>
          .1609/aaai. v35i15.
          <fpage>17631</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Yao, G. Niu, I. Tsang,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <article-title>How does Disagreement Help Generalization against Label Corruption?</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Machine Learning, ICML'19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>7164</fpage>
          -
          <lpage>7173</lpage>
          . URL: https://proceedings. mlr.press/v97/yu19b.html.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Houle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Erfani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N. R.</given-names>
            <surname>Wijewickrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <article-title>Dimensionality-driven learning with noisy labels</article-title>
          ,
          <source>in: Proceedings of the 35th International Conference on Machine Learning</source>
          , volume
          <volume>80</volume>
          <source>of ICML'18</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3361</fpage>
          -
          <lpage>3370</lpage>
          . URL: http: //proceedings.mlr.press/v80/ma18d.html.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <article-title>A Structured Review of the Validity of BLEU, Computational Linguistics 44 (</article-title>
          <year>2018</year>
          )
          <fpage>393</fpage>
          -
          <lpage>401</lpage>
          . URL: https://doi.org/10.1162/coli_a_00322. doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00322</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>T.</given-names>
            <surname>Falke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. F. R.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Utama</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Dagan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL'19,
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy,
          <year>2019</year>
          , pp.
          <fpage>2214</fpage>
          -
          <lpage>2220</lpage>
          . URL: https://aclanthology.org/P19-1213. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1213.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <source>ACL'20</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1072</fpage>
          -
          <lpage>1086</lpage>
          . URL: http: //arxiv.org/abs/
          <year>2005</year>
          .00969. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>101</volume>
          , arXiv:
          <year>2005</year>
          .00969.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Retrieval Augmentation Reduces Hallucination in Conversation, in: Findings of the 2021 Conference on Empirical Methods in Natural Language Processing</article-title>
          , EMNLP'21,
          <string-name>
            <surname>Punta</surname>
            <given-names>Cana</given-names>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>3784</fpage>
          -
          <lpage>3803</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>320</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>320</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martindale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carpuat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McNamee</surname>
          </string-name>
          ,
          <source>Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation, in: Proceedings of Machine Translation Summit XVII</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>243</lpage>
          . URL: https://aclanthology.org/W19-6623.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sui</surname>
          </string-name>
          ,
          <article-title>Towards Faithfulness in Open Domain Table-totext Generation from an Entity-centric View</article-title>
          , volume
          <volume>35</volume>
          <source>of AAAI'21</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>13415</fpage>
          -
          <lpage>13423</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/17583. doi:
          <volume>10</volume>
          .1609/aaai. v35i15.
          <fpage>17583</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>O.</given-names>
            <surname>Dušek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kasner</surname>
          </string-name>
          ,
          <article-title>Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on Natural Language Generation</source>
          , ICNLG'
          <fpage>20</fpage>
          , Dublin, Ireland,
          <year>2020</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>137</lpage>
          . URL: https: //aclanthology.org/
          <year>2020</year>
          .inlg-
          <volume>1</volume>
          .19. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .inlg-
          <volume>1</volume>
          .
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>Sticking to the Facts: Confident Decoding for Faithful Data-to-</article-title>
          <string-name>
            <surname>Text</surname>
            <given-names>Generation</given-names>
          </string-name>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>1910</year>
          .08684. doi:
          <volume>10</volume>
          . 48550/arXiv.
          <year>1910</year>
          .
          <volume>08684</volume>
          , arXiv:
          <year>1910</year>
          .08684 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-G.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Simple</surname>
          </string-name>
          <article-title>Recipe towards Reducing Hallucination in Neural Surface Realisation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL'19,
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy,
          <year>2019</year>
          , pp.
          <fpage>2673</fpage>
          -
          <lpage>2679</lpage>
          . URL: https://aclanthology.org/P19-1256. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1256.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Partition</given-names>
            <surname>Filter</surname>
          </string-name>
          <article-title>Network for Joint Entity and Relation Extraction</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP'21, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>197</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>17</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>17</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , M. Sun,
          <article-title>DocRED: A Large-Scale Document-Level Relation Extraction Dataset, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL'19,
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy,
          <year>2019</year>
          , pp.
          <fpage>764</fpage>
          -
          <lpage>777</lpage>
          . URL: https://aclanthology.org/P19-1074. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1074.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Okazaki, DREEAM: Guiding Attention with Evidence for Improving Document-Level Relation Extraction (</article-title>
          <year>2023</year>
          ). URL: http://arxiv.org/abs/2302.08675. doi:
          <volume>10</volume>
          . 48550/arXiv.2302.08675, arXiv:
          <fpage>2302</fpage>
          .
          <fpage>08675</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <surname>Papers with Code - DocRED Benchmark (Relation Extraction)</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https:// paperswithcode.com/sota/relation
          <article-title>-extraction-on-docred.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <surname>Papers with Code - ReDocRED Benchmark (Relation Extraction)</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https:// paperswithcode.com/sota/relation
          <article-title>-extraction-on-redocred.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aghajanyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Okhonko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>HTLM: hyper-text pre-training and prompting of language models</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Learning Representations, ICLR'22</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL'18,
          <string-name>
            <surname>Melbourne</surname>
          </string-name>
          , Australia,
          <year>2018</year>
          , pp.
          <fpage>506</fpage>
          -
          <lpage>514</lpage>
          . URL: https://aclanthology.org/P18-1047. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P18</fpage>
          -1047.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53] WebNLG, PapersRE with Code - WebNLG
          <string-name>
            <surname>Benchmark (Relation Extraction)</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://paperswithcode.com/sota/relation-extraction-on-webnlg, available at https://paperswithcode.com/sota/relation
          <article-title>-extraction-on-webnlg.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>CoRR abs/2307</source>
          .09288 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.09288. doi:
          <volume>10</volume>
          .48550/ ARXIV.2307.09288. arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gardent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shimorina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Perez-Beltrachini</surname>
          </string-name>
          ,
          <article-title>Creating Training Corpora for NLG Micro-Planners</article-title>
          , in: R. Barzilay, M.-Y. Kan (Eds.),
          <source>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>188</lpage>
          . URL: https://aclanthology.org/P17-1017. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P17</fpage>
          -1017.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <article-title>sentence-transformers/all-mpnet-base-</article-title>
          <string-name>
            <surname>v2 · Hugging</surname>
            <given-names>Face</given-names>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://huggingface. co/sentence-transformers/
          <article-title>all-mpnet-base-v2.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <string-name>
            <surname>RoBERTa: A Robustly Optimized BERT Pretraining Approach</surname>
          </string-name>
          (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1907</year>
          .
          <volume>11692</volume>
          , arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Deberta: decoding-enhanced bert with disentangled attention</article-title>
          ,
          <source>in: Proceedings of the 9th International Conference on Learning Representations, ICLR'21</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=XPZIaotutsD.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>