<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Diferent Hallucinations calls for Diferent Solutions - A Categorisation of LLM Transcription Mistakes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nemi Pelgrom</string-name>
          <email>nemi.pelgrom@lnu.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Håkan Grahn</string-name>
          <email>hakan.grahn@bth.se</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Generative AI, LLM, Verification, Document analysis</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Media Technology, Linnaeus University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Blekinge Institute of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a contribution to better interpretation of the results we get from GenAI models, more specifically, better interpretation of the mistakes that they make. We have conducted an analysis of 644 (from GPT-4o) + 4858 (from ARIA) mistakes made by two models on a key-value extraction task, and found that they may be categorised into three mutually exclusive groups. These groups are; i problems identifying the requested information, p problems presenting the correct information, and s skewed training data. These categories could be used to indicate which action a user could take to reduce the number of mistakes. Further, we have found a strong correlation between the suggested categories and the Ratclif/Obershelp pattern recognition score between the generated result and the expected result; all faulty results containing minor mistakes are more than 60% similar to the expected result. Only mistakes based on lack of identifying what was requested had less than 60% similarity to the expected result.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
A</p>
      <p>Categorisation of LLM</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        While there are many papers detailing the accuracy of large language models’ (LLMs’) knowledge
of particular topics, or forms of reasoning, we are looking closer at the diferent ways that LLMs
are making mistakes, often called “hallucination”. While hallucinations are mentioned much
in media and AI research, the focus is mainly on avoiding them. One dimensional accuracy
measurements are giving some indication of how well models are presenting results at particular
tasks. The lack of discussion on what form the mistakes are taking in most of those papers are
leaving us with little insight into how one might reach higher accuracy results. For example,
shall we do better prompting, use better models, change the line of questioning, how one might
avoid the wrong results [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], and what might be the edge cases that were not easy to classify
as right or wrong [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]?
      </p>
      <p>This experiment had the aim of identifying patterns in the mistakes made by GenAI models
with vision capabilities, to better interpret if a model is the right fit for a certain task. This is
relevant to the currently emerging paradigm, where there are GenAI models of varying formats
https://lnu.se/personal/nemi.pelgrom/ (N. Pelgrom); https://grahn.cse.bth.se/ (H. Grahn)</p>
      <p>© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
and aims, and the main question is no longer what model structure is the best one, but rather
“What model is the best for my particular task”.</p>
      <p>
        We have conducted previous experiments on transcription tasks which had indicated
possibilities to systematically categorise the mistakes made [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] (while hallucinations is currently used
to refer to a wide range of mistakes made by all GenAI models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we find this terminology to
be unhelpful, and will mainly use the term mistake to refer to the responses that are faulty). This
paper is a continuation of that work, towards systematically identifying the diferent mistakes
that these models make, so that we may better identify the best path forward when we get
lower-than-expected results from a GenAI model. It is not currently possible to automatically
identify whether the model or the prompt, or something else in the requested task is responsible
for unsatisfying results, when such occur. We contribute towards making such an identification
possible by here reporting our methods and results for identifying categories of mistakes that
are likely to have diferent sources.
      </p>
      <p>The experiment is using the two multimodal generative models GPT-4o and ARIA to complete
a transcription task on a dataset of 3000 images. This provides us with an environment where
it is immediately clear if the model is understanding the requested task, and where it is easy
to automatically separate the correct answers from the incorrect ones, as opposed to pure
text questions which often require some qualitative interpretation. Further, we are requesting
transcriptions of both numbers and of natural language strings separately, allowing us to identify
possible diferences in how they are interpreted by the models.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background and Related work</title>
      <p>
        The development of vision augmented LLM models has gone fast. The first ones were made
easily available just a year ago [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Despite this, there is such wide interest in using them that
there have been several significant developments since then. There are now many options for
vision-LLMs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], including several open-source ones [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. Open-source models are extra
interesting for tasks which require careful data-handling. For example, medical sciences are
one of the main driving forces in AI enhanced image interpretation tasks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and many of
the images gathered for medical studies contain personally identifiable information, making
them dificult to handle without breaking confidentiality laws. When it is possible to run all AI
tasks locally, such issues disappear. This progress also builds on decades of research in optical
character recognition (OCR), which has long aimed to automate transcription tasks and other
kinds of information extraction from images. While traditional OCR methods had limitations
in dealing with complex layouts or degraded text, vision-capable models have significantly
improved the accuracy and versatility of these systems, expanding their usefulness in domains
that demand both precision and context understanding [
        <xref ref-type="bibr" rid="ref14 ref15 ref7">7, 14, 15</xref>
        ].
      </p>
      <p>
        Vision models generally have a three-part structure; one for interpreting the image, one for
interpreting text, and one that combines the end result in a fitting way [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. There are other
versions suggested as well [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. Many of these models are possible to fine-tune to be better
at your desired task. There are several diferent ways this may be done. Regular fine-tuning,
where the whole model is re-trained with the added dataset. LoRA (Low Rank Adaption) [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]
ifne-tuning, where the new training-data is added in between layers rather than making any
changes to the pre-trained parts of the model. And there are also good results from adding the
new data to the beginning of the model [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        There are also other ways to add additional training-data to a pre-trained model. These
include Retrieval Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], larger context windows [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], and creating
pipelines which ads the relevant information to the model at the right time [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        A recent addition to the set of GenAI architecture is mixture-of-experts solutions. These aim
at minimising the efort used for any particular task, by allowing for separating the models
into several parts, where some parts may be ignored when they are deemed irrelevant for the
intended task [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. For our experiment, we are using a model based on this kind of architecture;
ARIA [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], the best GenAI model with vision-to-text capabilities available to run locally at the
beginning of this project. To contrast it, we are using OpenAIs model GPT4-o, the best available
model with vision-to-text capabilities at the time.
      </p>
      <p>
        With these great progresses, there are still some shortcomings [
        <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
        ]. Hallucinations have
become the standard word to use for mistakes made by Generative AI models [
        <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
        ], and there
are both many ways for the models to make mistakes [
        <xref ref-type="bibr" rid="ref32 ref33 ref34">32, 33, 34</xref>
        ], and for us researchers to judge
or estimate what should be counted as a mistake [
        <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
        ]. This is the area in which our paper
will contribute. There have been some in-depth qualitative studies done on the hallucinations
of these models [
        <xref ref-type="bibr" rid="ref37 ref8">37, 8</xref>
        ], discussing how the content and factuality of text relates to generated
text [
        <xref ref-type="bibr" rid="ref28 ref38 ref39">28, 38, 39</xref>
        ], and of course the many that are stating the presence of mistakes simply by
presenting the accuracy results of some experiment [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ]. There have not been much exploration
of what can be learned quantitatively from the mistakes themselves. While simple accuracy
estimations allow for fast comparison between several ways to solve the same task, they give
little insight into how any of the ways may be improved. The emergence of prompt engineering
as a role in itself [
        <xref ref-type="bibr" rid="ref41 ref42 ref43">41, 42, 43</xref>
        ] shows that it is possible to reach significantly diferent accuracy
results depending not only on the information provided to the model of what task it should
complete, but also how the information is presented to the model. It would be valuable to be
able to identify whether lower accuracy results from using a model is due to the limitations of
the model, or to the limitations of the prompt used.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>This section contains all the details of the conducted experiment, the results are presented in
the next section.</p>
      <sec id="sec-4-1">
        <title>3.1. Dataset</title>
        <p>The dataset consists of 3000 images of real receipts collected from a wide range of purchases
in Sweden. The images are scans or photos containing both full view of receipts, and in most
cases some additional background such as hands, tables, and knees. Most of the receipts have
wrinkles and are not fully flat, which makes them harder to read, and keeps them representative
of receipt scans that are uploaded to receipt-reading services.</p>
        <p>
          These images provide a complex task of identifying the correct information requested in the
prompt, extracting it correctly, and then presenting it in the way specified by the prompt. If
any one of these steps go wrong, the end result will be faulty. This makes it very impressive
that some of these models are able to reach high accuracy results [
          <xref ref-type="bibr" rid="ref19 ref7">7, 19</xref>
          ].
        </p>
        <p>
          So from this raw dataset, we created the data used in our experiment; we ran each image
through each of the two models, and created separate JSON files containing the response of
each reading, for each of the model. The same prompt was used for both models. GPT-4o was
accessed through an API call, and ARIA was run locally on a Nvidia A100 GPU. These models
were chosen for being the best available, and respectively the best available to run locally, in
regards to OCR tasks [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ], at the time of our experiments.
        </p>
        <p>This gave us three datasets for our comparisons: 3000 JSON files with GPT-4o transcriptions,
3000 JSON files with ARIA transcriptions, and 3000 JSON files containing the key for each
image.</p>
        <p>We made python code that compared a selection of keys from each file: date, total amount to
pay, VAT amount, company name, and organisation number. We chose these keys to include in
this experiment because they are present in most of the images, so they are more representative
of a standard receipt than for example ”tips” which are only present on a small minority of the
images. When we originally included more of these keys in the comparison, it meant we had
to do much more data cleaning. We decided a smaller dataset is preferable to a more complex
cleaning step, to ensure the replicability of our results.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Prompt</title>
        <p>The prompt that was used was developed by Fortnox AB, with support from Microsoft. It is
extracting most of the information that could be of book-keeping interest from each receipt.
While this means that a large amount of keys have been extracted from each image, we choose
to do our analysis on only 5 keys, so that we only looked at what is available on the majority of
the images. Including keys that exist on fewer of the images would require more time spent on
the data cleaning, without increasing the amount of useful mistakes in a proportional way.</p>
        <p>
          We included TotalAmount as a representation of a free-format number. We included date, and
VAT as a representations of a number to be extracted in a specific format. OrganisationNumber
as a representation of longer number strings (which are known to be dificult for GenAI models
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]), and finally MerchantName to represent text. We included this variety of keys, since it is
known that there are patterns of diferent mistakes made for diferent kinds of information.
        </p>
        <p>Your final output must satisfy the following typescript schema:
Valid VAT rates
type VatRate = "0%" | "12%" | "25%" | "6%";
Valid payment methods
type PaymentMethod = "" | "card" | "gift card" | "mobile" | "swish";
Valid currency types
type Currency = "SEK" | "DKK" | "NOK" | "EUR" | "USD" | "GBP" | "JPY" | "AUD" | "CHF"
↪ | "CAD" | "CNY" | "SGD" | "KRW" | "PLN" | "INR" | "HUF" | "other";
Return the response in a JSON-format that satisfies the above typescript type for
↪ Receipt.</p>
        <p>Only use the types from above nothing else.</p>
        <p>Only output pure json.</p>
        <p>Do your best, think step by step.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Data Cleaning</title>
        <p>We have two datasets of 15000 comparisons (the amount of files multiplied with the amount of
keys we chose for the comparison) in the format of excel sheets.</p>
        <p>For each of them we conduct the following steps:</p>
        <sec id="sec-4-3-1">
          <title>1. We remove all rows where our transcriptions match the keys. 2. We remove all rows where the key has no entry. 3. This left us with 644 (GPT-4o), and 4858 (ARIA) results that did not match their respective keys.</title>
          <p>An additional 488 rows were removed during the annotation process. These were comparisons
that were not identifiable earlier in the process as irrelevant to our study, but which could have
skewed our results if they have been left in the final dataset. All of these rows were ones where
we deemed it inaccurate to count the transcription as wrong, despite it not matching the key.
Here are some examples of these instances:
• The key stating ”Company Name”, and the model’s result is stating ”Company Name AB”.</p>
          <p>Both answers are present on the receipt and may therefore be seen as correct.
• The key stating ”179.90” and the transcription stating ”180”, and the image stating that
the total amount is 180, but that the card was charged 179.90. Both answers are therefore
present on the receipt as a total amount, and may be seen as correct.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Only these false positives were removed.</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Annotation</title>
        <p>Once the set of transcriptions was ready, we annotated each row with one of the following:
• i: the mistake is not correctly identifying what information is requested
• s: the mistake is skewed data, presenting a word or number that is clearly not a misreading
of what is present on the image, but instead a word or number from the training data
that the model interprets as equivalent
• p: when the correct information is identified, and mostly correctly presented, we have
annotated with p for presentation, where the mistake is a presentation problem. Wrong
spellings are included here. Reading i as 1, and 8 as 3, are included here.</p>
        <p>
          Category s could be understood as relating to the category ”confabulations” as introduced
in [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. They are not identical since that concept is focused on the meaning content of text,
and we are focusing on the characters, but both concepts aim to find wrong answers that may
appear right when there is no key available. No row was left unannotated, and the categories
are mutually exclusive so that no row could be annotated into more than one category.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>3.5. Ratclif/Obershelp pattern, Jaccard simularity and Levenshtein distance</title>
        <p>
          Once all rows were annotated, we had noticed a pattern of mistakes in the p category being
close to the correct replies with only a diference of one or two symbols (e.g. misspellings or
reading a 7 as a 1). This is partly per definition, but indicated a possibility of automatically
identifying which category a particular mistake belongs to. Based on this, we ran the document
through Python code that added a Ratclif/Obershelp pattern calculation [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ], a Jaccard similarity
percentage [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ], and a Levenshtein distance [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ] for each row. We found that there was a strong
correlation between the Ratclif/Obershelp similarity of a response to its key, and what mistake
category it had been annotated as. While it is not surprising that we found correlations, since
part of the definition of the category p that the result is similar to the correct one, it was
unexpected that there was a consistent amount of spelling mistakes, rather than an evenly
distributed one.
        </p>
        <p>Levenshtein distance, however, did not show any useful correlation with the categories.
While it correctly identifies long strings as being close to the correct answer when there are
only spelling mistakes, it does not account for length of the compared strings, which makes
fully wrong replies score equally good as slightly wrong ones, when the compared strings are
short.</p>
        <p>Jaccard similarity had some predictive value; lower similarity correlates with category i, and
higher with category p, however there is no cut-of point between them, and the distribution is
broad. Further, Jaccard similarity is based on only the characters that are present, which means
that it calculates two strings as equal, if they contain the same characters, even if the characters
are not in the same order. This means that it equates 660 with 600, which is not ideal.</p>
        <p>
          Ratclif/Obershelp pattern matching, also called Gestalt pattern matching, is accounting
for the characters present, the lengths of the strings, and the order of the characters in the
strings. This makes it a useful algorithm for identifying what kind of mistake a generative
model has made, by simply calculating the pattern matching score. And such an estimate may
be performed automatically, making it possible to identify the character of the mistakes a model
makes for a task without any further manual or otherwise qualitative interpretation of faulty
responses from models. When we sorted the rows according to their similarity score, we found
a very strong cut-of point, where there are no instances of i mistakes above it, and no instances
of p mistakes below it. This correlation will be shown in detail in the results section. While this
is here shown to be true in the context of our experiment, we have not yet tested the possibilities
of generalising this to other application areas. It is possible that this correlation holds true
for other tasks where it is possible to systematically identify one unique correct answer in
relation to a prompt. The category s (hallucinations according to the most common usage of it
in literature on Chatbots; ”fictional or erroneous information” [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] ) had no cut-of point, but the
majority of the mistakes in this category had a high Ratclif/Obershelp similarity score.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results and Discussion</title>
      <p>We present two primary findings from this experiment. First, we identified a consistent and
practical categorization scheme for all transcription mistakes made by GenAI models with
vision-to-text capabilities, in scenarios where a structured key is available for comparison.
These categories—i (identification), p (presentation), and s (skewed training data)—are broadly
applicable across all types of errors observed in our experiments, which involved thousands of
receipt images.</p>
      <p>
        As shown in Table 1, both ARIA and GPT-4o models produced mistakes that fell across
all three categories, and the distributions of these mistakes are not uniform. ARIA made
significantly more total mistakes (4,858 vs. 644), and the majority of its mistakes (4,124) were
classified as identification mistakes. GPT-4o, by contrast, exhibited a more balanced distribution,
with 403 presentation, and 138 identification mistakes. This uneven distribution suggests that
these errors are not random and may reflect inherent tendencies in how each model handles
uncertainty or incomplete information. While it is well known that GPT-4o avoid giving blank
answers to the degree of producing factual contradictions [
        <xref ref-type="bibr" rid="ref24 ref48 ref49 ref50">48, 49, 50, 24</xref>
        ], such tendencies are
not necessary for all GenAI models, which might be the explanation for this diference in this
case.
      </p>
      <p>This disparity highlights that these errors are not random; rather, they reflect consistent
patterns tied to model behaviour. Table 2 quantifies the diference, using GPT-4o as a baseline:
ARIA made 2,888% more identification mistakes, 50% more presentation mistakes, and 25% more
skewed data mistakes. The striking increase in i mistakes from ARIA suggests a conservative
extraction approach—opting to leave fields blank when uncertain.</p>
      <p>
        Field-level analysis provides more insight. Table 3 breaks down the number and type of
mistakes by data key. ARIA’s highest concentration of identification errors occurred with the
Merchant Name and Organisation Number fields—1,619 and 1,847 respectively. For these, the
model frequently failed to return any value, even when the information was clearly visibly
present in the image. Table 4 supports this: ARIA’s empty response rate was 91% for Merchant
Name and 98% for Organisation Number, versus 40% and 9% for GPT-4o, respectively. These
high omission rates indicate ARIA’s tendency to skip uncertain fields entirely, possibly due to a
narrower confidence threshold or the model being overwhelmed by the amount of information
that was requested of it [
        <xref ref-type="bibr" rid="ref51 ref52">51, 52</xref>
        ].
      </p>
      <p>The significant diference in identifying Organisation Numbers may also be explained by
the structural properties of receipts and the training-data of the models. For example, Swedish
Organisation Numbers are often present but not explicitly labelled as such in the receipts in our
dataset. Models not specifically trained to recognize these patterns, such as ARIA, are more
prone to fail on identification, especially if they rely on keyword cues. GPT-4o, by contrast,
appears more likely to attempt a response even when uncertain, which explains its higher
relative rate of presentation mistakes, where the correct value is approximated but not matched
exactly to the key.</p>
      <p>Additionally, the Merchant Name field revealed another nuance. According to Table 5, this
ifeld was the most afected by cleaning in both model outputs. GPT-4o had a 21% retention rate
post-cleaning, while ARIA retained 86%. This diference is partly explained by ARIAs tendency
to give blank answers. But it brings our attention to another issue with automatic accuracy
estimations; instances where there are multiple correct answers. All of the results that were
removed in this data-cleaning had the issue of having multiple accurate results. They could
have been included in our experiment if the keys were of a format that allowed several answers
to be understood as correct.</p>
      <p>The second major finding is the strong correlation between Ratclif/Obershelp similarity
scores and error category. Table 6 shows that all identification mistakes i had similarity scores
below 60%, while presentation mistakes p had scores of 60% or higher. Skewed data mistakes
s showed a mixed pattern: most were above 60%, but a small percentage (10% for GPT-4o,
5% for ARIA) fell between 35–60%. This suggests that Ratclif/Obershelp similarity can be
used as a heuristic for error classification, high similarity likely indicates p or s errors, while
low similarity indicates i errors. The exact threshold may vary depending on text length and
structure of the information that is extracted, but the trend holds across both models and all
keys that we tested.</p>
      <p>Further analysis of skewed data errors reveals model-specific behaviour. ARIA exhibits a
high rate of s errors in the Date field (28%), suggesting a tendency to hallucinate plausible
but incorrect values when uncertain. In contrast, GPT-4o shows more skewed mistakes in the
VAT field (5%), but overall keeps skewed data rates low across most keys (Table 4). These
patterns reinforce the idea that ARIA is conservative—risking omissions—while GPT-4o aims
for completeness, sometimes at the expense of accuracy.</p>
      <p>By combining our categorization with Ratclif/Obershelp similarity metrics, we provide a
replicable framework for analysing transcription performance.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>We made an experiment focused on identifying categories of mistakes made by GenAI models,
with the intention of finding patterns that could help clarify what models may or may not be
useful for particular tasks, further than a simple accuracy estimate does. We found two useful
patterns; that there are three systematically identifiable categories of mistakes that reoccur
across diferent models, and that there is a simple way to automatically sort mistakes into at least
two of these categories. This paper reports the details of how the experiment was conducted,
and we suggest that further research should be done on the possibilities of generalising these
ifndings, and on identifying the reasons for why these mistakes occur.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We want to thank Fortnox AB for providing the dataset, prompt, and financial support for this
project, and we also thank the reviewers for their helpful contributions.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o for drafting the structure of
the paper. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonmoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rawte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of hallucination mitigation techniques in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.01313 6</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vechev</surname>
          </string-name>
          ,
          <source>Automated classification of model errors on imagenet, Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>36826</fpage>
          -
          <lpage>36885</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Thomson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sundararajan</surname>
          </string-name>
          ,
          <article-title>Evaluating factual accuracy in complex datato-text</article-title>
          ,
          <source>Comput. Speech Lang</source>
          .
          <volume>80</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1016/j.csl.
          <year>2023</year>
          .
          <volume>101482</volume>
          . doi:
          <volume>10</volume>
          .1016/j.csl.
          <year>2023</year>
          .
          <volume>101482</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kwatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramjee</surname>
          </string-name>
          ,
          <article-title>Accuracy is not all you need</article-title>
          ,
          <source>arXiv preprint arXiv:2407.09141</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goodrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Saleh, Assessing the factual accuracy of generated text</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, KDD '19</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2019</year>
          , p.
          <fpage>166</fpage>
          -
          <lpage>175</lpage>
          . URL: http://dx.doi.org/10.1145/3292500. 3330955. doi:
          <volume>10</volume>
          .1145/3292500.3330955.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelgrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hangelbäck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ericsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nordqvist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Grahn</surname>
          </string-name>
          ,
          <article-title>Hallucinations and training-data bias: Results from two number transcription experiments using gpt models</article-title>
          ,
          <source>in: International Conference on Computational Science and Computational Intelligence</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelgrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ericsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Grahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nordqvist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hagelbäck</surname>
          </string-name>
          ,
          <article-title>Chatgpt as a combined ocr and key-value extractor</article-title>
          ,
          <source>in: 2025 IEEE 10th International Conference on Big Data Analytics (ICBDA)</source>
          ,
          <year>2025</year>
          . Accepted, to appear.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Maleki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Padmanabhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <article-title>Ai hallucinations: a misnomer worth clarifying</article-title>
          ,
          <source>in: 2024 IEEE conference on artificial intelligence (CAI)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] OpenAI, Gpt-4v(ision) system card</article-title>
          , https://openai.com/index/gpt-4v
          <string-name>
            <surname>-</surname>
          </string-name>
          system-card/,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -10-14.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G. Zeng,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , et al.,
          <article-title>Visionllm: Large language model is also an open-ended decoder for vision-centric tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruction tuning,
          <source>Advances in neural information processing systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Kosmos-2:
          <article-title>Grounding multimodal large language models to the world</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306. 14824. arXiv:
          <volume>2306</volume>
          .
          <fpage>14824</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , D. Metaxas,
          <article-title>On the challenges and perspectives of foundation models for medical image analysis</article-title>
          ,
          <source>Medical image analysis 91</source>
          (
          <year>2024</year>
          )
          <fpage>102996</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baudru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ryckbosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ginis</surname>
          </string-name>
          ,
          <article-title>Early evidence of how llms outperform traditional systems on ocr/htr tasks for historical records</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/ 2501.11623. arXiv:
          <volume>2501</volume>
          .
          <fpage>11623</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ming</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , W. Chen, Ocean-ocr:
          <article-title>Towards general ocr application via a vision-language model</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2501.15558. arXiv:
          <volume>2501</volume>
          .
          <fpage>15558</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
            ,
            <given-names>K.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Visualbert: A simple and performant baseline for vision and language</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>03557</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Masry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Suresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Puri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Noël</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Madhusudhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pedersoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chapados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hoque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Laradji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vazquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Taslakian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajeswar</surname>
          </string-name>
          , Alignvlm:
          <article-title>Bridging vision and language latent spaces for multimodal understanding</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.01341. arXiv:
          <volume>2502</volume>
          .
          <fpage>01341</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Stoica,
          <article-title>Eficient memory management for large language model serving with pagedattention</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2309.06180. arXiv:
          <volume>2309</volume>
          .
          <fpage>06180</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Faysse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sibille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Omrani</surname>
          </string-name>
          , G. Viaud,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hudelot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colombo</surname>
          </string-name>
          , Colpali:
          <article-title>Eficient document retrieval with vision language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407. 01449. arXiv:
          <volume>2407</volume>
          .
          <fpage>01449</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lora: Low-rank adaptation of large language models</article-title>
          .
          <source>, ICLR</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>A survey on lora of large language models</article-title>
          ,
          <source>Frontiers of Computer Science</source>
          <volume>19</volume>
          (
          <year>2025</year>
          )
          <fpage>197605</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>C</surname>
          </string-name>
          . Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , Llamaadapter:
          <article-title>Eficient fine-tuning of language models with zero-init attention</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2303.16199. arXiv:
          <volume>2303</volume>
          .
          <fpage>16199</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Extending context window of large language models via positional interpolation</article-title>
          ,
          <source>arXiv preprint arXiv:2306.15595</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Scius-Bertrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fakhari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vögtlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Cabral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <article-title>Are layout analysis and ocr still useful for document information extraction using foundation models?</article-title>
          ,
          <source>in: International Conference on Document Analysis and Recognition</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Chan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Parameter-eficient fine-tuning of large-scale pre-trained language models</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>220</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Aria:</surname>
          </string-name>
          <article-title>An open multimodal native mixture-of-experts model</article-title>
          ,
          <source>arXiv preprint arXiv:2410.05993</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM computing surveys 55</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Morency</surname>
          </string-name>
          ,
          <article-title>Foundations &amp; trends in multimodal machine learning: Principles, challenges, and open questions</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>56</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>N.</given-names>
            <surname>Maleki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Padmanabhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <article-title>Ai hallucinations: A misnomer worth clarifying</article-title>
          ,
          <source>in: 2024 IEEE Conference on Artificial Intelligence (CAI)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>138</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CAI59869.
          <year>2024</year>
          .
          <volume>00033</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>43</volume>
          (
          <year>2025</year>
          ). URL: https://doi.org/10. 1145/3703155. doi:
          <volume>10</volume>
          .1145/3703155.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Knowles</surname>
          </string-name>
          ,
          <article-title>Six challenges for neural machine translation</article-title>
          ,
          <source>ACL</source>
          <year>2017</year>
          (
          <year>2017</year>
          )
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Athaluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Manthena</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. K. M. Kesapragada</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Yarlagadda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Dave</surname>
          </string-name>
          , R. T. S. Duddumpudi,
          <article-title>Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through chatgpt references</article-title>
          ,
          <source>Cureus</source>
          <volume>15</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gravel</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D'Amours-Gravel</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Osmanlliu</surname>
          </string-name>
          ,
          <article-title>Learning to fake it: limited responses and fabricated references provided by chatgpt for medical questions</article-title>
          ,
          <source>Mayo Clinic Proceedings: Digital Health</source>
          <volume>1</volume>
          (
          <year>2023</year>
          )
          <fpage>226</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singla</surname>
          </string-name>
          ,
          <article-title>Llms will always hallucinate, and we need to live with this, arXiv e-prints (</article-title>
          <year>2024</year>
          ) arXiv-
          <fpage>2409</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barros</surname>
          </string-name>
          ,
          <article-title>I think, therefore i hallucinate: Minds, machines, and the art of being wrong</article-title>
          , arXiv e-prints (
          <year>2025</year>
          ) arXiv-
          <fpage>2503</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          , X. Cheng, W. X.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>The dawn after the dark: An empirical study on factuality hallucination in large language models</article-title>
          ,
          <source>in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>10879</fpage>
          -
          <lpage>10899</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cancedda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          , Hallulens: Llm hallucination benchmark,
          <source>arXiv preprint arXiv:2504.17550</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Farquhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kossen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <article-title>Detecting hallucinations in large language models using semantic entropy</article-title>
          ,
          <source>Nature</source>
          <volume>630</volume>
          (
          <year>2024</year>
          )
          <fpage>625</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>Exploring and evaluating hallucinations in llm-powered code generation</article-title>
          ,
          <source>arXiv preprint arXiv:2404.00971</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>L.</given-names>
            <surname>Giray</surname>
          </string-name>
          ,
          <article-title>Prompt engineering with chatgpt: a guide for academic writers</article-title>
          ,
          <source>Annals of biomedical engineering 51</source>
          (
          <year>2023</year>
          )
          <fpage>2629</fpage>
          -
          <lpage>2633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>J.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnashar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Spencer-Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <article-title>A prompt pattern catalog to enhance prompt engineering with chatgpt</article-title>
          ,
          <source>arXiv preprint arXiv:2302.11382</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>A systematic survey of prompt engineering in large language models: Techniques and applications</article-title>
          ,
          <source>arXiv preprint arXiv:2402.07927</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>H.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Vlmevalkit: An open-source toolkit for evaluating large multi-modality models</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM International Conference on Multimedia</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>11198</fpage>
          -
          <lpage>11201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Ratclif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Metzener</surname>
          </string-name>
          , et al.,
          <article-title>Pattern matching: The gestalt approach</article-title>
          , Dr.
          <source>Dobb's Journal</source>
          <volume>13</volume>
          (
          <year>1988</year>
          )
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jaccard</surname>
          </string-name>
          ,
          <article-title>Nouvelles recherches sur la distribution florale</article-title>
          ,
          <source>Bull. Soc. Vaud. Sci. Nat</source>
          .
          <volume>44</volume>
          (
          <year>1908</year>
          )
          <fpage>223</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          , et al.,
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          ,
          <source>in: Soviet physics doklady</source>
          , volume
          <volume>10</volume>
          ,
          <string-name>
            <surname>Soviet</surname>
            <given-names>Union</given-names>
          </string-name>
          ,
          <year>1966</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>A.</given-names>
            <surname>Payandeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pluth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hosier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Gurbani</surname>
          </string-name>
          ,
          <article-title>How susceptible are llms to logical fallacies?</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2308.09853. arXiv:
          <volume>2308</volume>
          .
          <fpage>09853</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J. Wu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>He, Utilize the flow before stepping into the same river twice: Certainty represented knowledge flow for refusal-aware instruction tuning</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.06913. arXiv:
          <volume>2410</volume>
          .
          <fpage>06913</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Medico:
          <article-title>Towards hallucination detection and correction with multi-source evidence fusion</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.10408. arXiv:
          <volume>2410</volume>
          .
          <fpage>10408</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <surname>R. M. French</surname>
          </string-name>
          ,
          <article-title>Catastrophic forgetting in connectionist networks</article-title>
          ,
          <source>Trends in cognitive sciences 3</source>
          (
          <year>1999</year>
          )
          <fpage>128</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>An empirical study of catastrophic forgetting in large language models during continual fine-tuning</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv. org/abs/2308.08747. arXiv:
          <volume>2308</volume>
          .
          <fpage>08747</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>