<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Instruction-tuned Quantized Small Language Models (SLMs): A Study on Hallucination Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elijah Soba</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harika Abburi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nirmala Pudota</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jain Aayush</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Balaji Veeramani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Bowen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanmitra Bhattacharya</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deloitte</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Touche LLP</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deloitte &amp; Touche Assurance and Enterprise Risk Services India Private Limited</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) have greatly advanced the field of Natural Language Generation (NLG). Despite their remarkable capabilities, their tendency to generate hallucinations-a phenomenon where models generate inaccurate or misleading information continues to be a significant challenge to their broader adoption across various domains. In this paper, we investigate the impact of instruction-tuned quantized Small Language Models (SLMs) (defined as models with fewer than 15 billion active parameters), specifically trained on a subset of Sharedtask on Hallucinations and Related Observable Overgeneration Mistakes (SHROOM) dataset for hallucination detection. We focus on SLMs to achieve a balance between computational eficiency and performance in detecting hallucinations. The instruction-tuned quantized models are compared against the Generative Pretrained Transformer (GPT-4) and traditional “textual entailment" (entailment) based methods across various datasets. Our findings demonstrate that the optimized SLMs achieve performance comparable to LLMs like GPT-4 and outperform traditional textual entailment-based methods in hallucination detection. This research highlights the potential of smaller, instruction-tuned language models as practical and eficient solutions for improving the reliability of language models, especially in resource-constrained environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hallucination Detection</kwd>
        <kwd>Small Language Models</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Instruction Tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The domain of Natural Language Generation (NLG) is witnessing a remarkable transformation with the
emergence of Large Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. LLMs have been shown to outperform traditional
Natural Language Processing (NLP) approaches across a wide range of applications [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Despite
the rapid advancements in LLMs, a concerning trend has emerged where these models generate
hallucinations [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], resulting in content that appears plausible but is factually unsupported. This
issue is particularly critical in sensitive domains such as healthcare, finance, and legal services, where
the accuracy of generated content is paramount. Hence, the automatic detection of hallucinated
content has become an active area of research, aiming to enhance the reliability and trustworthiness of
LLM-generated content [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        Diverse modeling strategies, ranging from Black-Box, White-Box to evidence-based approaches
[
        <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
        ], have been investigated to develop solutions for detecting hallucinated content. Black-Box
methods analyze the consistency of LLM’s outputs through follow-up questions with other LLMs [9] or
prompting the LLM for self-evaluation [10]. [11] proposed semantic-aware cross-check consistency
(SAC3), a sampling-based approach that builds upon self-consistency checks by incorporating
semantically equivalent question perturbations and cross-model response consistency verification techniques.
Similarly, [12] introduced SelfCheckGPT, which detects inconsistencies by evaluating the stability of
Disinformation, Misinformation and Learning in the Age of Generative AI: Joint Proceedings of the 1st International Workshop
on Disinformation and Misinformation in the Age of Generative AI (DISMISS-FAKE’25) and the 4th International Workshop on
Investigating Learning during Web Search (IWILDS’25) co-located with 18th International ACM WSDM Conference on Web Search
and Data Mining (WSDM 2025)
generated responses. These methods assume that inconsistencies arise when LLM is uncertain about a
concept. However, both approaches require multiple response generations from LLMs, making them
computationally expensive for practical applications.
      </p>
      <p>The White-Box approaches explore the internal workings of LLMs to analyze factual recall. [13]
analyzed how LLMs encode factual statements with a specific structure. They proposed the multi-layer
perceptron layers store facts, and transferred through attention layers that focus on subject tokens.
Similarly, [14] leveraged the activations of hidden layers as inputs to a classifier designed to assess
the truthfulness of statements. [15] proposed constraint SATisfaction (SAT) Probe, a method probing
attention patterns, to predict factual errors and allow early error identification. While these approaches
are promising for hallucination detection, their implementation remains challenging as access to the
inner workings of LLMs is not always feasible.</p>
      <p>Recently evidence-based fact-checking gained significant attention as an essential tool for combating
misinformation. Factual precision in Atomicity Score (FACTSCORE) by [16] evaluated the correctness
of individual facts within the generated text by referencing a knowledge source. [17] introduced a
real-world claim and evidence dataset specifically designed to enhance textual entailment models by
reducing the complexity of claims through a decomposition process. By breaking down claims into
simpler components, this approach aims to facilitate more efective entailment evaluation and thereby
improve overall model performance. [18] presented an automated pipeline for fact-checking real-world
claims by retrieving raw evidence from the web. This method retrieves a fixed number of documents for
each claim. But this predetermined approach may not always provide suficient evidence, potentially
resulting in incomplete or biased fact-checking. To address this limitation, [19] proposed a framework
that leverages statistical decision theory and Bayesian sequential analysis, which eliminates the need
for a predetermined number of observations. The analysis proceeds sequentially, enabling a quick
decision-making process through a stop-or-continue strategy. While these evidence-based approaches
benefit from real-world knowledge, they may introduce additional sources of error and are often limited
to addressing only the fact-checking form of hallucinations.</p>
      <p>This paper examines a specific scenario of hallucination detection, where the objective is to predict
which hypothesis is a hallucination given a triplet consisting of a source input and two hypotheses.
The contribution of this study is twofold.</p>
      <p>• We explore the impact of instruction-tuned, quantized SLMs and compare their performance
against both textual entailment models and GPT-4.
• Our results demonstrate that instruction-tuned, quantized SLMs achieve performance comparable
to GPT-4 while ofering significant advantages in terms of computational eficiency.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets</title>
      <p>This section describes the datasets used for instruction-tuning and evaluating our hallucination detection
model. The number of training and testing samples are shown in Table 1.</p>
      <sec id="sec-2-1">
        <title>2.1. SHROOM</title>
        <p>The SHROOM dataset is released as part of the SemEval-2024 shared task for hallucination detection. It
contains data from three distinct NLG tasks: Machine Translation (MT) and Paraphrase Generation (PG).</p>
        <p>Input Label
source: I didn’t give you enough credit. hypothesis 2
hypothesis 1: I didn’t give you enough credit.
hypothesis 2: I gave you enough credit.
source: Tokyo ekozala engumba moko pamba ya Asie oyo eyambi masano ya Oympique ya eleko hypothesis 2
ya mibale, eyambaki ya liboso na 1964.
hypothesis 1: Tokyo will be the only Asian city to have hosted two summer Olympics, having
hosted the games in 1964.
hypothesis 2: Tokyo will be the only Asian city to host the second Olympic Games, the first being
in 1964.
source: Medas de sas traditziones a inghÃ¬riu de sa festa sunt istadas adotadas fintzas dae sos chi hypothesis 1
non creent in sos paisos cristianos e dae sos non cristianos in totu su mundu.
hypothesis 1: Many of the traditions surrounding the festival have been adopted by non-Christian
people in their Christian countries and by non-Christian people around the world.
hypothesis 2: Many of the traditions surrounding the holiday have been adopted also by
nonbelievers in Christian countries and non-Christians around the world.
source: James, we shouldn’t be here. hypothesis 1
hypothesis 1: James, we’re supposed to be out of here.</p>
        <p>hypothesis 2: We shouldn’t be in this situation.</p>
        <p>More details about the dataset can be found in the SemEval-2024 shared task 6 overview paper [20]. For
this work, we consider data from MT and PG tasks with source, target, hypothesis, and label details . To
enable the model to simultaneously learn the characteristics of hallucinations while also identifying the
patterns that diferentiate them from non-hallucinations, we transform the data into triplets. Each triplet
consists of an original input sentence (source) paired with two hypotheses (hypothesis 1, hypothesis 2):
one representing the correct output (target) and the other a hallucinated output (hypothesis labeled as
a hallucination in the original data). The order of the hypotheses is randomized to prevent bias. This
transformation resulted in a training set of 538 samples and a testing set of 115 samples. Table 2 shows
few samples from training set. This is the only data we used to instruction-tune SLMs in our approach.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. HaluEval</title>
        <p>HaluEval [21] is a large-scale hallucination evaluation benchmark that ofers a collection of generated
and human-annotated hallucinated samples to evaluate the performance of LLMs in detecting
hallucinations. It includes data from three NLP tasks: question answering, knowledge-grounded dialogue, and
text summarization.</p>
        <p>To test our approach, we exclusively focused on data from the text summarization task as it is inline
with the PG data used in the SHROOM training set. This dataset is comprised of columns such as
document, right summary, and hallucinated summary. As the dataset contains more than 10k samples,
we randomly sampled 1,000 examples for our experiments. To create triplets, we used the document as
the source, and included the right summary and hallucinated summary as the hypotheses.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>The choice of SLMs in this study is motivated by the necessity for resource eficiency. Smaller models
provide significant benefits in terms of reduced computational cost, lower memory requirements, and
faster inference speed. These advantages make them more feasible for practical applications, particularly
in resource-constrained environments, while maintaining competitive performance.</p>
      <p>We explored several SLMs and finally selected Mixtral 8x7B [ 22] and SOLAR 10.7B [23] as the base
models in our approach as illustrated in Figure 1. These models were chosen due to their strong
performance on the SHROOM test set. Mixtral 8x7B uses a Mixture of Experts (MoE) architecture.
This design allows the model to dynamically select diferent subsets of parameters for diferent inputs,
enhancing its ability to handle diverse linguistic tasks eficiently. Additionally, the model has been
trained on a multilingual dataset, enhancing its ability to capture language nuances and understand
semantic relationships across languages. SOLAR 10.7B on the other hand, utilizes Depth Up-Scaling
(DUS), which combines multiple base models into a unified framework. This approach enhances
the model’s capacity for complex language analysis, making it particularly efective for detecting
hallucinations and other intricate language phenomena.</p>
      <p>We performed instruction-tuning on the quantized versions of both Mixtral and SOLAR to further
optimize their computational eficiency. Both models were quantized to 4-bits significantly lowering the
computational requirements and subsequently instruction-tuned using Quantized Weight-Decomposed
Low-Rank Adaptation (QDoRA) technique [24]. We selected QDoRA due to the greater eficiency it
ofers in terms of speed, robustness to rank selection, and faster learning. It accelerates the fine-tuning
process, allowing for quicker adaptation to specific tasks, and is less sensitive to the choice of rank
during the decomposition process, ensuring stable performance across diferent configurations. Each
LLM was instruction-tuned with the prompt shown in Table 3.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section details the experimental evaluation of our approach. To assess the efectiveness of our
method, we employed established classification metrics like accuracy ( ), macro F1 score (),
precision ( ), and recall (). Additionally, we compared our model’s performance against GPT-4
and two baseline entailment models on all test sets: i) SelfcheckGPT-NLI [12] which is a
samplebased detection method that relies on the consistency of generated responses ii) Hughes Hallucination
Evaluation Model (HHEM) [25] which examines the structure, logic, and factual grounding within the
text that identify instances where the LLM might have generated incorrect or unsupported claims. We
specifically chose entailment models because their training objective aligns closely with the type of
hallucination we targeted in this work. To adapt these models to our triplet setting, we calculated the
entailment score between the source sentence and each hypothesis. The hypothesis with the lowest
entailment score was then classified as the hallucination.</p>
      <p>
        To justify the emphasis on smaller language models, it is essential to evaluate their resource eficiency
in comparision to larger models like GPT-4. With an estimation of 1.8 trillion parameters, GPT-4 requires
substantial computational resources for training and inference [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In contrast, the smaller language
models examined in this study, Mixtral 8x7B and SOLAR 10.7B, contain fewer parameters (less than
15 billion active parameters). This significant reduction in model size results in lower computational
requirements, making these smaller models more practical for deployment in resource-constrained
settings.
      </p>
      <p>We compared the performance of Mixtral 8x7B and SOLAR 10.7B across three configurations: Base
(B), Quantized (Q), and Quantized Instruction-Tuned (QIT) as shown in the Table 4. From the results,
it is observed that the  scores of the quantized models are lower compared to their base models.
However, after performing instruction-tuning on the quantized models, we observed a significant
improvement in  scores of 0.88, 0.87 for Mixtral 8x7B + QIT (Mix-QIT), SOLAR 10.7B + QIT (S-QIT)
respectively. These scores represent an increase of 20% to 50% compared to the base model’s 
scores, highlighting the efectiveness of instruction-tuning in enhancing the ability of quantized LLMs
to detect hallucinations.</p>
      <p>
        To benchmark our approach against other established methods, we compared its performance with
two entailment baselines as shown in Table 5. The results demonstrate that our instruction-tuned SLMs
consistently outperformed both the SelfCheckGPT-NLI and HHEM baselines across the datasets. This
highlights the efectiveness of instruction-tuning for hallucination detection across diferent domains.
Further to evaluate our approach and highlight the eficiency with SLMs, we compared the results with
the standard, non-fine-tuned GPT-4 model rather than fine-tuned version of GPT-4. Fine-tuning larger
models like GPT-4 is a highly resource-intensive process, often require several days of computation on
high-end hardware due to their larger parameter size [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. On the other hand, fine-tuning smaller models
like Mixtral 8x7B and SOLAR 10.7B is more eficient, both in terms of time and resource consumption.
Having fewer parameters (less than 15 billion active parameters), it is quicker to train them with lower
memory footprint and reduced energy usage.
      </p>
      <p>We also note the results are not consistent across the datasets when we compare instruction-tuned
SLMs with GPT-4. On the SHROOM dataset, both Mix-QIT and S-QIT achieved impressive  scores
of 0.88 and 0.87, exceeding GPT-4 by 8%. These results show that, inorder to detect the hallucinations,
instruction-tuning the smaller models can achieve performance comparable to a larger model like
GPT-4. However, the performance was not consistent on HaluEval dataset where both Mix-QIT and
S-QIT  scores (0.66 and 0.65) fell short of GPT-4 by around 10%. While GPT-4 ofers superior
performance due to its size, the trade-of in computational eficiency makes smaller language models a
viable alternative for many use cases.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we explored the efectiveness of instruction-tuning on the quantized versions of SLMs
for hallucination detection. We compared these instruction-tuned models against established methods,
including GPT-4 and entailment models, and found consistent improvement across various datasets.
While our instruction-tuned models achieved performance comparable to GPT-4 on SHROOM datasets,
a discrepancy emerged on the HaluEval dataset. This highlights the need for further research to enhance
the robustness and generalizability of instruction tuning for hallucination detection. Smaller language
models, defined as those with fewer than 15 billion active parameters, ofer significant advantages in
terms of computational cost, memory usage, and inference speed, making them more accessible for
practical applications, especially in resource-constrained environments.</p>
      <p>As future work, we plan to investigate methods not only to detect hallucinations but also to understand
the underlying reasoning behind them, potentially leading to efective correction strategies.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[9] R. Cohen, M. Hamri, M. Geva, A. Globerson, Lm vs lm: Detecting factual errors via cross
examination, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, 2023, pp. 12621–12640.
[10] M. Zhang, O. Press, W. Merrill, A. Liu, N. A. Smith, How language model hallucinations can
snowball, arXiv e-prints (2023) arXiv–2305.
[11] J. Zhang, Z. Li, K. Das, B. Malin, S. Kumar, Sac3: Reliable hallucination detection in black-box
language models via semantic-aware cross-check consistency, in: Findings of the Association for
Computational Linguistics: EMNLP 2023, 2023, pp. 15445–15458.
[12] P. Manakul, A. Liusie, M. J. Gales, Selfcheckgpt: Zero-resource black-box hallucination detection
for generative large language models, arXiv preprint arXiv:2303.08896 (2023).
[13] M. Geva, J. Bastings, K. Filippova, A. Globerson, Dissecting recall of factual associations in
autoregressive language models, in: Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, 2023, pp. 12216–12235.
[14] A. Azaria, T. Mitchell, The internal state of an llm knows when it’s lying, in: Findings of the</p>
      <p>Association for Computational Linguistics: EMNLP 2023, 2023, pp. 967–976.
[15] M. Yuksekgonul, V. Chandrasekaran, E. Jones, S. Gunasekar, R. Naik, H. Palangi, E. Kamar, B. Nushi,
Attention satisfies: A constraint-satisfaction lens on factual errors of language models, in: The
Twelfth International Conference on Learning Representations, 2023.
[16] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi,
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation, arXiv
preprint arXiv:2305.14251 (2023).
[17] R. Kamoi, T. Goyal, J. D. Rodriguez, G. Durrett, Wice: Real-world entailment for claims in wikipedia,
in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
2023, pp. 7561–7583.
[18] J. Chen, G. Kim, A. Sriram, G. Durrett, E. Choi, Complex claim verification with evidence retrieved
in the wild, arXiv preprint arXiv:2305.11859 (2023).
[19] X. Wang, Y. Yan, L. Huang, X. Zheng, X.-J. Huang, Hallucination detection for generative large
language models by bayesian sequential estimation, in: Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing, 2023, pp. 15361–15371.
[20] T. Mickus, E. Zosa, R. Vázquez, T. Vahtola, J. Tiedemann, V. Segonne, A. Raganato, M. Apidianaki,
Semeval-2024 shared task 6: Shroom, a shared-task on hallucinations and related observable
overgeneration mistakes, 2024. arXiv:2403.07726.
[21] J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, J.-R. Wen, Halueval: A large-scale hallucination evaluation
benchmark for large language models, 2023. URL: https://arxiv.org/abs/2305.11747.
[22] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las
Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A.
Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang,
T. Lacroix, W. E. Sayed, Mixtral of experts, 2024. arXiv:2401.04088.
[23] D. Kim, C. Park, S. Kim, W. Lee, W. Song, Y. Kim, H. Kim, Y. Kim, H. Lee, J. Kim, C. Ahn, S. Yang,
S. Lee, H. Park, G. Gim, M. Cha, H. Lee, S. Kim, Solar 10.7b: Scaling large language models with
simple yet efective depth up-scaling, 2024. arXiv:2312.15166.
[24] S.-Y. Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C. F. Wang, K.-T. Cheng, M.-H. Chen, Dora:</p>
      <p>Weight-decomposed low-rank adaptation, arXiv preprint arXiv:2402.09353 (2024).
[25] S. Hughes, Cut the bull. . . . detecting hallucinations in large language models, ????</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Manyika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hsiao</surname>
          </string-name>
          ,
          <article-title>An overview of bard: an early experiment with generative ai</article-title>
          ,
          <source>AI. Google Static Documents</source>
          <volume>2</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Medenilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sillos</surname>
          </string-name>
          , L. De Leon,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elepaño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madriaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aggabao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diaz-Candido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maningo</surname>
          </string-name>
          , et al.,
          <article-title>Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models</article-title>
          ,
          <source>PLoS digital health 2</source>
          (
          <year>2023</year>
          )
          <article-title>e0000198</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Caldarella</surname>
          </string-name>
          , G. Riccardi,
          <article-title>Response generation in longitudinal dialogues: Which knowledge representation helps</article-title>
          ?,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>15908</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cahyawijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wilie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lovenia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chung</surname>
          </string-name>
          , et al.,
          <article-title>A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity</article-title>
          ,
          <source>arXiv preprint arXiv:2302.04023</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Siren's song in the ai ocean: A survey on hallucination in large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>01219</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Zhang,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <article-title>Hallucination of multimodal large language models: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2404.18930</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>