Evaluating Large Language Models on Qualitative
                         Reasoning Tasks: A Case Study using OpenAI’s GPT
                         Models
                         Najwa AlGhamdi∗ , Kwabena Nuamah and Alan Bundy
                         Artificial Intelligence and its Applications Institute, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh,
                         Scotland


                                      Abstract
                                      This study evaluates the performance of Large Language Models (LLMs) on qualitative reasoning tasks, focusing
                                      on identifying inconsistencies in their reasoning processes. Specifically, we examine two versions of OpenAI’s
                                      General-Purpose Transformer (GPT) models: GPT-3.5 and GPT-4. We hypothesize that LLMs may produce
                                      inconsistencies in handling qualitative information, in particular they can generate explanations of their answers
                                      that are not faithful to the question. To test this hypothesis, we prompt the models with different scenarios to
                                      answer questions from the QuaRTz [1], requiring the models to select the correct answer from options A or
                                      B and provide a corresponding explanation for their reasoning process. The evaluation focuses on three key
                                      dimensions: the consistency of responses, the faithfulness to the context and the effect of different prompts on
                                      their explanation. The results reveal several inconsistencies in the models’ explanations of their answers, thereby
                                      highlighting challenges in the consistency of their qualitative reasoning.

                                      Keywords
                                      Inconsistency, Faithfulness, Explanation, LLM


                         1. Introduction
                         In recent years, the field of natural language understanding (NLU) in reasoning has seen significant
                         advances through the development of large language models (LLMs). LLMs are artificial intelligence
                         models that leverage “large-scale, pre-trained, statistical language models based on neural networks”
                         [2]. Among these models, GPT models have gained substantial attention for their ability to generate
                         human-like text. However, a crucial and challenging aspect of LLMs is explainability, or interpretability,
                         which refers to their ability to understand and describe how a model makes its decisions or arrives at
                         its reasoning [3]. In this study, we use the term explanation to refer to the natural text generated by the
                         model to justify its answers.
                            Generating explanations can sometimes lead to hallucinations, where the explanation seems reason-
                         able and logical but is actually meaningless or unfaithful to the context [4, 5]. This issue is related
                         to the faithfulness of the explanation, which concerns how accurately the explanation reflects the
                         model’s actual reasoning process [6, 7]. The main challenge is ensuring that the explanations provided
                         by the LLMs are not only plausible but also truly faithful to the model’s reasoning. Often, generated
                         explanations do not align with how the model arrived at its conclusions, leading to inconsistencies. For
                         example, consider the following question from the QuaRTz dataset [1]:

                                “Mark studies rocks for a living. He is curious if rocks deeper in the Earth will be warmer
                                or cooler. If Mark were to take the temperature of a rock that is half a mile into the Earth’s
                                surface, and then take the temperature of a rock 1 mile into the Earth’s surface, he would find
                                that the rock half a mile into the Earth’s temperature is... (A) lower (B) higher”


                          ABERDEEN’24: Reasoning, Explanations and Applications of Large Language Models, October 17, 2024, ABERDEEN, Scotland
                         ∗
                              Corresponding author.
                          Envelope-Open n.alghamdi@sms.ed.ac.uk (N. AlGhamdi); k.nuamah@ed.ac.uk (K. Nuamah); a.bundy@ed.ac.uk (A. Bundy)
                          Orcid 0000-0002-9058-4155 (N. AlGhamdi); 0000-0002-6868-9858 (K. Nuamah); 0000-0000-0000-0000 (A. Bundy)
                                     © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   The model answer was “(B) higher” and generated the explanation “The rock at 1 mile into the Earth’s
surface will have a higher temperature compared to the rock at half a mile into the Earth as the temperature
increases with depth inside the Earth due to geothermal gradient”. In this case, the model correctly
explains that temperature increases with depth inside the Earth but then incorrectly answers a question
by stating that the deeper rock would be higher (option B), which contradicts its explanation. This
example illustrates the importance of clearly understanding the model’s reasoning process to identify
any inconsistencies in the explanations it generates when justifying its answers. To investigate these
issues, we evaluated OpenAI’s GPT-3.5 [8] and GPT-4 [9], focusing on their performance in qualitative
reasoning tasks using qualitative questions from QuaRTz [1]. In evaluating explanations generated by
LLMs, two key dimensions are essential: consistency, and faithfulness.
    Consistency refers to the model’s ability to provide explanations that do not contradict each other,
       maintaining consistent answers across different questions [10]. Whereas,

    Faithfulness measures the degree to which the explanation accurately represents the model’s actual
        reasoning process. Faithfulness focuses on the transparency and accuracy of the model’s reasoning
        rather than on how convincing the explanation appears to be [7].
These dimensions provide a framework for assessing the quality and reliability of explanations produced
by LLMs along with different scenarios of prompting.
   The paper is organized as follows. Section 2, gives an overview of the models used in the experiment
and the evaluation approach with the dataset used in the evaluation. Section 3, discusses the findings
of the experiment. Section 4 briefly discusses some similar work in this area. Finally, in section 5, we
conclude the paper with a brief summary and discuss possible future work.


2. Experimental Setup
2.1. Models and Prompting Strategies
This study investigates the performance of OpenAI’s GPT models, specifically GPT-3.5-Turbo[8] and
GPT-4-Turbo[9], in qualitative reasoning tasks. The primary focus is on identifying inconsistencies
between the generated answers and the generated explanations (reasoning processes). To explore this,
we employed zero-shot prompting, where the models were only provided with task descriptions without
examples and tried different prompts, discussed further in 2.3.

2.2. Dataset
The QuaRTz dataset 1 , used in this study, contains open-domain qualitative relationship questions, each
presents two possible answers, labelled A and B. In addition to the answer choices, the dataset includes
a correct answer key as a ground truth, which is not available to the LLM and is used for assessing
the LLM’s answers, but does not provide explanations for the correct answers. Each question in the
dataset contains contexts or annotations (para-anno and question-anno) and knowledge statements
(para), which support the correct answer yet do not leak the correct answer to the LLM explicitly.
   This supporting information is part of the evaluating models’ understanding of the questions and
will be referred to as annotation. Using this dataset will require the model to use multi-hop reasoning
and use its background knowledge to handle different types of question. An example of such a question,
along with its annotations or context that illustrate how the model constructs its answers, is shown
below:
          {”answerKey”:”A”,
          ”para_id”:”QRSent-10162”,
          ”id”:”QRQA-10162-3-flip”,
1
    The dataset can be found in https://www.kaggle.com/datasets/thedevastator/quartz-a-dataset-of-open-domain-qualitative-rela
      ”question”:”stem”:”We are designing a submarine to study fish that live far below the surface
      of the ocean. Before we can send a human researcher down in the submarine, we have to
      be sure it can tolerate the pressure of the water without cracking. The easier test will be to
      send our submarine down to”,
      ”choices”:
      [”label”:”A”,”text”:”500 feet or”,
      ”label”:”B”,”text”:”1500 feet?”],
      ”para”:”A fluid exerts pressure in all directions, but the pressure is greater at greater depth.”,
      ”para_anno”:
      ”effect_dir_sign”:”MORE”,
      ”cause_dir_sign”:”MORE”,
      ”effect_prop”:”pressure”,
      ”cause_prop”:”depth”,
      ”cause_dir_str”:”greater”,
      ”effect_dir_str”:”greater”,
      ”question_anno”:
      ”less_cause_dir”:”500 feet”,
      ”more_cause_dir”:”1500 feet”,
      ”less_effect_prop”:”test”,
      ”less_effect_dir”:”easier”}

2.3. Evaluation Metrics
The evaluation of the models was centered on three dimensions: consistency, faithfulness of the
generated explanations and prompting. First, a manual review was conducted to assess whether the
explanations provided by the models were consistent and correctly justified the answers, noting any
inconsistencies and instances in which the models failed to generate explanations. Second, the LLM’s
accuracy was measured by comparing the number of correctly generated answers to the total number
of questions, under two instruction sequences in the prompts: one which asked to choose then explain
and another asked to explain then choose. This approach allowed us to identify challenges in the
consistency of the models’ qualitative reasoning and understand how the order of instructions affects
the models’ responses. Additionally, we used zero-shot prompting which is a technique that instructs
LLMs, in this case GPT-3.5 and GPT-4, to use their text generalization capabilities to perform qualitative
reasoning tasks without prior training or examples. The models respond to prompts based on their
pre-trained knowledge and reasoning skills.
   The order of instructions in the prompt and the context, when included in the prompt, can vary and
possibly lead to different responses. In this study, we explored two different instruction sequences in
our prompts and an annotation. Below is an example of zero-shot prompting with the instructions used
and the annotation added:
   First, Choose then Explain:

      Prompt: “answer the question by choosing A or B then justify the right choice. return A or
      B in JSON form with fields ‘FinalAnswer’ and a justification field ‘Explanation’ and do not
      include markdown and write the results in results.jsonl file in JSON form”

Second, Explain then Choose (reverse prompt):
          Prompt: “justify the right choice then answer the question by choosing A or B. return A or
          B in JSON form with fields ‘FinalAnswer’ and a justification field ‘Explanation’ and do not
          include markdown and write the results in reverseresults.jsonl file in JSON form”

Third, Annotation in prompt:

          Prompt: “the context and question annotations added beside the question and choices”


3. Findings
3.1. General Observation
The models struggled with maintaining consistency based on random sample of 800 questions from
results, especially GPT-3.5, leading to frequent disagreements between the reasoning and the final
answer/explanation. Both models could generate plausible explanations, but these did not always
imply the correct conclusions due to faulty reasoning or misunderstood concepts. Faithfulness, or
the alignment of the explanation with the problem’s context, was more reliably maintained by GPT-4.
However, the models sometimes generated irrelevant details that mislead the reasoning process2 .
  Despite improvements in different scenarios such as adding annotation or changing instruction orders,
both models still struggled with handling multi-step reasoning, subtle characteristics, and complex
causal relationships. The observed inconsistencies confirm ongoing challenges in achieving human-like
reasoning, especially in scenarios requiring detailed contextual and scientific understanding.

3.2. Challenges in Consistency of Explanation in Qualitative Reasoning Based on
     Manual Evaluation
A key focus of this study was to identify inconsistencies in the models’ reasoning processes and it can
be found in the generated answers and explanation. table 1

       Table 1
       The number of answers and explanations that models GPT-3.5 and GPT-4 generated for both scenarios
       (with and without annotation)
             Model (both scenarios)        have explanation     no explanation    have answers   no answers
                    GPT-3.5                      1783                911               2694          0
                     GPT-4                       2694                 0                1525         169
          Reverse Prompting (GPT-3.5)            2621                73                2694          0

  Inaccurate responses were analyzed to determine whether the inconsistencies arose from errors in the
reasoning process or misinterpretation of the concepts. Across both models in both scenarios, certain
challenges were clear, particularly in handling complex multi-step reasoning or scenarios requiring
deep contextual or conceptual understanding. Inconsistent explanations often fall into the following
categories with an approximate number of each that appears in both models shown in table 1:

       • Misinterpreting causal relationships: the model understood the intent of the question but incor-
         rectly identified/applied causal relationships. For example, in question “If greenhouse gases in the
         atmosphere were to be doubled, Earth will become... (A) cooler (B) warmer”, the model incorrectly
         provided the explanation: “Increasing the amount of greenhouse gases should theoretically block
         more heat from reaching the surface, leading to a cooler planet.”, the response incorrectly applied
         the causal relationship between greenhouse gases and Earth’s temperature.
       • Adding irrelevant details: Explanations contains unrelated details to the answer rather than
         addressing the core question. For instance, the question: “Causing a series to lose resistors will

2
    The supplementary materials can be found in https://github.com/NajwaAG/EvaluatingLLM.git
      result in a/an .... in the total resistance of the circuit. (A) increase (B) decrease”, the explanation
      given was:“Adding more resistors in a series circuit increases the total resistance. Conversely, losing
      resistors, as stated in the question, means reducing the number of resistors, which implies that the
      total resistance of the circuit will decrease. However, the annotations and the logical relationship
      between resistors in a series and the total resistance (where more resistors result in an increase in
      resistance) indicate that the correct context is the effect of having more resistors. Since the question
      refers to the effect of losing resistors, which is the opposite action, the resulting effect on the total
      resistance should logically be a decrease. However, the annotations suggest interpreting the increase
      as the correct outcome possibly due to misunderstanding the question’s phrasing or annotations
      mismatch.”.
    • Struggling with precise scientific concepts: Questions involving fine differences in logic or
      distinctions often lead to inaccurate reasoning. For example, the question: “When there are fewer
      carbon atoms in something there are... (A) many options (B) few options”, led to the explanation:
      “When there are fewer carbon atoms in something, there are many options for different compounds
      and structures to form due to the versatility of carbon bonding”. The model answered “(A) many
      options” whereas the correct answer should be “(B) few options,” which reflects that fewer carbon
      atoms would constrain the diversity of compounds rather than expand it.
    • Demonstrating the correct concepts, but generating the wrong conclusion: The model understood
      the underlying principles but misapplied them, leading to an incorrect conclusion. For example,
      in question: “We are designing a submarine to study fish that live far below the surface of the ocean.
      Before we can send a human researcher down in the submarine, we have to be sure it can tolerate the
      pressure of the water without cracking. The easier test will be to send our submarine down to... (A)500
      feet (B)1500 feet”, the generated explanation was:“The deeper the submarine goes, the higher the
      water pressure. Sending the submarine to 1500 feet will provide a more rigorous test of its ability to
      tolerate pressure compared to sending it to only 500 feet.” the model explained the concept right but
      failed to deduce the correct conclusion and generate answer “(B) 1500 feet” which is wrong. The
      model misapplies its correctly understood concept of increasing pressure with depth to conclude
      that the more depth 1500 feet is the easier test, where the correct answer should be A. 500 feet.

3.3. Accuracy and the Impact of Annotations
When evaluating the models using zero-shot prompting, we found that including annotations had a
small impact on accuracy for both GPT-3.5 and GPT-4 as shown in Table 2.

   Table 2
   Accuracy in percentage and the total number of correct answers out of 2694 questions for both GPT-3.5
   and GPT-4 with and without annotation
                                  without annotation              with annotation
                   Model
                              Accuracy Correct Answers       Accuracy Correct Answers
                   GPT-3.5     73.8%            1993           75.5%            2004
                   GPT-4       83.3%            2245           83.6%            2252

   For GPT-3.5, without annotations, the model correctly answered 1,993 out of 2,694 questions, resulting
in an accuracy of 73.8%. With annotations, the accuracy slightly increased to 74.5%, with 2,004 correct
answers. This small improvement suggests that while annotations help, they don’t make a huge
difference in how well the model performs.
   GPT-4 performed better overall. Without annotations, it had an accuracy of 83.3%, correctly answering
2,245 out of 2,694 questions. With annotations, the accuracy rose slightly to 83.6%, with 2,252 correct
answers. Although GPT-4 was more accurate than GPT-3.5, the benefit of adding annotations was still
quite small. These results show that annotations can help improve the performance of large language
models on qualitative reasoning tasks, but the improvement is minor. GPT-4 was better at handling
reasoning tasks and performed relatively similarly regardless of annotations.
   In table 3, a manual evaluation of the results of GPT-3.5 and GPT-4 models, with and without annota-
tions were conducted, which highlights notable differences in handling various types of inconsistencies
that are descried in the previous subsection.

   Table 3
   Number of each type of inconsistencies found in explanation that models GPT-3.5 and GPT-4 generated
   for both scenarios (with and without annotation)
         Model           Irrelevant Details   Scientific Struggles   Contradictions   Misinterpret relationships
 GPT-3.5 without anno.          27                    15                  43                     24
  GPT-3.5 with anno.            87                    34                  71                     21
 GPT-4 without anno.            30                     1                  26                      2
   GPT-4 with anno.             95                    79                  65                     5

   The table shows that GPT-4 with annotations tends to include too many irrelevant details to its
explanations with 95 instances and often misunderstood scientific concepts in 79 questions. In contrast,
GPT-4 without annotations has fewer issues with scientific concepts, which suggests that although
annotations are intended to enhance the model’s understanding and responses, it can make them worse.
Based on the results, it probably contributes to complicate the reasoning process and to being unable to
answer all questions, even if it generate an explanation. Meanwhile, GPT-3.5 when annotated, shows a
significant number of instances where the model correctly understands the concepts of 71 questions
but leads it to wrong conclusions and generates incorrect answers, 24 of explanations misinterpret
the relationships and more irrelevant details appears in 87 explanation which is more comparing to
same model without annotation, which was 27 instances. Additionally, GPT-3.5 without annotation
shows 15 instances where it struggles with precise scientific concepts which is less than when the same
model is: 43 occurrences where it understands the correct concept but concludes incorrectly, and 24
cases of misinterpreting relationships. This experiment is limited by the use of a closed model, and the
manual evaluation may be subject to subjective biases. These limitations could be overcome by using
other open LLMs such as LLaMA [11] and by involving a larger number of participants in the manual
assessment.


4. Related work
Large Language Models (LLMs), like GPT-3.5 and GPT-4, have significantly impacted the field of
natural language understanding (NLU) in various reasoning tasks, including translation and question-
answering[12, 13]. These models have shown noticeable ability in generating human-like text and
dealing with reasoning tasks[12]. However, one of the ongoing challenges is the explainability of these
models, i.e., how they generate responses and whether these responses are based on logical reasoning.
Inconsistencies and hallucinations found in explanations provided by models can sometimes seems
reasonable and logical but are meaningless or unfaithful to the context [4, 5, 14]
   As pointed by [15] and [16] LLMs often raise challenges regarding the transparency of their decision-
making/reasoning processes and it can be clearly seen in the explanation of their answers. Many studies
evaluate the models by focusing on only the annotations/context added in the prompt and how they
reflect the generate related explanation to the question [17, 18, 19].Additionally, [20] identifies two
primary sources of hallucination in three language models (LLaMA, GPT-3.5, and PaLM): first, the
models may assert a logical connection based on the presence of similar statements in their training data,
even if these are irrelevant to the current context; second, the models may make incorrect assumptions
based on similar frequency patterns rather than logical reasoning. The study shows that while LLMs
can generate plausible responses, these are often not grounded in a true the context understanding
but rather rely on memorized data. The study addresses these issues by isolating the effects of these
hallucination on model performance. The paper [21] evaluates how LLMs like Flan-T5, Alpaca, GPT-3.5,
and Llama-2 handle prompting, focusing on identifying two types of inconsistencies: the irrelevant
details and the generation of contradictory or factually incorrect responses. The study utilizes human
assessments to evaluate the models’ correctness and faithfulness to the provided information. This
approach aims to enhance the models’ capabilities in producing accurate and contextually appropriate
responses.
  Compared to prior work, our study evaluated explanations provided along with different prompting
scenarios and how they affect each other, then identify any inconsistencies from qualitative questions
and how many times it appears, which is not included in previous works.


5. Conclusion and Future work
In this study, we evaluated the performance of OpenAI’s GPT-3.5 and GPT-4 models on qualitative
reasoning tasks, with a focus on understanding how well these models generate explanations that are
consistent and faithful to their reasoning processes. Our findings indicate that while GPT-4 performs
better in the view of overall accuracy and consistency compared to GPT-3.5, both models still face
challenges, particularly in handling complex reasoning tasks that require a deep understanding of context
and scientific concepts. Issues such as misinterpretation of causal relationships and the overemphasis on
irrelevant details highlight the limitations of these models in producing reliable, human-like reasoning.
   Future work may focus on improving how models deal with information overload, which seems to
hurt their reasoning abilities. We should explore ways to help models handle multi-step reasoning better.
Additionally, we will look at using few-shot prompting—where we give an example in the prompt—to
see how it affects model performance. It’s also crucial to study how information overload relates to
reasoning accuracy. Further research is needed to make sure the explanations from large language
models are consistent and trustworthy. This will help ensure that their answers are not just convincing
but also truly reflect how they arrive at their conclusions.


References
 [1] O. Tafjord, M. Gardner, K. Lin, P. Clark, QuaRTz: An open-domain dataset of qualitative relationship
     questions, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical
     Methods in Natural Language Processing and the 9th International Joint Conference on Natural
     Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
     China, 2019, pp. 5941–5946. URL: https://aclanthology.org/D19-1608. doi:10.18653/v1/D19- 1608 .
 [2] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, J. Gao, Large language
     models: A survey, arXiv preprint arXiv:2402.06196 (2024).
 [3] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, M. Du, Explainability for large
     language models: A survey, ACM Transactions on Intelligent Systems and Technology 15 (2024)
     1–38.
 [4] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al., Siren’s song
     in the ai ocean: a survey on hallucination in large language models, arXiv preprint arXiv:2309.01219
     (2023).
 [5] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, A. Mian, A
     comprehensive overview of large language models, arXiv preprint arXiv:2307.06435 (2023).
 [6] A. Jacovi, Y. Goldberg, Towards faithfully interpretable nlp systems: How should we define and
     evaluate faithfulness?, arXiv preprint arXiv:2004.03685 (2020).
 [7] S. Wiegreffe, A. Marasović, Teach me to explain: A review of datasets for explainable natural
     language processing, arXiv preprint arXiv:2102.12060 (2021).
 [8] OpenAI, gpt-3-5-turbo, 2023. URL: https://platform.openai.com/docs/models/gpt-3-5-turbo.
 [9] OpenAI, gpt-4-turbo-and-gpt-4, 2023. URL: https://platform.openai.com/docs/models/
     gpt-4-turbo-and-gpt-4, accessed on: September 15, 2023.
[10] P. Hase, M. Bansal, Evaluating explainable ai: Which algorithmic explanations help users predict
     model behavior?, arXiv preprint arXiv:2005.01831 (2020).
[11] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     gava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
     arXiv:2307.09288 (2023).
[12] T. B. Brown, Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020).
[13] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
     S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[14] T. Xue, Z. Wang, Z. Wang, C. Han, P. Yu, H. Ji, Rcot: Detecting and rectifying factual inconsistency
     in reasoning by reversing chain-of-thought, arXiv preprint arXiv:2305.11499 (2023).
[15] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic parrots:
     Can language models be too big?, in: Proceedings of the 2021 ACM Conference on Fairness,
     Accountability, and Transparency, FAccT ’21, Association for Computing Machinery, New York,
     NY, USA, 2021, p. 610–623. URL: https://doi.org/10.1145/3442188.3445922. doi:10.1145/3442188.
     3445922 .
[16] K. McGrath, Unveiling the power and limitations of large language models, 2024. URL: https:
     //www.6clicks.com/resources/blog/unveiling-the-power-of-large-language-models.
[17] A. K. Lampinen, I. Dasgupta, S. C. Chan, K. Matthewson, M. H. Tessler, A. Creswell, J. L. McClelland,
     J. X. Wang, F. Hill, Can language models learn from explanations in context?, arXiv preprint
     arXiv:2204.02329 (2022).
[18] S. Teso, Ö. Alkan, W. Stammer, E. Daly, Leveraging explanations in interactive machine learning:
     An overview, Frontiers in Artificial Intelligence 6 (2023) 1066049.
[19] J. Kunz, M. Kuhlmann, Properties and challenges of llm-generated explanations, arXiv preprint
     arXiv:2402.10532 (2024).
[20] N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, M. Steedman, Sources of hallucination by
     large language models on inference tasks, arXiv preprint arXiv:2305.14552 (2023).
[21] V. Adlakha, P. BehnamGhader, X. H. Lu, N. Meade, S. Reddy, Evaluating Correctness and
     Faithfulness of Instruction-Following Models for Question Answering, Transactions of the
     Association for Computational Linguistics 12 (2024) 681–699. URL: https://doi.org/10.1162/
     tacl_a_00667. doi:10.1162/tacl_a_00667 . arXiv:https://direct.mit.edu/tacl/article-
     pdf/doi/10.1162/tacl_a_00667/2374800/tacl_a_00667.pdf .