Leveraging Large Language Models (LLMs) as Domain
                                Experts in a Validation Process
                                Carlos Badenes-Olmedo1,† , Esteban García-Cuesta2,*,† , Alejandro Sánchez-González2 and
                                Oscar Corcho2
                                1
                                    Ontology Engineering Group, Departamento de Sistemas Informáticos, Universidad Politécnica de Madrid
                                2
                                    Ontology Engineering Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid


                                                  Abstract
                                                  The explosion of information requires robust methods to validate knowledge claims. On the other hand, there is also an
                                                  increase interest on understanding and creating methods that helps on the interpretation of machine learning models. Both
                                                  approaches converge on the necessity of a validation step that clarifies or helps end-users to better understand if the decision
                                                  or information provided by the model is what is needed or if there is some mismatch between what the artificial intelligent
                                                  system is suggesting and reality. Large Language Models (LLMs), with their ability to process and synthesize vast amounts of
                                                  text data, have emerged as potential tools for this purpose. This study explores the utility of LLMs in hypothesis validation in
                                                  two different scenarios. The first relies on hypothesis generated from explanations obtained by XAI methods or by inherently
                                                  explainable models. We propose a method to transform the inferences provided by a machine learning model into explanations
                                                  in natural language, hence linking the symbolic and sub-symbolic areas. The second relies on hypothesis generated with
                                                  techniques that automatically extract answers from text. The results show that LLMs can complement other XAI techniques
                                                  and although all LLMs analyzed are able to provide truthfulness-related answers, not all are equally successful.

                                                  Keywords
                                                  LLMs, knowledge validation, explainable artificial intelligence


                                1. Introduction                                                                                        framing and in-context interference effects showing that
                                                                                                                                       large language models are subject to the influences of
                                In recent years, the field of artificial intelligence has wit- various hallucinations-inducing causes. This is true for
                                nessed a remarkable evolution in natural language pro- Word Prediction (WP), Question-Answer (QA), and Fact-
                                cessing capabilities, largely driven by the advent of Large Checking (FC). Other studies [3] compare the results
                                Language Models (LLMs). The essence of utilizing these obtained by machine learning models with those pro-
                                models for Artificial Intelligence (AI) tasks such as knowl- duced by LLMs for diagnostic decision support systems.
                                edge and hypothesis validation lies in their ability to They propose a processing pipeline for interacting with
                                understand, generate, and manipulate human language. language models concluding that LLMs models often are
                                This ability is crucial for tasks that require a deep under- ambiguous and provide incorrect diagnoses, being the
                                standing of context and nuances of human communica- prompt engineering a critical step in the process. Thus,
                                tion.                                                                                                  claim verification has emerged as a key-point to discern
                                   Applying LLMs to real-world scenarios inevitably between misinformation and real facts. Most of these
                                leads to language generation deviating from known facts works [4][5] rely on human-annotated datasets to verify
                                (aka “factual hallucination” [1] due to multiple causes the explanations or decisions but that information is not
                                (e.g. it may be over-estimating due to overfitting on bi- accessible in hypothesis testing scenarios.
                                ased prompts (framing effect)). Several studies have tried                                                In this paper, we explore the innovative application
                                to measure these effects but it is still difficult to general- of Large Language Models (LLMs) as validation tools
                                ize them outside the specific context where the studies in fact-checking and hypothesis fact-checking scenar-
                                have been performed. In [2] the authors study prompt ios. Traditionally, the validation of conclusions drawn
                                                                                                                                       from data and models in specialized domains has been
                                SEPLN-2024: 40th Conference of the Spanish Society for Natural Lan- a task reserved for human experts, largely due to the
                                guage Processing. Valladolid, Spain. 24-27 September 2024.
                                *
                                  Corresponding author.
                                                                                                                                       complexity and domain-specific nature of the required
                                †
                                  These authors contributed equally.
                                                                                                                                       knowledge. However, with the evolution of LLMs, the
                                $ carlos.badenes@upmm.es (C. Badenes-Olmedo);                                                          question arises whether these powerful natural language
                                esteban.garcia@upm.es (E. García-Cuesta);                                                              processing tools can assume a role similar to human ex-
                                alejandro.sanchezg@alumnos.upm.es (A. Sánchez-González);                                               perts in the validation of domain-specific knowledge.
                                oscar.corcho@upm.es (O. Corcho)                                                                           The primary goal of this work is to analyze and as-
                                 0000-0002-2753-9917 (C. Badenes-Olmedo); 0000-0002-1215-3333
                                                                                                                                       sess the ability of LLMs to serve as domain experts in a
                                (E. García-Cuesta); 0000-0002-9260-0753 (O. Corcho)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License clinical scenario, specifically in validating explanations
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
(and thereof validating the models) derived from classi-     inputs. The examination of LLMs under different condi-
cal machine learning decision models. These explana-         tions, such as varying the context and structure of the
tions, presented in the form of affirmative statements       prompts, sheds light on their performance variability and
such as "the hypertension increase the risk of death from    the strategies for optimizing accuracy.
COVID19", are transformed into questions (for example,          Building on this foundation, the interplay between
"Does hypertension mean an increased risk of death from      context, choice structure, and decision-making, as ex-
COVID-19?") to be presented to the LLMs. This approach       plored in [9], [10], [11], and [12], directly relates to the
allows us to directly evaluate the LLM models’ as a knowl-   challenges LLMs face. This parallel between human and
edge base to validate specific claims within the domain,     computational decision-making processes emphasizes
offering a unique perspective on their applicability as      the importance of carefully designed prompts and the
validation tools in scientific and clinical contexts.        strategic manipulation of choice options to improve LLM
   This work aims to address the following research ques-    reliability and decision accuracy. Through innovative
tions: RQ1) What effect does the variation in the number     decision-making strategies and prompt engineering tech-
of options within a fact-checking question have on the       niques proposed in [13], [14], [15], [16], and [17], the
responses provided by Large Language Models (LLMs)?          nuanced approach to prompt framing is critical for en-
RQ2) How consistent are the boolean answers (i.e. yes,       hancing LLM interactions and understanding. This body
or no) provided by Large Language Models (LLMs)? RQ3)        of work collectively illustrates a key insight: adjusting
What is the impact of integrating machine learning infer-    the number of options and the framing of prompts can
ences with Large Language Models (LLMs) on enriching         profoundly influence the effectiveness of LLMs in verify-
and validating the explanations?                             ing statements and making decisions, bridging the gap
   Through this analysis, we seek not only to understand     between consistency in output and the complexity of
the level of knowledge and accuracy of LLMs in spe-          input conditions.
cialized domains but also to investigate their potential
to complement or, in some cases, replace the need for        Explainable AI and LLMs Interpretability and ex-
human peer review in the validation stage of scientific      plainability in Machine Learning (ML) refers to the ability
conclusions.                                                 to make understandable an ML model’s workings. This
                                                             is particularly vital in high-risk applications and desir-
Our contributions are:                                       able in most cases. The burgeoning field of research that
    1. a novel assessment method that integrates ma-         addresses to foster this ability is known as eXplainable Ar-
       chine learning inferences with Large Language         tificial Intelligence (XAI). A variety of XAI methods have
       Models (LLMs) to generate fact-checking (FC)          been developed in recent years. They may be related to
       type questions.                                       intrinsically interpretable models or to "black box" mod-
                                                             els, but all pursue coherent and meaningful explanations
    2. a study on the variability and consistency of re-
                                                             for the audience. As an example, SHAP (SHapley Addi-
       sponses provided by LLMs in multiple-choice
                                                             tive exPlanations) is one of the most widely used XAI
       questions and scenarios with established ground
                                                             model agnostic techniques. It is based on concepts from
       truths..
                                                             game theory that allow the computing, which are the
    3. an investigation into the variability of explana-
                                                             features that contribute the most to the outcomes of the
       tions provided by LLMs in scenarios involving
                                                             black-box model, by trying different feature set permu-
       fact-checking (including questions with multiple
                                                             tations [18]. LIME (Local Interpretable Model-agnostic
       factual options) and fact recovery, offering a com-
                                                             Explanations) is another well known example that builds
       prehensive understanding of LLMs’ explanatory
                                                             a simple linear surrogate model to explain each of the
       capabilities and their potential for enhancing AI
                                                             predictions of the learned black-box model [19]. There
       interpretability.
                                                             are also some interpretable ML models such as logistic
                                                             regression, Generalised Linear Models (GLMs), or Gener-
2. Related works                                             alised Additive Models (GAMs).There are some attempts
                                                             to facilitate the comprehension of some XAI methods
Prompt framing effect The study of the prompt fram-          providing new tools to end-users. At [20] a new GPT
ing effect reveals that the performance of Large Language    x-[plAIn] is proposed to transform the output explana-
Models (LLMs) is highly dependent on the construction of     tions provided by those methods (e.g. SHAP or LIME) to
the prompts, with a significant focus on the consistency     natural language that contains the technical descriptions
of LLMs’ responses to similar prompts. This concept,         of the results. Despite the improvements in end-user sat-
discussed in [6], [7], and [8], examines LLMs’ ability       isfaction, this work does not include any enrichment or
to provide consistent outputs for semantically similar       additional information that could contextualize not only
prompts and their sensitivity to hallucination-inducing      the explanations themselves, but also the meaning and
validation of the application domain. In [21] the authors     3.1. Large Language Models
propose to use LLMs to facilitate decision-making pro-
                                                              A range of LLMs have been developed in the last years.
cesss by the end users providing concise summaries of
                                                              GPT-4, developed by OpenAI, is a state-of-the-art LLM
varios XAI methods tailored for different audiences. This
                                                              known for its deep learning architecture. As part of the
can be viewed as LLM enhanced XAI explainer trying
                                                              Generative Pre-trained Transformer series, it includes
to bridge the gap between complex AI technologies and
                                                              a large network of multi-layer transformers, capable of
their practical applications.
                                                              processing sequential data and preserving textual depen-
                                                              dencies in the long term. This version marks a signif-
Veracity and truth extraction The exploration of              icant advancement over its predecessors by scaling up
truth within the realm of big data and its verification       the number of parameters and broadening the diversity
through LLMs embodies a complex interaction between           of its training data, thus enhancing its ability to gener-
technological advancements and the multifaceted nature        ate coherent and contextually relevant text based on the
of truth. The assembly method, as proposed by [22],           input it receives [27].
marks a significant step in addressing the challenge of          Moreover, Google’s DeepMind project Gemini, is a
data veracity by combining individual truth discovery         key competitor to GPT-4. Gemini is a family of models
methods to mitigate the effects of limited labeled ground     built on top of transformer decoders that employ atten-
truth availability. This approach lays the groundwork for     tion mechanisms, analogous to GPT-4. Gemini Pro, the
further research on the role of technology in differentiat-   second model in the family in terms of size, has been op-
ing between truth and falsehood. Furthermore, research        timized for both cost and latency, offering considerable
on linguistic indicators of truth and deception, such as      performance improvements across numerous tasks; it is
that of [23], reveals the potential of linguistic complexi-   designed to understand, reason, and generate outputs
ties and immediacy to act as markers to distinguish be-       across various types of data, including text [28].
tween truthful and deceptive narratives, enriching the           Similarly, Llama 2 constitutes a collection of pretrained
conversation about truth verification in digital commu-       and fine-tuned LLMs that is distinctive from the models
nications.                                                    mentioned due to its open-source nature [29]. This group
   Recent advances in artificial intelligence, notably the    of models developed by Meta includes two models (Llama
conceptualization of models such as InstructGPT as            2 and Llama 2-Chat) with different versions that adjust
"Truth Machines" by [24], highlight ongoing efforts to        the number of parameters: 7B, 13B and 70B.
define and operationalize truth through sophisticated            Mistral represents another significant collection of
data analysis and model architectures. Currently, innova-     LLMs, characterized by their advanced reasoning capabil-
tive methodologies such as the DoLa decoding strategy         ities and a robust performance. Their largest model, Mis-
by [25] and the development of truthfulness personas          tral Large, demonstrates state-of-the-art results across a
by [26] aim to enhance the factuality and reliability of      variety of benchmarks, including areas such as common
LLM outputs. These strategies not only address the chal-      sense, reasoning, and knowledge-based tasks [30]. The
lenge of hallucinations in model responses but also open      Mistral family also includes open-source models that sur-
up new pathways for embedding truthfulness within AI          pass certain versions of Llama 2 in several benchmarks,
systems, underscoring the dynamic nature of research          as documented by [31].
focused on achieving reliable knowledge verification and
decision-making processes in the digital era.
                                                              3.2. Datasets
                                                              Covid19 explanations The questions included in Ta-
3. Approach and Problem Setup                                 ble 1 are created from a clinical study [32]. In that study
Our proposal involves using LLMs as knowledge bases           one thousand and three hundred thirty-one COVID-19 pa-
to evaluate the outcomes of machine learning models           tients (medium age 66.9 years old; males n= 841, medium
by answering Boolean questions derived from the mod-          length of hospital stayed 8 days, non-survivors n=233)
els’ inferences. This approach aims to harness the com-       were analyzed. Based on the hypotheses raised in the
prehensive knowledge and understanding capabilities           study, the questions are constructed. Questions Q2, Q3,
of LLMs to verify the accuracy and reliability of infer-      Q4, Q5, Q6, Q7, and Q8 were identified as significant
ences made by machine learning models, thereby provid-        using a regression Cox model and Q1, Q9, Q10 were
ing a novel method for validating AI-generated insights       identified as significant by univariate analysis. Q1 was
through direct, yes-or-no questioning.                        also identified as 1 of the most important variables using
                                                              SHAP explanations over LSTM learned model using the
                                                              same Covid19 dataset. By domain knowledge and based
                                                              on model explanations we can set Q1, Q2, Q3, Q4, Q5, Q6,
Table 1                                                      edge conveyed by LLMs. This targeted evaluation was
Consistency questions dataset                                designed to determine the precision of the LLM answers
                           Consistency                       compared to the gold standard answers of the data set.
                   Does hypertension mean an                 This method of validation not only tests the LLMs’ under-
  Q1                                                         standing of complex texts, but also assesses their reliabil-
            increased risk of death from COVID-19?
               Does a low platelet count mean an             ity in providing information that matches human-curated
  Q2
            increased risk of death from COVID-19?           answers.
         Does a high leukocyte count at emergency mean
  Q3
           an increased risk of death from COVID-19?
                Does older age mean an increased
                                                             3.3. Use Cases
  Q4
                  risk of death from COVID-19?                Three use cases (UC) have been designed to address pre-
  Q5
              Does male gender mean an increased              vious research questions, focusing on the practical ap-
                  risk of death from COVID-19?                plications and implications of using LLMs to validate
           Does previous chronic therapy with steroids
   Q6                                                         machine learning inferences. The first area investigates
         mean an increased risk of death from COVID-19?
                                                              the influence of varying the number of options in fact-
           Does not treating with hydroxychloroquine
   Q7
         mean an increased risk of death from COVID-19?
                                                              check questions on LLM responses, aiming to understand
           Does oxygen saturation at emergency mean           how choice diversity impacts LLM accuracy. The second
   Q8
           an increased risk of death from COVID-19?          focuses on assessing the consistency of boolean (yes or
         Does no early prescription of lopinavir/ritonavir    no) answers provided by LLMs, evaluating their relia-
   Q9                                                         bility in delivering steady responses. Lastly, we explore
         mean an increased risk of death from COVID-19?
           Does no treatment with steroid bolus mean          the effects of combine machine learning inferences with
  Q10
           an increased risk of death from COVID-19?          LLMs to both enrich and validate the explanations of
                                                              these models. This last use case uses the Covid19 dataset
                                                              to create a ML model and the SHAP technique to obtain
and Q8 as positive truth answers. We did not include Q7 a set of important features that later are enriched with
as a positive response (but controversy), despite being LLMs.
obtained by the Cox model explanations, because there            The models used in this study include “gpt-4” from
was controversy about the use of hydroxychloroquine OpenAI, “mistral-large-2402” from Mistral AI, “gemini-
during the pandemic and although it was initially consid- 1.0-pro-001” from Google, and “llama-2-70b-chat” from
ered as a drug to reduce the risk of mortality, it was later Meta AI. In addition, the temperature parameter was set
contradicted by other studies and was not recommended to the lowest possible value to ensure the most deter-
by the World Health Organization. Therefore, the vari- ministic behavior in the LLMs. Temperature controls
ables that were obtained only by the univariate analysis the randomness of the generated output, with a lower
(Q9 and Q10) are proposed as controversy answers.             value leading to more deterministic outputs by favoring
   It is important to highlight that all the questions adhere the most likely predictions. Therefore, in most models,
to a consistent structure to optimize the performance of the temperature value was set to 0 to minimize random-
the LLM. Specifically, each question is framed as “Does ness. However, it is important to note that for the Llama
#hypothesis# mean an increased risk of death from COVID- 2 model, the minimum supported temperature value is
19?”. This uniformity ensures that the LLM’s responses 0.01. Despite this slight deviation from 0, the aim re-
are directly comparable and minimizes variability that mains the same: to achieve the lowest possible level of
could arise from differing question formats. It also allows randomness in the output.
to test hypothesis obtained by the explainability models.
                                                              UC1: Fact Density Impact Analysis It examines the
Veracity dataset The Stanford Question Answering performance of LLMs in delivering binary responses (“yes”
Dataset (SQuAD) [33] has been extensively used in the or “no”) versus incorporating a third option (“contro-
scientific literature for the development of Question An- versy”) to introduce an element of uncertainty. This eval-
swering (QA) language models, serving as a benchmark uation aims to measure the models’ performance in terms
to assess the abilities of these models in understanding of veracity, exploring how the structure of the response
and processing natural language queries. As a rich com- options affects the LLMs’ ability to provide accurate and
pilation of questions and answers based on Wikipedia reliable answers in fact-checking scenarios.
articles, SQuAD challenges models to provide accurate            Table 2 presents the prompts used in three scenarios
answers by comprehending the context provided in the to evaluate veracity, allowing the model to use binary re-
passages.                                                     sponses or multiple options, and requesting the model to
   In our work, we retrieved a subset of questions from act as an expert in the clinical domain, providing precise
the SQuAD dataset to specifically validate the knowl-
Table 2
Use Case 1 Contexts
                                                                Prompt
                                            You are an expert on COVID-19 and your duty is
                                           to answer questions related to the topic only with
                          Context 1
                                          yes or no followed by the explanation that validates
                                                the answer in a maximum of 2 sentences.
                                            You are an expert on COVID-19 and your duty is
                                           to answer questions related to the topic only with
                          Context 2
                                           yes, no or controversy followed by the explanation
                                        that validates the answer in a maximum of 2 sentences.
                                                You are a medical expert and your duty is
                          Context 3              to answer medical questions in a single
                                                 sentence in a precise and brief manner.


and concise responses. The use of the parameter max-           perature parameter was minimized to enhance response
tokens inadvertently caused responses to be abruptly cut,      determinism. This methodology provides a nuanced un-
leading to nonsensical outcomes. Consequently, we di-          derstanding of the models’ consistency by ensuring con-
rected the model within the context to be precise and          trolled conditions and leveraging the lowest possible tem-
concise, with the aim of minimizing this issue and en-         perature setting to maximize the determinism of the mod-
hancing the clarity and relevance of its answers. This         els’ responses.
additional context of evaluation was designed to gauge
the model’s capacity to offer accurate and reliable an-        Algorithm 1 Evaluate the consistency of a single LLM
swers when positioned as a domain-specific authority,           1: for each question 𝑞𝑖 in dataset1 do
further enriching our understanding of its performance          2:     Initialize Responses to an empty list
in delivering veracious responses within specialized sce-       3:     for 𝑖 ← 1 to 10 do
narios. This distinction allows for a detailed examination      4:         response r ← AskLLM(𝑞𝑖 , context1)
of how the inclusion of an “controversy” option alongside       5:         Append 𝑟 to Responses
traditional “yes” or “no” answers influences the model’s        6:     end for
response behavior in our Use Case 1 analysis.                   7:     SemanticSimilarity ← CalculateSemanticSimilar-
                                                                   ity(Responses)
                                                                8:    Overlap ← CalculateOverlap(Responses)
UC2: Consistency and Veracity Evaluation Use                    9:    ROUGE ← CalculateROUGE(Responses)
Case 2 distinguishes between two methods of evaluat-           10:    BLEU ← CalculateBLEU(Responses)
ing LLM consistency based on the availability of ground        11:    Store metrics for further analysis
truth. In the first approach, where the true answer is         12: end for
not available, consistency is assessed by comparing the
LLM’s responses against each other. This method focuses           On the other hand, the veracity evaluation involves
on the internal consistency of the model’s answers. In         the use of ground truth. Therefore, akin to the previous
the second approach, where a known true answer exists,         method, we employ a different algorithm (see Algorithm
the LLM’s responses are evaluated against this ground          2) designed to assess the veracity of each response from
truth to measure the model’s accuracy and reliability          each model. The key difference in this approach is that
in providing consistent and correct answers, a quality         when invoking the LLM, both the response along with
referred to as veracity.                                       its context (Context 3) and the ground truth for each
   On the one hand, the first approach or consistency          response (“𝑞𝑖 answer”) are provided. This enables a direct
evaluation aims to assess the stability of responses from      comparison between the LLM’s responses and the known
LLMs through repeated inquiries. By introducing an algo-       accurate answers.
rithm 1 to systematically evaluate consistency within the
Covid19 dataset, we probe each question in the dataset         UC3: XAI Enhancement and Validation Use Case
multiple times using the question and Context 1 as the         3 involves leveraging machine learning inferences and
prompt. This method allows us to gauge the LLMs’ consis-       LLMs to enrich and validate explanations. We propose to
tency using the metrics described in Section 3.4. Similarly,   utilize important features identified by XAI techniques,
the same algorithm is used with Context 2.                     such as SHAP, to augment information and validate ex-
   The following algorithm was deployed twice for each         planations. This involves transforming explanations into
LLM, once for each of the two contexts, and the tem-           binary questions that LLMs can answer, with prompts
Algorithm 2 Evaluate the veracity of a single LLM             meaningful elements) that appear in two sentences. This
 1: for each question 𝑞𝑖 in dataset2 do                       metric is used to assess how much shared content exists
 2:     Initialize Responses to an empty list                 between both sentences, indicating their consistency or
 3:     for 𝑖 ← 1 to 10 do                                    similarity in terms of the information they convey.
 4:         response 𝑟𝑖 ← AskLLM(𝑞𝑖 , context3)
 5:         Append 𝑟𝑖 to Responses
 6:     end for                                               ROUGE stands for Recall-Oriented Understudy for
 7:     SemanticSimilarity ← CalculateSemanticSimilar-        Gisting Evaluation. ROUGE includes a collection of met-
    ity(Responses, 𝑞𝑖 answer)                                 rics designed for the formal evaluation of text generation
 8:    Overlap ← CalculateOverlap(Responses, 𝑞𝑖 answer)       models such us summarization or machine translation.
 9:    ROUGE ← CalculateROUGE(Responses, 𝑞𝑖 answer)           In the evaluation of responses generated by a LLM, the
10:    BLEU ← CalculateBLEU(Responses, 𝑞𝑖 answer)             use of the ROUGE metric can be justified by its ability
11:    Store metrics for further analysis                     to quantitatively measure the lexical overlap across dif-
12: end for                                                   ferent responses generated by the LLM itself. This is
                                                              accomplished utilizing the ROUGE-L variant, which em-
                                                              ploys the Longest Common Subsequence (LCS) between
that contain both the question and relevant contexts. By      two sentences as a basis for computing recall, precision,
constructing queries to directly link significant features    and the 𝐹1 score derived from both [36].
with real-world results (e.g. ’Does #hypothesis# mean
an increased risk of death from COVID-19?), we bridge         BLEU stands for Bilingual Evaluation Understudy.
the gap between XAI insights and practical applications.      BLEU is a metric initially conceived for evaluating the
Additionally, by instructing LLMs to respond with “yes”       quality of text translated by machine translation sys-
or “no” and provide validating explanations, we achieve       tems by comparing it with one or more reference transla-
dual objectives of validating and enriching responses,        tions [37]. Unlike ROUGE, which is recall-oriented, BLEU
prompting LLMs to elaborate on pertinent features.            emphasizes precision. It assesses how many words or
                                                              phrases in the machine-generated text appear in the ref-
3.4. Metrics                                                  erence texts. This metric calculates n-gram (contiguous
                                                              sequences of n items from a given sample of text) pre-
A suite of metrics has been implemented to evaluate the
                                                              cision for different lengths and combines them through
consistency and veracity of the LLMs. This suite includes
                                                              a weighted geometric mean, incorporating a brevity
semantic similarity, token overlap, and the ROUGE and
                                                              penalty to discourage overly short translations [37].
BLEU metrics.
                                                                 This precision-oriented approach is particularly valu-
                                                              able when the objective is to ensure that certain key infor-
Semantic similarity is a measure of the degree to mation is consistently represented in the LLM’s outputs.
which two concepts (such as words, phrases, or sen- We computed the BLEU metric by treating each LLM
tences) are related in terms of their meanings within response as a “translation” and comparing it to other re-
a given semantic space. In formal terms, semantic simi- sponses. BLEU can highlight the extent to which the LLM
larity can be quantified based on the distance or closeness is capable of producing responses that contain expected
of the concepts in a multi-dimensional space, where each and relevant content. This method offers a complemen-
dimension represents a feature of the concept’s mean- tary perspective to the recall-focused metric ROUGE,
ing. The closer two concepts are in this space, the more providing a balanced assessment of the LLM’s perfor-
semantically similar they are.                                mance.
   Diverse methods for calculating semantic similarity are
analysed in [34], encompassing a range of approaches.
However, this research will specifically utilize cosine 4. Evaluation and Results
similarity in conjunction with sentence embeddings. We
will use Sentence-BERT, a variation of BERT(Devlin et In this section we evaluate the different use cases.
al., 2018) optimized for sentence-level embeddings, due
to their proven efficiency [35]. In particular, this research 4.1. UC1 Results
utilizes the “all-MiniLM-L6-v2” model for its remarkable
balance between high performance and speed. Despite The first use case focuses on evaluating how the structure
being one of the smallest models in terms of size, it stands of response options presented to the LLM influences the
out for its rapid processing capabilities.                    performance of the models’ accuracy and reliability. This
                                                              evaluation was addressed by using different contexts:
Overlap as a metric refers to the method of quantifying Context 1 which employs a binary response (such as “yes”
similarity based on the common tokens (words or other
Table 3
Context 1 vs Context 2 responses for all LLMs
                                   GPT                     Mistral                   Gemini                   Llama2
         Expected      Context 1     Context 2     Context 1   Context 2     Context 1  Context 2     Context 1   Context 2
 Q1      yes           yes           yes           yes         yes           yes        controversy   yes         yes
 Q2      yes           yes           yes           yes         yes           yes        controversy   no          controversy
 Q3      yes           yes           yes           yes         yes           yes        controversy   yes         controversy
 Q4      yes           yes           yes           yes         yes           yes        yes           yes         yes
 Q5      yes           yes           yes           yes         yes           yes        yes           no          controversy
 Q6      yes           yes           controversy   yes         controversy   no         controversy   yes         controversy
 Q7      controversy   no            controversy   no          controversy   no         controversy   no          controversy
 Q8      yes           yes           yes           yes         yes           yes        yes           yes         controversy
 Q9      controversy   no            controversy   yes         controversy   no         controversy   yes         controversy
 Q10     controversy   yes           controversy   yes         controversy   no         controversy   yes         controversy


or “no”), and Context 2, which introduces a third element               • Llama2 demonstrates accuracy in variations of
associated to uncertainty characterized as “controversy”.                 Q7, Q9, and Q10. However, it produces unjusti-
   Table 3 shows the results of the models’ responses for                 fied variations in Q2, Q3, Q5, Q6, and Q8. Fur-
each question. It is important to clarify that although mul-              thermore, it provides incorrect answers for Q2
tiple responses are generated for each question (specifi-                 and Q5, where “yes” was expected, but “no” was
cally 10), the table presents only a single value in each                 output.
cell. This reduction is justified because the answers (“yes”,
“no” or “controversy”) do not vary across iterations. What           Our findings suggest that introducing the option of
varies is the model’s explanations of the responses, not          “controversy” as a potential response significantly influ-
the answer itself.                                                ences the behavior of the analyzed LLMs, leading to a
   However, some differences can be noted both in the             noticeable shift in their response patterns. Across various
responses generated by a LLM with different contexts for          models, including GPT and Mistral, where the response
the same question and in the performance across various           changed in 4 out of 10 instances, Gemini with a change
language models (e.g. Q1 in Context 2 is answered as “yes”        in 7 out of 10 instances, and Llama2 showing a change in
by GPT but “controversy” by Gemini). These variations             8 out of 10 instances, there is a marked preference for se-
reveal that while some differences can be attributed to           lecting “controversy” over a definitive “yes” or “no”. This
the introduction of ’controversy’ in response options (e.g.       tendency persists irrespective of the model in question
GPT Q7), others may not have such a clear justification           and appears to reflect a broader pattern: when presented
(e.g. Gemini Q2).                                                 with the “controversy” option, models consistently avoid
   Optimally, each LLM should make three justified varia-         negative responses, opting instead to categorize state-
tions (Q7, Q9, and Q10) when introducing the uncertainty          ments as controversial. This behavior suggests a higher
option with the second context, due to the limitation             level of confidence in asserting conclusions rather than
of Context 1 to binary “yes” or “no” answers. For GPT,            denying them. While for GPT and Mistral, 75% of these
an analysis of the responses between contexts reveals a           shifts towards “controversy” can be considered justified,
mixed outcome: 3 of the variations presented are deemed           enhancing the quality of the output, the justification for
correct (Q7, Q9, and Q10), indicating that the model ac-          this change drops to 43% for Gemini and 37% for Llama2,
curately handled both contexts. Conversely, the model’s           indicating variability in how these adjustments align with
responses to the question Q6 is classified as wrong varia-        the underlying data uncertainty.
tions, suggesting inaccuracies in dealing with different
contexts.                                                         4.2. UC2 Results
   Similarly, the performance of the other models is as
                                                           In this section, we present the results from the second
follows:
                                                           use case, which are detailed in Tables 4 and 5. These
     • Mistral accurately handles 3 variations (Q7, Q9, tables show the average performance metrics for consis-
       and Q10) but had an error in Q6.                    tency and veracity -namely, semantic similarity, overlap,
     • Gemini stands out by correctly handling 3 varia- ROUGE and BLEU scores - for each model across various
       tions (Q7, Q9, and Q10) but falls short by produc- datasets. These metrics were computed for each ques-
       ing 4 unjustified incorrect variations (Q1, Q2, Q3, tion within the datasets, with averages provided to give
       Q6), including a notable discrepancy in Question a view of each model’s performance under two different
       6 where the expected answer was “yes”, but the contexts (i.e Context 1 and Context 2 ) for consistency eval-
       output was “no”.                                    uation (Table 4; the consistency results per questions are
Table 4
Average consistency evaluation
                                                    Semantic
                                                                   Overlap     ROUGE       BLEU
                                                    similarity
                                        Context 1     0,983          0,888       0,844      0,770
                              GPT
                                        Context 2     0,980          0,897       0,853      0,763
                                        Context 1     1,000          1,000       1,000      1,000
                          Mistral
                                        Context 2     0,999          0,995       0,992      0,988
                                        Context 1     1,000          0,996       0,996      0,992
                          Gemini
                                        Context 2     0,999          0,996       0,993      0,986
                                        Context 1     1,000          0,998       0,997      0,996
                          Llama2
                                        Context 2     0,999          0,991       0,989      0,983


Table 5                                                          Table 6
Average veracity evaluation                                      Average consistency of explanations
             Semantic                                                         Semantic
                              Overlap     ROUGE      BLEU                                   Overlap    ROUGE       BLEU
             similarity                                                       similarity
 GPT           0,740           0,464       0,301     0,408        GPT           0,916         0,659      0,541      0,375
 Mistral       0,727           0,403       0,222     0,380        Mistral       0,931         0,673      0,570      0,389
 Gemini        0,676           0,415       0,273     0,328        Gemini        0,915         0,600      0,442      0,247
 Llama2        0,734           0,466       0,239     0,393        Llama2        0,905         0,552      0,411      0,219


provided at Appendix TableA1) and a third context (i.e.          der analysis (e.g. Context 1 ’You are an expert on COVID-
Context 3) for veracity evaluation (Table 5; the veracity        19 and your duty is to answer questions related to the
results per questions are provided at Appendix TableA3).         topic only with yes or no followed by the explanation
   Our analysis reveals no significant difference in per-        that validates the answer in a maximum of 2 sentences.’).
formance between the first two contexts evaluated for            Table 7 shows example of responses for Q1, Q2, Q3 from
consistency, where all LLMs demonstrated high levels of          GPT-4. Q1 enriches the fact-checking response adding
consistency. Mistral achieved perfect consistency scores,        information related with the consequences of having hy-
while Gemini and Llama2 were nearly perfect. However,            pertension and how they are related to higher death risk.
GPT showed the lowest consistency (for all metrics in-           Q2 enriches the response adding reasons why the impor-
cluding semantic similarity), even with the temperature          tant feature (i.e. platelet) plays a crucial role that may
parameter set to the lowest level, indicating potential          lead to high risk of death. Last, Q3 response enriches the
variability in its response generation process.                  response indicating that a high leukocyte can be a symp-
   When comparing the models’ performance to the                 tom of severe Covid19. At table 8 we studied syntactically
ground truth data for veracity (see Section 3.2), GPT            the number of words that contain the explanation and
stands out by achieving the best results across all met-         also the average number of words per sentence. Llama2
rics, indicating that its responses, on average, align more      and Mistral have larger explanations and also syntacti-
closely with the ground truth than those of the other            cally are slighly more comples (Llama2 has ≈ 28 words
models. Llama2 follows closely behind as the second-             per sentence for context2). Gemini provides the shortest
best performer, with Gemini and Mistral trailings and            explanations and also the lowest syntactic complexity
their positions varying depending on the metric applied.         (36.72 number of words average and 19.91 words per sen-
These findings suggest that while GPT may struggle with          tence). Similarly to previous use cases we analyzed the
consistency relative to its peers, it excels in generating       differences between Context 1 and Context 2 explanations
responses that are more closely aligned with verifiable          (including the controversy as an option in the second) to
facts, highlighting a nuanced trade-off between consis-          measure how different are the explanations. According to
tency and veracity across different LLMs.                        all metrics the results show that the LLM that change the
                                                                 most is Llama2 (i.e. ROUGE 0.411), followed by Gemini
4.3. UC3 Results                                                 (i.e. ROUGE 0.442), GPT-4 (i.e. ROUGE 0.541) and Mistral
                                                                 (i.e. ROUGE 0.570) (see Table 6 for other metrics).
It examines the use of prompts that transform explana-
tions into binary questions that contain both the question
and relevant contexts related with the fact-checking un-
Table 7
Examples of responses for GPT-4
                                                     Example of responses
                                  Yes, hypertension has been identified as a risk factor for severe
                            outcomes in patients with COVID-19. Studies have shown that patients
                     Q1
                              with hypertension are more likely to experience severe symptoms or
                                          complications, including death, from the virus.
                            Yes, studies have shown that a low platelet count, or thrombocytopenia,
                             can be associated with a higher risk of severe disease and mortality in
                     Q2       patients with COVID-19. This is because platelets play a crucial role
                            in the body’s immune response, and a low count can impair the body’s
                                                   ability to fight off infections.
                                Yes, a high leukocyte count, or leukocytosis, can indicate a severe
                               infection or inflammation in the body, including severe COVID-19.
                     Q3
                              Studies have shown that patients with severe COVID-19 often have
                                 leukocytosis, which is associated with a higher risk of mortality.


Table 8                                                       might even imply contradictory responses. As for the
Average number of words of explanation per text and per       truthfulness analysis, we observed that GPT obtained
sentence                                                      the best results on average and can be considered quite
                        Avg words       Avg words             accurate.
                          per text     per sentence
            Context 1       39,38           19,69
   GPT      Context 2       37,46           22,65             Acknowledgments
            Average         38,42           21,17
            Context 1       53,30           21,62             This work has been funded by the project “Inteligencia
   Mistral  Context 2       45,70           20,20             Artificial eXplicable” IAX grant of the Young Researchers
            Average         49,5            20,91             2022/2024 initiative of the Community of Madrid.
            Context 1       37,70           18,85
   Gemini   Context 2       35,74           20,98
            Average         36,72          19,915             References
            Context 1       57,70           23,20
   Llama2   Context 2       50,76           27,69              [1] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu,
            Average         54,23          25,445                  H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on
                                                                   evaluation of large language models, ACM Trans-
                                                                   actions on Intelligent Systems and Technology 15
5. Conclusions                                                     (2024) 1–45.
                                                               [2] W. Wang, B. Haddow, A. Birch, W. Peng, Assess-
In this paper we studied the effect of variation in the            ing the reliability of large language model knowl-
number of options within fact-checking questions, the              edge, arXiv:2310.09820 (2023). URL: https://doi.org/
consistency and truthfulness of the answers, and the ca-           10.48550/arXiv.2310.09820].
pabilities to enrich fact-checking with explanations. We       [3] L. Caruccio, et al., Can chatgpt provide intelligent
also proposed to link explanations from machine learn-             diagnoses? a comparative study between predictive
ing models to LLMs by using those explanations to create           models and chatgpt to define a new medical diag-
a fact-checking type input question. We measured co-               nostic bot, Expert Systems with Applications 235
herence and veracity using state-of-the-art metrics such           (2024) 121186. URL: https://www.sciencedirect.com/
as semantic similarity, overlap, ROUGE and BLEU, and               science/article/pii/S0957417423016883. doi:https:
the results show that Mistral is the most coherent LLM.            //doi.org/10.1016/j.eswa.2023.121186.
Notably, Gemini and Llama2 obtained similar results and        [4] Y. Jin, X. Wang, R. Yang, Y. Sun, W. Wang, H. Liao,
GPT was slightly behind. Furthermore, we conclude that             X. Xie, Towards fine-grained reasoning for fake
fact-cheking consistency does not depend on the number             news detection, Proceedings of the AAAI Confer-
of options but explanations’ consistency does. This is rel-        ence on Artificial Intelligence 36 (2022) 5746–5754.
evant because it means that a different number of options          URL: https://ojs.aaai.org/index.php/AAAI/article/
not only may change the fact response but will also be             view/20517. doi:10.1609/aaai.v36i5.20517.
able to justify it differently. Further research should be     [5] D. Wadden, K. Lo, L. L. Wang, A. Cohan, I. Belt-
done to analyze in depth to what extend these differences          agy, H. Hajishirzi, MultiVerS: Improving scientific
     claim verification with weak supervision and full-            doi:10.1177/0013164493053004013.
     document context, in: M. Carpuat, M.-C. de Marn-         [15] J. White, Q. Fu, S. Hays, M. Sandborn,
     effe, I. V. Meza Ruiz (Eds.), Findings of the As-             C. Olea, H. Gilbert, A. Elnashar, J. Spencer-
     sociation for Computational Linguistics: NAACL                Smith, D. Schmidt,             A prompt pattern
     2022, Association for Computational Linguistics,              catalog to enhance prompt engineering
     Seattle, United States, 2022, pp. 61–76. URL: https:          with chatgpt,        ArXiv abs/2302.11382 (2023).
     //aclanthology.org/2022.findings-naacl.6. doi:10.             doi:10.48550/arXiv.2302.11382.
     18653/v1/2022.findings-naacl.6.                          [16] S. Oymak, A. Rawat, M. Soltanolkotabi, C. Thram-
 [6] Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichan-              poulidis, On the role of attention in prompt-tuning
     der, E. Hovy, H. Schütze, Y. Goldberg, Mea-                   (2023) 26724–26768. doi:10.48550/arXiv.2306.
     suring and improving consistency in pretrained                03435.
     language models, Transactions of the Associa-            [17] A. Bhargava, C. Witkowski, M. Shah, M. W. Thom-
     tion for Computational Linguistics 9 (2021) 1012–             son, What’s the magic word? a control theory
     1031. URL: https://aclanthology.org/2021.tacl-1.60.           of llm prompting, ArXiv abs/2310.04444 (2023).
     doi:10.1162/tacl_a_00410.                                     doi:10.48550/arXiv.2310.04444.
 [7] H. Raj, V. Gupta, D. Rosati, S. Majumdar, Seman-         [18] S. M. Lundberg, S.-I. Lee, A unified approach to
     tic consistency for assuring reliability of large lan-        interpreting model predictions, Advances in neural
     guage models, arXiv preprint arXiv:2308.09138                 information processing systems 30 (2017).
     (2023).                                                  [19] M. T. Ribeiro, S. Singh, C. Guestrin, "why should i
 [8] Q. Dong, J. Xu, L. Kong, Z. Sui, L. Li, Statistical           trust you?" explaining the predictions of any clas-
     knowledge assessment for large language models,               sifier, in: Proceedings of the 22nd ACM SIGKDD
     Advances in Neural Information Processing Sys-                international conference on knowledge discovery
     tems 36 (2024).                                               and data mining, 2016, pp. 1135–1144.
 [9] E. A. Maylor, M. A. Roberts, Similarity and              [20] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos,
     attraction effects in episodic memory judg-                   M. M. Separdani, D. Kyriazis, Xai for all: Can large
     ments,      Cognition 105 (2007) 715–723. URL:                language models simplify explainable ai?, 2024.
     https://www.sciencedirect.com/science/article/pii/            arXiv:2401.13110.
     S0010027706002587. doi:https://doi.org/10.               [21] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos,
     1016/j.cognition.2006.12.002.                                 M. M. Separdani, D. Kyriazis,           Xai for all:
[10] K. V. Morgan, T. A. Hurly, M. Bateson, L. Asher,              Can large language models simplify explainable
     S. D. Healy,         Context-dependent decisions              ai?, ArXiv abs/2401.13110 (2024). URL: https://api.
     among options varying in a single dimen-                      semanticscholar.org/CorpusID:267199844.
     sion,     Behavioural Processes 89 (2012) 115–           [22] L. Berti-Équille, Data veracity estimation with
     120. URL: https://www.sciencedirect.com/science/              ensembling truth discovery methods, 2015 IEEE
     article/pii/S0376635711001719. doi:https://doi.               International Conference on Big Data (Big Data)
     org/10.1016/j.beproc.2011.08.017, com-                        (2015) 2628–2636. doi:10.1109/BigData.2015.
     parative cognition: Function and mechanism in lab             7364062.
     and field.                                               [23] J. Burgoon, L. Hamel, T. Qin, Predicting veracity
[11] P. Pezeshkpour, E. Hruschka, Large language mod-              from linguistic indicators, Journal of Language
     els sensitivity to the order of options in multiple-          and Social Psychology 37 (2012) 603 – 631. doi:10.
     choice questions, ArXiv abs/2308.11483 (2023).                1177/0261927X18784119.
     doi:10.48550/arXiv.2308.11483.                           [24] L. Munn, L. Magee, V. Arora,             Truth ma-
[12] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu-         chines: Synthesizing veracity in ai language
     big, Pre-train, prompt, and predict: A systematic             models, AI & SOCIETY (2023). doi:10.1007/
     survey of prompting methods in natural language               s00146-023-01756-4.
     processing, ACM Computing Surveys 55 (2021) 1 –          [25] Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass,
     35. doi:10.1145/3560815.                                      P. He, Dola: Decoding by contrasting layers im-
[13] J. Zhan, H. Jiang, Y. Yao, Three-way multiattribute           proves factuality in large language models, in:
     decision-making based on outranking relations,                Learning Representations (ICLR), 2024 Interna-
     IEEE Transactions on Fuzzy Systems 29 (2021) 2844–            tional Conference on, volume abs/2309.03883, 2023.
     2858. doi:10.1109/tfuzz.2020.3007423.                         doi:10.48550/arXiv.2309.03883.
[14] T. Haladyna, S. Downing,                How many         [26] N. Joshi, J. Rando, A. Saparov, N. Kim, H. He,
     options is enough for a multiple-choice                       Personas as a way to model truthfulness in lan-
     test item?,          Educational and Psycholog-               guage models, ArXiv abs/2310.18168 (2023). doi:10.
     ical Measurement 53 (1993) 1010 – 999.                        48550/arXiv.2310.18168.
[27] OpenAI, J. Achiam, S. Adler, S. Agarwal, Gpt-4
     technical report, ArXiv abs/2303.08774 (2023).
     URL: https://api.semanticscholar.org/CorpusID:
     257532815.
[28] G. T. G. R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac,
     Gemini: A family of highly capable multimodal
     models, ArXiv abs/2312.11805 (2023). URL: https:
     //api.semanticscholar.org/CorpusID:266361876.
[29] H. Touvron, L. Martin, K. R. Stone, P. Albert,
     Llama 2: Open foundation and fine-tuned chat
     models, ArXiv abs/2307.09288 (2023). URL: https:
     //api.semanticscholar.org/CorpusID:259950998.
[30] Mistral, Mistral large, our new flagship model, URL
     https://mistral.ai/news/mistral-large/, 2024.
[31] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,
     Mistral 7b, ArXiv abs/2310.06825 (2023). URL: https:
     //api.semanticscholar.org/CorpusID:263830494.
[32] P. Cardinal-Fernandez, E. Garcia-Cuesta, J. Bar-
     beran, J. F. Varona, A. Estirado, A. Moreno, J. Vil-
     lanueva, M. Villareal, O. Baez-Pravia, J. Menendez,
     et al., Clinical characteristics and outcomes of 1,331
     patients with covid-19: Hm spanish cohort, Revista
     Española de Quimioterapia 34 (2021) 342.
[33] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang,
     Squad: 100,000+ questions for machine compre-
     hension of text (2016) 2383–2392. doi:10.18653/
     v1/D16-1264.
[34] D. Chandrasekaran, V. Mago, Evolution of semantic
     similarity—a survey, ACM Computing Surveys 54
     (2021) 1–37. URL: http://dx.doi.org/10.1145/3440755.
     doi:10.1145/3440755.
[35] N. Reimers, I. Gurevych, Sentence-bert: Sentence
     embeddings using siamese bert-networks, in: Pro-
     ceedings of the 2019 Conference on Empirical Meth-
     ods in Natural Language Processing, Association
     for Computational Linguistics, 2019. URL: https:
     //arxiv.org/abs/1908.10084.
[36] C.-Y. Lin, ROUGE: A package for automatic eval-
     uation of summaries, in: Text Summarization
     Branches Out, Association for Computational Lin-
     guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
     https://aclanthology.org/W04-1013.
[37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu:
     a method for automatic evaluation of machine
     translation, in: P. Isabelle, E. Charniak, D. Lin
     (Eds.), Proceedings of the 40th Annual Meeting of
     the Association for Computational Linguistics, As-
     sociation for Computational Linguistics, Philadel-
     phia, Pennsylvania, USA, 2002, pp. 311–318. URL:
     https://aclanthology.org/P02-1040. doi:10.3115/
     1073083.1073135.
A. Appendix. Detailed results of the consistency and veracity metrics
   for the three contexts.

Table A1
Consistency evaluation per question with Context 1
                             GPT                                    Mistral                                    Gemini                                    Llama2
           Semantic                                 Semantic                                   Semantic                                  Semantic
                        Overlap    ROUGE    BLEU                 Overlap      ROUGE    BLEU                 Overlap     ROUGE    BLEU                 Overlap     ROUGE    BLEU
           similarity                               similarity                                 similarity                                similarity
   Q1        0,949       0,760      0,666   0,462     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q2        0,989       0,941      0,935   0,880     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     0,999       0,983       0,983   0,977
   Q3        0,975       0,868      0,789   0,731     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q4        0,984       0,823      0,757   0,653     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       0,992       0,990   0,985
   Q5        0,981       0,895      0,841   0,738     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q6        0,996       0,972      0,948   0,925     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q7        0,980       0,864      0,833   0,800     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q8        0,993       0,899      0,874   0,797     1,000       1,000        1,000   1,000     0,998       0,964       0,961   0,922     1,000       1,000       1,000   1,000
   Q9        1,000       1,000      1,000   1,000     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q10       0,987       0,858      0,800   0,710     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
 Average     0,983       0,888      0,844   0,770     1,000       1,000        1,000   1,000     1,000       0,996       0,996   0,992     1,000       0,998       0,997   0,996


Table A2
Consistency evaluation per question with Context 2
                             GPT                                    Mistral                                    Gemini                                    Llama2
           Semantic                                 Semantic                                   Semantic                                  Semantic
                        Overlap    ROUGE    BLEU                 Overlap      ROUGE    BLEU                 Overlap     ROUGE    BLEU                 Overlap     ROUGE    BLEU
           similarity                               similarity                                 similarity                                similarity
   Q1        0,976       0,969      0,924   0,825     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q2        0,997       0,990      0,988   0,976     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q3        0,972       0,869      0,832   0,663     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q4        0,998       0,926      0,920   0,875     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q5        0,997       1,000      0,963   0,904     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q6        0,939       0,706      0,556   0,383     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     0,991       0,969       0,957   0,925
   Q7        0,949       0,736      0,649   0,475     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q8        0,996       0,932      0,896   0,867     0,991       0,954        0,927   0,903     0,995       0,960       0,957   0,925     0,995       0,944       0,932   0,903
   Q9        0,990       0,927      0,906   0,862     1,000       1,000        1,000   1,000     1,000       1,000       1,000   1,000     1,000       1,000       1,000   1,000
   Q10       0,982       0,918      0,895   0,796     1,000       0,991        0,989   0,977     0,998       1,000       0,975   0,938     1,000       1,000       1,000   1,000
 Average     0,980       0,897      0,853   0,763     0,999       0,995        0,992   0,988     0,999       0,996       0,993   0,986     0,999       0,991       0,989   0,983


Table A3
Veracity evaluation per question with Context 3
                             GPT                                    Mistral                                    Gemini                                    Llama2
           Semantic                                 Semantic                                   Semantic                                  Semantic
                        Overlap    ROUGE    BLEU                 Overlap      ROUGE    BLEU                 Overlap     ROUGE    BLEU                 Overlap     ROUGE    BLEU
           similarity                               similarity                                 similarity                                similarity
   Q1        0,777       0,750      0,466   0,612     0,770       0,750        0,424   0,537     0,412       0,250       0,181   0,368     0,777       0,750       0,466   0,612
   Q2        0,647       0,554      0,331   0,347     0,677       0,550        0,244   0,298     0,596       0,599       0,461   0,496     0,668       0,700       0,363   0,294
   Q3        0,687       0,285      0,178   0,289     0,681       0,380        0,173   0,349     0,698       0,500       0,266   0,358     0,680       0,380       0,210   0,420
   Q4        0,764       0,448      0,207   0,409     0,805       0,466        0,239   0,507     0,714       0,157       0,093   0,218     0,648       0,290       0,163   0,304
   Q5        0,665       0,428      0,255   0,184     0,619       0,312        0,222   0,259     0,622       0,350       0,260   0,137     0,634       0,297       0,101   0,271
   Q6        0,778       0,458      0,368   0,479     0,737       0,333        0,170   0,384     0,774       0,533       0,294   0,242     0,787       0,500       0,307   0,436
   Q7        0,758       0,350      0,191   0,373     0,733       0,256        0,161   0,361     0,631       0,413       0,254   0,306     0,821       0,435       0,242   0,454
   Q8        0,889       0,679      0,543   0,769     0,791       0,411        0,235   0,440     0,858       0,529       0,461   0,673     0,926       0,647       0,156   0,405
   Q9        0,705       0,269      0,202   0,196     0,725       0,242        0,107   0,259     0,746       0,230       0,163   0,220     0,691       0,222       0,175   0,300
   Q10       0,731       0,420      0,264   0,423     0,728       0,333        0,244   0,402     0,708       0,588       0,299   0,266     0,706       0,441       0,210   0,437
 Average     0,740       0,464      0,301   0,408     0,727       0,403        0,222   0,380     0,676       0,415       0,273   0,328     0,734       0,466       0,239   0,393