<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Bhargava, C. Witkowski, M. Shah, M. W. Thom-
tion for Computational Linguistics</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1177/0013164493053004013</article-id>
      <title-group>
        <article-title>Leveraging Large Language Models (LLMs) as Domain Experts in a Validation Process</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carlos Badenes-Olmedo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esteban García-Cuesta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Sánchez-González</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Corcho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ontology Engineering Group, Departamento de Sistemas Informáticos, Universidad Politécnica de Madrid</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>9</volume>
      <issue>2021</issue>
      <fpage>61</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>The explosion of information requires robust methods to validate knowledge claims. On the other hand, there is also an increase interest on understanding and creating methods that helps on the interpretation of machine learning models. Both approaches converge on the necessity of a validation step that clarifies or helps end-users to better understand if the decision or information provided by the model is what is needed or if there is some mismatch between what the artificial intelligent system is suggesting and reality. Large Language Models (LLMs), with their ability to process and synthesize vast amounts of text data, have emerged as potential tools for this purpose. This study explores the utility of LLMs in hypothesis validation in two diferent scenarios. The first relies on hypothesis generated from explanations obtained by XAI methods or by inherently explainable models. We propose a method to transform the inferences provided by a machine learning model into explanations in natural language, hence linking the symbolic and sub-symbolic areas. The second relies on hypothesis generated with techniques that automatically extract answers from text. The results show that LLMs can complement other XAI techniques and although all LLMs analyzed are able to provide truthfulness-related answers, not all are equally successful.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>knowledge validation</kwd>
        <kwd>explainable artificial intelligence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(and thereof validating the models) derived from classi- inputs. The examination of LLMs under diferent
condical machine learning decision models. These explana- tions, such as varying the context and structure of the
tions, presented in the form of afirmative statements prompts, sheds light on their performance variability and
such as "the hypertension increase the risk of death from the strategies for optimizing accuracy.
COVID19", are transformed into questions (for example, Building on this foundation, the interplay between
"Does hypertension mean an increased risk of death from context, choice structure, and decision-making, as
exCOVID-19?") to be presented to the LLMs. This approach plored in [9], [10], [11], and [12], directly relates to the
allows us to directly evaluate the LLM models’ as a knowl- challenges LLMs face. This parallel between human and
edge base to validate specific claims within the domain, computational decision-making processes emphasizes
ofering a unique perspective on their applicability as the importance of carefully designed prompts and the
validation tools in scientific and clinical contexts. strategic manipulation of choice options to improve LLM</p>
      <p>This work aims to address the following research ques- reliability and decision accuracy. Through innovative
tions: RQ1) What efect does the variation in the number decision-making strategies and prompt engineering
techof options within a fact-checking question have on the niques proposed in [13], [14], [15], [16], and [17], the
responses provided by Large Language Models (LLMs)? nuanced approach to prompt framing is critical for
enRQ2) How consistent are the boolean answers (i.e. yes, hancing LLM interactions and understanding. This body
or no) provided by Large Language Models (LLMs)? RQ3) of work collectively illustrates a key insight: adjusting
What is the impact of integrating machine learning infer- the number of options and the framing of prompts can
ences with Large Language Models (LLMs) on enriching profoundly influence the efectiveness of LLMs in
verifyand validating the explanations? ing statements and making decisions, bridging the gap</p>
      <p>Through this analysis, we seek not only to understand between consistency in output and the complexity of
the level of knowledge and accuracy of LLMs in spe- input conditions.
cialized domains but also to investigate their potential
to complement or, in some cases, replace the need for
human peer review in the validation stage of scientific
conclusions.</p>
      <sec id="sec-1-1">
        <title>Explainable AI and LLMs Interpretability and ex</title>
        <p>plainability in Machine Learning (ML) refers to the ability
to make understandable an ML model’s workings. This
is particularly vital in high-risk applications and
desirOur contributions are: able in most cases. The burgeoning field of research that
1. a novel assessment method that integrates ma- addresses to foster this ability is known as eXplainable
Archine learning inferences with Large Language tificial Intelligence (XAI). A variety of XAI methods have
Models (LLMs) to generate fact-checking (FC) been developed in recent years. They may be related to
type questions. intrinsically interpretable models or to "black box"
mod2. a study on the variability and consistency of re- els, but all pursue coherent and meaningful explanations
sponses provided by LLMs in multiple-choice for the audience. As an example, SHAP (SHapley
Addiquestions and scenarios with established ground tive exPlanations) is one of the most widely used XAI
truths.. model agnostic techniques. It is based on concepts from
game theory that allow the computing, which are the
3. an investigation into the variability of explana- features that contribute the most to the outcomes of the
tions provided by LLMs in scenarios involving black-box model, by trying diferent feature set
permufact-checking (including questions with multiple tations [18]. LIME (Local Interpretable Model-agnostic
factual options) and fact recovery, ofering a com- Explanations) is another well known example that builds
prehensive understanding of LLMs’ explanatory a simple linear surrogate model to explain each of the
capabilities and their potential for enhancing AI predictions of the learned black-box model [19]. There
interpretability. are also some interpretable ML models such as logistic
regression, Generalised Linear Models (GLMs), or
Gener2. Related works alised Additive Models (GAMs).There are some attempts
to facilitate the comprehension of some XAI methods
Prompt framing efect The study of the prompt fram- providing new tools to end-users. At [20] a new GPT
ing efect reveals that the performance of Large Language x-[plAIn] is proposed to transform the output
explanaModels (LLMs) is highly dependent on the construction of tions provided by those methods (e.g. SHAP or LIME) to
the prompts, with a significant focus on the consistency natural language that contains the technical descriptions
of LLMs’ responses to similar prompts. This concept, of the results. Despite the improvements in end-user
satdiscussed in [6], [7], and [8], examines LLMs’ ability isfaction, this work does not include any enrichment or
to provide consistent outputs for semantically similar additional information that could contextualize not only
prompts and their sensitivity to hallucination-inducing the explanations themselves, but also the meaning and
validation of the application domain. In [21] the authors
propose to use LLMs to facilitate decision-making
processs by the end users providing concise summaries of
varios XAI methods tailored for diferent audiences. This
can be viewed as LLM enhanced XAI explainer trying
to bridge the gap between complex AI technologies and
their practical applications.</p>
        <p>Veracity and truth extraction The exploration of
truth within the realm of big data and its verification
through LLMs embodies a complex interaction between
technological advancements and the multifaceted nature
of truth. The assembly method, as proposed by [22],
marks a significant step in addressing the challenge of
data veracity by combining individual truth discovery
methods to mitigate the efects of limited labeled ground
truth availability. This approach lays the groundwork for
further research on the role of technology in
diferentiating between truth and falsehood. Furthermore, research
on linguistic indicators of truth and deception, such as
that of [23], reveals the potential of linguistic
complexities and immediacy to act as markers to distinguish
between truthful and deceptive narratives, enriching the
conversation about truth verification in digital
communications.</p>
        <p>Recent advances in artificial intelligence, notably the
conceptualization of models such as InstructGPT as
"Truth Machines" by [24], highlight ongoing eforts to
define and operationalize truth through sophisticated
data analysis and model architectures. Currently,
innovative methodologies such as the DoLa decoding strategy
by [25] and the development of truthfulness personas
by [26] aim to enhance the factuality and reliability of
LLM outputs. These strategies not only address the
challenge of hallucinations in model responses but also open
up new pathways for embedding truthfulness within AI
systems, underscoring the dynamic nature of research
focused on achieving reliable knowledge verification and
decision-making processes in the digital era.</p>
      </sec>
      <sec id="sec-1-2">
        <title>A range of LLMs have been developed in the last years.</title>
        <p>GPT-4, developed by OpenAI, is a state-of-the-art LLM
known for its deep learning architecture. As part of the
Generative Pre-trained Transformer series, it includes
a large network of multi-layer transformers, capable of
processing sequential data and preserving textual
dependencies in the long term. This version marks a
significant advancement over its predecessors by scaling up
the number of parameters and broadening the diversity
of its training data, thus enhancing its ability to
generate coherent and contextually relevant text based on the
input it receives [27].</p>
        <p>Moreover, Google’s DeepMind project Gemini, is a
key competitor to GPT-4. Gemini is a family of models
built on top of transformer decoders that employ
attention mechanisms, analogous to GPT-4. Gemini Pro, the
second model in the family in terms of size, has been
optimized for both cost and latency, ofering considerable
performance improvements across numerous tasks; it is
designed to understand, reason, and generate outputs
across various types of data, including text [28].</p>
        <p>Similarly, Llama 2 constitutes a collection of pretrained
and fine-tuned LLMs that is distinctive from the models
mentioned due to its open-source nature [29]. This group
of models developed by Meta includes two models (Llama
2 and Llama 2-Chat) with diferent versions that adjust
the number of parameters: 7B, 13B and 70B.</p>
        <p>Mistral represents another significant collection of
LLMs, characterized by their advanced reasoning
capabilities and a robust performance. Their largest model,
Mistral Large, demonstrates state-of-the-art results across a
variety of benchmarks, including areas such as common
sense, reasoning, and knowledge-based tasks [30]. The
Mistral family also includes open-source models that
surpass certain versions of Llama 2 in several benchmarks,
as documented by [31].</p>
        <sec id="sec-1-2-1">
          <title>3.2. Datasets</title>
          <p>3. Approach and Problem Setup Covid19 explanations The questions included in
Table 1 are created from a clinical study [32]. In that study
Our proposal involves using LLMs as knowledge bases one thousand and three hundred thirty-one COVID-19
pato evaluate the outcomes of machine learning models tients (medium age 66.9 years old; males n= 841, medium
by answering Boolean questions derived from the mod- length of hospital stayed 8 days, non-survivors n=233)
els’ inferences. This approach aims to harness the com- were analyzed. Based on the hypotheses raised in the
prehensive knowledge and understanding capabilities study, the questions are constructed. Questions Q2, Q3,
of LLMs to verify the accuracy and reliability of infer- Q4, Q5, Q6, Q7, and Q8 were identified as significant
ences made by machine learning models, thereby provid- using a regression Cox model and Q1, Q9, Q10 were
ing a novel method for validating AI-generated insights identified as significant by univariate analysis. Q1 was
through direct, yes-or-no questioning. also identified as 1 of the most important variables using
SHAP explanations over LSTM learned model using the
same Covid19 dataset. By domain knowledge and based
on model explanations we can set Q1, Q2, Q3, Q4, Q5, Q6,
and Q8 as positive truth answers. We did not include Q7
as a positive response (but controversy), despite being
obtained by the Cox model explanations, because there
was controversy about the use of hydroxychloroquine
during the pandemic and although it was initially
considered as a drug to reduce the risk of mortality, it was later
contradicted by other studies and was not recommended
by the World Health Organization. Therefore, the
variables that were obtained only by the univariate analysis
(Q9 and Q10) are proposed as controversy answers.</p>
          <p>It is important to highlight that all the questions adhere
to a consistent structure to optimize the performance of
the LLM. Specifically, each question is framed as “ Does
#hypothesis# mean an increased risk of death from
COVID19?”. This uniformity ensures that the LLM’s responses
are directly comparable and minimizes variability that
could arise from difering question formats. It also allows
to test hypothesis obtained by the explainability models.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Veracity dataset The Stanford Question Answering</title>
        <p>Dataset (SQuAD) [33] has been extensively used in the
scientific literature for the development of Question
Answering (QA) language models, serving as a benchmark
to assess the abilities of these models in understanding
and processing natural language queries. As a rich
compilation of questions and answers based on Wikipedia
articles, SQuAD challenges models to provide accurate
answers by comprehending the context provided in the
passages.</p>
        <p>In our work, we retrieved a subset of questions from
the SQuAD dataset to specifically validate the
knowledge conveyed by LLMs. This targeted evaluation was
designed to determine the precision of the LLM answers
compared to the gold standard answers of the data set.
This method of validation not only tests the LLMs’
understanding of complex texts, but also assesses their
reliability in providing information that matches human-curated
answers.</p>
        <sec id="sec-1-3-1">
          <title>3.3. Use Cases</title>
          <p>Three use cases (UC) have been designed to address
previous research questions, focusing on the practical
applications and implications of using LLMs to validate
machine learning inferences. The first area investigates
the influence of varying the number of options in
factcheck questions on LLM responses, aiming to understand
how choice diversity impacts LLM accuracy. The second
focuses on assessing the consistency of boolean (yes or
no) answers provided by LLMs, evaluating their
reliability in delivering steady responses. Lastly, we explore
the efects of combine machine learning inferences with
LLMs to both enrich and validate the explanations of
these models. This last use case uses the Covid19 dataset
to create a ML model and the SHAP technique to obtain
a set of important features that later are enriched with
LLMs.</p>
          <p>The models used in this study include “gpt-4” from
OpenAI, “mistral-large-2402” from Mistral AI,
“gemini1.0-pro-001” from Google, and “llama-2-70b-chat” from
Meta AI. In addition, the temperature parameter was set
to the lowest possible value to ensure the most
deterministic behavior in the LLMs. Temperature controls
the randomness of the generated output, with a lower
value leading to more deterministic outputs by favoring
the most likely predictions. Therefore, in most models,
the temperature value was set to 0 to minimize
randomness. However, it is important to note that for the Llama
2 model, the minimum supported temperature value is
0.01. Despite this slight deviation from 0, the aim
remains the same: to achieve the lowest possible level of
randomness in the output.</p>
          <p>UC1: Fact Density Impact Analysis It examines the
performance of LLMs in delivering binary responses (“yes”
or “no” ) versus incorporating a third option
(“controversy” ) to introduce an element of uncertainty. This
evaluation aims to measure the models’ performance in terms
of veracity, exploring how the structure of the response
options afects the LLMs’ ability to provide accurate and
reliable answers in fact-checking scenarios.</p>
          <p>Table 2 presents the prompts used in three scenarios
to evaluate veracity, allowing the model to use binary
responses or multiple options, and requesting the model to
act as an expert in the clinical domain, providing precise
and concise responses. The use of the parameter
maxtokens inadvertently caused responses to be abruptly cut,
leading to nonsensical outcomes. Consequently, we
directed the model within the context to be precise and
concise, with the aim of minimizing this issue and
enhancing the clarity and relevance of its answers. This
additional context of evaluation was designed to gauge
the model’s capacity to ofer accurate and reliable
answers when positioned as a domain-specific authority,
further enriching our understanding of its performance
in delivering veracious responses within specialized
scenarios. This distinction allows for a detailed examination
of how the inclusion of an “controversy” option alongside
traditional “yes” or “no” answers influences the model’s
response behavior in our Use Case 1 analysis.
UC2: Consistency and Veracity Evaluation Use
Case 2 distinguishes between two methods of
evaluating LLM consistency based on the availability of ground
truth. In the first approach, where the true answer is
not available, consistency is assessed by comparing the
LLM’s responses against each other. This method focuses
on the internal consistency of the model’s answers. In
the second approach, where a known true answer exists,
the LLM’s responses are evaluated against this ground
truth to measure the model’s accuracy and reliability
in providing consistent and correct answers, a quality
referred to as veracity.</p>
          <p>On the one hand, the first approach or consistency
evaluation aims to assess the stability of responses from
LLMs through repeated inquiries. By introducing an
algorithm 1 to systematically evaluate consistency within the
Covid19 dataset, we probe each question in the dataset
multiple times using the question and Context 1 as the
prompt. This method allows us to gauge the LLMs’
consistency using the metrics described in Section 3.4. Similarly,
the same algorithm is used with Context 2.</p>
          <p>The following algorithm was deployed twice for each
LLM, once for each of the two contexts, and the
temperature parameter was minimized to enhance response
determinism. This methodology provides a nuanced
understanding of the models’ consistency by ensuring
controlled conditions and leveraging the lowest possible
temperature setting to maximize the determinism of the
models’ responses.</p>
          <p>Algorithm 1 Evaluate the consistency of a single LLM
1: for each question  in dataset1 do
2: Initialize Responses to an empty list
3: for  ← 1 to 10 do
4: response r ← AskLLM(, context1)
5: Append  to Responses
6: end for
7: SemanticSimilarity ←</p>
          <p>ity(Responses)
8: Overlap ← CalculateOverlap(Responses)
9: ROUGE ← CalculateROUGE(Responses)
10: BLEU ← CalculateBLEU(Responses)
11: Store metrics for further analysis
12: end for</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>CalculateSemanticSimilar</title>
        <p>On the other hand, the veracity evaluation involves
the use of ground truth. Therefore, akin to the previous
method, we employ a diferent algorithm (see Algorithm
2) designed to assess the veracity of each response from
each model. The key diference in this approach is that
when invoking the LLM, both the response along with
its context (Context 3) and the ground truth for each
response (“ answer”) are provided. This enables a direct
comparison between the LLM’s responses and the known
accurate answers.</p>
        <p>UC3: XAI Enhancement and Validation Use Case
3 involves leveraging machine learning inferences and
LLMs to enrich and validate explanations. We propose to
utilize important features identified by XAI techniques,
such as SHAP, to augment information and validate
explanations. This involves transforming explanations into
binary questions that LLMs can answer, with prompts
Algorithm 2 Evaluate the veracity of a single LLM
1: for each question  in dataset2 do
2: Initialize Responses to an empty list
3: for  ← 1 to 10 do
4: response  ← AskLLM(, context3)
5: Append  to Responses
6: end for
7: SemanticSimilarity ←</p>
        <p>ity(Responses,  answer)
8: Overlap ← CalculateOverlap(Responses,  answer)
9: ROUGE ← CalculateROUGE(Responses,  answer)
10: BLEU ← CalculateBLEU(Responses,  answer)
11: Store metrics for further analysis
12: end for
CalculateSemanticSimilarmeaningful elements) that appear in two sentences. This
metric is used to assess how much shared content exists
between both sentences, indicating their consistency or
similarity in terms of the information they convey.</p>
      </sec>
      <sec id="sec-1-5">
        <title>ROUGE stands for Recall-Oriented Understudy for</title>
        <p>Gisting Evaluation. ROUGE includes a collection of
metrics designed for the formal evaluation of text generation
models such us summarization or machine translation.
In the evaluation of responses generated by a LLM, the
use of the ROUGE metric can be justified by its ability
to quantitatively measure the lexical overlap across
different responses generated by the LLM itself. This is
accomplished utilizing the ROUGE-L variant, which
employs the Longest Common Subsequence (LCS) between
two sentences as a basis for computing recall, precision,
and the 1 score derived from both [36].
that contain both the question and relevant contexts. By
constructing queries to directly link significant features
with real-world results (e.g. ’Does #hypothesis# mean
an increased risk of death from COVID-19?), we bridge
the gap between XAI insights and practical applications.
Additionally, by instructing LLMs to respond with “yes”
or “no” and provide validating explanations, we achieve
dual objectives of validating and enriching responses,
prompting LLMs to elaborate on pertinent features.</p>
      </sec>
      <sec id="sec-1-6">
        <title>BLEU stands for Bilingual Evaluation Understudy.</title>
        <p>BLEU is a metric initially conceived for evaluating the
quality of text translated by machine translation
systems by comparing it with one or more reference
translations [37]. Unlike ROUGE, which is recall-oriented, BLEU
emphasizes precision. It assesses how many words or
phrases in the machine-generated text appear in the
ref3.4. Metrics erence texts. This metric calculates n-gram (contiguous
sequences of n items from a given sample of text)
preA suite of metrics has been implemented to evaluate the cision for diferent lengths and combines them through
consistency and veracity of the LLMs. This suite includes a weighted geometric mean, incorporating a brevity
semantic similarity, token overlap, and the ROUGE and penalty to discourage overly short translations [37].
BLEU metrics. This precision-oriented approach is particularly
valuable when the objective is to ensure that certain key
inforSemantic similarity is a measure of the degree to mation is consistently represented in the LLM’s outputs.
which two concepts (such as words, phrases, or sen- We computed the BLEU metric by treating each LLM
tences) are related in terms of their meanings within response as a “translation” and comparing it to other
rea given semantic space. In formal terms, semantic simi- sponses. BLEU can highlight the extent to which the LLM
larity can be quantified based on the distance or closeness is capable of producing responses that contain expected
of the concepts in a multi-dimensional space, where each and relevant content. This method ofers a
complemendimension represents a feature of the concept’s mean- tary perspective to the recall-focused metric ROUGE,
ing. The closer two concepts are in this space, the more providing a balanced assessment of the LLM’s
perforsemantically similar they are. mance.</p>
        <p>Diverse methods for calculating semantic similarity are
analysed in [34], encompassing a range of approaches.</p>
        <p>However, this research will specifically utilize cosine 4. Evaluation and Results
similarity in conjunction with sentence embeddings. We
will use Sentence-BERT, a variation of BERT(Devlin et In this section we evaluate the diferent use cases.
al., 2018) optimized for sentence-level embeddings, due
to their proven eficiency [ 35]. In particular, this research 4.1. UC1 Results
utilizes the “all-MiniLM-L6-v2” model for its remarkable
balance between high performance and speed. Despite
being one of the smallest models in terms of size, it stands
out for its rapid processing capabilities.</p>
      </sec>
      <sec id="sec-1-7">
        <title>The first use case focuses on evaluating how the structure</title>
        <p>of response options presented to the LLM influences the
performance of the models’ accuracy and reliability. This
evaluation was addressed by using diferent contexts:
Context 1 which employs a binary response (such as “yes”</p>
      </sec>
      <sec id="sec-1-8">
        <title>Overlap as a metric refers to the method of quantifying similarity based on the common tokens (words or other</title>
        <p>or “no”), and Context 2, which introduces a third element • Llama2 demonstrates accuracy in variations of
associated to uncertainty characterized as “controversy”. Q7, Q9, and Q10. However, it produces
unjusti</p>
        <p>Table 3 shows the results of the models’ responses for ifed variations in Q2, Q3, Q5, Q6, and Q8.
Fureach question. It is important to clarify that although mul- thermore, it provides incorrect answers for Q2
tiple responses are generated for each question (specifi- and Q5, where “yes” was expected, but “no” was
cally 10), the table presents only a single value in each output.
cell. This reduction is justified because the answers (“yes”,
“no” or “controversy”) do not vary across iterations. What Our findings suggest that introducing the option of
varies is the model’s explanations of the responses, not “controversy” as a potential response significantly
influthe answer itself. ences the behavior of the analyzed LLMs, leading to a</p>
        <p>However, some diferences can be noted both in the noticeable shift in their response patterns. Across various
responses generated by a LLM with diferent contexts for models, including GPT and Mistral, where the response
the same question and in the performance across various changed in 4 out of 10 instances, Gemini with a change
language models (e.g. Q1 in Context 2 is answered as “yes” in 7 out of 10 instances, and Llama2 showing a change in
by GPT but “controversy” by Gemini). These variations 8 out of 10 instances, there is a marked preference for
sereveal that while some diferences can be attributed to lecting “controversy” over a definitive “yes” or “no”. This
the introduction of ’controversy’ in response options (e.g. tendency persists irrespective of the model in question
GPT Q7), others may not have such a clear justification and appears to reflect a broader pattern: when presented
(e.g. Gemini Q2). with the “controversy” option, models consistently avoid</p>
        <p>Optimally, each LLM should make three justified varia- negative responses, opting instead to categorize
statetions (Q7, Q9, and Q10) when introducing the uncertainty ments as controversial. This behavior suggests a higher
option with the second context, due to the limitation level of confidence in asserting conclusions rather than
of Context 1 to binary “yes” or “no” answers. For GPT, denying them. While for GPT and Mistral, 75% of these
an analysis of the responses between contexts reveals a shifts towards “controversy” can be considered justified,
mixed outcome: 3 of the variations presented are deemed enhancing the quality of the output, the justification for
correct (Q7, Q9, and Q10), indicating that the model ac- this change drops to 43% for Gemini and 37% for Llama2,
curately handled both contexts. Conversely, the model’s indicating variability in how these adjustments align with
responses to the question Q6 is classified as wrong varia- the underlying data uncertainty.
tions, suggesting inaccuracies in dealing with diferent
contexts. 4.2. UC2 Results</p>
        <p>Similarly, the performance of the other models is as
follows:
In this section, we present the results from the second
use case, which are detailed in Tables 4 and 5. These
• Mistral accurately handles 3 variations (Q7, Q9, tables show the average performance metrics for
consisand Q10) but had an error in Q6. tency and veracity -namely, semantic similarity, overlap,
• Gemini stands out by correctly handling 3 varia- ROUGE and BLEU scores - for each model across various
tions (Q7, Q9, and Q10) but falls short by produc- datasets. These metrics were computed for each
quesing 4 unjustified incorrect variations (Q1, Q2, Q3, tion within the datasets, with averages provided to give
Q6), including a notable discrepancy in Question a view of each model’s performance under two diferent
6 where the expected answer was “yes”, but the contexts (i.e Context 1 and Context 2 ) for consistency
evaloutput was “no”. uation (Table 4; the consistency results per questions are
provided at Appendix TableA1) and a third context (i.e. der analysis (e.g. Context 1 ’You are an expert on
COVIDContext 3) for veracity evaluation (Table 5; the veracity 19 and your duty is to answer questions related to the
results per questions are provided at Appendix TableA3). topic only with yes or no followed by the explanation</p>
        <p>Our analysis reveals no significant diference in per- that validates the answer in a maximum of 2 sentences.’).
formance between the first two contexts evaluated for Table 7 shows example of responses for Q1, Q2, Q3 from
consistency, where all LLMs demonstrated high levels of GPT-4. Q1 enriches the fact-checking response adding
consistency. Mistral achieved perfect consistency scores, information related with the consequences of having
hywhile Gemini and Llama2 were nearly perfect. However, pertension and how they are related to higher death risk.
GPT showed the lowest consistency (for all metrics in- Q2 enriches the response adding reasons why the
imporcluding semantic similarity), even with the temperature tant feature (i.e. platelet) plays a crucial role that may
parameter set to the lowest level, indicating potential lead to high risk of death. Last, Q3 response enriches the
variability in its response generation process. response indicating that a high leukocyte can be a
symp</p>
        <p>When comparing the models’ performance to the tom of severe Covid19. At table 8 we studied syntactically
ground truth data for veracity (see Section 3.2), GPT the number of words that contain the explanation and
stands out by achieving the best results across all met- also the average number of words per sentence. Llama2
rics, indicating that its responses, on average, align more and Mistral have larger explanations and also
syntacticlosely with the ground truth than those of the other cally are slighly more comples (Llama2 has ≈ 28 words
models. Llama2 follows closely behind as the second- per sentence for context2). Gemini provides the shortest
best performer, with Gemini and Mistral trailings and explanations and also the lowest syntactic complexity
their positions varying depending on the metric applied. (36.72 number of words average and 19.91 words per
senThese findings suggest that while GPT may struggle with tence). Similarly to previous use cases we analyzed the
consistency relative to its peers, it excels in generating diferences between Context 1 and Context 2 explanations
responses that are more closely aligned with verifiable (including the controversy as an option in the second) to
facts, highlighting a nuanced trade-of between consis- measure how diferent are the explanations. According to
tency and veracity across diferent LLMs. all metrics the results show that the LLM that change the
most is Llama2 (i.e. ROUGE 0.411), followed by Gemini
4.3. UC3 Results (i.e. ROUGE 0.442), GPT-4 (i.e. ROUGE 0.541) and Mistral
(i.e. ROUGE 0.570) (see Table 6 for other metrics).</p>
      </sec>
      <sec id="sec-1-9">
        <title>It examines the use of prompts that transform explanations into binary questions that contain both the question and relevant contexts related with the fact-checking un</title>
        <p>Table 8 might even imply contradictory responses. As for the
Average number of words of explanation per text and per truthfulness analysis, we observed that GPT obtained
sentence the best results on average and can be considered quite
accurate.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusions</title>
      <p>In this paper we studied the efect of variation in the
number of options within fact-checking questions, the
consistency and truthfulness of the answers, and the
capabilities to enrich fact-checking with explanations. We
also proposed to link explanations from machine
learning models to LLMs by using those explanations to create
a fact-checking type input question. We measured
coherence and veracity using state-of-the-art metrics such
as semantic similarity, overlap, ROUGE and BLEU, and
the results show that Mistral is the most coherent LLM.
Notably, Gemini and Llama2 obtained similar results and
GPT was slightly behind. Furthermore, we conclude that
fact-cheking consistency does not depend on the number
of options but explanations’ consistency does. This is
relevant because it means that a diferent number of options
not only may change the fact response but will also be
able to justify it diferently. Further research should be
done to analyze in depth to what extend these diferences</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This work has been funded by the project “Inteligencia Artificial eXplicable” IAX grant of the Young Researchers 2022/2024 initiative of the Community of Madrid.</title>
        <p>[27] OpenAI, J. Achiam, S. Adler, S. Agarwal, Gpt-4
technical report, ArXiv abs/2303.08774 (2023).</p>
        <p>URL: https://api.semanticscholar.org/CorpusID:
257532815.
[28] G. T. G. R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac,</p>
        <p>Gemini: A family of highly capable multimodal
models, ArXiv abs/2312.11805 (2023). URL: https:
//api.semanticscholar.org/CorpusID:266361876.
[29] H. Touvron, L. Martin, K. R. Stone, P. Albert,</p>
        <p>Llama 2: Open foundation and fine-tuned chat
models, ArXiv abs/2307.09288 (2023). URL: https:
//api.semanticscholar.org/CorpusID:259950998.
[30] Mistral, Mistral large, our new flagship model, URL</p>
        <p>https://mistral.ai/news/mistral-large/, 2024.
[31] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,</p>
        <p>Mistral 7b, ArXiv abs/2310.06825 (2023). URL: https:
//api.semanticscholar.org/CorpusID:263830494.
[32] P. Cardinal-Fernandez, E. Garcia-Cuesta, J.
Barberan, J. F. Varona, A. Estirado, A. Moreno, J.
Villanueva, M. Villareal, O. Baez-Pravia, J. Menendez,
et al., Clinical characteristics and outcomes of 1,331
patients with covid-19: Hm spanish cohort, Revista</p>
        <p>Española de Quimioterapia 34 (2021) 342.
[33] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang,</p>
        <p>Squad: 100,000+ questions for machine
comprehension of text (2016) 2383–2392. doi:10.18653/
v1/D16-1264.
[34] D. Chandrasekaran, V. Mago, Evolution of semantic
similarity—a survey, ACM Computing Surveys 54
(2021) 1–37. URL: http://dx.doi.org/10.1145/3440755.</p>
        <p>doi:10.1145/3440755.
[35] N. Reimers, I. Gurevych, Sentence-bert: Sentence
embeddings using siamese bert-networks, in:
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing, Association
for Computational Linguistics, 2019. URL: https:
//arxiv.org/abs/1908.10084.
[36] C.-Y. Lin, ROUGE: A package for automatic
evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational
Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013.
[37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu:
a method for automatic evaluation of machine
translation, in: P. Isabelle, E. Charniak, D. Lin
(Eds.), Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL:
https://aclanthology.org/P02-1040. doi:10.3115/
1073083.1073135.
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
A. Appendix. Detailed results of the consistency and veracity metrics
for the three contexts.</p>
        <p>Overlap
1,000
1,000
1,000
1,000
1,000
1,000
1,000
0,964
1,000
1,000
0,996
1,000
0,983
1,000
0,992
1,000
1,000
1,000
1,000
1,000
1,000
0,998</p>
        <p>ROUGE</p>
        <p>BLEU</p>
        <p>ROUGE</p>
        <p>BLEU</p>
        <p>ROUGE</p>
        <p>BLEU</p>
        <p>Overlap</p>
        <p>Overlap</p>
        <p>Llama2</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>A survey on evaluation of large language models</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Assessing the reliability of large language model knowledge</article-title>
          ,
          <source>arXiv:2310.09820</source>
          (
          <year>2023</year>
          ). URL: https://doi.org/ 10.48550/arXiv.2310.09820].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          , et al.,
          <article-title>Can chatgpt provide intelligent diagnoses? a comparative study between predictive models and chatgpt to define a new medical diagnostic bot</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>235</volume>
          (
          <year>2024</year>
          )
          <article-title>121186</article-title>
          . URL: https://www.sciencedirect.com/ science/article/pii/S0957417423016883. doi:https: //doi.org/10.1016/j.eswa.
          <year>2023</year>
          .
          <volume>121186</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Towards fine-grained reasoning for fake news detection</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>36</volume>
          (
          <year>2022</year>
          )
          <fpage>5746</fpage>
          -
          <lpage>5754</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/ view/20517. doi:
          <volume>10</volume>
          .1609/aaai.v36i5.
          <fpage>20517</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wadden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          , I. Beltagy, H. Hajishirzi, MultiVerS: Improving scientific Llama2
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>