1. Introduction

A. Bhargava, C. Witkowski, M. Shah, M. W. Thom- tion for Computational Linguistics

10.1177/0013164493053004013

Leveraging Large Language Models (LLMs) as Domain Experts in a Validation Process

Carlos Badenes-Olmedo

Esteban García-Cuesta

Alejandro Sánchez-González

Oscar Corcho

0 0 Ontology Engineering Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid 1 Ontology Engineering Group, Departamento de Sistemas Informáticos, Universidad Politécnica de Madrid

2015

9 2021 61 76

The explosion of information requires robust methods to validate knowledge claims. On the other hand, there is also an increase interest on understanding and creating methods that helps on the interpretation of machine learning models. Both approaches converge on the necessity of a validation step that clarifies or helps end-users to better understand if the decision or information provided by the model is what is needed or if there is some mismatch between what the artificial intelligent system is suggesting and reality. Large Language Models (LLMs), with their ability to process and synthesize vast amounts of text data, have emerged as potential tools for this purpose. This study explores the utility of LLMs in hypothesis validation in two diferent scenarios. The first relies on hypothesis generated from explanations obtained by XAI methods or by inherently explainable models. We propose a method to transform the inferences provided by a machine learning model into explanations in natural language, hence linking the symbolic and sub-symbolic areas. The second relies on hypothesis generated with techniques that automatically extract answers from text. The results show that LLMs can complement other XAI techniques and although all LLMs analyzed are able to provide truthfulness-related answers, not all are equally successful.

eol>LLMs knowledge validation explainable artificial intelligence

1. Introduction

(and thereof validating the models) derived from classi- inputs. The examination of LLMs under diferent condical machine learning decision models. These explana- tions, such as varying the context and structure of the tions, presented in the form of afirmative statements prompts, sheds light on their performance variability and such as "the hypertension increase the risk of death from the strategies for optimizing accuracy. COVID19", are transformed into questions (for example, Building on this foundation, the interplay between "Does hypertension mean an increased risk of death from context, choice structure, and decision-making, as exCOVID-19?") to be presented to the LLMs. This approach plored in [9], [10], [11], and [12], directly relates to the allows us to directly evaluate the LLM models’ as a knowl- challenges LLMs face. This parallel between human and edge base to validate specific claims within the domain, computational decision-making processes emphasizes ofering a unique perspective on their applicability as the importance of carefully designed prompts and the validation tools in scientific and clinical contexts. strategic manipulation of choice options to improve LLM

This work aims to address the following research ques- reliability and decision accuracy. Through innovative tions: RQ1) What efect does the variation in the number decision-making strategies and prompt engineering techof options within a fact-checking question have on the niques proposed in [13], [14], [15], [16], and [17], the responses provided by Large Language Models (LLMs)? nuanced approach to prompt framing is critical for enRQ2) How consistent are the boolean answers (i.e. yes, hancing LLM interactions and understanding. This body or no) provided by Large Language Models (LLMs)? RQ3) of work collectively illustrates a key insight: adjusting What is the impact of integrating machine learning infer- the number of options and the framing of prompts can ences with Large Language Models (LLMs) on enriching profoundly influence the efectiveness of LLMs in verifyand validating the explanations? ing statements and making decisions, bridging the gap

Through this analysis, we seek not only to understand between consistency in output and the complexity of the level of knowledge and accuracy of LLMs in spe- input conditions. cialized domains but also to investigate their potential to complement or, in some cases, replace the need for human peer review in the validation stage of scientific conclusions.

Explainable AI and LLMs Interpretability and ex

plainability in Machine Learning (ML) refers to the ability to make understandable an ML model’s workings. This is particularly vital in high-risk applications and desirOur contributions are: able in most cases. The burgeoning field of research that 1. a novel assessment method that integrates ma- addresses to foster this ability is known as eXplainable Archine learning inferences with Large Language tificial Intelligence (XAI). A variety of XAI methods have Models (LLMs) to generate fact-checking (FC) been developed in recent years. They may be related to type questions. intrinsically interpretable models or to "black box" mod2. a study on the variability and consistency of re- els, but all pursue coherent and meaningful explanations sponses provided by LLMs in multiple-choice for the audience. As an example, SHAP (SHapley Addiquestions and scenarios with established ground tive exPlanations) is one of the most widely used XAI truths.. model agnostic techniques. It is based on concepts from game theory that allow the computing, which are the 3. an investigation into the variability of explana- features that contribute the most to the outcomes of the tions provided by LLMs in scenarios involving black-box model, by trying diferent feature set permufact-checking (including questions with multiple tations [18]. LIME (Local Interpretable Model-agnostic factual options) and fact recovery, ofering a com- Explanations) is another well known example that builds prehensive understanding of LLMs’ explanatory a simple linear surrogate model to explain each of the capabilities and their potential for enhancing AI predictions of the learned black-box model [19]. There interpretability. are also some interpretable ML models such as logistic regression, Generalised Linear Models (GLMs), or Gener2. Related works alised Additive Models (GAMs).There are some attempts to facilitate the comprehension of some XAI methods Prompt framing efect The study of the prompt fram- providing new tools to end-users. At [20] a new GPT ing efect reveals that the performance of Large Language x-[plAIn] is proposed to transform the output explanaModels (LLMs) is highly dependent on the construction of tions provided by those methods (e.g. SHAP or LIME) to the prompts, with a significant focus on the consistency natural language that contains the technical descriptions of LLMs’ responses to similar prompts. This concept, of the results. Despite the improvements in end-user satdiscussed in [6], [7], and [8], examines LLMs’ ability isfaction, this work does not include any enrichment or to provide consistent outputs for semantically similar additional information that could contextualize not only prompts and their sensitivity to hallucination-inducing the explanations themselves, but also the meaning and validation of the application domain. In [21] the authors propose to use LLMs to facilitate decision-making processs by the end users providing concise summaries of varios XAI methods tailored for diferent audiences. This can be viewed as LLM enhanced XAI explainer trying to bridge the gap between complex AI technologies and their practical applications.

Veracity and truth extraction The exploration of truth within the realm of big data and its verification through LLMs embodies a complex interaction between technological advancements and the multifaceted nature of truth. The assembly method, as proposed by [22], marks a significant step in addressing the challenge of data veracity by combining individual truth discovery methods to mitigate the efects of limited labeled ground truth availability. This approach lays the groundwork for further research on the role of technology in diferentiating between truth and falsehood. Furthermore, research on linguistic indicators of truth and deception, such as that of [23], reveals the potential of linguistic complexities and immediacy to act as markers to distinguish between truthful and deceptive narratives, enriching the conversation about truth verification in digital communications.

Recent advances in artificial intelligence, notably the conceptualization of models such as InstructGPT as "Truth Machines" by [24], highlight ongoing eforts to define and operationalize truth through sophisticated data analysis and model architectures. Currently, innovative methodologies such as the DoLa decoding strategy by [25] and the development of truthfulness personas by [26] aim to enhance the factuality and reliability of LLM outputs. These strategies not only address the challenge of hallucinations in model responses but also open up new pathways for embedding truthfulness within AI systems, underscoring the dynamic nature of research focused on achieving reliable knowledge verification and decision-making processes in the digital era.

A range of LLMs have been developed in the last years.

GPT-4, developed by OpenAI, is a state-of-the-art LLM known for its deep learning architecture. As part of the Generative Pre-trained Transformer series, it includes a large network of multi-layer transformers, capable of processing sequential data and preserving textual dependencies in the long term. This version marks a significant advancement over its predecessors by scaling up the number of parameters and broadening the diversity of its training data, thus enhancing its ability to generate coherent and contextually relevant text based on the input it receives [27].

Moreover, Google’s DeepMind project Gemini, is a key competitor to GPT-4. Gemini is a family of models built on top of transformer decoders that employ attention mechanisms, analogous to GPT-4. Gemini Pro, the second model in the family in terms of size, has been optimized for both cost and latency, ofering considerable performance improvements across numerous tasks; it is designed to understand, reason, and generate outputs across various types of data, including text [28].

Similarly, Llama 2 constitutes a collection of pretrained and fine-tuned LLMs that is distinctive from the models mentioned due to its open-source nature [29]. This group of models developed by Meta includes two models (Llama 2 and Llama 2-Chat) with diferent versions that adjust the number of parameters: 7B, 13B and 70B.

Mistral represents another significant collection of LLMs, characterized by their advanced reasoning capabilities and a robust performance. Their largest model, Mistral Large, demonstrates state-of-the-art results across a variety of benchmarks, including areas such as common sense, reasoning, and knowledge-based tasks [30]. The Mistral family also includes open-source models that surpass certain versions of Llama 2 in several benchmarks, as documented by [31].

3.2. Datasets

3. Approach and Problem Setup Covid19 explanations The questions included in Table 1 are created from a clinical study [32]. In that study Our proposal involves using LLMs as knowledge bases one thousand and three hundred thirty-one COVID-19 pato evaluate the outcomes of machine learning models tients (medium age 66.9 years old; males n= 841, medium by answering Boolean questions derived from the mod- length of hospital stayed 8 days, non-survivors n=233) els’ inferences. This approach aims to harness the com- were analyzed. Based on the hypotheses raised in the prehensive knowledge and understanding capabilities study, the questions are constructed. Questions Q2, Q3, of LLMs to verify the accuracy and reliability of infer- Q4, Q5, Q6, Q7, and Q8 were identified as significant ences made by machine learning models, thereby provid- using a regression Cox model and Q1, Q9, Q10 were ing a novel method for validating AI-generated insights identified as significant by univariate analysis. Q1 was through direct, yes-or-no questioning. also identified as 1 of the most important variables using SHAP explanations over LSTM learned model using the same Covid19 dataset. By domain knowledge and based on model explanations we can set Q1, Q2, Q3, Q4, Q5, Q6, and Q8 as positive truth answers. We did not include Q7 as a positive response (but controversy), despite being obtained by the Cox model explanations, because there was controversy about the use of hydroxychloroquine during the pandemic and although it was initially considered as a drug to reduce the risk of mortality, it was later contradicted by other studies and was not recommended by the World Health Organization. Therefore, the variables that were obtained only by the univariate analysis (Q9 and Q10) are proposed as controversy answers.

It is important to highlight that all the questions adhere to a consistent structure to optimize the performance of the LLM. Specifically, each question is framed as “ Does #hypothesis# mean an increased risk of death from COVID19?”. This uniformity ensures that the LLM’s responses are directly comparable and minimizes variability that could arise from difering question formats. It also allows to test hypothesis obtained by the explainability models.

Veracity dataset The Stanford Question Answering

Dataset (SQuAD) [33] has been extensively used in the scientific literature for the development of Question Answering (QA) language models, serving as a benchmark to assess the abilities of these models in understanding and processing natural language queries. As a rich compilation of questions and answers based on Wikipedia articles, SQuAD challenges models to provide accurate answers by comprehending the context provided in the passages.

In our work, we retrieved a subset of questions from the SQuAD dataset to specifically validate the knowledge conveyed by LLMs. This targeted evaluation was designed to determine the precision of the LLM answers compared to the gold standard answers of the data set. This method of validation not only tests the LLMs’ understanding of complex texts, but also assesses their reliability in providing information that matches human-curated answers.

3.3. Use Cases

Three use cases (UC) have been designed to address previous research questions, focusing on the practical applications and implications of using LLMs to validate machine learning inferences. The first area investigates the influence of varying the number of options in factcheck questions on LLM responses, aiming to understand how choice diversity impacts LLM accuracy. The second focuses on assessing the consistency of boolean (yes or no) answers provided by LLMs, evaluating their reliability in delivering steady responses. Lastly, we explore the efects of combine machine learning inferences with LLMs to both enrich and validate the explanations of these models. This last use case uses the Covid19 dataset to create a ML model and the SHAP technique to obtain a set of important features that later are enriched with LLMs.

The models used in this study include “gpt-4” from OpenAI, “mistral-large-2402” from Mistral AI, “gemini1.0-pro-001” from Google, and “llama-2-70b-chat” from Meta AI. In addition, the temperature parameter was set to the lowest possible value to ensure the most deterministic behavior in the LLMs. Temperature controls the randomness of the generated output, with a lower value leading to more deterministic outputs by favoring the most likely predictions. Therefore, in most models, the temperature value was set to 0 to minimize randomness. However, it is important to note that for the Llama 2 model, the minimum supported temperature value is 0.01. Despite this slight deviation from 0, the aim remains the same: to achieve the lowest possible level of randomness in the output.

UC1: Fact Density Impact Analysis It examines the performance of LLMs in delivering binary responses (“yes” or “no” ) versus incorporating a third option (“controversy” ) to introduce an element of uncertainty. This evaluation aims to measure the models’ performance in terms of veracity, exploring how the structure of the response options afects the LLMs’ ability to provide accurate and reliable answers in fact-checking scenarios.

Table 2 presents the prompts used in three scenarios to evaluate veracity, allowing the model to use binary responses or multiple options, and requesting the model to act as an expert in the clinical domain, providing precise and concise responses. The use of the parameter maxtokens inadvertently caused responses to be abruptly cut, leading to nonsensical outcomes. Consequently, we directed the model within the context to be precise and concise, with the aim of minimizing this issue and enhancing the clarity and relevance of its answers. This additional context of evaluation was designed to gauge the model’s capacity to ofer accurate and reliable answers when positioned as a domain-specific authority, further enriching our understanding of its performance in delivering veracious responses within specialized scenarios. This distinction allows for a detailed examination of how the inclusion of an “controversy” option alongside traditional “yes” or “no” answers influences the model’s response behavior in our Use Case 1 analysis. UC2: Consistency and Veracity Evaluation Use Case 2 distinguishes between two methods of evaluating LLM consistency based on the availability of ground truth. In the first approach, where the true answer is not available, consistency is assessed by comparing the LLM’s responses against each other. This method focuses on the internal consistency of the model’s answers. In the second approach, where a known true answer exists, the LLM’s responses are evaluated against this ground truth to measure the model’s accuracy and reliability in providing consistent and correct answers, a quality referred to as veracity.

On the one hand, the first approach or consistency evaluation aims to assess the stability of responses from LLMs through repeated inquiries. By introducing an algorithm 1 to systematically evaluate consistency within the Covid19 dataset, we probe each question in the dataset multiple times using the question and Context 1 as the prompt. This method allows us to gauge the LLMs’ consistency using the metrics described in Section 3.4. Similarly, the same algorithm is used with Context 2.

The following algorithm was deployed twice for each LLM, once for each of the two contexts, and the temperature parameter was minimized to enhance response determinism. This methodology provides a nuanced understanding of the models’ consistency by ensuring controlled conditions and leveraging the lowest possible temperature setting to maximize the determinism of the models’ responses.

Algorithm 1 Evaluate the consistency of a single LLM 1: for each question in dataset1 do 2: Initialize Responses to an empty list 3: for ← 1 to 10 do 4: response r ← AskLLM(, context1) 5: Append to Responses 6: end for 7: SemanticSimilarity ←

ity(Responses) 8: Overlap ← CalculateOverlap(Responses) 9: ROUGE ← CalculateROUGE(Responses) 10: BLEU ← CalculateBLEU(Responses) 11: Store metrics for further analysis 12: end for

CalculateSemanticSimilar

On the other hand, the veracity evaluation involves the use of ground truth. Therefore, akin to the previous method, we employ a diferent algorithm (see Algorithm 2) designed to assess the veracity of each response from each model. The key diference in this approach is that when invoking the LLM, both the response along with its context (Context 3) and the ground truth for each response (“ answer”) are provided. This enables a direct comparison between the LLM’s responses and the known accurate answers.

UC3: XAI Enhancement and Validation Use Case 3 involves leveraging machine learning inferences and LLMs to enrich and validate explanations. We propose to utilize important features identified by XAI techniques, such as SHAP, to augment information and validate explanations. This involves transforming explanations into binary questions that LLMs can answer, with prompts Algorithm 2 Evaluate the veracity of a single LLM 1: for each question in dataset2 do 2: Initialize Responses to an empty list 3: for ← 1 to 10 do 4: response ← AskLLM(, context3) 5: Append to Responses 6: end for 7: SemanticSimilarity ←

ity(Responses, answer) 8: Overlap ← CalculateOverlap(Responses, answer) 9: ROUGE ← CalculateROUGE(Responses, answer) 10: BLEU ← CalculateBLEU(Responses, answer) 11: Store metrics for further analysis 12: end for CalculateSemanticSimilarmeaningful elements) that appear in two sentences. This metric is used to assess how much shared content exists between both sentences, indicating their consistency or similarity in terms of the information they convey.

ROUGE stands for Recall-Oriented Understudy for

Gisting Evaluation. ROUGE includes a collection of metrics designed for the formal evaluation of text generation models such us summarization or machine translation. In the evaluation of responses generated by a LLM, the use of the ROUGE metric can be justified by its ability to quantitatively measure the lexical overlap across different responses generated by the LLM itself. This is accomplished utilizing the ROUGE-L variant, which employs the Longest Common Subsequence (LCS) between two sentences as a basis for computing recall, precision, and the 1 score derived from both [36]. that contain both the question and relevant contexts. By constructing queries to directly link significant features with real-world results (e.g. ’Does #hypothesis# mean an increased risk of death from COVID-19?), we bridge the gap between XAI insights and practical applications. Additionally, by instructing LLMs to respond with “yes” or “no” and provide validating explanations, we achieve dual objectives of validating and enriching responses, prompting LLMs to elaborate on pertinent features.

BLEU stands for Bilingual Evaluation Understudy.

BLEU is a metric initially conceived for evaluating the quality of text translated by machine translation systems by comparing it with one or more reference translations [37]. Unlike ROUGE, which is recall-oriented, BLEU emphasizes precision. It assesses how many words or phrases in the machine-generated text appear in the ref3.4. Metrics erence texts. This metric calculates n-gram (contiguous sequences of n items from a given sample of text) preA suite of metrics has been implemented to evaluate the cision for diferent lengths and combines them through consistency and veracity of the LLMs. This suite includes a weighted geometric mean, incorporating a brevity semantic similarity, token overlap, and the ROUGE and penalty to discourage overly short translations [37]. BLEU metrics. This precision-oriented approach is particularly valuable when the objective is to ensure that certain key inforSemantic similarity is a measure of the degree to mation is consistently represented in the LLM’s outputs. which two concepts (such as words, phrases, or sen- We computed the BLEU metric by treating each LLM tences) are related in terms of their meanings within response as a “translation” and comparing it to other rea given semantic space. In formal terms, semantic simi- sponses. BLEU can highlight the extent to which the LLM larity can be quantified based on the distance or closeness is capable of producing responses that contain expected of the concepts in a multi-dimensional space, where each and relevant content. This method ofers a complemendimension represents a feature of the concept’s mean- tary perspective to the recall-focused metric ROUGE, ing. The closer two concepts are in this space, the more providing a balanced assessment of the LLM’s perforsemantically similar they are. mance.

Diverse methods for calculating semantic similarity are analysed in [34], encompassing a range of approaches.

However, this research will specifically utilize cosine 4. Evaluation and Results similarity in conjunction with sentence embeddings. We will use Sentence-BERT, a variation of BERT(Devlin et In this section we evaluate the diferent use cases. al., 2018) optimized for sentence-level embeddings, due to their proven eficiency [ 35]. In particular, this research 4.1. UC1 Results utilizes the “all-MiniLM-L6-v2” model for its remarkable balance between high performance and speed. Despite being one of the smallest models in terms of size, it stands out for its rapid processing capabilities.

The first use case focuses on evaluating how the structure

of response options presented to the LLM influences the performance of the models’ accuracy and reliability. This evaluation was addressed by using diferent contexts: Context 1 which employs a binary response (such as “yes”

Overlap as a metric refers to the method of quantifying similarity based on the common tokens (words or other

or “no”), and Context 2, which introduces a third element • Llama2 demonstrates accuracy in variations of associated to uncertainty characterized as “controversy”. Q7, Q9, and Q10. However, it produces unjusti

Table 3 shows the results of the models’ responses for ifed variations in Q2, Q3, Q5, Q6, and Q8. Fureach question. It is important to clarify that although mul- thermore, it provides incorrect answers for Q2 tiple responses are generated for each question (specifi- and Q5, where “yes” was expected, but “no” was cally 10), the table presents only a single value in each output. cell. This reduction is justified because the answers (“yes”, “no” or “controversy”) do not vary across iterations. What Our findings suggest that introducing the option of varies is the model’s explanations of the responses, not “controversy” as a potential response significantly influthe answer itself. ences the behavior of the analyzed LLMs, leading to a

However, some diferences can be noted both in the noticeable shift in their response patterns. Across various responses generated by a LLM with diferent contexts for models, including GPT and Mistral, where the response the same question and in the performance across various changed in 4 out of 10 instances, Gemini with a change language models (e.g. Q1 in Context 2 is answered as “yes” in 7 out of 10 instances, and Llama2 showing a change in by GPT but “controversy” by Gemini). These variations 8 out of 10 instances, there is a marked preference for sereveal that while some diferences can be attributed to lecting “controversy” over a definitive “yes” or “no”. This the introduction of ’controversy’ in response options (e.g. tendency persists irrespective of the model in question GPT Q7), others may not have such a clear justification and appears to reflect a broader pattern: when presented (e.g. Gemini Q2). with the “controversy” option, models consistently avoid

Optimally, each LLM should make three justified varia- negative responses, opting instead to categorize statetions (Q7, Q9, and Q10) when introducing the uncertainty ments as controversial. This behavior suggests a higher option with the second context, due to the limitation level of confidence in asserting conclusions rather than of Context 1 to binary “yes” or “no” answers. For GPT, denying them. While for GPT and Mistral, 75% of these an analysis of the responses between contexts reveals a shifts towards “controversy” can be considered justified, mixed outcome: 3 of the variations presented are deemed enhancing the quality of the output, the justification for correct (Q7, Q9, and Q10), indicating that the model ac- this change drops to 43% for Gemini and 37% for Llama2, curately handled both contexts. Conversely, the model’s indicating variability in how these adjustments align with responses to the question Q6 is classified as wrong varia- the underlying data uncertainty. tions, suggesting inaccuracies in dealing with diferent contexts. 4.2. UC2 Results

Similarly, the performance of the other models is as follows: In this section, we present the results from the second use case, which are detailed in Tables 4 and 5. These • Mistral accurately handles 3 variations (Q7, Q9, tables show the average performance metrics for consisand Q10) but had an error in Q6. tency and veracity -namely, semantic similarity, overlap, • Gemini stands out by correctly handling 3 varia- ROUGE and BLEU scores - for each model across various tions (Q7, Q9, and Q10) but falls short by produc- datasets. These metrics were computed for each quesing 4 unjustified incorrect variations (Q1, Q2, Q3, tion within the datasets, with averages provided to give Q6), including a notable discrepancy in Question a view of each model’s performance under two diferent 6 where the expected answer was “yes”, but the contexts (i.e Context 1 and Context 2 ) for consistency evaloutput was “no”. uation (Table 4; the consistency results per questions are provided at Appendix TableA1) and a third context (i.e. der analysis (e.g. Context 1 ’You are an expert on COVIDContext 3) for veracity evaluation (Table 5; the veracity 19 and your duty is to answer questions related to the results per questions are provided at Appendix TableA3). topic only with yes or no followed by the explanation

Our analysis reveals no significant diference in per- that validates the answer in a maximum of 2 sentences.’). formance between the first two contexts evaluated for Table 7 shows example of responses for Q1, Q2, Q3 from consistency, where all LLMs demonstrated high levels of GPT-4. Q1 enriches the fact-checking response adding consistency. Mistral achieved perfect consistency scores, information related with the consequences of having hywhile Gemini and Llama2 were nearly perfect. However, pertension and how they are related to higher death risk. GPT showed the lowest consistency (for all metrics in- Q2 enriches the response adding reasons why the imporcluding semantic similarity), even with the temperature tant feature (i.e. platelet) plays a crucial role that may parameter set to the lowest level, indicating potential lead to high risk of death. Last, Q3 response enriches the variability in its response generation process. response indicating that a high leukocyte can be a symp

When comparing the models’ performance to the tom of severe Covid19. At table 8 we studied syntactically ground truth data for veracity (see Section 3.2), GPT the number of words that contain the explanation and stands out by achieving the best results across all met- also the average number of words per sentence. Llama2 rics, indicating that its responses, on average, align more and Mistral have larger explanations and also syntacticlosely with the ground truth than those of the other cally are slighly more comples (Llama2 has ≈ 28 words models. Llama2 follows closely behind as the second- per sentence for context2). Gemini provides the shortest best performer, with Gemini and Mistral trailings and explanations and also the lowest syntactic complexity their positions varying depending on the metric applied. (36.72 number of words average and 19.91 words per senThese findings suggest that while GPT may struggle with tence). Similarly to previous use cases we analyzed the consistency relative to its peers, it excels in generating diferences between Context 1 and Context 2 explanations responses that are more closely aligned with verifiable (including the controversy as an option in the second) to facts, highlighting a nuanced trade-of between consis- measure how diferent are the explanations. According to tency and veracity across diferent LLMs. all metrics the results show that the LLM that change the most is Llama2 (i.e. ROUGE 0.411), followed by Gemini 4.3. UC3 Results (i.e. ROUGE 0.442), GPT-4 (i.e. ROUGE 0.541) and Mistral (i.e. ROUGE 0.570) (see Table 6 for other metrics).

It examines the use of prompts that transform explanations into binary questions that contain both the question and relevant contexts related with the fact-checking un

Table 8 might even imply contradictory responses. As for the Average number of words of explanation per text and per truthfulness analysis, we observed that GPT obtained sentence the best results on average and can be considered quite accurate.

5. Conclusions

In this paper we studied the efect of variation in the number of options within fact-checking questions, the consistency and truthfulness of the answers, and the capabilities to enrich fact-checking with explanations. We also proposed to link explanations from machine learning models to LLMs by using those explanations to create a fact-checking type input question. We measured coherence and veracity using state-of-the-art metrics such as semantic similarity, overlap, ROUGE and BLEU, and the results show that Mistral is the most coherent LLM. Notably, Gemini and Llama2 obtained similar results and GPT was slightly behind. Furthermore, we conclude that fact-cheking consistency does not depend on the number of options but explanations’ consistency does. This is relevant because it means that a diferent number of options not only may change the fact response but will also be able to justify it diferently. Further research should be done to analyze in depth to what extend these diferences

Acknowledgments This work has been funded by the project “Inteligencia Artificial eXplicable” IAX grant of the Young Researchers 2022/2024 initiative of the Community of Madrid.

[27] OpenAI, J. Achiam, S. Adler, S. Agarwal, Gpt-4 technical report, ArXiv abs/2303.08774 (2023).

URL: https://api.semanticscholar.org/CorpusID: 257532815. [28] G. T. G. R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac,

Gemini: A family of highly capable multimodal models, ArXiv abs/2312.11805 (2023). URL: https: //api.semanticscholar.org/CorpusID:266361876. [29] H. Touvron, L. Martin, K. R. Stone, P. Albert,

Llama 2: Open foundation and fine-tuned chat models, ArXiv abs/2307.09288 (2023). URL: https: //api.semanticscholar.org/CorpusID:259950998. [30] Mistral, Mistral large, our new flagship model, URL

https://mistral.ai/news/mistral-large/, 2024. [31] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,

Mistral 7b, ArXiv abs/2310.06825 (2023). URL: https: //api.semanticscholar.org/CorpusID:263830494. [32] P. Cardinal-Fernandez, E. Garcia-Cuesta, J. Barberan, J. F. Varona, A. Estirado, A. Moreno, J. Villanueva, M. Villareal, O. Baez-Pravia, J. Menendez, et al., Clinical characteristics and outcomes of 1,331 patients with covid-19: Hm spanish cohort, Revista

Española de Quimioterapia 34 (2021) 342. [33] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang,

Squad: 100,000+ questions for machine comprehension of text (2016) 2383–2392. doi:10.18653/ v1/D16-1264. [34] D. Chandrasekaran, V. Mago, Evolution of semantic similarity—a survey, ACM Computing Surveys 54 (2021) 1–37. URL: http://dx.doi.org/10.1145/3440755.

doi:10.1145/3440755. [35] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019. URL: https: //arxiv.org/abs/1908.10084. [36] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013. [37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040. doi:10.3115/ 1073083.1073135. 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 A. Appendix. Detailed results of the consistency and veracity metrics for the three contexts.

Overlap 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,964 1,000 1,000 0,996 1,000 0,983 1,000 0,992 1,000 1,000 1,000 1,000 1,000 1,000 0,998

ROUGE

BLEU

ROUGE

BLEU

ROUGE

BLEU

Overlap

Llama2

[1]

Chang ,

Wang ,

Wu ,

Yang ,

Zhu ,

Chen ,

Yi ,

Wang ,

Wang , et al., A survey on evaluation of large language models , ACM Transactions on Intelligent Systems and Technology 15 ( 2024 ) 1 - 45 .

[2]

Wang ,

Haddow ,

Birch ,

Peng , Assessing the reliability of large language model knowledge , arXiv:2310.09820 ( 2023 ). URL: https://doi.org/ 10.48550/arXiv.2310.09820].

[3]

Caruccio , et al., Can chatgpt provide intelligent diagnoses? a comparative study between predictive models and chatgpt to define a new medical diagnostic bot , Expert Systems with Applications 235 ( 2024 ) 121186 . URL: https://www.sciencedirect.com/ science/article/pii/S0957417423016883. doi:https: //doi.org/10.1016/j.eswa. 2023 . 121186 .

[4]

Jin ,

Wang ,

Yang ,

Sun ,

Wang ,

Liao ,

Xie , Towards fine-grained reasoning for fake news detection , Proceedings of the AAAI Conference on Artificial Intelligence 36 ( 2022 ) 5746 - 5754 . URL: https://ojs.aaai.org/index.php/AAAI/article/ view/20517. doi: 10 .1609/aaai.v36i5. 20517 .

[5]

Wadden ,

Lo ,

L. L.

Wang ,

Cohan , I. Beltagy, H. Hajishirzi, MultiVerS: Improving scientific Llama2