Leveraging Large Language Models (LLMs) as Domain Experts in a Validation Process Carlos Badenes-Olmedo1,† , Esteban García-Cuesta2,*,† , Alejandro Sánchez-González2 and Oscar Corcho2 1 Ontology Engineering Group, Departamento de Sistemas Informáticos, Universidad Politécnica de Madrid 2 Ontology Engineering Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid Abstract The explosion of information requires robust methods to validate knowledge claims. On the other hand, there is also an increase interest on understanding and creating methods that helps on the interpretation of machine learning models. Both approaches converge on the necessity of a validation step that clarifies or helps end-users to better understand if the decision or information provided by the model is what is needed or if there is some mismatch between what the artificial intelligent system is suggesting and reality. Large Language Models (LLMs), with their ability to process and synthesize vast amounts of text data, have emerged as potential tools for this purpose. This study explores the utility of LLMs in hypothesis validation in two different scenarios. The first relies on hypothesis generated from explanations obtained by XAI methods or by inherently explainable models. We propose a method to transform the inferences provided by a machine learning model into explanations in natural language, hence linking the symbolic and sub-symbolic areas. The second relies on hypothesis generated with techniques that automatically extract answers from text. The results show that LLMs can complement other XAI techniques and although all LLMs analyzed are able to provide truthfulness-related answers, not all are equally successful. Keywords LLMs, knowledge validation, explainable artificial intelligence 1. Introduction framing and in-context interference effects showing that large language models are subject to the influences of In recent years, the field of artificial intelligence has wit- various hallucinations-inducing causes. This is true for nessed a remarkable evolution in natural language pro- Word Prediction (WP), Question-Answer (QA), and Fact- cessing capabilities, largely driven by the advent of Large Checking (FC). Other studies [3] compare the results Language Models (LLMs). The essence of utilizing these obtained by machine learning models with those pro- models for Artificial Intelligence (AI) tasks such as knowl- duced by LLMs for diagnostic decision support systems. edge and hypothesis validation lies in their ability to They propose a processing pipeline for interacting with understand, generate, and manipulate human language. language models concluding that LLMs models often are This ability is crucial for tasks that require a deep under- ambiguous and provide incorrect diagnoses, being the standing of context and nuances of human communica- prompt engineering a critical step in the process. Thus, tion. claim verification has emerged as a key-point to discern Applying LLMs to real-world scenarios inevitably between misinformation and real facts. Most of these leads to language generation deviating from known facts works [4][5] rely on human-annotated datasets to verify (aka “factual hallucination” [1] due to multiple causes the explanations or decisions but that information is not (e.g. it may be over-estimating due to overfitting on bi- accessible in hypothesis testing scenarios. ased prompts (framing effect)). Several studies have tried In this paper, we explore the innovative application to measure these effects but it is still difficult to general- of Large Language Models (LLMs) as validation tools ize them outside the specific context where the studies in fact-checking and hypothesis fact-checking scenar- have been performed. In [2] the authors study prompt ios. Traditionally, the validation of conclusions drawn from data and models in specialized domains has been SEPLN-2024: 40th Conference of the Spanish Society for Natural Lan- a task reserved for human experts, largely due to the guage Processing. Valladolid, Spain. 24-27 September 2024. * Corresponding author. complexity and domain-specific nature of the required † These authors contributed equally. knowledge. However, with the evolution of LLMs, the $ carlos.badenes@upmm.es (C. Badenes-Olmedo); question arises whether these powerful natural language esteban.garcia@upm.es (E. García-Cuesta); processing tools can assume a role similar to human ex- alejandro.sanchezg@alumnos.upm.es (A. Sánchez-González); perts in the validation of domain-specific knowledge. oscar.corcho@upm.es (O. Corcho) The primary goal of this work is to analyze and as-  0000-0002-2753-9917 (C. Badenes-Olmedo); 0000-0002-1215-3333 sess the ability of LLMs to serve as domain experts in a (E. García-Cuesta); 0000-0002-9260-0753 (O. Corcho) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License clinical scenario, specifically in validating explanations Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings (and thereof validating the models) derived from classi- inputs. The examination of LLMs under different condi- cal machine learning decision models. These explana- tions, such as varying the context and structure of the tions, presented in the form of affirmative statements prompts, sheds light on their performance variability and such as "the hypertension increase the risk of death from the strategies for optimizing accuracy. COVID19", are transformed into questions (for example, Building on this foundation, the interplay between "Does hypertension mean an increased risk of death from context, choice structure, and decision-making, as ex- COVID-19?") to be presented to the LLMs. This approach plored in [9], [10], [11], and [12], directly relates to the allows us to directly evaluate the LLM models’ as a knowl- challenges LLMs face. This parallel between human and edge base to validate specific claims within the domain, computational decision-making processes emphasizes offering a unique perspective on their applicability as the importance of carefully designed prompts and the validation tools in scientific and clinical contexts. strategic manipulation of choice options to improve LLM This work aims to address the following research ques- reliability and decision accuracy. Through innovative tions: RQ1) What effect does the variation in the number decision-making strategies and prompt engineering tech- of options within a fact-checking question have on the niques proposed in [13], [14], [15], [16], and [17], the responses provided by Large Language Models (LLMs)? nuanced approach to prompt framing is critical for en- RQ2) How consistent are the boolean answers (i.e. yes, hancing LLM interactions and understanding. This body or no) provided by Large Language Models (LLMs)? RQ3) of work collectively illustrates a key insight: adjusting What is the impact of integrating machine learning infer- the number of options and the framing of prompts can ences with Large Language Models (LLMs) on enriching profoundly influence the effectiveness of LLMs in verify- and validating the explanations? ing statements and making decisions, bridging the gap Through this analysis, we seek not only to understand between consistency in output and the complexity of the level of knowledge and accuracy of LLMs in spe- input conditions. cialized domains but also to investigate their potential to complement or, in some cases, replace the need for Explainable AI and LLMs Interpretability and ex- human peer review in the validation stage of scientific plainability in Machine Learning (ML) refers to the ability conclusions. to make understandable an ML model’s workings. This is particularly vital in high-risk applications and desir- Our contributions are: able in most cases. The burgeoning field of research that 1. a novel assessment method that integrates ma- addresses to foster this ability is known as eXplainable Ar- chine learning inferences with Large Language tificial Intelligence (XAI). A variety of XAI methods have Models (LLMs) to generate fact-checking (FC) been developed in recent years. They may be related to type questions. intrinsically interpretable models or to "black box" mod- els, but all pursue coherent and meaningful explanations 2. a study on the variability and consistency of re- for the audience. As an example, SHAP (SHapley Addi- sponses provided by LLMs in multiple-choice tive exPlanations) is one of the most widely used XAI questions and scenarios with established ground model agnostic techniques. It is based on concepts from truths.. game theory that allow the computing, which are the 3. an investigation into the variability of explana- features that contribute the most to the outcomes of the tions provided by LLMs in scenarios involving black-box model, by trying different feature set permu- fact-checking (including questions with multiple tations [18]. LIME (Local Interpretable Model-agnostic factual options) and fact recovery, offering a com- Explanations) is another well known example that builds prehensive understanding of LLMs’ explanatory a simple linear surrogate model to explain each of the capabilities and their potential for enhancing AI predictions of the learned black-box model [19]. There interpretability. are also some interpretable ML models such as logistic regression, Generalised Linear Models (GLMs), or Gener- 2. Related works alised Additive Models (GAMs).There are some attempts to facilitate the comprehension of some XAI methods Prompt framing effect The study of the prompt fram- providing new tools to end-users. At [20] a new GPT ing effect reveals that the performance of Large Language x-[plAIn] is proposed to transform the output explana- Models (LLMs) is highly dependent on the construction of tions provided by those methods (e.g. SHAP or LIME) to the prompts, with a significant focus on the consistency natural language that contains the technical descriptions of LLMs’ responses to similar prompts. This concept, of the results. Despite the improvements in end-user sat- discussed in [6], [7], and [8], examines LLMs’ ability isfaction, this work does not include any enrichment or to provide consistent outputs for semantically similar additional information that could contextualize not only prompts and their sensitivity to hallucination-inducing the explanations themselves, but also the meaning and validation of the application domain. In [21] the authors 3.1. Large Language Models propose to use LLMs to facilitate decision-making pro- A range of LLMs have been developed in the last years. cesss by the end users providing concise summaries of GPT-4, developed by OpenAI, is a state-of-the-art LLM varios XAI methods tailored for different audiences. This known for its deep learning architecture. As part of the can be viewed as LLM enhanced XAI explainer trying Generative Pre-trained Transformer series, it includes to bridge the gap between complex AI technologies and a large network of multi-layer transformers, capable of their practical applications. processing sequential data and preserving textual depen- dencies in the long term. This version marks a signif- Veracity and truth extraction The exploration of icant advancement over its predecessors by scaling up truth within the realm of big data and its verification the number of parameters and broadening the diversity through LLMs embodies a complex interaction between of its training data, thus enhancing its ability to gener- technological advancements and the multifaceted nature ate coherent and contextually relevant text based on the of truth. The assembly method, as proposed by [22], input it receives [27]. marks a significant step in addressing the challenge of Moreover, Google’s DeepMind project Gemini, is a data veracity by combining individual truth discovery key competitor to GPT-4. Gemini is a family of models methods to mitigate the effects of limited labeled ground built on top of transformer decoders that employ atten- truth availability. This approach lays the groundwork for tion mechanisms, analogous to GPT-4. Gemini Pro, the further research on the role of technology in differentiat- second model in the family in terms of size, has been op- ing between truth and falsehood. Furthermore, research timized for both cost and latency, offering considerable on linguistic indicators of truth and deception, such as performance improvements across numerous tasks; it is that of [23], reveals the potential of linguistic complexi- designed to understand, reason, and generate outputs ties and immediacy to act as markers to distinguish be- across various types of data, including text [28]. tween truthful and deceptive narratives, enriching the Similarly, Llama 2 constitutes a collection of pretrained conversation about truth verification in digital commu- and fine-tuned LLMs that is distinctive from the models nications. mentioned due to its open-source nature [29]. This group Recent advances in artificial intelligence, notably the of models developed by Meta includes two models (Llama conceptualization of models such as InstructGPT as 2 and Llama 2-Chat) with different versions that adjust "Truth Machines" by [24], highlight ongoing efforts to the number of parameters: 7B, 13B and 70B. define and operationalize truth through sophisticated Mistral represents another significant collection of data analysis and model architectures. Currently, innova- LLMs, characterized by their advanced reasoning capabil- tive methodologies such as the DoLa decoding strategy ities and a robust performance. Their largest model, Mis- by [25] and the development of truthfulness personas tral Large, demonstrates state-of-the-art results across a by [26] aim to enhance the factuality and reliability of variety of benchmarks, including areas such as common LLM outputs. These strategies not only address the chal- sense, reasoning, and knowledge-based tasks [30]. The lenge of hallucinations in model responses but also open Mistral family also includes open-source models that sur- up new pathways for embedding truthfulness within AI pass certain versions of Llama 2 in several benchmarks, systems, underscoring the dynamic nature of research as documented by [31]. focused on achieving reliable knowledge verification and decision-making processes in the digital era. 3.2. Datasets Covid19 explanations The questions included in Ta- 3. Approach and Problem Setup ble 1 are created from a clinical study [32]. In that study Our proposal involves using LLMs as knowledge bases one thousand and three hundred thirty-one COVID-19 pa- to evaluate the outcomes of machine learning models tients (medium age 66.9 years old; males n= 841, medium by answering Boolean questions derived from the mod- length of hospital stayed 8 days, non-survivors n=233) els’ inferences. This approach aims to harness the com- were analyzed. Based on the hypotheses raised in the prehensive knowledge and understanding capabilities study, the questions are constructed. Questions Q2, Q3, of LLMs to verify the accuracy and reliability of infer- Q4, Q5, Q6, Q7, and Q8 were identified as significant ences made by machine learning models, thereby provid- using a regression Cox model and Q1, Q9, Q10 were ing a novel method for validating AI-generated insights identified as significant by univariate analysis. Q1 was through direct, yes-or-no questioning. also identified as 1 of the most important variables using SHAP explanations over LSTM learned model using the same Covid19 dataset. By domain knowledge and based on model explanations we can set Q1, Q2, Q3, Q4, Q5, Q6, Table 1 edge conveyed by LLMs. This targeted evaluation was Consistency questions dataset designed to determine the precision of the LLM answers Consistency compared to the gold standard answers of the data set. Does hypertension mean an This method of validation not only tests the LLMs’ under- Q1 standing of complex texts, but also assesses their reliabil- increased risk of death from COVID-19? Does a low platelet count mean an ity in providing information that matches human-curated Q2 increased risk of death from COVID-19? answers. Does a high leukocyte count at emergency mean Q3 an increased risk of death from COVID-19? Does older age mean an increased 3.3. Use Cases Q4 risk of death from COVID-19? Three use cases (UC) have been designed to address pre- Q5 Does male gender mean an increased vious research questions, focusing on the practical ap- risk of death from COVID-19? plications and implications of using LLMs to validate Does previous chronic therapy with steroids Q6 machine learning inferences. The first area investigates mean an increased risk of death from COVID-19? the influence of varying the number of options in fact- Does not treating with hydroxychloroquine Q7 mean an increased risk of death from COVID-19? check questions on LLM responses, aiming to understand Does oxygen saturation at emergency mean how choice diversity impacts LLM accuracy. The second Q8 an increased risk of death from COVID-19? focuses on assessing the consistency of boolean (yes or Does no early prescription of lopinavir/ritonavir no) answers provided by LLMs, evaluating their relia- Q9 bility in delivering steady responses. Lastly, we explore mean an increased risk of death from COVID-19? Does no treatment with steroid bolus mean the effects of combine machine learning inferences with Q10 an increased risk of death from COVID-19? LLMs to both enrich and validate the explanations of these models. This last use case uses the Covid19 dataset to create a ML model and the SHAP technique to obtain and Q8 as positive truth answers. We did not include Q7 a set of important features that later are enriched with as a positive response (but controversy), despite being LLMs. obtained by the Cox model explanations, because there The models used in this study include “gpt-4” from was controversy about the use of hydroxychloroquine OpenAI, “mistral-large-2402” from Mistral AI, “gemini- during the pandemic and although it was initially consid- 1.0-pro-001” from Google, and “llama-2-70b-chat” from ered as a drug to reduce the risk of mortality, it was later Meta AI. In addition, the temperature parameter was set contradicted by other studies and was not recommended to the lowest possible value to ensure the most deter- by the World Health Organization. Therefore, the vari- ministic behavior in the LLMs. Temperature controls ables that were obtained only by the univariate analysis the randomness of the generated output, with a lower (Q9 and Q10) are proposed as controversy answers. value leading to more deterministic outputs by favoring It is important to highlight that all the questions adhere the most likely predictions. Therefore, in most models, to a consistent structure to optimize the performance of the temperature value was set to 0 to minimize random- the LLM. Specifically, each question is framed as “Does ness. However, it is important to note that for the Llama #hypothesis# mean an increased risk of death from COVID- 2 model, the minimum supported temperature value is 19?”. This uniformity ensures that the LLM’s responses 0.01. Despite this slight deviation from 0, the aim re- are directly comparable and minimizes variability that mains the same: to achieve the lowest possible level of could arise from differing question formats. It also allows randomness in the output. to test hypothesis obtained by the explainability models. UC1: Fact Density Impact Analysis It examines the Veracity dataset The Stanford Question Answering performance of LLMs in delivering binary responses (“yes” Dataset (SQuAD) [33] has been extensively used in the or “no”) versus incorporating a third option (“contro- scientific literature for the development of Question An- versy”) to introduce an element of uncertainty. This eval- swering (QA) language models, serving as a benchmark uation aims to measure the models’ performance in terms to assess the abilities of these models in understanding of veracity, exploring how the structure of the response and processing natural language queries. As a rich com- options affects the LLMs’ ability to provide accurate and pilation of questions and answers based on Wikipedia reliable answers in fact-checking scenarios. articles, SQuAD challenges models to provide accurate Table 2 presents the prompts used in three scenarios answers by comprehending the context provided in the to evaluate veracity, allowing the model to use binary re- passages. sponses or multiple options, and requesting the model to In our work, we retrieved a subset of questions from act as an expert in the clinical domain, providing precise the SQuAD dataset to specifically validate the knowl- Table 2 Use Case 1 Contexts Prompt You are an expert on COVID-19 and your duty is to answer questions related to the topic only with Context 1 yes or no followed by the explanation that validates the answer in a maximum of 2 sentences. You are an expert on COVID-19 and your duty is to answer questions related to the topic only with Context 2 yes, no or controversy followed by the explanation that validates the answer in a maximum of 2 sentences. You are a medical expert and your duty is Context 3 to answer medical questions in a single sentence in a precise and brief manner. and concise responses. The use of the parameter max- perature parameter was minimized to enhance response tokens inadvertently caused responses to be abruptly cut, determinism. This methodology provides a nuanced un- leading to nonsensical outcomes. Consequently, we di- derstanding of the models’ consistency by ensuring con- rected the model within the context to be precise and trolled conditions and leveraging the lowest possible tem- concise, with the aim of minimizing this issue and en- perature setting to maximize the determinism of the mod- hancing the clarity and relevance of its answers. This els’ responses. additional context of evaluation was designed to gauge the model’s capacity to offer accurate and reliable an- Algorithm 1 Evaluate the consistency of a single LLM swers when positioned as a domain-specific authority, 1: for each question 𝑞𝑖 in dataset1 do further enriching our understanding of its performance 2: Initialize Responses to an empty list in delivering veracious responses within specialized sce- 3: for 𝑖 ← 1 to 10 do narios. This distinction allows for a detailed examination 4: response r ← AskLLM(𝑞𝑖 , context1) of how the inclusion of an “controversy” option alongside 5: Append 𝑟 to Responses traditional “yes” or “no” answers influences the model’s 6: end for response behavior in our Use Case 1 analysis. 7: SemanticSimilarity ← CalculateSemanticSimilar- ity(Responses) 8: Overlap ← CalculateOverlap(Responses) UC2: Consistency and Veracity Evaluation Use 9: ROUGE ← CalculateROUGE(Responses) Case 2 distinguishes between two methods of evaluat- 10: BLEU ← CalculateBLEU(Responses) ing LLM consistency based on the availability of ground 11: Store metrics for further analysis truth. In the first approach, where the true answer is 12: end for not available, consistency is assessed by comparing the LLM’s responses against each other. This method focuses On the other hand, the veracity evaluation involves on the internal consistency of the model’s answers. In the use of ground truth. Therefore, akin to the previous the second approach, where a known true answer exists, method, we employ a different algorithm (see Algorithm the LLM’s responses are evaluated against this ground 2) designed to assess the veracity of each response from truth to measure the model’s accuracy and reliability each model. The key difference in this approach is that in providing consistent and correct answers, a quality when invoking the LLM, both the response along with referred to as veracity. its context (Context 3) and the ground truth for each On the one hand, the first approach or consistency response (“𝑞𝑖 answer”) are provided. This enables a direct evaluation aims to assess the stability of responses from comparison between the LLM’s responses and the known LLMs through repeated inquiries. By introducing an algo- accurate answers. rithm 1 to systematically evaluate consistency within the Covid19 dataset, we probe each question in the dataset UC3: XAI Enhancement and Validation Use Case multiple times using the question and Context 1 as the 3 involves leveraging machine learning inferences and prompt. This method allows us to gauge the LLMs’ consis- LLMs to enrich and validate explanations. We propose to tency using the metrics described in Section 3.4. Similarly, utilize important features identified by XAI techniques, the same algorithm is used with Context 2. such as SHAP, to augment information and validate ex- The following algorithm was deployed twice for each planations. This involves transforming explanations into LLM, once for each of the two contexts, and the tem- binary questions that LLMs can answer, with prompts Algorithm 2 Evaluate the veracity of a single LLM meaningful elements) that appear in two sentences. This 1: for each question 𝑞𝑖 in dataset2 do metric is used to assess how much shared content exists 2: Initialize Responses to an empty list between both sentences, indicating their consistency or 3: for 𝑖 ← 1 to 10 do similarity in terms of the information they convey. 4: response 𝑟𝑖 ← AskLLM(𝑞𝑖 , context3) 5: Append 𝑟𝑖 to Responses 6: end for ROUGE stands for Recall-Oriented Understudy for 7: SemanticSimilarity ← CalculateSemanticSimilar- Gisting Evaluation. ROUGE includes a collection of met- ity(Responses, 𝑞𝑖 answer) rics designed for the formal evaluation of text generation 8: Overlap ← CalculateOverlap(Responses, 𝑞𝑖 answer) models such us summarization or machine translation. 9: ROUGE ← CalculateROUGE(Responses, 𝑞𝑖 answer) In the evaluation of responses generated by a LLM, the 10: BLEU ← CalculateBLEU(Responses, 𝑞𝑖 answer) use of the ROUGE metric can be justified by its ability 11: Store metrics for further analysis to quantitatively measure the lexical overlap across dif- 12: end for ferent responses generated by the LLM itself. This is accomplished utilizing the ROUGE-L variant, which em- ploys the Longest Common Subsequence (LCS) between that contain both the question and relevant contexts. By two sentences as a basis for computing recall, precision, constructing queries to directly link significant features and the 𝐹1 score derived from both [36]. with real-world results (e.g. ’Does #hypothesis# mean an increased risk of death from COVID-19?), we bridge BLEU stands for Bilingual Evaluation Understudy. the gap between XAI insights and practical applications. BLEU is a metric initially conceived for evaluating the Additionally, by instructing LLMs to respond with “yes” quality of text translated by machine translation sys- or “no” and provide validating explanations, we achieve tems by comparing it with one or more reference transla- dual objectives of validating and enriching responses, tions [37]. Unlike ROUGE, which is recall-oriented, BLEU prompting LLMs to elaborate on pertinent features. emphasizes precision. It assesses how many words or phrases in the machine-generated text appear in the ref- 3.4. Metrics erence texts. This metric calculates n-gram (contiguous sequences of n items from a given sample of text) pre- A suite of metrics has been implemented to evaluate the cision for different lengths and combines them through consistency and veracity of the LLMs. This suite includes a weighted geometric mean, incorporating a brevity semantic similarity, token overlap, and the ROUGE and penalty to discourage overly short translations [37]. BLEU metrics. This precision-oriented approach is particularly valu- able when the objective is to ensure that certain key infor- Semantic similarity is a measure of the degree to mation is consistently represented in the LLM’s outputs. which two concepts (such as words, phrases, or sen- We computed the BLEU metric by treating each LLM tences) are related in terms of their meanings within response as a “translation” and comparing it to other re- a given semantic space. In formal terms, semantic simi- sponses. BLEU can highlight the extent to which the LLM larity can be quantified based on the distance or closeness is capable of producing responses that contain expected of the concepts in a multi-dimensional space, where each and relevant content. This method offers a complemen- dimension represents a feature of the concept’s mean- tary perspective to the recall-focused metric ROUGE, ing. The closer two concepts are in this space, the more providing a balanced assessment of the LLM’s perfor- semantically similar they are. mance. Diverse methods for calculating semantic similarity are analysed in [34], encompassing a range of approaches. However, this research will specifically utilize cosine 4. Evaluation and Results similarity in conjunction with sentence embeddings. We will use Sentence-BERT, a variation of BERT(Devlin et In this section we evaluate the different use cases. al., 2018) optimized for sentence-level embeddings, due to their proven efficiency [35]. In particular, this research 4.1. UC1 Results utilizes the “all-MiniLM-L6-v2” model for its remarkable balance between high performance and speed. Despite The first use case focuses on evaluating how the structure being one of the smallest models in terms of size, it stands of response options presented to the LLM influences the out for its rapid processing capabilities. performance of the models’ accuracy and reliability. This evaluation was addressed by using different contexts: Overlap as a metric refers to the method of quantifying Context 1 which employs a binary response (such as “yes” similarity based on the common tokens (words or other Table 3 Context 1 vs Context 2 responses for all LLMs GPT Mistral Gemini Llama2 Expected Context 1 Context 2 Context 1 Context 2 Context 1 Context 2 Context 1 Context 2 Q1 yes yes yes yes yes yes controversy yes yes Q2 yes yes yes yes yes yes controversy no controversy Q3 yes yes yes yes yes yes controversy yes controversy Q4 yes yes yes yes yes yes yes yes yes Q5 yes yes yes yes yes yes yes no controversy Q6 yes yes controversy yes controversy no controversy yes controversy Q7 controversy no controversy no controversy no controversy no controversy Q8 yes yes yes yes yes yes yes yes controversy Q9 controversy no controversy yes controversy no controversy yes controversy Q10 controversy yes controversy yes controversy no controversy yes controversy or “no”), and Context 2, which introduces a third element • Llama2 demonstrates accuracy in variations of associated to uncertainty characterized as “controversy”. Q7, Q9, and Q10. However, it produces unjusti- Table 3 shows the results of the models’ responses for fied variations in Q2, Q3, Q5, Q6, and Q8. Fur- each question. It is important to clarify that although mul- thermore, it provides incorrect answers for Q2 tiple responses are generated for each question (specifi- and Q5, where “yes” was expected, but “no” was cally 10), the table presents only a single value in each output. cell. This reduction is justified because the answers (“yes”, “no” or “controversy”) do not vary across iterations. What Our findings suggest that introducing the option of varies is the model’s explanations of the responses, not “controversy” as a potential response significantly influ- the answer itself. ences the behavior of the analyzed LLMs, leading to a However, some differences can be noted both in the noticeable shift in their response patterns. Across various responses generated by a LLM with different contexts for models, including GPT and Mistral, where the response the same question and in the performance across various changed in 4 out of 10 instances, Gemini with a change language models (e.g. Q1 in Context 2 is answered as “yes” in 7 out of 10 instances, and Llama2 showing a change in by GPT but “controversy” by Gemini). These variations 8 out of 10 instances, there is a marked preference for se- reveal that while some differences can be attributed to lecting “controversy” over a definitive “yes” or “no”. This the introduction of ’controversy’ in response options (e.g. tendency persists irrespective of the model in question GPT Q7), others may not have such a clear justification and appears to reflect a broader pattern: when presented (e.g. Gemini Q2). with the “controversy” option, models consistently avoid Optimally, each LLM should make three justified varia- negative responses, opting instead to categorize state- tions (Q7, Q9, and Q10) when introducing the uncertainty ments as controversial. This behavior suggests a higher option with the second context, due to the limitation level of confidence in asserting conclusions rather than of Context 1 to binary “yes” or “no” answers. For GPT, denying them. While for GPT and Mistral, 75% of these an analysis of the responses between contexts reveals a shifts towards “controversy” can be considered justified, mixed outcome: 3 of the variations presented are deemed enhancing the quality of the output, the justification for correct (Q7, Q9, and Q10), indicating that the model ac- this change drops to 43% for Gemini and 37% for Llama2, curately handled both contexts. Conversely, the model’s indicating variability in how these adjustments align with responses to the question Q6 is classified as wrong varia- the underlying data uncertainty. tions, suggesting inaccuracies in dealing with different contexts. 4.2. UC2 Results Similarly, the performance of the other models is as In this section, we present the results from the second follows: use case, which are detailed in Tables 4 and 5. These • Mistral accurately handles 3 variations (Q7, Q9, tables show the average performance metrics for consis- and Q10) but had an error in Q6. tency and veracity -namely, semantic similarity, overlap, • Gemini stands out by correctly handling 3 varia- ROUGE and BLEU scores - for each model across various tions (Q7, Q9, and Q10) but falls short by produc- datasets. These metrics were computed for each ques- ing 4 unjustified incorrect variations (Q1, Q2, Q3, tion within the datasets, with averages provided to give Q6), including a notable discrepancy in Question a view of each model’s performance under two different 6 where the expected answer was “yes”, but the contexts (i.e Context 1 and Context 2 ) for consistency eval- output was “no”. uation (Table 4; the consistency results per questions are Table 4 Average consistency evaluation Semantic Overlap ROUGE BLEU similarity Context 1 0,983 0,888 0,844 0,770 GPT Context 2 0,980 0,897 0,853 0,763 Context 1 1,000 1,000 1,000 1,000 Mistral Context 2 0,999 0,995 0,992 0,988 Context 1 1,000 0,996 0,996 0,992 Gemini Context 2 0,999 0,996 0,993 0,986 Context 1 1,000 0,998 0,997 0,996 Llama2 Context 2 0,999 0,991 0,989 0,983 Table 5 Table 6 Average veracity evaluation Average consistency of explanations Semantic Semantic Overlap ROUGE BLEU Overlap ROUGE BLEU similarity similarity GPT 0,740 0,464 0,301 0,408 GPT 0,916 0,659 0,541 0,375 Mistral 0,727 0,403 0,222 0,380 Mistral 0,931 0,673 0,570 0,389 Gemini 0,676 0,415 0,273 0,328 Gemini 0,915 0,600 0,442 0,247 Llama2 0,734 0,466 0,239 0,393 Llama2 0,905 0,552 0,411 0,219 provided at Appendix TableA1) and a third context (i.e. der analysis (e.g. Context 1 ’You are an expert on COVID- Context 3) for veracity evaluation (Table 5; the veracity 19 and your duty is to answer questions related to the results per questions are provided at Appendix TableA3). topic only with yes or no followed by the explanation Our analysis reveals no significant difference in per- that validates the answer in a maximum of 2 sentences.’). formance between the first two contexts evaluated for Table 7 shows example of responses for Q1, Q2, Q3 from consistency, where all LLMs demonstrated high levels of GPT-4. Q1 enriches the fact-checking response adding consistency. Mistral achieved perfect consistency scores, information related with the consequences of having hy- while Gemini and Llama2 were nearly perfect. However, pertension and how they are related to higher death risk. GPT showed the lowest consistency (for all metrics in- Q2 enriches the response adding reasons why the impor- cluding semantic similarity), even with the temperature tant feature (i.e. platelet) plays a crucial role that may parameter set to the lowest level, indicating potential lead to high risk of death. Last, Q3 response enriches the variability in its response generation process. response indicating that a high leukocyte can be a symp- When comparing the models’ performance to the tom of severe Covid19. At table 8 we studied syntactically ground truth data for veracity (see Section 3.2), GPT the number of words that contain the explanation and stands out by achieving the best results across all met- also the average number of words per sentence. Llama2 rics, indicating that its responses, on average, align more and Mistral have larger explanations and also syntacti- closely with the ground truth than those of the other cally are slighly more comples (Llama2 has ≈ 28 words models. Llama2 follows closely behind as the second- per sentence for context2). Gemini provides the shortest best performer, with Gemini and Mistral trailings and explanations and also the lowest syntactic complexity their positions varying depending on the metric applied. (36.72 number of words average and 19.91 words per sen- These findings suggest that while GPT may struggle with tence). Similarly to previous use cases we analyzed the consistency relative to its peers, it excels in generating differences between Context 1 and Context 2 explanations responses that are more closely aligned with verifiable (including the controversy as an option in the second) to facts, highlighting a nuanced trade-off between consis- measure how different are the explanations. According to tency and veracity across different LLMs. all metrics the results show that the LLM that change the most is Llama2 (i.e. ROUGE 0.411), followed by Gemini 4.3. UC3 Results (i.e. ROUGE 0.442), GPT-4 (i.e. ROUGE 0.541) and Mistral (i.e. ROUGE 0.570) (see Table 6 for other metrics). It examines the use of prompts that transform explana- tions into binary questions that contain both the question and relevant contexts related with the fact-checking un- Table 7 Examples of responses for GPT-4 Example of responses Yes, hypertension has been identified as a risk factor for severe outcomes in patients with COVID-19. Studies have shown that patients Q1 with hypertension are more likely to experience severe symptoms or complications, including death, from the virus. Yes, studies have shown that a low platelet count, or thrombocytopenia, can be associated with a higher risk of severe disease and mortality in Q2 patients with COVID-19. This is because platelets play a crucial role in the body’s immune response, and a low count can impair the body’s ability to fight off infections. Yes, a high leukocyte count, or leukocytosis, can indicate a severe infection or inflammation in the body, including severe COVID-19. Q3 Studies have shown that patients with severe COVID-19 often have leukocytosis, which is associated with a higher risk of mortality. Table 8 might even imply contradictory responses. As for the Average number of words of explanation per text and per truthfulness analysis, we observed that GPT obtained sentence the best results on average and can be considered quite Avg words Avg words accurate. per text per sentence Context 1 39,38 19,69 GPT Context 2 37,46 22,65 Acknowledgments Average 38,42 21,17 Context 1 53,30 21,62 This work has been funded by the project “Inteligencia Mistral Context 2 45,70 20,20 Artificial eXplicable” IAX grant of the Young Researchers Average 49,5 20,91 2022/2024 initiative of the Community of Madrid. Context 1 37,70 18,85 Gemini Context 2 35,74 20,98 Average 36,72 19,915 References Context 1 57,70 23,20 Llama2 Context 2 50,76 27,69 [1] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, Average 54,23 25,445 H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large language models, ACM Trans- actions on Intelligent Systems and Technology 15 5. Conclusions (2024) 1–45. [2] W. Wang, B. Haddow, A. Birch, W. Peng, Assess- In this paper we studied the effect of variation in the ing the reliability of large language model knowl- number of options within fact-checking questions, the edge, arXiv:2310.09820 (2023). URL: https://doi.org/ consistency and truthfulness of the answers, and the ca- 10.48550/arXiv.2310.09820]. pabilities to enrich fact-checking with explanations. We [3] L. Caruccio, et al., Can chatgpt provide intelligent also proposed to link explanations from machine learn- diagnoses? a comparative study between predictive ing models to LLMs by using those explanations to create models and chatgpt to define a new medical diag- a fact-checking type input question. We measured co- nostic bot, Expert Systems with Applications 235 herence and veracity using state-of-the-art metrics such (2024) 121186. URL: https://www.sciencedirect.com/ as semantic similarity, overlap, ROUGE and BLEU, and science/article/pii/S0957417423016883. doi:https: the results show that Mistral is the most coherent LLM. //doi.org/10.1016/j.eswa.2023.121186. Notably, Gemini and Llama2 obtained similar results and [4] Y. Jin, X. Wang, R. Yang, Y. Sun, W. Wang, H. Liao, GPT was slightly behind. Furthermore, we conclude that X. Xie, Towards fine-grained reasoning for fake fact-cheking consistency does not depend on the number news detection, Proceedings of the AAAI Confer- of options but explanations’ consistency does. This is rel- ence on Artificial Intelligence 36 (2022) 5746–5754. evant because it means that a different number of options URL: https://ojs.aaai.org/index.php/AAAI/article/ not only may change the fact response but will also be view/20517. doi:10.1609/aaai.v36i5.20517. able to justify it differently. Further research should be [5] D. Wadden, K. Lo, L. L. Wang, A. Cohan, I. Belt- done to analyze in depth to what extend these differences agy, H. Hajishirzi, MultiVerS: Improving scientific claim verification with weak supervision and full- doi:10.1177/0013164493053004013. document context, in: M. Carpuat, M.-C. de Marn- [15] J. White, Q. Fu, S. Hays, M. Sandborn, effe, I. V. Meza Ruiz (Eds.), Findings of the As- C. Olea, H. Gilbert, A. Elnashar, J. Spencer- sociation for Computational Linguistics: NAACL Smith, D. Schmidt, A prompt pattern 2022, Association for Computational Linguistics, catalog to enhance prompt engineering Seattle, United States, 2022, pp. 61–76. URL: https: with chatgpt, ArXiv abs/2302.11382 (2023). //aclanthology.org/2022.findings-naacl.6. doi:10. doi:10.48550/arXiv.2302.11382. 18653/v1/2022.findings-naacl.6. [16] S. Oymak, A. Rawat, M. Soltanolkotabi, C. Thram- [6] Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichan- poulidis, On the role of attention in prompt-tuning der, E. Hovy, H. Schütze, Y. Goldberg, Mea- (2023) 26724–26768. doi:10.48550/arXiv.2306. suring and improving consistency in pretrained 03435. language models, Transactions of the Associa- [17] A. Bhargava, C. Witkowski, M. Shah, M. W. Thom- tion for Computational Linguistics 9 (2021) 1012– son, What’s the magic word? a control theory 1031. URL: https://aclanthology.org/2021.tacl-1.60. of llm prompting, ArXiv abs/2310.04444 (2023). doi:10.1162/tacl_a_00410. doi:10.48550/arXiv.2310.04444. [7] H. Raj, V. Gupta, D. Rosati, S. Majumdar, Seman- [18] S. M. Lundberg, S.-I. Lee, A unified approach to tic consistency for assuring reliability of large lan- interpreting model predictions, Advances in neural guage models, arXiv preprint arXiv:2308.09138 information processing systems 30 (2017). (2023). [19] M. T. Ribeiro, S. Singh, C. Guestrin, "why should i [8] Q. Dong, J. Xu, L. Kong, Z. Sui, L. Li, Statistical trust you?" explaining the predictions of any clas- knowledge assessment for large language models, sifier, in: Proceedings of the 22nd ACM SIGKDD Advances in Neural Information Processing Sys- international conference on knowledge discovery tems 36 (2024). and data mining, 2016, pp. 1135–1144. [9] E. A. Maylor, M. A. Roberts, Similarity and [20] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos, attraction effects in episodic memory judg- M. M. Separdani, D. Kyriazis, Xai for all: Can large ments, Cognition 105 (2007) 715–723. URL: language models simplify explainable ai?, 2024. https://www.sciencedirect.com/science/article/pii/ arXiv:2401.13110. S0010027706002587. doi:https://doi.org/10. [21] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos, 1016/j.cognition.2006.12.002. M. M. Separdani, D. Kyriazis, Xai for all: [10] K. V. Morgan, T. A. Hurly, M. Bateson, L. Asher, Can large language models simplify explainable S. D. Healy, Context-dependent decisions ai?, ArXiv abs/2401.13110 (2024). URL: https://api. among options varying in a single dimen- semanticscholar.org/CorpusID:267199844. sion, Behavioural Processes 89 (2012) 115– [22] L. Berti-Équille, Data veracity estimation with 120. URL: https://www.sciencedirect.com/science/ ensembling truth discovery methods, 2015 IEEE article/pii/S0376635711001719. doi:https://doi. International Conference on Big Data (Big Data) org/10.1016/j.beproc.2011.08.017, com- (2015) 2628–2636. doi:10.1109/BigData.2015. parative cognition: Function and mechanism in lab 7364062. and field. [23] J. Burgoon, L. Hamel, T. Qin, Predicting veracity [11] P. Pezeshkpour, E. Hruschka, Large language mod- from linguistic indicators, Journal of Language els sensitivity to the order of options in multiple- and Social Psychology 37 (2012) 603 – 631. doi:10. choice questions, ArXiv abs/2308.11483 (2023). 1177/0261927X18784119. doi:10.48550/arXiv.2308.11483. [24] L. Munn, L. Magee, V. Arora, Truth ma- [12] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu- chines: Synthesizing veracity in ai language big, Pre-train, prompt, and predict: A systematic models, AI & SOCIETY (2023). doi:10.1007/ survey of prompting methods in natural language s00146-023-01756-4. processing, ACM Computing Surveys 55 (2021) 1 – [25] Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, 35. doi:10.1145/3560815. P. He, Dola: Decoding by contrasting layers im- [13] J. Zhan, H. Jiang, Y. Yao, Three-way multiattribute proves factuality in large language models, in: decision-making based on outranking relations, Learning Representations (ICLR), 2024 Interna- IEEE Transactions on Fuzzy Systems 29 (2021) 2844– tional Conference on, volume abs/2309.03883, 2023. 2858. doi:10.1109/tfuzz.2020.3007423. doi:10.48550/arXiv.2309.03883. [14] T. Haladyna, S. Downing, How many [26] N. Joshi, J. Rando, A. Saparov, N. Kim, H. He, options is enough for a multiple-choice Personas as a way to model truthfulness in lan- test item?, Educational and Psycholog- guage models, ArXiv abs/2310.18168 (2023). doi:10. ical Measurement 53 (1993) 1010 – 999. 48550/arXiv.2310.18168. [27] OpenAI, J. Achiam, S. Adler, S. Agarwal, Gpt-4 technical report, ArXiv abs/2303.08774 (2023). URL: https://api.semanticscholar.org/CorpusID: 257532815. [28] G. T. G. R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, Gemini: A family of highly capable multimodal models, ArXiv abs/2312.11805 (2023). URL: https: //api.semanticscholar.org/CorpusID:266361876. [29] H. Touvron, L. Martin, K. R. Stone, P. Albert, Llama 2: Open foundation and fine-tuned chat models, ArXiv abs/2307.09288 (2023). URL: https: //api.semanticscholar.org/CorpusID:259950998. [30] Mistral, Mistral large, our new flagship model, URL https://mistral.ai/news/mistral-large/, 2024. [31] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, Mistral 7b, ArXiv abs/2310.06825 (2023). URL: https: //api.semanticscholar.org/CorpusID:263830494. [32] P. Cardinal-Fernandez, E. Garcia-Cuesta, J. Bar- beran, J. F. Varona, A. Estirado, A. Moreno, J. Vil- lanueva, M. Villareal, O. Baez-Pravia, J. Menendez, et al., Clinical characteristics and outcomes of 1,331 patients with covid-19: Hm spanish cohort, Revista Española de Quimioterapia 34 (2021) 342. [33] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine compre- hension of text (2016) 2383–2392. doi:10.18653/ v1/D16-1264. [34] D. Chandrasekaran, V. Mago, Evolution of semantic similarity—a survey, ACM Computing Surveys 54 (2021) 1–37. URL: http://dx.doi.org/10.1145/3440755. doi:10.1145/3440755. [35] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Pro- ceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing, Association for Computational Linguistics, 2019. URL: https: //arxiv.org/abs/1908.10084. [36] C.-Y. Lin, ROUGE: A package for automatic eval- uation of summaries, in: Text Summarization Branches Out, Association for Computational Lin- guistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013. [37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, As- sociation for Computational Linguistics, Philadel- phia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040. doi:10.3115/ 1073083.1073135. A. Appendix. Detailed results of the consistency and veracity metrics for the three contexts. Table A1 Consistency evaluation per question with Context 1 GPT Mistral Gemini Llama2 Semantic Semantic Semantic Semantic Overlap ROUGE BLEU Overlap ROUGE BLEU Overlap ROUGE BLEU Overlap ROUGE BLEU similarity similarity similarity similarity Q1 0,949 0,760 0,666 0,462 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q2 0,989 0,941 0,935 0,880 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,999 0,983 0,983 0,977 Q3 0,975 0,868 0,789 0,731 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q4 0,984 0,823 0,757 0,653 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,992 0,990 0,985 Q5 0,981 0,895 0,841 0,738 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q6 0,996 0,972 0,948 0,925 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q7 0,980 0,864 0,833 0,800 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q8 0,993 0,899 0,874 0,797 1,000 1,000 1,000 1,000 0,998 0,964 0,961 0,922 1,000 1,000 1,000 1,000 Q9 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q10 0,987 0,858 0,800 0,710 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Average 0,983 0,888 0,844 0,770 1,000 1,000 1,000 1,000 1,000 0,996 0,996 0,992 1,000 0,998 0,997 0,996 Table A2 Consistency evaluation per question with Context 2 GPT Mistral Gemini Llama2 Semantic Semantic Semantic Semantic Overlap ROUGE BLEU Overlap ROUGE BLEU Overlap ROUGE BLEU Overlap ROUGE BLEU similarity similarity similarity similarity Q1 0,976 0,969 0,924 0,825 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q2 0,997 0,990 0,988 0,976 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q3 0,972 0,869 0,832 0,663 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q4 0,998 0,926 0,920 0,875 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q5 0,997 1,000 0,963 0,904 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q6 0,939 0,706 0,556 0,383 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 0,991 0,969 0,957 0,925 Q7 0,949 0,736 0,649 0,475 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q8 0,996 0,932 0,896 0,867 0,991 0,954 0,927 0,903 0,995 0,960 0,957 0,925 0,995 0,944 0,932 0,903 Q9 0,990 0,927 0,906 0,862 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 Q10 0,982 0,918 0,895 0,796 1,000 0,991 0,989 0,977 0,998 1,000 0,975 0,938 1,000 1,000 1,000 1,000 Average 0,980 0,897 0,853 0,763 0,999 0,995 0,992 0,988 0,999 0,996 0,993 0,986 0,999 0,991 0,989 0,983 Table A3 Veracity evaluation per question with Context 3 GPT Mistral Gemini Llama2 Semantic Semantic Semantic Semantic Overlap ROUGE BLEU Overlap ROUGE BLEU Overlap ROUGE BLEU Overlap ROUGE BLEU similarity similarity similarity similarity Q1 0,777 0,750 0,466 0,612 0,770 0,750 0,424 0,537 0,412 0,250 0,181 0,368 0,777 0,750 0,466 0,612 Q2 0,647 0,554 0,331 0,347 0,677 0,550 0,244 0,298 0,596 0,599 0,461 0,496 0,668 0,700 0,363 0,294 Q3 0,687 0,285 0,178 0,289 0,681 0,380 0,173 0,349 0,698 0,500 0,266 0,358 0,680 0,380 0,210 0,420 Q4 0,764 0,448 0,207 0,409 0,805 0,466 0,239 0,507 0,714 0,157 0,093 0,218 0,648 0,290 0,163 0,304 Q5 0,665 0,428 0,255 0,184 0,619 0,312 0,222 0,259 0,622 0,350 0,260 0,137 0,634 0,297 0,101 0,271 Q6 0,778 0,458 0,368 0,479 0,737 0,333 0,170 0,384 0,774 0,533 0,294 0,242 0,787 0,500 0,307 0,436 Q7 0,758 0,350 0,191 0,373 0,733 0,256 0,161 0,361 0,631 0,413 0,254 0,306 0,821 0,435 0,242 0,454 Q8 0,889 0,679 0,543 0,769 0,791 0,411 0,235 0,440 0,858 0,529 0,461 0,673 0,926 0,647 0,156 0,405 Q9 0,705 0,269 0,202 0,196 0,725 0,242 0,107 0,259 0,746 0,230 0,163 0,220 0,691 0,222 0,175 0,300 Q10 0,731 0,420 0,264 0,423 0,728 0,333 0,244 0,402 0,708 0,588 0,299 0,266 0,706 0,441 0,210 0,437 Average 0,740 0,464 0,301 0,408 0,727 0,403 0,222 0,380 0,676 0,415 0,273 0,328 0,734 0,466 0,239 0,393