<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Automated Fact-checking based on Large Language Models: An application for the press</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Bogdan</forename><forename type="middle">Andrei</forename><surname>Baltes</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Centro de Estudios en Ciencia de Datos e Inteligencia Artificial (ESenCIA)</orgName>
								<orgName type="institution">Valencian International University</orgName>
								<address>
									<addrLine>C/Pintor Sorolla 21</addrLine>
									<postCode>46002</postCode>
									<settlement>València</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yudith</forename><surname>Cardinale</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Centro de Estudios en Ciencia de Datos e Inteligencia Artificial (ESenCIA)</orgName>
								<orgName type="institution">Valencian International University</orgName>
								<address>
									<addrLine>C/Pintor Sorolla 21</addrLine>
									<postCode>46002</postCode>
									<settlement>València</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Benjamín</forename><surname>Arroquia-Cuadros</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Centro de Estudios en Ciencia de Datos e Inteligencia Artificial (ESenCIA)</orgName>
								<orgName type="institution">Valencian International University</orgName>
								<address>
									<addrLine>C/Pintor Sorolla 21</addrLine>
									<postCode>46002</postCode>
									<settlement>València</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Automated Fact-checking based on Large Language Models: An application for the press</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">1A0A4DC137D788D87E22B9A3E7741069</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Automated Fact-Checking</term>
					<term>Large Language Models</term>
					<term>Artificial Intelligence</term>
					<term>Ethics</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The current proliferation of digital media for the dispersion of news represents advantages given the ease of access but also challenges as the different sources might not necessarily be reliable or fully consistent with each other. Existing solutions for contrasting information include knowledge bases with previously verified information that are often lacking updated information or insightful details. In this context, we propose a framework for enhancing information retrieval from the press to make information more digestible and with the ultimate goal of reducing misinformation. The proposed framework, at the interconnection of automated fact-checking, AI-based reasoning, and ethics, consists of a tool that combines information from several sources and allows users to verify a claim given the information as a knowledge base. The work explores the reasoning capabilities of Large Language Models (LLM) as a new way of automating fact-checking, creating a flexible and dynamic solution. The framework returns a verdict about the claim, as well as a justification and references, building trust for the users. The performance is rigorously evaluated achieving a score of 70% accuracy of classification and justification production for the top-performing models. Equally important, the work studies the ethical challenges of building a framework that changes the way that information from the press is consumed by society. The underlying ethics of the project are discussed both from a perspective for final users and publishing companies, offering guidance for large-scale implementation of the framework. This research poses challenges as well, mainly regarding the capabilities of current and future LLM and the commercial partnership dynamics with publishing companies.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Disinformation and fake news have been combatted through the manual work of journalists at traditional media and fact-checking outlets <ref type="bibr" target="#b0">[1]</ref>. Tasks related to fact-checking procedures include contacting the original source via phone or e-mail, consulting alternative sources, and writing and rating the claim and publishing it <ref type="bibr" target="#b1">[2]</ref>. While this workflow is complete and consistent, the workforce is often insufficient to monitor every piece of published information, so it is often users' mission to verify whether something they read or heard is true or false <ref type="bibr" target="#b2">[3]</ref>.</p><p>With or without deliberation to spread false information, there are often a variety of sources that are not fully consistent with each other. It is generally not possible for a person to read the same news in many different media to find complementary or contradictory information to get the full picture. Because of this, fact-checking is needed. In addition to the manual efforts of journalists, automated fact-checking (AFC) techniques are being developed mostly by nonprofit fact-checking entities <ref type="bibr" target="#b3">[4]</ref>. The limitations of AFC have traditionally been the sensitivity to context that impedes the full automation of fact-checking systems, requiring human supervision. Another direction that AFC has been taking is that of identifying claims and constructing a database of verified claims <ref type="bibr" target="#b4">[5]</ref>, which is useful for assisting the fact-checker although with a static context.</p><p>To make information more digestible, in this work we present a framework that processes information from different sources, solves the user's original question, and indicates where the information comes from, being able to consider the context that is given to the system. An important aspect is that the user of the system is fully aware of the contents of its context and can check it if necessary, adding trustworthiness. The aim is to help users get a broader perspective on the news from different sources, in order to fight misinformation. This is pursued through the creation of a system that compiles information from the press to be able to verify claims based on the knowledge base, providing a reasoned answer, and having the ability to reference the employed sources, supported by AI-based reasoning and ethics.</p><p>In this work, we explore a new approach for automating fact-checking: through the reasoning capabilities of Large Language Models (LLM). We carry out a complete implementation of the framework, starting from a knowledge base crafted from news from Spanish media until the interface where final users can make use of the framework. The functioning of the proposed system is tested, both from a technical and functional perspective, rigorously carrying out an evaluation achieving a score of 70% accuracy of classification and justification production, but also from an ethical standpoint, studying the underlying ethics of the change that people would undergo if the framework were implemented at a large scale and the way media is consumed were modified.</p><p>This research revolves around solving problems derived from misinformation and disinformation. The former is defined as "false or inaccurate information", while the latter is adding the notion of the "false or misleading information peddled deliberately to deceive, often in pursuit of an objective" <ref type="bibr" target="#b5">[6]</ref>. In particular, this system is intended for journalists as primary end-users. Journalists at fact-checking agencies continuously track claims made by politicians and evaluate the veracity of them, publishing the results for the general public.</p><p>The rest of the paper is organized as follows. Section 2 describes recent studies on automated fact-checking and reasoning capabilities of LLM. The proposed Assisted Fact-Checking Framework is presented in Section 3 and the obtained results are discussed in Section 4. The ethics of the proposed work are studied in Section 5 and we present the conclusions in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In this section, some studies related to fact-checking and reasoning capabilities of LLM are described.</p><p>Fact-checking, in its simplest form, is a practice that verifies whether a claim is true or not. It has, of course, more complex definitions, given that a claim can be technically true but written in a misleading way, or only partially true. The most common workflow when doing the task of fact-checking is searching through multiple sources that can be used to verify the veracity of the claim, assess their reliability, and make a decision on the original claim based on the evidence found in the sources <ref type="bibr" target="#b6">[7]</ref>.</p><p>Traditionally, it consists of manual work carried out by journalists, whether to fact-check published claims in other agencies' work or to assess the correctness of works before being published <ref type="bibr" target="#b1">[2]</ref>.</p><p>The need to automatise fact-checking processes arises from the inability of journalists to verify everything they publish, since this manual work is oftentimes a task that can take up to several days <ref type="bibr" target="#b1">[2]</ref>. Sources are not always accessible in literature or on the Internet. While there are official databases or reports such as the National Institute of Statistics (INE) in Spain, there is also manual work to be done when information needs to be verified directly calling an institution, like the Government or the Police.</p><p>In the last years, the AI community has dedicated efforts to discuss AFC. The most widely accepted structure for this automation was proposed by Vlachos and Riedel <ref type="bibr">[5] [8]</ref>, in a sequential process that starts by identifying the claims that need to be checked, looking through sources for the evidence needed to support or refute the claims, and taking a decision considering the given evidence.</p><p>There are, however, two issues with the knowledge commonly available to most approaches found in the literature <ref type="bibr" target="#b3">[4]</ref>: not all available information is trustworthy, and not all needed information is available.</p><p>To overcome these problems, researchers have taken the assumptions that the information included in the employed sources is correct and that the evidence is the information that can be retrieved from there. As the evidence is assumed to be correct, veracity will be defined as the coherence of the claim and the evidence.</p><p>This common structure for automated fact-checking can -and should -be adapted to the needs of its end users (mostly journalists). Regarding the research of these systems, there has sometimes been a lack of collaboration between researchers and journalists <ref type="bibr" target="#b8">[9]</ref>. A better collaboration could lead to the solution of some of the issues that AFC systems present, although not all of them can be technically solved. Furthermore, advances in the field of Generative Artificial Intelligence and specifically LLM are contributing to the transition from simpler natural language processing (NLP) techniques to the usage of the reasoning abilities of more complex models. There have been attempts to integrate LLM in the whole framework for automated fact-checking, using it to detect claims, retrieve evidence and finally, predict a verdict and build a conclusion <ref type="bibr" target="#b9">[10]</ref>. However, the results obtained are inferior to those of the state of the art models on datasets like FEVER <ref type="bibr" target="#b10">[11]</ref> and WiCE <ref type="bibr" target="#b11">[12]</ref> and further research is encouraged.</p><p>To better understand the purpose of the research of automated fact-checking, a study has shown that there are eight main intended use cases of automated fact-checking <ref type="bibr" target="#b12">[13]</ref>. The study has analysed 100 highly-cited papers, with publication dates ranging from 1998 to 2023, with most studies being from the 2010s. These use cases are listed below, specifying the percentage of the 100 papers where the respective use case is pursued: Automated external fact-checking (22%), Assisted external fact-checking (18%), Assisted media consumption (8%), Scientific curiosity (8%), Assisted knowledge curation (7%), Assisted internal fact-checking (4%), Automated content moderation (4%), Truth-telling for law enforcement (1%).</p><p>On the other hand, LLM are AI systems that can process and generate text, to solve a variety of tasks, such as summarisation, translation, question answering <ref type="bibr" target="#b13">[14]</ref>.</p><p>These systems have significantly gained popularity over the recent years. One of the main reasons of this rise was the introduction of the Transformer architecture <ref type="bibr" target="#b14">[15]</ref>. This technical breakthrough, along with the ever-growing data collection and generation for training, as well as larger computational abilities, triggered a large wave of more capable language models. The paradigm of their creation shifted from task-specific to task-agnostic training, allowing models to perform a wider range of tasks <ref type="bibr" target="#b15">[16]</ref>.</p><p>One of the desired capabilities of LLM is reasoning. It is a cognitive process designed as the process of thinking about something in order to make a decision. At the intersection of psychology, philosophy and computer science, it is a process that benefits individuals to solve problems and take decisions <ref type="bibr" target="#b16">[17]</ref>.</p><p>Language models have a good performance on specific reasoning tasks, although there is no general agreement on whether or not they have the ability to reason <ref type="bibr" target="#b16">[17]</ref>. It has been demonstrated, however, that these models' ability to reason improves considerably with their parameter count. Given this, recently released LLM with over 100 billion parameters are better at reasoning <ref type="bibr" target="#b17">[18]</ref>.</p><p>Performance on reasoning tasks, however, is not only a matter of parameter count. It can be heavily improved through multiple methods, which are commonly classified as <ref type="bibr" target="#b18">[19]</ref>:</p><p>• Strategy Enhanced Reasoning. As LLM usually contain implicit knowledge for reasoning from their pretraining <ref type="bibr" target="#b19">[20]</ref>, the focus in this method is how to take advantage of this knowledge.</p><p>The main research area is prompt engineering, which defines how to construct the questions that are fed to the models. It can be single-stage or multi-stage, the latter emulating human reasoning, decomposing a complex problem and reasoning stage by stage. Both cases are also improved by the Chain-of-Thought (CoT) method <ref type="bibr" target="#b20">[21]</ref>, which generates a series of intermediate reasoning steps by providing demonstrations on the thought process inside the prompt. Other efforts towards Strategy Enhanced Reasoning include Process Optimization <ref type="bibr" target="#b21">[22]</ref> and External Engines <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b23">24]</ref>. • Knowledge Enhanced Reasoning. These methods focus on how to use both implicit and explicit knowledge to assist the model in reasoning. Regarding the implicit knowledge, there has been work to take advantage of the implicit knowledge contained in LLM to generate more knowledge and refine results <ref type="bibr" target="#b24">[25]</ref>. As for explicit knowledge, efforts have been directed towards reducing hallucinations (the invention of incorrect facts) <ref type="bibr" target="#b25">[26]</ref> and improving information retrieval from external files <ref type="bibr" target="#b26">[27]</ref>.</p><p>It is noteworthy to mention that the answer to better reasoning is not necessarily found with more training parameters. Recent research is also focused on smaller models, easier to use in production environments, using explanations from bigger LLM to become better reasoners <ref type="bibr">[28] [29]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Assisted Fact-Checking Framework</head><p>The framework proposed aims at assessing the improvement of retrieval and consumption of information from the press, in an attempt to improve fact-checking processes and reduce misinformation through machine learning techniques. Hence, the selected narrative of this work is that of assisted media consumption or assisted fact-checking. As seen in the literature review, these use cases contribute to around 30% of the intended uses of automated fact-checking tools <ref type="bibr" target="#b12">[13]</ref>. It is important to have a human interaction in the models without it being fully automated, since some steps in fact-checking sometimes need to be done manually (e.g., calling an official source at a Ministry to verify a fact, which cannot be done online).</p><p>The proposed framework, illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, starts with building a knowledge base of digital media according to the interests of users: the sources and categories of articles of choice. This flexibility allows the framework to be versatile, as it can be used for any type of fact-checking, with information from the press, official documents or any private document base. For the implementation of this work, several pieces on unemployment from Spanish digital media were used as a knowledge base. We recommend that whenever this framework is used for political fact-checking, the knowledge base should ideally consist of a choice of media agencies having different types of audiences <ref type="bibr" target="#b29">[30]</ref>, to make sure the contents are diverse and can complement or contrast each other.</p><p>A machine learning model, specifically an LLM in this case, is used by a data actor (mostly journalists) to verify an input, having the knowledge base as a context. The output of the proposed framework consists of a classification of the given input, as well as a justification with citations to the sources supporting or refuting it. Prompt engineering and retrieval techniques are used to control the behaviour of the language model, in an effort to restrict its context to the given knowledge base without hallucinating information and giving false information to the user <ref type="bibr" target="#b30">[31]</ref>.</p><p>As for the evaluation of the performance of this framework, traditional benchmarks are not useful since the accuracy of the responses are not commonsense reasoning capabilities but depend on specific information from the context. Moreover, human evaluation has shown reproducibility limitations and instability towards the execution of NLP tasks <ref type="bibr" target="#b31">[32]</ref>. Hence, the approach shifts from the usage of traditional benchmarks to the evaluation through elements inspired by the LLM-as-a-judge method <ref type="bibr" target="#b32">[33]</ref>. In this case, given a fixed knowledge base, an LLM creates a series of potential claims given its context, as well as their classification (supported by the context, refuted by it, or with no information) and their justification. This serves as an evaluation dataset that needs to be manually revised and then utilised to extract performance metrics from the behaviour of the framework, with the same context and several combinations of prompts.</p><p>The implementation of the framework was done through LangChain, an open-source framework aimed at developing applications with LLM. Through components of this framework, LLM from OpenAI (gpt-3.5-turbo-1106, gpt-4-0125-preview<ref type="foot" target="#foot_0">1</ref> ) and Cohere (Command<ref type="foot" target="#foot_1">2</ref> ) were integrated. The embeddings used for this work were also from OpenAI (text-embedding-ada-002) and the LLM were used through API calls to providers offering them at no cost or at a limited one. To perform the evaluation, data consist on several pieces of information on unemployment data from Spanish digital media, in the Spanish language, from the following sites: El Plural, ABC, El Mundo, and Okdiario. These data were stored in an open-source vector database: Faiss. Lastly, the results of the framework were shown through the Gradio interface.</p><p>Several techniques were combined to improve the prompt composition <ref type="bibr" target="#b33">[34]</ref>:</p><p>• Specifying the role: "You are a fact-checker. "</p><p>• Explicitly asking to only use knowledge from the context provided. • Explaining the format of the desired output: verdict, justification, and passages of each of the sources either supporting or refuting the given claim. • Chain-of-Thought <ref type="bibr" target="#b20">[21]</ref>: providing an example of how a claim can be verified.</p><p>The workflow of the implementation is depicted in Figure <ref type="figure">2</ref>, through the diverse parts of the described process, leading to the different parts of the output being generated.</p><p>The model is invoked twice. In the first call to the model, the prompt only contains instructions about the justification. Specifically, it is told that it is a fact-checker that can only base its answers on the provided context. Then, a description of each of the categories that the system needs to classify the claim into is depicted, however, it is only asked to provide the justification. It is given also an example of how it should work. Finally, the Markdown format to return is specified, and the model is once again reminded to not produce anything that is not in the given context.</p><p>Next, the model is invoked for a second time. In this call, it is only asked to create a classification based on the justification from the response of the first call. The prompt once again explains the different values that the categories can have, and the output format (in Markdown) is specified.</p><p>As there is no single ground truth for the use case of this work, there is no standard traditional machine learning evaluation method. However, besides the qualitative evaluation, which serves only to identify whether some specific examples were functioning correctly, an additional evaluation is needed to asses the general behaviour of the framework.</p><p>GPT-4 was used to generate an evaluation dataset. Iteratively, each of the media pieces was passed to the model with a prompt asking to generate 40 claims from each of the news pieces. The prompt also specified the the claim can have -Supported, Refuted, Half-supported and No information -and a description of each of them, with the demand to classify the generated claim as well. Additionally, it was specified that there should be an equal number of each of the categories to not have unbalanced classes. The output format was demanded to be a Python dictionary in order to use it to create a dataframe afterwards, with three columns: claim, classification, and source (to keep traceability of where the claim was generated from).</p><p>After the generation of the evaluation dataset, each model (gpt-3.5-turbo-1106, gpt-4-0125-preview, Cohere Command) was invoked with the same prompt that is used in the final implementation, feeding it as a claim each generation of the evaluation dataset at a time. Therefore, a loop was created, iterating over the 160 claims that were generated for evaluation in total. The method then saved the results of each invocation, appending a column in each row with the verdict (or classification), justification, and the generated references. The results were afterwards opened in a Google Sheets file, where the automatic classification grading was done. A function was implemented to compare the column of the classification created as part of the evaluation dataset, with the one that was extracted from the output through a Google Sheets function. This would result, for each row, in a score of 0 or 1, with the latter being an exact match.</p><p>Moving to the manual stage, each row from the 480 generated in total (160 per model) was manually revised to find incorrect classifications from the evaluation dataset -changing the ground truth to an adjusted, correct version -and to find incorrect classifications of each model that could be accepted upon revision, if the justification was supporting it. The accuracy metric is created by the sum of the column of the classification adjustment, dividing by the number of rows evaluated (initially, 160). This would return a result between 0 and 1, later presented as percentage.</p><p>The final part of the evaluation is the justification grading. Each justification was graded with an answer correctness metric, assigned a score from 0 to 5. Score 0 corresponds to an output where none of the justification is correct, or it is classified as "No information" even though there exists relevant information in the context to provide a classification, while Score 5 is a justification that is entirely correct.</p><p>Justification grading was done manually following the guidance of the criteria above. To avoid inconsistencies, two rounds of grading were conducted, on different days, shuffling the order of the claims in the evaluation dataset. Afterwards, the scores of both rounds were compared, in case any of the claims were graded differently, and a final decision was taken in those cases where the scores varied. This methodology added rigorousness to an otherwise potentially subjective evaluation process.</p><p>At the end, after the grading, the score is divided by 5 in order to be a score between 0 and 1. The column of the normalised justification grading is added up and divided by the number of rows that are being evaluated (once again, initially, 160), so the resulting score for the answer correctness metric is also assigned a result between 0 and 1, later presented as percentage. It was deemed important to use both metrics, since both evaluate important functions of the framework that might be used independently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and Discussion</head><p>The final results are shown in Table <ref type="table" target="#tab_1">1</ref>. Results show that the model with the best performance in terms of average of both metrics of classification and justification score is GPT-3.5, scoring over 70% in both metrics. Perhaps counter-intuitively, GPT-4 achieves significantly lower performance in terms of claim classification, although it achieves the top results in terms of justification. Lastly, the justification score of Cohere Command could not be calculated since it had issues with the justification language and format.</p><p>The evaluation dataset consisted initially of 160 claims and it underwent a manual revision of quality of generated claims and their classification by the judge model. As described in Section 3, this dataset was As seen from the evaluation results, GPT-3.5 is the best-performing model among the three supported. Its strong points are that it had the best scores in classification and a close second place for justification production. Moreover, out of the 114 claims that were correctly classified, 84 of them (73.68%) also got the best score for its justification production.</p><p>Furthermore, there has been exactly one case of a claim that was incorrectly classified, but got the maximum score for the justification. It is, in fact, similar to one of the cases mentioned in the incorrect classifications from the evaluation dataset and it has to do with double negations and antonyms. More exactly, the claim was about the unemployment rate improving with regards to the one from 2022. It is probable that the justification was correctly created, but there was confusion with the model with the concept of improvement for terms like unemployment rate. The unemployment rate improved for society as the rate declined, but it seems that the model might have understood this phenomenon as an improvement if the rate actually increased. This example showcases the importance of considering both the classification and the justification in the usage of this proposed framework.</p><p>Additionally, there have been 17 claims rated with a score of 4 in justification production instead of the maximum of 5. Most of these claims' loss of the final point were related with inexact quantities or approximations. The output of the framework is almost correct in terms of justification, but it has been observed that in some cases, the model starts comparing several mentioned quantities as if they were completely different and not just a mere approximation. Prompting the model to behave in a specific manner when it had to do with approximations was tried in previous implementations in the experimental phase, although it did not have the expected result. If the behaviour of the framework when dealing with approximations improved and the results rated with a 4 were given a score of 5 instead, the results produced by GPT-3.5 could improve an additional 2.14% in terms of justification production, achieving a score of 72.96%. However, it is worth noting that there are cases where approximations are correctly handled. For instance, a claim that said there were 20 million employed people by the end of 2023 was correctly classified as "refuted" by the framework, since the correct number is 21.24 million and it is not a valid approximation. Figure <ref type="figure" target="#fig_1">3</ref> shows an example of the functioning of the framework powered by GPT-3.5. The user input claims that the number of unemployed people increased in Spain in 2023. The answer given by the system classifies the claim as "Refuted", which is correct as 3/4 sources in the knowledge base support the contrary, whereas there is no relevant information in the fourth source. The justification is an accurate summary of the reasons why the system refutes the claim. Moreover, the references created by the system are also correct.</p><p>GPT-4 achieved short of 60% in the classification score, and more than 73%, surpassing GPT-3.5 for the justification production. These results are lower than expected for classification, as both models are from OpenAI but GPT-4 is a newer, bigger model with better results than GPT-3.5 on most benchmarks.</p><p>As seen from the contrast of its two scores, it is highlighted that it has lower classification capabilities in this current implementation. There are six claims classified as "No information", while the justification received the highest score possible, creating a faithful and complete reasoning on the claim. It is unknown why this behaviour occurs, since in the final implementation the methodology was changed. Instead of allowing the model to decide directly in the first invocation the classification of a claim, it had to do it in a second call, based only on the justification it previously created. Therefore, these cases should be reduced with this implementation. Additionally, there has been a recurring result that arose for 45 claims that impeded getting both a correct classification and a justification. After invoking the prompts, the output was "Understood, I am ready to start. Please, provide a claim and a justification" in 31 occasions, with another 14 cases returning "Verdict: No information. Justification: [Input given as justification]". After the first run of the evaluation, it had looked like an execution error, so the claims that led to this result were executed again. However, the same result was returned and no explanation was found to explain this behaviour, which leads to indicate an instability in the responses of GPT-4 with these prompts.</p><p>It is observed from the evaluation that the different models have varied performance levels. The best-performing model, GPT-3.5, achieves a score higher than 70% in both of the metrics that are evaluated. In this configuration, it is safe to say that the framework can be considered reliable when used as a tool for assisted fact-checking or media consumption, in a setting where a human checks the outputs instead of having a fully-automated environment.</p><p>One of the requirements needed for the framework was explainability, which is achieved mainly through the creation of references for each output. This is considered to be assured through the similarity search of the pieces of news given as input to the knowledge base. As the context is created through a more manual procedure, rather than given to the LLM to reason about, it is considered to be more reliable.</p><p>The creation of references is one of the strong points towards the trustworthiness of the proposed framework. However, it is worth noting that although there are procedures to avoid hallucinations or the invention of unrelated information in the output of LLM, they are not always completely avoided. This is why it is important to disclose to final users that the framework, if used for fact-checking or assisted media consumption, can be prone to occasionally produce such outputs.</p><p>Studies seem to suggest that English LLM trained at a very large scale can have almost as good results in other languages although there is still room for improvement <ref type="bibr" target="#b34">[35]</ref>. This might also be the case for this research work: the performance could have been lower as the prompts and input data were designed to be used in Spanish. For instance, the confusion at reasoning might have been avoided in English. However, capabilities of language models are increasing at a considerable speed, therefore the language shall not be an impediment for the adoption of the proposed framework as a solution for assisted fact-checking or media consumption.</p><p>Furthermore, given that the different models produce at times contradictory verdicts and reasoning, an improvement point to increase reliability could be a majority voting, or a weighted majority voting based on the evaluated performance of each method. This technique is used in well-known techniques in traditional machine learning such as Random Forests, where each of the trees have a vote and the final decision depends on the answer with most votes.</p><p>All in all, the proposed framework for fact-checking with information from the press provides a reliable solution for automating work that was traditionally manual for journalists, as well as opens new possibilities for non-professionals to consume more contrasted information from the news.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Ethical Considerations</head><p>The proposed framework is aimed to enhance the way information is consumed from the press, either with the intention of assisted fact-checking or mere media consumption in a new form. A new form of consuming information can have a considerable impact on society in case it is established. Therefore, it is important to assess ethical considerations of this proposed framework besides its implementation.</p><p>The Ethics Guidelines for Trustworthy AI developed by the European Commission in 2019 offer a systematic manner to assess the ethical considerations of the proposed framework through a set of requirements that any trustworthy AI system should meet <ref type="bibr" target="#b35">[36]</ref>. The system proposed in this work can be systematically assessed in terms of ethics considering the following requirements from the guidelines:</p><p>• Human agency and oversight: In terms of human agency and autonomy, it is vital to stress that the proposed framework is a system based on information from the press, which in no case can fully assure that the information on which it relies is totally factual. Therefore, over-reliance shall be avoided. As for the concept of oversight, it is not considered an autonomous system, as it is needed to be overseen by a Human-in-the-Loop. • Technical Robustness and safety: A low level of accuracy could create undesired results from the system. However, as it would have human supervision and reference checking, the consequences could not be damaging. The final levels of accuracy achieved through the final implementation would need to be properly communicated to end-users for them to acknowledge the behaviour and limitations of the system in order to align their expectations. Finally, in relation to the training data and assumptions that the LLM were trained on, they have not been observed to lead to adversarial effects during the experiments, mostly due to the explicit prompting to follow instructions and only rely on the data given as a context. • Privacy and data governance: The framework does not use any personal data and only uses publicly available data. • Transparency: There are three main elements that constitute transparency as a requirement.</p><p>First, traceability is important to track; for this system, the version of the models used, the prompts with which the models are invoked, and the data that make up the knowledge base. Next, explainability is vital for building trust in the AI system, and this is achieved in this framework through the creation of a justification and the references used for that purpose. The last element of transparency is communication. It is clearly communicated that the framework is an AI system and not a human, as well as its benefits, limitations, potential risks, level of accuracy, and error rates. For the sake of transparency, it is also recommended to use open-source models in an industrial implementation of this framework. This research project has only considered closedsource LLM as there was no budget allocated for hardware or API usage. However, transparency would be improved through the usage of open-source models that are presenting performances comparable GPT-3.5 and GPT-4, like models from Llama 2 from Meta 3 and Mixtral from Mistral AI<ref type="foot" target="#foot_3">4</ref> . • Diversity, non-discrimination, and fairness: The intention when implementing this work is to always avoid unfair bias. This has been done by carefully crafting the prompts in order to leave out expressions that could leave room for subjectivity. However, it is necessary to include a disclaimer about biases that could already exist. These biases can either be in the data from the press -as the system needs to be faithful to that information -or in the training data of the LLM, although this is less common as the instructions are clearly defined to not use data from the training. Moreover, as it is advisable to consult stakeholders affected by the AI system throughout the whole life cycle, experts in fact-checking from a Spanish fact-checking start-up were consulted about functional feedback regarding the feedback. This was done in order to ensure that the system's design and development was taking into account the actual needs of professionals that could benefit from this system. • Societal and environmental well-being: The implementation of the proposed framework, if used at a large scale, could have impact on human work and society at large. It would have the potential to change some aspects from journalism, specifically fact-checking, as evidence retrieval would be faster and journalists and fact-checkers could benefit from the time saved to invest in other tasks. On the other side, the usage of this framework for assisted fact-checking could impact favourably society at large by the reduction of misinformation and disinformation. However, it would also pose a challenge: learning a new way to digest information, as facts and claims would already reach people with a justification created, and critical thinking could decrease. • Accountability: The functioning of the framework is documented and it can be externally audited.</p><p>Moreover, it is also well-communicated to end-users about the limitations and data sources of the system, since the framework ultimately verifies if a claim is supported or not by the context provided, not if it is factually true or false. Therefore, the responsibility of the accuracy of the information falls under the data sources.</p><p>Through this assessment of requirements, it can be concluded that the proposed framework can be considered a responsible application of AI.</p><p>The other ethical aspect that needs to be studied prior to any deployment of the proposed framework in a production environment is where the data come from: What should the process to collect news pieces look like? How should the digital outlets be picked in this regard?</p><p>The New York Times, one of the longest-running newspapers in the United States, sued OpenAI in December 2023 over content created by ChatGPT <ref type="bibr" target="#b36">[37]</ref>. The lawsuit informed of several issues with this content that the newspaper claimed were hurting the brand and functioning of the New York Times. The two most pressing issues were the regurgitation of full articles from the New York Times that ChatGPT would perform if prompted correctly, and its hallucinations where false or inaccurate information was created and then attributed them to the New York Times. Both of these problems can affect the image and the finances of the publisher.</p><p>The proposed framework pursues a new way of consuming information, joining several sources to provide a complete picture of the context of a claim users wish to consult. This is always done attributing the original authors, displaying each of the sources' positions on the claim to be checked. Although credit is important, it might not be enough. It is important to know where the information comes from and provide a way to consult the source data directly in case it is needed. However, in case the framework is adopted at large scale, it could reduce the traffic on the information sources' websites.</p><p>For all of this, it is considered that the framework, rather than a framework with direct web-scrapping, should be proposed as a collaboration with diverse digital newspapers or media outlets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>In this work, we have studied the interconnection between AI, journalism, and ethics in an attempt to create a framework whose ultimate goal is to reduce misinformation in society. The proposed framework is powered by reasoning capabilities of LLM, as it allows users to contrast a given claim with a previously built knowledge base, based on sources of interest. The claim is classified based on its alignment with the knowledge base, and a justification and references are also returned.</p><p>The functioning of the framework is evaluated with a mix of automated and manual techniques, ultimately returning a percentage of classification and justification accuracy. The best-performing model out of the several that have been studied -GPT-3.5 -scored over 70% in both metrics.</p><p>It is important to consider all parts of the output -verdict, justification, and references -when using the system, since there are cases when only some of the parts are correct. However, even given this limitation, the tool can serve as a companion for professionals and non-professionals when consuming information. The proposed framework can reliably be used as a tool for assisted fact-checking or assisted media consumption.</p><p>Overall, the broad implication of the present research work is that it is possible to use an AI-based framework to enhance the retrieval of information from the press in a responsible manner, showing that AI may be considered a promising companion tool to journalists and non-professionals wanting to contrast information.</p><p>We are currently working on testing new state-of-the-art Large Language Models, since they have the potential to improve the current performance. New models are getting rolled out at a very fast pace</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Framework for Assisted Fact-Checking from the Press.</figDesc><graphic coords="5,143.14,65.60,309.00,405.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Output from implementation.</figDesc><graphic coords="9,86.03,65.60,423.23,287.89" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Quantitative results from evaluation.</figDesc><table><row><cell>Model</cell><cell>Classification Score (%)</cell><cell>Justification Score (%)</cell></row><row><cell>GPT-3.5</cell><cell>71.70</cell><cell>70.82</cell></row><row><cell>GPT-4</cell><cell>63.52</cell><cell>73.58</cell></row><row><cell cols="2">Cohere Command 44.65</cell><cell>-</cell></row><row><cell cols="3">generated by GPT-4, creating balanced categories of classifications (Supported, Refuted, Half-supported</cell></row><row><cell cols="3">and No information) from each of the four sources (El Plural, ABC, El Mundo, and Okdiario).</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://platform.openai.com/docs/models/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://docs.cohere.com/docs/models</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://llama.meta.com/llama2</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://mistral.ai/technology/#models</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>and benchmark scores are improving. We are putting focus on open-source releases, since there are models already surpassing GPT-3.5 as they are getting support of the open-source community, so they would be worth testing in the scope of the framework. Minimal code modifications are being performed, since the framework is designed to support any LLM.</p><p>We also aim at modifying current methods to increase accuracy and faithfulness. There are several methods that might increase the performance of the current framework without more powerful LLM. One of them could be the combination of the answers of several methods, either by a weighted majorityvoting based on the evaluated performance, or through a third call giving the outputs of the models as a new context and letting a model decide based on that information.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The evaluation was carried out in a mixed approach: automated and manual. The automated stages were the evaluation dataset creation and initial classification grading. On the other side, the classification adjustment and the justification grading were done manually.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Why do fact-checking organizations go beyond fact-checking? a leap toward media and information literacy education</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Çömlekçi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Communication</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page">21</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Bringing journalism back to its roots: examining fact-checking practices, methods, and challenges in the Mediterranean context</title>
		<author>
			<persName><forename type="first">V</forename><surname>Moreno-Gil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ramon-Vegas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mauri-Ríos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Profesional de la información</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Debunking false information: investigating journalists&apos; factchecking skills</title>
		<author>
			<persName><forename type="first">M</forename><surname>Himma-Kadakas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ojamets</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Digital journalism</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="866" to="887" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Graves</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:13750196" />
		<title level="m">Understanding the Promise and Limits of Automated Fact-Checking</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
		<respStmt>
			<orgName>Reuters Institute for the Study of Journalism, University of Oxford</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A Survey on Automated Fact-Checking</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schlichtkrull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="178" to="206" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="https://www.thefire.org/research-learn/misinformation-versus-disinformation-explained,????FoundationforIndividualRightsandExpression" />
		<title level="m">Misinformation versus disinformation, explained | The Foundation for Individual Rights and Expression</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Borel</surname></persName>
		</author>
		<title level="m">The Chicago Guide to Fact-Checking</title>
				<imprint>
			<publisher>University of Chicago Press</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Fact Checking: Task definition and dataset construction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Riedel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Language Technologies and Computational Social Science</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="18" to="22" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Automated Fact-Checking to Support Professional Practices: Systematic Literature Review and Meta-Analysis</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dierickx</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-G</forename><surname>Lindén</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Opdahl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Communication</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page">21</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models</title>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<idno>ArXiv:2305.14623</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">FEVER: a large-scale dataset for fact extraction and VERification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Thorne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Christodoulopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mittal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="809" to="819" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">WiCE: Real-world entailment for claims in Wikipedia</title>
		<author>
			<persName><forename type="first">R</forename><surname>Kamoi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Durrett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Empirical Methods in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="7561" to="7583" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The Intended Uses of Automated Fact-Checking Artefacts: Why, How and Who</title>
		<author>
			<persName><forename type="first">M</forename><surname>Schlichtkrull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ousidhoum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="8618" to="8642" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">U</forename><surname>Hadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Qureshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Irfan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zafar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Shaikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Akhtar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mirjalili</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">A survey on large language models: Applications, challenges, limitations, and practical usage</title>
		<title level="s">Authorea Preprints</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Attention is All you Need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">U</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Language Models are Few-Shot Learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Towards reasoning in large language models: A survey</title>
		<author>
			<persName><forename type="first">J</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C</forename></persName>
		</author>
		<author>
			<persName><forename type="first">.-C</forename><surname>Chang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: ACL 2023</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1049" to="1065" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bommasani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Borgeaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yogatama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2206.07682</idno>
		<title level="m">Emergent abilities of large language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Reasoning with Language Model Prompting: A Survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Qiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<idno>ArXiv:2212.09597</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Prompting contrastive explanations for commonsense reasoning tasks</title>
		<author>
			<persName><forename type="first">B</forename><surname>Paranjape</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Michael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghazvininejad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="4179" to="4192" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Chain-of-thought prompting elicits reasoning in large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="24824" to="24837" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chowdhery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.11171</idno>
		<title level="m">Self-consistency improves chain of thought reasoning in language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Mind&apos;s Eye: Grounded Language Model Reasoning through Simulation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vosoughi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Dai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Eleventh International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Language Models of Code are Few-Shot Commonsense Learners</title>
		<author>
			<persName><forename type="first">A</forename><surname>Madaan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Alon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Empirical Methods in Natural Language Processing</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1384" to="1403" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Generated knowledge prompting for commonsense reasoning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Welleck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>West</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">Le</forename><surname>Bras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">60th Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3154" to="3169" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Object Hallucination in Image Captioning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Hendricks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Burns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Saenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Empirical Methods in Natural Language Processing</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="4035" to="4045" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Retrieval-augmented generation for knowledge-intensive NLP tasks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Piktus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Küttler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>-T. Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rocktäschel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Riedel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kiela</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">34th International Conference on Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.06726</idno>
		<title level="m">Explanations from large language models make small reasoners better</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Mukherjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Jawahar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Palangi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Awadallah</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.02707</idno>
		<title level="m">Orca: Progressive learning from complex explanation traces of gpt-4</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">The ideology of media: Measuring the political leaning of Spanish news media through Twitter users&apos; interactions</title>
		<author>
			<persName><forename type="first">F</forename><surname>Guerrero-Solé</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Comunicación y sociedad = Communication &amp; Society</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="29" to="43" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Shuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Poff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kiela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.07567</idno>
		<title level="m">Retrieval augmentation reduces hallucination in conversation</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">C.-H</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-Y</forename><surname>Lee</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.01937</idno>
		<title level="m">Can large language models be an alternative to human evaluations?</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-L</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Xing</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sandborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Olea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gilbert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Elnashar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Spencer-Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">C</forename><surname>Schmidt</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.11382</idno>
		<title level="m">A prompt pattern catalog to enhance prompt engineering with chatgpt</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Armengol-Estapé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">D G</forename><surname>Bonet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Melero</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2108.13349</idno>
		<title level="m">On the multilingual capabilities of very large-scale english language models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<idno type="DOI">10.2759/346720</idno>
		<ptr target="-GeneralforCommunications" />
		<title level="m">Networks, Content and Technology (European Commission) and Grupa ekspertów wysokiego szczebla ds</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
		<respStmt>
			<orgName>Publications Office of the European Union ; directorate</orgName>
		</respStmt>
	</monogr>
	<note>Ethics guidelines for trustworthy AI. sztucznej inteligencji</note>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Helmore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Paul</surname></persName>
		</author>
		<ptr target="https://www.theguardian.com/media/2023/dec/27/new-york-times-openai-microsoft-lawsuit" />
		<title level="m">New York Times sues OpenAI and Microsoft for copyright infringement</title>
				<imprint>
			<publisher>The Guardian</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
