Automated Fact-checking based on Large Language Models: An application for the press

Automated Fact-checking based on Large Language Models: An application for the press BogdanAndreiBaltes Centro de Estudios en Ciencia de Datos e Inteligencia Artificial (ESenCIA) Valencian International University

C/Pintor Sorolla 21 46002 València Spain

YudithCardinale Centro de Estudios en Ciencia de Datos e Inteligencia Artificial (ESenCIA) Valencian International University

C/Pintor Sorolla 21 46002 València Spain

BenjamínArroquia-Cuadros Centro de Estudios en Ciencia de Datos e Inteligencia Artificial (ESenCIA) Valencian International University

C/Pintor Sorolla 21 46002 València Spain

Automated Fact-checking based on Large Language Models: An application for the press 1613-0073 1A0A4DC137D788D87E22B9A3E7741069 GROBID - A machine learning software for extracting information from scholarly documents Automated Fact-Checking Large Language Models Artificial Intelligence Ethics

The current proliferation of digital media for the dispersion of news represents advantages given the ease of access but also challenges as the different sources might not necessarily be reliable or fully consistent with each other. Existing solutions for contrasting information include knowledge bases with previously verified information that are often lacking updated information or insightful details. In this context, we propose a framework for enhancing information retrieval from the press to make information more digestible and with the ultimate goal of reducing misinformation. The proposed framework, at the interconnection of automated fact-checking, AI-based reasoning, and ethics, consists of a tool that combines information from several sources and allows users to verify a claim given the information as a knowledge base. The work explores the reasoning capabilities of Large Language Models (LLM) as a new way of automating fact-checking, creating a flexible and dynamic solution. The framework returns a verdict about the claim, as well as a justification and references, building trust for the users. The performance is rigorously evaluated achieving a score of 70% accuracy of classification and justification production for the top-performing models. Equally important, the work studies the ethical challenges of building a framework that changes the way that information from the press is consumed by society. The underlying ethics of the project are discussed both from a perspective for final users and publishing companies, offering guidance for large-scale implementation of the framework. This research poses challenges as well, mainly regarding the capabilities of current and future LLM and the commercial partnership dynamics with publishing companies.

Introduction

Disinformation and fake news have been combatted through the manual work of journalists at traditional media and fact-checking outlets [1]. Tasks related to fact-checking procedures include contacting the original source via phone or e-mail, consulting alternative sources, and writing and rating the claim and publishing it [2]. While this workflow is complete and consistent, the workforce is often insufficient to monitor every piece of published information, so it is often users' mission to verify whether something they read or heard is true or false [3].

With or without deliberation to spread false information, there are often a variety of sources that are not fully consistent with each other. It is generally not possible for a person to read the same news in many different media to find complementary or contradictory information to get the full picture. Because of this, fact-checking is needed. In addition to the manual efforts of journalists, automated fact-checking (AFC) techniques are being developed mostly by nonprofit fact-checking entities [4]. The limitations of AFC have traditionally been the sensitivity to context that impedes the full automation of fact-checking systems, requiring human supervision. Another direction that AFC has been taking is that of identifying claims and constructing a database of verified claims [5], which is useful for assisting the fact-checker although with a static context.

To make information more digestible, in this work we present a framework that processes information from different sources, solves the user's original question, and indicates where the information comes from, being able to consider the context that is given to the system. An important aspect is that the user of the system is fully aware of the contents of its context and can check it if necessary, adding trustworthiness. The aim is to help users get a broader perspective on the news from different sources, in order to fight misinformation. This is pursued through the creation of a system that compiles information from the press to be able to verify claims based on the knowledge base, providing a reasoned answer, and having the ability to reference the employed sources, supported by AI-based reasoning and ethics.

In this work, we explore a new approach for automating fact-checking: through the reasoning capabilities of Large Language Models (LLM). We carry out a complete implementation of the framework, starting from a knowledge base crafted from news from Spanish media until the interface where final users can make use of the framework. The functioning of the proposed system is tested, both from a technical and functional perspective, rigorously carrying out an evaluation achieving a score of 70% accuracy of classification and justification production, but also from an ethical standpoint, studying the underlying ethics of the change that people would undergo if the framework were implemented at a large scale and the way media is consumed were modified.

This research revolves around solving problems derived from misinformation and disinformation. The former is defined as "false or inaccurate information", while the latter is adding the notion of the "false or misleading information peddled deliberately to deceive, often in pursuit of an objective" [6]. In particular, this system is intended for journalists as primary end-users. Journalists at fact-checking agencies continuously track claims made by politicians and evaluate the veracity of them, publishing the results for the general public.

The rest of the paper is organized as follows. Section 2 describes recent studies on automated fact-checking and reasoning capabilities of LLM. The proposed Assisted Fact-Checking Framework is presented in Section 3 and the obtained results are discussed in Section 4. The ethics of the proposed work are studied in Section 5 and we present the conclusions in Section 6.

Related Work

In this section, some studies related to fact-checking and reasoning capabilities of LLM are described.

Fact-checking, in its simplest form, is a practice that verifies whether a claim is true or not. It has, of course, more complex definitions, given that a claim can be technically true but written in a misleading way, or only partially true. The most common workflow when doing the task of fact-checking is searching through multiple sources that can be used to verify the veracity of the claim, assess their reliability, and make a decision on the original claim based on the evidence found in the sources [7].

Traditionally, it consists of manual work carried out by journalists, whether to fact-check published claims in other agencies' work or to assess the correctness of works before being published [2].

The need to automatise fact-checking processes arises from the inability of journalists to verify everything they publish, since this manual work is oftentimes a task that can take up to several days [2]. Sources are not always accessible in literature or on the Internet. While there are official databases or reports such as the National Institute of Statistics (INE) in Spain, there is also manual work to be done when information needs to be verified directly calling an institution, like the Government or the Police.

In the last years, the AI community has dedicated efforts to discuss AFC. The most widely accepted structure for this automation was proposed by Vlachos and Riedel [5] [8], in a sequential process that starts by identifying the claims that need to be checked, looking through sources for the evidence needed to support or refute the claims, and taking a decision considering the given evidence.

There are, however, two issues with the knowledge commonly available to most approaches found in the literature [4]: not all available information is trustworthy, and not all needed information is available.

To overcome these problems, researchers have taken the assumptions that the information included in the employed sources is correct and that the evidence is the information that can be retrieved from there. As the evidence is assumed to be correct, veracity will be defined as the coherence of the claim and the evidence.

This common structure for automated fact-checking can -and should -be adapted to the needs of its end users (mostly journalists). Regarding the research of these systems, there has sometimes been a lack of collaboration between researchers and journalists [9]. A better collaboration could lead to the solution of some of the issues that AFC systems present, although not all of them can be technically solved. Furthermore, advances in the field of Generative Artificial Intelligence and specifically LLM are contributing to the transition from simpler natural language processing (NLP) techniques to the usage of the reasoning abilities of more complex models. There have been attempts to integrate LLM in the whole framework for automated fact-checking, using it to detect claims, retrieve evidence and finally, predict a verdict and build a conclusion [10]. However, the results obtained are inferior to those of the state of the art models on datasets like FEVER [11] and WiCE [12] and further research is encouraged.

To better understand the purpose of the research of automated fact-checking, a study has shown that there are eight main intended use cases of automated fact-checking [13]. The study has analysed 100 highly-cited papers, with publication dates ranging from 1998 to 2023, with most studies being from the 2010s. These use cases are listed below, specifying the percentage of the 100 papers where the respective use case is pursued: Automated external fact-checking (22%), Assisted external fact-checking (18%), Assisted media consumption (8%), Scientific curiosity (8%), Assisted knowledge curation (7%), Assisted internal fact-checking (4%), Automated content moderation (4%), Truth-telling for law enforcement (1%).

On the other hand, LLM are AI systems that can process and generate text, to solve a variety of tasks, such as summarisation, translation, question answering [14].

These systems have significantly gained popularity over the recent years. One of the main reasons of this rise was the introduction of the Transformer architecture [15]. This technical breakthrough, along with the ever-growing data collection and generation for training, as well as larger computational abilities, triggered a large wave of more capable language models. The paradigm of their creation shifted from task-specific to task-agnostic training, allowing models to perform a wider range of tasks [16].

One of the desired capabilities of LLM is reasoning. It is a cognitive process designed as the process of thinking about something in order to make a decision. At the intersection of psychology, philosophy and computer science, it is a process that benefits individuals to solve problems and take decisions [17].

Language models have a good performance on specific reasoning tasks, although there is no general agreement on whether or not they have the ability to reason [17]. It has been demonstrated, however, that these models' ability to reason improves considerably with their parameter count. Given this, recently released LLM with over 100 billion parameters are better at reasoning [18].

Performance on reasoning tasks, however, is not only a matter of parameter count. It can be heavily improved through multiple methods, which are commonly classified as [19]:

• Strategy Enhanced Reasoning. As LLM usually contain implicit knowledge for reasoning from their pretraining [20], the focus in this method is how to take advantage of this knowledge.

The main research area is prompt engineering, which defines how to construct the questions that are fed to the models. It can be single-stage or multi-stage, the latter emulating human reasoning, decomposing a complex problem and reasoning stage by stage. Both cases are also improved by the Chain-of-Thought (CoT) method [21], which generates a series of intermediate reasoning steps by providing demonstrations on the thought process inside the prompt. Other efforts towards Strategy Enhanced Reasoning include Process Optimization [22] and External Engines [23,24]. • Knowledge Enhanced Reasoning. These methods focus on how to use both implicit and explicit knowledge to assist the model in reasoning. Regarding the implicit knowledge, there has been work to take advantage of the implicit knowledge contained in LLM to generate more knowledge and refine results [25]. As for explicit knowledge, efforts have been directed towards reducing hallucinations (the invention of incorrect facts) [26] and improving information retrieval from external files [27].

It is noteworthy to mention that the answer to better reasoning is not necessarily found with more training parameters. Recent research is also focused on smaller models, easier to use in production environments, using explanations from bigger LLM to become better reasoners [28] [29].

Assisted Fact-Checking Framework

The framework proposed aims at assessing the improvement of retrieval and consumption of information from the press, in an attempt to improve fact-checking processes and reduce misinformation through machine learning techniques. Hence, the selected narrative of this work is that of assisted media consumption or assisted fact-checking. As seen in the literature review, these use cases contribute to around 30% of the intended uses of automated fact-checking tools [13]. It is important to have a human interaction in the models without it being fully automated, since some steps in fact-checking sometimes need to be done manually (e.g., calling an official source at a Ministry to verify a fact, which cannot be done online).

The proposed framework, illustrated in Figure 1, starts with building a knowledge base of digital media according to the interests of users: the sources and categories of articles of choice. This flexibility allows the framework to be versatile, as it can be used for any type of fact-checking, with information from the press, official documents or any private document base. For the implementation of this work, several pieces on unemployment from Spanish digital media were used as a knowledge base. We recommend that whenever this framework is used for political fact-checking, the knowledge base should ideally consist of a choice of media agencies having different types of audiences [30], to make sure the contents are diverse and can complement or contrast each other.

A machine learning model, specifically an LLM in this case, is used by a data actor (mostly journalists) to verify an input, having the knowledge base as a context. The output of the proposed framework consists of a classification of the given input, as well as a justification with citations to the sources supporting or refuting it. Prompt engineering and retrieval techniques are used to control the behaviour of the language model, in an effort to restrict its context to the given knowledge base without hallucinating information and giving false information to the user [31].

As for the evaluation of the performance of this framework, traditional benchmarks are not useful since the accuracy of the responses are not commonsense reasoning capabilities but depend on specific information from the context. Moreover, human evaluation has shown reproducibility limitations and instability towards the execution of NLP tasks [32]. Hence, the approach shifts from the usage of traditional benchmarks to the evaluation through elements inspired by the LLM-as-a-judge method [33]. In this case, given a fixed knowledge base, an LLM creates a series of potential claims given its context, as well as their classification (supported by the context, refuted by it, or with no information) and their justification. This serves as an evaluation dataset that needs to be manually revised and then utilised to extract performance metrics from the behaviour of the framework, with the same context and several combinations of prompts.

The implementation of the framework was done through LangChain, an open-source framework aimed at developing applications with LLM. Through components of this framework, LLM from OpenAI (gpt-3.5-turbo-1106, gpt-4-0125-preview1 ) and Cohere (Command2 ) were integrated. The embeddings used for this work were also from OpenAI (text-embedding-ada-002) and the LLM were used through API calls to providers offering them at no cost or at a limited one. To perform the evaluation, data consist on several pieces of information on unemployment data from Spanish digital media, in the Spanish language, from the following sites: El Plural, ABC, El Mundo, and Okdiario. These data were stored in an open-source vector database: Faiss. Lastly, the results of the framework were shown through the Gradio interface.

Several techniques were combined to improve the prompt composition [34]:

• Specifying the role: "You are a fact-checker. "

• Explicitly asking to only use knowledge from the context provided. • Explaining the format of the desired output: verdict, justification, and passages of each of the sources either supporting or refuting the given claim. • Chain-of-Thought [21]: providing an example of how a claim can be verified.

The workflow of the implementation is depicted in Figure 2, through the diverse parts of the described process, leading to the different parts of the output being generated.

The model is invoked twice. In the first call to the model, the prompt only contains instructions about the justification. Specifically, it is told that it is a fact-checker that can only base its answers on the provided context. Then, a description of each of the categories that the system needs to classify the claim into is depicted, however, it is only asked to provide the justification. It is given also an example of how it should work. Finally, the Markdown format to return is specified, and the model is once again reminded to not produce anything that is not in the given context.

Next, the model is invoked for a second time. In this call, it is only asked to create a classification based on the justification from the response of the first call. The prompt once again explains the different values that the categories can have, and the output format (in Markdown) is specified.

As there is no single ground truth for the use case of this work, there is no standard traditional machine learning evaluation method. However, besides the qualitative evaluation, which serves only to identify whether some specific examples were functioning correctly, an additional evaluation is needed to asses the general behaviour of the framework.

GPT-4 was used to generate an evaluation dataset. Iteratively, each of the media pieces was passed to the model with a prompt asking to generate 40 claims from each of the news pieces. The prompt also specified the the claim can have -Supported, Refuted, Half-supported and No information -and a description of each of them, with the demand to classify the generated claim as well. Additionally, it was specified that there should be an equal number of each of the categories to not have unbalanced classes. The output format was demanded to be a Python dictionary in order to use it to create a dataframe afterwards, with three columns: claim, classification, and source (to keep traceability of where the claim was generated from).

After the generation of the evaluation dataset, each model (gpt-3.5-turbo-1106, gpt-4-0125-preview, Cohere Command) was invoked with the same prompt that is used in the final implementation, feeding it as a claim each generation of the evaluation dataset at a time. Therefore, a loop was created, iterating over the 160 claims that were generated for evaluation in total. The method then saved the results of each invocation, appending a column in each row with the verdict (or classification), justification, and the generated references. The results were afterwards opened in a Google Sheets file, where the automatic classification grading was done. A function was implemented to compare the column of the classification created as part of the evaluation dataset, with the one that was extracted from the output through a Google Sheets function. This would result, for each row, in a score of 0 or 1, with the latter being an exact match.

Moving to the manual stage, each row from the 480 generated in total (160 per model) was manually revised to find incorrect classifications from the evaluation dataset -changing the ground truth to an adjusted, correct version -and to find incorrect classifications of each model that could be accepted upon revision, if the justification was supporting it. The accuracy metric is created by the sum of the column of the classification adjustment, dividing by the number of rows evaluated (initially, 160). This would return a result between 0 and 1, later presented as percentage.

The final part of the evaluation is the justification grading. Each justification was graded with an answer correctness metric, assigned a score from 0 to 5. Score 0 corresponds to an output where none of the justification is correct, or it is classified as "No information" even though there exists relevant information in the context to provide a classification, while Score 5 is a justification that is entirely correct.

Justification grading was done manually following the guidance of the criteria above. To avoid inconsistencies, two rounds of grading were conducted, on different days, shuffling the order of the claims in the evaluation dataset. Afterwards, the scores of both rounds were compared, in case any of the claims were graded differently, and a final decision was taken in those cases where the scores varied. This methodology added rigorousness to an otherwise potentially subjective evaluation process.

At the end, after the grading, the score is divided by 5 in order to be a score between 0 and 1. The column of the normalised justification grading is added up and divided by the number of rows that are being evaluated (once again, initially, 160), so the resulting score for the answer correctness metric is also assigned a result between 0 and 1, later presented as percentage. It was deemed important to use both metrics, since both evaluate important functions of the framework that might be used independently.

Results and Discussion

The final results are shown in Table 1. Results show that the model with the best performance in terms of average of both metrics of classification and justification score is GPT-3.5, scoring over 70% in both metrics. Perhaps counter-intuitively, GPT-4 achieves significantly lower performance in terms of claim classification, although it achieves the top results in terms of justification. Lastly, the justification score of Cohere Command could not be calculated since it had issues with the justification language and format.

The evaluation dataset consisted initially of 160 claims and it underwent a manual revision of quality of generated claims and their classification by the judge model. As described in Section 3, this dataset was As seen from the evaluation results, GPT-3.5 is the best-performing model among the three supported. Its strong points are that it had the best scores in classification and a close second place for justification production. Moreover, out of the 114 claims that were correctly classified, 84 of them (73.68%) also got the best score for its justification production.

Furthermore, there has been exactly one case of a claim that was incorrectly classified, but got the maximum score for the justification. It is, in fact, similar to one of the cases mentioned in the incorrect classifications from the evaluation dataset and it has to do with double negations and antonyms. More exactly, the claim was about the unemployment rate improving with regards to the one from 2022. It is probable that the justification was correctly created, but there was confusion with the model with the concept of improvement for terms like unemployment rate. The unemployment rate improved for society as the rate declined, but it seems that the model might have understood this phenomenon as an improvement if the rate actually increased. This example showcases the importance of considering both the classification and the justification in the usage of this proposed framework.

Additionally, there have been 17 claims rated with a score of 4 in justification production instead of the maximum of 5. Most of these claims' loss of the final point were related with inexact quantities or approximations. The output of the framework is almost correct in terms of justification, but it has been observed that in some cases, the model starts comparing several mentioned quantities as if they were completely different and not just a mere approximation. Prompting the model to behave in a specific manner when it had to do with approximations was tried in previous implementations in the experimental phase, although it did not have the expected result. If the behaviour of the framework when dealing with approximations improved and the results rated with a 4 were given a score of 5 instead, the results produced by GPT-3.5 could improve an additional 2.14% in terms of justification production, achieving a score of 72.96%. However, it is worth noting that there are cases where approximations are correctly handled. For instance, a claim that said there were 20 million employed people by the end of 2023 was correctly classified as "refuted" by the framework, since the correct number is 21.24 million and it is not a valid approximation. Figure 3 shows an example of the functioning of the framework powered by GPT-3.5. The user input claims that the number of unemployed people increased in Spain in 2023. The answer given by the system classifies the claim as "Refuted", which is correct as 3/4 sources in the knowledge base support the contrary, whereas there is no relevant information in the fourth source. The justification is an accurate summary of the reasons why the system refutes the claim. Moreover, the references created by the system are also correct.

GPT-4 achieved short of 60% in the classification score, and more than 73%, surpassing GPT-3.5 for the justification production. These results are lower than expected for classification, as both models are from OpenAI but GPT-4 is a newer, bigger model with better results than GPT-3.5 on most benchmarks.

As seen from the contrast of its two scores, it is highlighted that it has lower classification capabilities in this current implementation. There are six claims classified as "No information", while the justification received the highest score possible, creating a faithful and complete reasoning on the claim. It is unknown why this behaviour occurs, since in the final implementation the methodology was changed. Instead of allowing the model to decide directly in the first invocation the classification of a claim, it had to do it in a second call, based only on the justification it previously created. Therefore, these cases should be reduced with this implementation. Additionally, there has been a recurring result that arose for 45 claims that impeded getting both a correct classification and a justification. After invoking the prompts, the output was "Understood, I am ready to start. Please, provide a claim and a justification" in 31 occasions, with another 14 cases returning "Verdict: No information. Justification: [Input given as justification]". After the first run of the evaluation, it had looked like an execution error, so the claims that led to this result were executed again. However, the same result was returned and no explanation was found to explain this behaviour, which leads to indicate an instability in the responses of GPT-4 with these prompts.

It is observed from the evaluation that the different models have varied performance levels. The best-performing model, GPT-3.5, achieves a score higher than 70% in both of the metrics that are evaluated. In this configuration, it is safe to say that the framework can be considered reliable when used as a tool for assisted fact-checking or media consumption, in a setting where a human checks the outputs instead of having a fully-automated environment.

One of the requirements needed for the framework was explainability, which is achieved mainly through the creation of references for each output. This is considered to be assured through the similarity search of the pieces of news given as input to the knowledge base. As the context is created through a more manual procedure, rather than given to the LLM to reason about, it is considered to be more reliable.

The creation of references is one of the strong points towards the trustworthiness of the proposed framework. However, it is worth noting that although there are procedures to avoid hallucinations or the invention of unrelated information in the output of LLM, they are not always completely avoided. This is why it is important to disclose to final users that the framework, if used for fact-checking or assisted media consumption, can be prone to occasionally produce such outputs.

Studies seem to suggest that English LLM trained at a very large scale can have almost as good results in other languages although there is still room for improvement [35]. This might also be the case for this research work: the performance could have been lower as the prompts and input data were designed to be used in Spanish. For instance, the confusion at reasoning might have been avoided in English. However, capabilities of language models are increasing at a considerable speed, therefore the language shall not be an impediment for the adoption of the proposed framework as a solution for assisted fact-checking or media consumption.

Furthermore, given that the different models produce at times contradictory verdicts and reasoning, an improvement point to increase reliability could be a majority voting, or a weighted majority voting based on the evaluated performance of each method. This technique is used in well-known techniques in traditional machine learning such as Random Forests, where each of the trees have a vote and the final decision depends on the answer with most votes.

All in all, the proposed framework for fact-checking with information from the press provides a reliable solution for automating work that was traditionally manual for journalists, as well as opens new possibilities for non-professionals to consume more contrasted information from the news.

Ethical Considerations

The proposed framework is aimed to enhance the way information is consumed from the press, either with the intention of assisted fact-checking or mere media consumption in a new form. A new form of consuming information can have a considerable impact on society in case it is established. Therefore, it is important to assess ethical considerations of this proposed framework besides its implementation.

The Ethics Guidelines for Trustworthy AI developed by the European Commission in 2019 offer a systematic manner to assess the ethical considerations of the proposed framework through a set of requirements that any trustworthy AI system should meet [36]. The system proposed in this work can be systematically assessed in terms of ethics considering the following requirements from the guidelines:

• Human agency and oversight: In terms of human agency and autonomy, it is vital to stress that the proposed framework is a system based on information from the press, which in no case can fully assure that the information on which it relies is totally factual. Therefore, over-reliance shall be avoided. As for the concept of oversight, it is not considered an autonomous system, as it is needed to be overseen by a Human-in-the-Loop. • Technical Robustness and safety: A low level of accuracy could create undesired results from the system. However, as it would have human supervision and reference checking, the consequences could not be damaging. The final levels of accuracy achieved through the final implementation would need to be properly communicated to end-users for them to acknowledge the behaviour and limitations of the system in order to align their expectations. Finally, in relation to the training data and assumptions that the LLM were trained on, they have not been observed to lead to adversarial effects during the experiments, mostly due to the explicit prompting to follow instructions and only rely on the data given as a context. • Privacy and data governance: The framework does not use any personal data and only uses publicly available data. • Transparency: There are three main elements that constitute transparency as a requirement.

First, traceability is important to track; for this system, the version of the models used, the prompts with which the models are invoked, and the data that make up the knowledge base. Next, explainability is vital for building trust in the AI system, and this is achieved in this framework through the creation of a justification and the references used for that purpose. The last element of transparency is communication. It is clearly communicated that the framework is an AI system and not a human, as well as its benefits, limitations, potential risks, level of accuracy, and error rates. For the sake of transparency, it is also recommended to use open-source models in an industrial implementation of this framework. This research project has only considered closedsource LLM as there was no budget allocated for hardware or API usage. However, transparency would be improved through the usage of open-source models that are presenting performances comparable GPT-3.5 and GPT-4, like models from Llama 2 from Meta 3 and Mixtral from Mistral AI4 . • Diversity, non-discrimination, and fairness: The intention when implementing this work is to always avoid unfair bias. This has been done by carefully crafting the prompts in order to leave out expressions that could leave room for subjectivity. However, it is necessary to include a disclaimer about biases that could already exist. These biases can either be in the data from the press -as the system needs to be faithful to that information -or in the training data of the LLM, although this is less common as the instructions are clearly defined to not use data from the training. Moreover, as it is advisable to consult stakeholders affected by the AI system throughout the whole life cycle, experts in fact-checking from a Spanish fact-checking start-up were consulted about functional feedback regarding the feedback. This was done in order to ensure that the system's design and development was taking into account the actual needs of professionals that could benefit from this system. • Societal and environmental well-being: The implementation of the proposed framework, if used at a large scale, could have impact on human work and society at large. It would have the potential to change some aspects from journalism, specifically fact-checking, as evidence retrieval would be faster and journalists and fact-checkers could benefit from the time saved to invest in other tasks. On the other side, the usage of this framework for assisted fact-checking could impact favourably society at large by the reduction of misinformation and disinformation. However, it would also pose a challenge: learning a new way to digest information, as facts and claims would already reach people with a justification created, and critical thinking could decrease. • Accountability: The functioning of the framework is documented and it can be externally audited.

Moreover, it is also well-communicated to end-users about the limitations and data sources of the system, since the framework ultimately verifies if a claim is supported or not by the context provided, not if it is factually true or false. Therefore, the responsibility of the accuracy of the information falls under the data sources.

Through this assessment of requirements, it can be concluded that the proposed framework can be considered a responsible application of AI.

The other ethical aspect that needs to be studied prior to any deployment of the proposed framework in a production environment is where the data come from: What should the process to collect news pieces look like? How should the digital outlets be picked in this regard?

The New York Times, one of the longest-running newspapers in the United States, sued OpenAI in December 2023 over content created by ChatGPT [37]. The lawsuit informed of several issues with this content that the newspaper claimed were hurting the brand and functioning of the New York Times. The two most pressing issues were the regurgitation of full articles from the New York Times that ChatGPT would perform if prompted correctly, and its hallucinations where false or inaccurate information was created and then attributed them to the New York Times. Both of these problems can affect the image and the finances of the publisher.

The proposed framework pursues a new way of consuming information, joining several sources to provide a complete picture of the context of a claim users wish to consult. This is always done attributing the original authors, displaying each of the sources' positions on the claim to be checked. Although credit is important, it might not be enough. It is important to know where the information comes from and provide a way to consult the source data directly in case it is needed. However, in case the framework is adopted at large scale, it could reduce the traffic on the information sources' websites.

For all of this, it is considered that the framework, rather than a framework with direct web-scrapping, should be proposed as a collaboration with diverse digital newspapers or media outlets.

Conclusions

In this work, we have studied the interconnection between AI, journalism, and ethics in an attempt to create a framework whose ultimate goal is to reduce misinformation in society. The proposed framework is powered by reasoning capabilities of LLM, as it allows users to contrast a given claim with a previously built knowledge base, based on sources of interest. The claim is classified based on its alignment with the knowledge base, and a justification and references are also returned.

The functioning of the framework is evaluated with a mix of automated and manual techniques, ultimately returning a percentage of classification and justification accuracy. The best-performing model out of the several that have been studied -GPT-3.5 -scored over 70% in both metrics.

It is important to consider all parts of the output -verdict, justification, and references -when using the system, since there are cases when only some of the parts are correct. However, even given this limitation, the tool can serve as a companion for professionals and non-professionals when consuming information. The proposed framework can reliably be used as a tool for assisted fact-checking or assisted media consumption.

Overall, the broad implication of the present research work is that it is possible to use an AI-based framework to enhance the retrieval of information from the press in a responsible manner, showing that AI may be considered a promising companion tool to journalists and non-professionals wanting to contrast information.

We are currently working on testing new state-of-the-art Large Language Models, since they have the potential to improve the current performance. New models are getting rolled out at a very fast pace

Figure 1 :1Figure 1: Framework for Assisted Fact-Checking from the Press.

Figure 3 :3Figure 3: Output from implementation.

Table 11Quantitative results from evaluation.ModelClassification Score (%)Justification Score (%)GPT-3.571.7070.82GPT-463.5273.58Cohere Command 44.65-generated by GPT-4, creating balanced categories of classifications (Supported, Refuted, Half-supportedand No information) from each of the four sources (El Plural, ABC, El Mundo, and Okdiario).

https://platform.openai.com/docs/models/ https://docs.cohere.com/docs/models https://llama.meta.com/llama2 https://mistral.ai/technology/#models

and benchmark scores are improving. We are putting focus on open-source releases, since there are models already surpassing GPT-3.5 as they are getting support of the open-source community, so they would be worth testing in the scope of the framework. Minimal code modifications are being performed, since the framework is designed to support any LLM.

We also aim at modifying current methods to increase accuracy and faithfulness. There are several methods that might increase the performance of the current framework without more powerful LLM. One of them could be the combination of the answers of several methods, either by a weighted majorityvoting based on the evaluated performance, or through a third call giving the outputs of the models as a new context and letting a model decide based on that information.

The evaluation was carried out in a mixed approach: automated and manual. The automated stages were the evaluation dataset creation and initial classification grading. On the other side, the classification adjustment and the justification grading were done manually.

Why do fact-checking organizations go beyond fact-checking? a leap toward media and information literacy education MFÇömlekçi International Journal of Communication 16 21 2022 Bringing journalism back to its roots: examining fact-checking practices, methods, and challenges in the Mediterranean context VMoreno-Gil XRamon-Vegas MMauri-Ríos Profesional de la información 31 2022 Debunking false information: investigating journalists' factchecking skills MHimma-Kadakas IOjamets Digital journalism 10 2022 LGraves Understanding the Promise and Limits of Automated Fact-Checking 2018 Reuters Institute for the Study of Journalism, University of Oxford A Survey on Automated Fact-Checking ZGuo MSchlichtkrull AVlachos Transactions of the Association for Computational Linguistics 10 2022 Misinformation versus disinformation, explained | The Foundation for Individual Rights and Expression BBorel The Chicago Guide to Fact-Checking University of Chicago Press 2016 Fact Checking: Task definition and dataset construction AVlachos SRiedel Workshop on Language Technologies and Computational Social Science 2014 Automated Fact-Checking to Support Professional Practices: Systematic Literature Review and Meta-Analysis LDierickx C.-GLindén ALOpdahl International Journal of Communication 17 21 2023 Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models MLi BPeng ZZhang ArXiv:2305.14623 2023 cs FEVER: a large-scale dataset for fact extraction and VERification JThorne AVlachos CChristodoulopoulos AMittal Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2018 1 WiCE: Real-world entailment for claims in Wikipedia RKamoi TGoyal JRodriguez GDurrett Conference on Empirical Methods in Natural Language Processing HBouamor JPino KBali 2023 The Intended Uses of Automated Fact-Checking Artefacts: Why, How and Who MSchlichtkrull NOusidhoum AVlachos Findings of the Association for Computational Linguistics: EMNLP 2023 MUHadi RQureshi AShah MIrfan AZafar MBShaikh NAkhtar JWu SMirjalili A survey on large language models: Applications, challenges, limitations, and practical usage Authorea Preprints 2023 Attention is All you Need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LUKaiser IPolosukhin Advances in Neural Information Processing Systems 30 2017 Language Models are Few-Shot Learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Advances in Neural Information Processing Systems Curran Associates, Inc 2020 33 Towards reasoning in large language models: A survey JHuang KC .-CChang Findings of the Association for Computational Linguistics: ACL 2023 2023 JWei YTay RBommasani CRaffel BZoph SBorgeaud DYogatama MBosma DZhou DMetzler arXiv:2206.07682 Emergent abilities of large language models 2022 arXiv preprint Reasoning with Language Model Prompting: A Survey SQiao YOu NZhang XChen YYao SDeng CTan FHuang HChen ArXiv:2212.09597 2023 cs Prompting contrastive explanations for commonsense reasoning tasks BParanjape JMichael MGhazvininejad HHajishirzi LZettlemoyer Findings of the Association for Computational Linguistics 2021 Chain-of-thought prompting elicits reasoning in large language models JWei XWang DSchuurmans MBosma FXia EChi QVLe DZhou Advances in neural information processing systems 35 2022 XWang JWei DSchuurmans QLe EChi SNarang AChowdhery DZhou arXiv:2203.11171 Self-consistency improves chain of thought reasoning in language models 2022 arXiv preprint Mind's Eye: Grounded Language Model Reasoning through Simulation RLiu JWei SSGu T.-YWu SVosoughi CCui DZhou AMDai The Eleventh International Conference on Learning Representations 2023 Language Models of Code are Few-Shot Commonsense Learners AMadaan SZhou UAlon YYang GNeubig Conference on Empirical Methods in Natural Language Processing 2022 Generated knowledge prompting for commonsense reasoning JLiu ALiu XLu SWelleck PWest RLeBras YChoi HHajishirzi 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio 2022 Object Hallucination in Image Captioning ARohrbach LAHendricks KBurns TDarrell KSaenko Conference on Empirical Methods in Natural Language Processing 2018 Retrieval-augmented generation for knowledge-intensive NLP tasks PLewis EPerez APiktus FPetroni VKarpukhin NGoyal HKüttler MLewis W-T. Yih TRocktäschel SRiedel DKiela 34th International Conference on Neural Information Processing Systems 2020 SLi JChen YShen ZChen XZhang ZLi HWang JQian BPeng YMao arXiv:2210.06726 Explanations from large language models make small reasoners better 2022 arXiv preprint SMukherjee AMitra GJawahar SAgarwal HPalangi AAwadallah arXiv:2306.02707 Orca: Progressive learning from complex explanation traces of gpt-4 2023 arXiv preprint The ideology of media: Measuring the political leaning of Spanish news media through Twitter users' interactions FGuerrero-Solé Comunicación y sociedad = Communication & Society 35 2022 KShuster SPoff MChen DKiela JWeston arXiv:2104.07567 Retrieval augmentation reduces hallucination in conversation 2021 arXiv preprint C.-HChiang H.-YLee arXiv:2305.01937 Can large language models be an alternative to human evaluations? 2023 arXiv preprint Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena LZheng W.-LChiang YSheng SZhuang ZWu YZhuang ZLin ZLi DLi EXing Advances in Neural Information Processing Systems 36 2024 JWhite QFu SHays MSandborn COlea HGilbert AElnashar JSpencer-Smith DCSchmidt arXiv:2302.11382 A prompt pattern catalog to enhance prompt engineering with chatgpt 2023 arXiv preprint JArmengol-Estapé OD GBonet MMelero arXiv:2108.13349 On the multilingual capabilities of very large-scale english language models 2021 arXiv preprint 10.2759/346720 Networks, Content and Technology (European Commission) and Grupa ekspertów wysokiego szczebla ds 2019 Publications Office of the European Union ; directorate Ethics guidelines for trustworthy AI. sztucznej inteligencji EHelmore KPaul New York Times sues OpenAI and Microsoft for copyright infringement The Guardian 2023