1. Challenge: Introduction and

10.1002/aaai.12188

VeryfIT - Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge

Jacopo Gili

Viviana Patti

Lucia Passaro

Tommaso Caselli

0 0 CLCG, University of Groningen , The Netherlands 1 Department of Computer Science, University of Pisa , Italy 2 Department of Computer Science, University of Turin , Italy

2018

1 389 402

Achieving factual accuracy is a known pending issue for language models. Their design centered around the interactive component of user interaction and the extensive use of “spontaneous” training data, has made them highly adept at conversational tasks but not fully reliable in terms of factual correctness. VeryfIT addresses this issue by evaluating the in-memory factual knowledge of language models on data written by professional fact-checkers, posing it as a true or false question. Topics of the statements vary but most are in specific domains related to the Italian government, policies, and social issues. The task presents several challenges: extracting statements from segments of speeches, determining appropriate contextual relevance both temporally and factually, and ultimately verifying the accuracy of the statements.

eol>fact checking benchmark factual knowledge Italian fake news CALAMITA CheckIT!

1. Challenge: Introduction and Motivation

accurately evaluate factual knowledge is more relevant than ever considering the ease of access of these tools to non-experts for any purpose (entertainment, education, The pollution of the information ecosystem by means professional settings) and the increasing integration of of misleading or false information has reached unprece- these technologies in every day activities. dented levels at a global scale. This has been possible Notably, most of these tasks and corresponding benchthanks to a combination of multiple factors, among which marks are in English with other languages being reprethe collapse of (local and national) journalism; an increas- sented through machine-translated data or no data at all. ing sense of distrust in science and evidence-based facts; This is true for Italian too. For instance, SQUAD-IT [13] is and the presence of computational amplification tools a machine-translated version of the SQUAD dataset [14] such as bots [1, 2]. In this sense the rise of Large Language and it is the reference for evaluating models on QA-tasks. Models (LLMs) with the constant increase of their perfor- While machine-translation has been constantly immances has introduced both opportunities and challenges proving, it can indeed easily introduce artefacts in the in the fight against misinformation: while LLMs possess output text impairing naturalness and correctness, morethe capability to generate coherent and contextually rel- over translated data can be subjected to the loss of nuance evant text, they also pose risks by potentially producing and context as translations may not capture cultural nudeceptive misinformation at scale [3, 4]. ances or contextual meanings, leading to misunderstand

Testing factual and common sense knowledge in LLMs ings or misinterpretations in the target language: certain has been a common although not easy task involving phrases or idioms may not have direct equivalents in mostly multi-choice question answering, a method easy other languages, and the presence of linguistic constructo automate and not prone to ambiguity, and spanning tions typical of the source language may be encouraged across wide ranges of academic and professional domains excessively [15]. like mathematics, medicine, history, law, general knowl- By using data from a professional fact-checking edge and many others [5, 6, 7, 8, 9, 10, 11, 12]. agency1 we can test knowledge memorization of LMs Developing benchmarks to test the ability of LLMs to and to what extend intra-memory conflicts, resulting in “hallucinations”, arise. Furthermore, doing so using CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Italian data centered around the Italian and European Dec 04 — 06, 2024, Pisa, Italy contexts ensures testing LM’s functionalities directly in † These authors contributed equally. Italian. ($V. jPaactotpi)o; .lguiclii5a8.p4a@sseadruo.@unuintoip.iit.i(tJ.(LG.iPlia);svsaivrioa)n;at..cpaastetlil@i@urnuitgo.n.itl This task is based on CheckIT! [16], a resource of ex(T. Caselli) pert fact-checked claims designed to fill a gap for the 0009-0007-1343-3760 (J. Gili); 0000-0001-5991-370X (V. Patti); development of AI- assisted fact-checking pipelines for 0000-0003-4934-5344 (L. Passaro); 0000-0003-2936-0256 (T. Caselli) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1Data have been obtained from Pagella Politica

Attribution 4.0 International (CC BY 4.0).

2. Challenge: Description

The challenge is a binary classification task in a zero-shot setting: for each atomic statement, any LM is asked to determine its factuality with respect to the time it was uttered by answering only with one of the two labels, “Vero” (true) or “Falso” (false). A third label for half true statements could have been easily kept as it was already part of the dataset from which the data is sourced, but in this first stage we opted for the binary setting as to limit task complexity.

Some cases in the dataset exhibit complexities due to the combination of multiple pieces of information within a single claim, which can afect the final determination of veracity. For instance, consider the following scenario:

Original claim Translation

«Se è vero che oltre l’82% dei morti da Covid hanno più di 70 anni, non si capisce perché meno della metà degli over 80 sia stato vaccinato finora» «If it is true that over 82% of Covid deaths are over 70 years old, it is not clear why less than half of those over 80 have been vaccinated so far» recognizes an intermediate “Ni” [Half true] label. As a result, all claims with the “half-true” verdict were discarded.

Furthermore, we considered pertaining to the task to provide also a smaller subset of claims, “VeryfIT_small”, balanced on the political orientation of the politician speaking, as misinformation can occur on all topics but when referring to political misinformation each side of the political spectrum has some more widespread topics and recurrent formulations.

Additionally, an annotation task was carried out on the VeryfIT_small subset aimed at the clarification of statements presenting a level of ambiguity that would have proven detrimental to the task: around 12% of the statements have available an alternative version “enriched” of informations vital to the task. We will refer to them as “enriched statements” (subsection 3.2).

In conclusion, 2 versions of the dataset are available: VeryfIT (2,021 claims) and VeryfIT_small (352 claims of which 43 with an enriched version).

3.1. Creation of VeryfIT_small 3. Data description

The informations concerning this statement are:

This example also highlights the importance of incorporating the appropriate temporal context in the verification process. Factual information, especially involving statistics or reports about the state of the world, evolves over time and failing to account for this can invalidate the conclusions drawn by experts. Although more complex statements require a broader knowledge base, by now language models have shown understanding abilities well over this level and should not be subjugated by it.

The first step to achieve this goal was to exclude around 400 out of the 2,021 claims of VeryfIT for which information about the political orientation of the speaker was not available.

We then mapped, using Wikipedia as a source, the political orientation of the parties (and thus of the authors of the claims at the moment of remark) into eight finegrained, commonly recognized political categories: far1. Out of all the deceased due to the Covid19 pan- left, left, center-left, center, center-right, right, far-right. demic, 82% are people over 70 years old. An illustration on the list of all the parties and their corresponding political orientation is reported in Table 2. 2. Less than half of the citizens over 80 years old had An additional label ‘transverse’ was added to indicate administered at least one dose of vaccine against a non precise placement in the political spectrum. This Covid19. label includes one party (“Movimento 5 Stelle”), members of the Italian institutions above political parties (e.g. the President of the Republic), and experts not afiliated to any political party or political coalition like members of a technical government 2.

At first glance, the Italian political spectrum may appear only slightly unbalanced. Despite the absence of a far-left representation, the distribution of parties across the spectrum is relatively symmetrical. Out of the 23 political parties in the data, six are from the left, two from the center-left, six from the center, three from the center-right, two from the right, and three from the farright. However, the distribution of claims is not as well balanced, with a larger number of claims from the rights and far-right parties than the rest as reported in table 3.

To ensure the balance of our benchmark we decided to reduce the label granularity from eight to four, by colThe VeryfIT dataset consists of 2,021 claims taken from CheckIT! [16]. Not all claims were included due to the binary format of the task as VeryfIT classifies claims as either “Vero” [True] or “Falso” [False], whereas CheckIT! 2https://en.wikipedia.org/wiki/Technocratic_government_(Italy)

Political party

Alleanza Verdi e Sinistra Alternativa Popolare Articolo Uno Azione Coraggio Italia Europa Verde Forza Italia Fratelli d’Italia Impegno Civico Indipendente Italexit Italia Viva Lega Nord Liberi e uguali Movimento 5 Stelle Nuovo Centro Destra Partito Democratico Più Europa Popolo della Libertà Possibile Radicali Italiani Scelta Civica Sinistra Ecologia Libertà Sinistra italiana Tecnico

Orientation label left center-right center-left center center-right left right far-right center transverse far-right center far-right left transverse center-right center-left center right left center center left left transverse lapsing labels far-left, left and center-left into ‘left’ [SX], and far-right, right and center-right into ‘right’ [DX]. Labels center [C] and trasversal [T] remained untouched. The re-aggregated coarse-grained labels are reported in Table 4.

Although the distribution is still unbalanced between Political side Left [SX] Center [C] Right [DX] Transverse [T]

True 138 82 327 146

Total

3.2. Enriched statements

Given the specificity of the statements, many of which require detailed knowledge of topics related to Italian institutions and policies, and the occasional ambiguity arising from their oral nature, the task has been further divided into two sub-tasks with slight data modifications, aimed at adding vital context to statements that were excessively reliant on information external to the statements themselves. The altered statements account for around 12% of the VeryfIT_small dataset, as excessive human intervention would undermine the core principle of testing on natural data, aligned with what language models might be asked to handle in real-life scenarios. In most cases, minimal adjustments were made, such as retaining the original claim but adding the name of the politician speaking or clarifying specific references.

Original statement

Abbiamo 490 grandi elettori Oggi in Italia sono 796 quelli che pagano più di 1 milione di euro [Alle europee] io ho battuto Salvini in molti capoluoghi di provincia In parlamento stiamo facendo un lavoro che risponde a una prerogativa costituzionale. Certamente si sarebbero tutti auspicati, me compresa, tempi più brevi ma non stiamo perdendo tempo. Stiamo svolgendo un ruolo che ci compete e che la Costituzione da’ al parlamento.

Enriched statement

Gli elettori dell’area di centrosinistra che voteranno per l’elezione del Presidente della Repubblica saranno 490.

Oggi in Italia sono 796 quelli che dichiarano un reddito superiore ad 1 milione di euro. [Alle europee] io [Carlo Calenda] ho battuto Salvini in molti capoluoghi di provincia.

L’elezione dei membri della Corte Costituzionale e del Consiglio Superiore della Magistratura (Csm) è un dovere che la costituzione italiana dà al parlamento.

3.3. Annotation details

During the making of the VeryfIT datasets, it was noticed that not all the statements were actual claims: in articles with multiple claims to check, the ‘statement’ field was iflled with a short title resuming them all, often in the format “[name of the politician] on [topic]”. Regular expressions were used to highlight statements not starting with ‘“’ or ‘«’, the two symbols used to denote a dialogue or part of a speech, and a manual check brought to the exclusion of around 170 statements. Moreover around 30 statements with formats resembling “[name of the politician] is [right/wrong] on [topic]: [statement]” were reformulated as claims by removing hints about the factuality verdict and the author of the statement. A couple examples are brought up in table 7.

Original statement

Giulia Grillo sbaglia: i medici e gli infermieri italiani non sono i meno pagati Secondo Di Maio il governo investe nelle centrali a carbone, ma è il contrario No, per la Corte dei Conti non ci saranno 17 miliardi di nuove tasse

Reworded statement

i medici e gli infermieri italiani sono i meno pagati Il governo investe nelle centrali a carbone Per la Corte dei Conti ci saranno 17 miliardi di nuove tasse

The goal of partially or entirely removing the initial Preliminary results obtained through the chat function layer of complexity, by simplifying the extraction of the of Claude 3.5 Sonnet3 and GPT-4o 4 show that respecrelevant information from the statement for verification, tively two out of the four statements (Claude) and one is to highlight a stronger correlation between the bench- out of the four statements (GPT) reported in Table 6 get mark results and the language model’s actual factual wrongly classified when presented in the original version, knowledge: when working with natural data, the model’s while providing the models with the enriched versions responses may stem from its dificulty in comprehending brings up the correctly classifications to four out of four the specific information it is being asked to verify. How- for both models. These results however can only parever, with altered data, its responses are more directly tially prove the efectiveness of enriched statements as influenced by gaps in its knowledge. diferent models when presented a partial context could

Examples of enriched statements are reported in Ta- provide diferent verdicts, even guessing the right one. ble 6:

The reasons for enriching the statements in table 6 all revolve around the lack of pivotal information to determine factuality: The first statement is completely missing Another important annotation step has been producthe context and presents an unclear term “grandi elettori” ing the enriched statements. A human annotator5 re[big voters], relatively known in the political context, viewed the VeryfIT_small dataset, identifying statements but that could be mistaken for a physical feature or for that could benefit from additional context, and produced a consideration regarding the age of voters; the second enriched variations of those statements. In most cases, statement has an unclear formulation as “pagare” [to pay] minimal adjustments were made, such as retaining the does not refer univocally to taxes; the third statement original claim but adding the name of the politician speakis missing the subject; the fourth and last statement is ing or clarifying anaphoric references. missing part of its context as “stiamo facendo un lavoro” [we are doing a job] “stiamo svolgendo un ruolo” [we are playing a role] both refer to a very specific duty of the parliament that does not get mentioned directly. 3https://claude.ai/chat 4https://chatgpt.com/ 5All the annotations noted in the report was done by the first author of the paper, master student in Computer Science with a background in Natural Language Processing

The decision of applying this annotation step to the VeryfIT_small subset, instead of the full dataset, is related to the amount of manual work it would have required.

Additionally another annotation step involved completing the “macro_area” [topic] field for all the 352 entries of VeryfIT_small. Although this field was included in the original dataset, it was missing a value in approximately 15% of the entries. This was done manually, classifying statements into the pre-existing topic labels which are: ‘questioni sociali’ [social matters], ‘economia’ [economy], ‘esteri’ [foreign afairs], ‘giustizia’ [justice], ‘istituzioni’ [institutions], ‘ambiente’ [environment], ‘altro’ [others]. The new labels were chosen by comparing unlabelled statements with statements that already had a label and inspecting the contents of the articles from which they were extracted, sometimes only needing to look at the ‘tags’ field to find all the information needed.

To avoid even the smallest imprecision that would have impaired the original label system made by journalist, non-certain labels were put in the ‘altro’ category.

Statistics about the distribution of these labels can be found in section 3.6.

3.4. Data format

Brief explanation of the data fields: • annotato: If True, the statement has a revised

version. • id: ID of the corresponding article in CheckIT!. • statement_date: Date of statements difusion. • statement: The statement. • verdict: Factuality verdict. { }, { } "annotato": False, "id": 991, "statement_date": 2019-07-12, "statement": "[Il salario minimo n.d.r.]

Manca solo a noi e ai Paesi dell’Est

Europa", "verdict": "Falso", "orientamento": ’C’, "macro_area": "questioni sociali", "tags": "[’questioni sociali’, ’panzana pazzesca’, ’italia’, ’eu’, ’salario minimo’]", "statement_revised": "" "annotato": True, "id": 123, "statement_date": 2023-02-14, "statement": "Il canone in bolletta fu una mia scelta. Costava 113 euro. Averlo fatto pagare a tutti ha portato a un abbassamento del costo da 113 a 90 euro ", "verdict": "Vero", "orientamento": ’C’, "macro_area": "altro", "tags": "["canone", "rai", "bolletta", "

costo"]", "statement_revised": "Il canone in bolletta fu una mia scelta [di Matteo Renzi].

Costava 113 euro. Averlo fatto pagare a tutti ha portato a un abbassamento del costo da 113 a 90 euro" • orientamento: Orientation of the political party of the politician author of the statement.

The models are expected to be evaluated on this task in • macro_area: Topic of the statement. a zero-shot setting, thereby also better resembling the conditions of a real use-case scenario. The prompt we • tags: List of tags. suggest to use for the evaluation is basic, and urges the model to limit its answer to just the letter corresponding • statement_revised: Revised version of the state- to the answer. The original prompt in Italian, together ment, if present. with its English translation, are reported in Box 1.

3.5. Example of prompts used for zero shots

Fields such as ‘macro_area’ and ‘tags’ serve as indicators of the topic, the former providing a general categorization and the latter ofering more specific details. These informations were included with in mind future tasks that could reveal diferences in factual knowledge across diferent subjects.

Prompt Il seguente statement, nella data indicata, è vero o falso? Rispondi solo con "Vero" o "Falso".

The following statement, on the date indicated, is true or false? Answer only with "True" or "False".

Box 1: Zero-shot prompt

The prompt does not contain any information about the subject of the question or any other informative cues apart from the time reference needed to anchor the claim in a temporal context. In this way, our benchmark not only tests the model in question answering, but also indirectly tests the instruction-following abilities of the model in a language diferent than English.

3.6. Detailed data statistics

The full VeryfIT! dataset is composed of 2,021 entries in the italian language. Out of these claims, 352 form the VeryfIT_small dataset in which the entries are equally split across the three main sides of a semplification of the classical political spectrum (left, right, center) and a fourth label ‘trasversal’, used to address non precise placement in the political spectrum or complete absence of afiliation to any political party or political coalition.

Of the 352 claims in the VeryfIT_small dataset, 43 have available an enriched variation of the statement, providing additional context alongside the original statement.

The distribution of claims and factuality labels across topics is presented in Table 8, Table 9, Table 10, Table 11.

4. Metrics

Accuracy serves as the evaluation metric of the task due to its intuitive interpretation and broad applicability. Accuracy provides a clear measure of a classifier’s overall performance by calculating the proportion of correct predictions among total cases examined.

No other metrics were chosen for the task.

Total questioni sociali economia istituzioni esteri giustizia altro ambiente un-noted 19 11 10 4 3 1 1 23

CSX

5. Limitations

The totality of the data comes from an expert, reliable source. For this reason, the quality of the verdicts is assured to be high. One possible limitation is due to the time-relatedness of said verdicts: claims can be truth and false at times depending on the temporal context arXiv preprint arXiv:2406.17789 (2024). [16] J. Gili, L. Passaro, T. Caselli, Checkit!: A corpus of expert fact-checked claims for italian, in: F. Boschetti, G. Lebani, B. Magnini, N. Novielli (Eds.), Proceedings of the 9th Italian Conference on Computational Linguistics, CEUR Workshop Proceedings, CEUR Workshop Proceedings (CEURWS.org), 2023. Publisher Copyright: © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 9th Italian Conference on Computational Linguistics, CLiC-it 2023 ; Conference date: 30-11-2023 Through 02-12-2023. total

CSX