<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1002/aaai.12188</article-id>
      <title-group>
        <article-title>VeryfIT - Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacopo Gili</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Passaro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Caselli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLCG, University of Groningen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science, University of Turin</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>1</volume>
      <fpage>389</fpage>
      <lpage>402</lpage>
      <abstract>
        <p>Achieving factual accuracy is a known pending issue for language models. Their design centered around the interactive component of user interaction and the extensive use of “spontaneous” training data, has made them highly adept at conversational tasks but not fully reliable in terms of factual correctness. VeryfIT addresses this issue by evaluating the in-memory factual knowledge of language models on data written by professional fact-checkers, posing it as a true or false question. Topics of the statements vary but most are in specific domains related to the Italian government, policies, and social issues. The task presents several challenges: extracting statements from segments of speeches, determining appropriate contextual relevance both temporally and factually, and ultimately verifying the accuracy of the statements.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fact checking</kwd>
        <kwd>benchmark</kwd>
        <kwd>factual knowledge</kwd>
        <kwd>Italian</kwd>
        <kwd>fake news</kwd>
        <kwd>CALAMITA</kwd>
        <kwd>CheckIT!</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Challenge: Introduction and</title>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>accurately evaluate factual knowledge is more relevant
than ever considering the ease of access of these tools to
non-experts for any purpose (entertainment, education,
The pollution of the information ecosystem by means professional settings) and the increasing integration of
of misleading or false information has reached unprece- these technologies in every day activities.
dented levels at a global scale. This has been possible Notably, most of these tasks and corresponding
benchthanks to a combination of multiple factors, among which marks are in English with other languages being
reprethe collapse of (local and national) journalism; an increas- sented through machine-translated data or no data at all.
ing sense of distrust in science and evidence-based facts; This is true for Italian too. For instance, SQUAD-IT [13] is
and the presence of computational amplification tools a machine-translated version of the SQUAD dataset [14]
such as bots [1, 2]. In this sense the rise of Large Language and it is the reference for evaluating models on QA-tasks.
Models (LLMs) with the constant increase of their perfor- While machine-translation has been constantly
immances has introduced both opportunities and challenges proving, it can indeed easily introduce artefacts in the
in the fight against misinformation: while LLMs possess output text impairing naturalness and correctness,
morethe capability to generate coherent and contextually rel- over translated data can be subjected to the loss of nuance
evant text, they also pose risks by potentially producing and context as translations may not capture cultural
nudeceptive misinformation at scale [3, 4]. ances or contextual meanings, leading to
misunderstand</p>
      <p>Testing factual and common sense knowledge in LLMs ings or misinterpretations in the target language: certain
has been a common although not easy task involving phrases or idioms may not have direct equivalents in
mostly multi-choice question answering, a method easy other languages, and the presence of linguistic
constructo automate and not prone to ambiguity, and spanning tions typical of the source language may be encouraged
across wide ranges of academic and professional domains excessively [15].
like mathematics, medicine, history, law, general knowl- By using data from a professional fact-checking
edge and many others [5, 6, 7, 8, 9, 10, 11, 12]. agency1 we can test knowledge memorization of LMs
Developing benchmarks to test the ability of LLMs to and to what extend intra-memory conflicts, resulting
in “hallucinations”, arise. Furthermore, doing so using
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Italian data centered around the Italian and European
Dec 04 — 06, 2024, Pisa, Italy contexts ensures testing LM’s functionalities directly in
† These authors contributed equally. Italian.
($V. jPaactotpi)o; .lguiclii5a8.p4a@sseadruo.@unuintoip.iit.i(tJ.(LG.iPlia);svsaivrioa)n;at..cpaastetlil@i@urnuitgo.n.itl This task is based on CheckIT! [16], a resource of
ex(T. Caselli) pert fact-checked claims designed to fill a gap for the
0009-0007-1343-3760 (J. Gili); 0000-0001-5991-370X (V. Patti); development of AI- assisted fact-checking pipelines for
0000-0003-4934-5344 (L. Passaro); 0000-0003-2936-0256 (T. Caselli)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1Data have been obtained from Pagella Politica</p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-3">
      <title>2. Challenge: Description</title>
      <p>The challenge is a binary classification task in a zero-shot
setting: for each atomic statement, any LM is asked to
determine its factuality with respect to the time it was
uttered by answering only with one of the two labels,
“Vero” (true) or “Falso” (false). A third label for half true
statements could have been easily kept as it was already
part of the dataset from which the data is sourced, but in
this first stage we opted for the binary setting as to limit
task complexity.</p>
      <p>Some cases in the dataset exhibit complexities due to
the combination of multiple pieces of information within
a single claim, which can afect the final determination
of veracity. For instance, consider the following scenario:</p>
      <sec id="sec-3-1">
        <title>Original claim Translation</title>
        <p>«Se è vero che oltre l’82%
dei morti da Covid hanno
più di 70 anni, non si
capisce perché meno della
metà degli over 80 sia stato
vaccinato finora»
«If it is true that over 82%
of Covid deaths are over 70
years old, it is not clear why
less than half of those over
80 have been vaccinated so
far»
recognizes an intermediate “Ni” [Half true] label. As a
result, all claims with the “half-true” verdict were
discarded.</p>
        <p>Furthermore, we considered pertaining to the task to
provide also a smaller subset of claims, “VeryfIT_small”,
balanced on the political orientation of the politician
speaking, as misinformation can occur on all topics but
when referring to political misinformation each side of
the political spectrum has some more widespread topics
and recurrent formulations.</p>
        <p>Additionally, an annotation task was carried out on the
VeryfIT_small subset aimed at the clarification of
statements presenting a level of ambiguity that would have
proven detrimental to the task: around 12% of the
statements have available an alternative version “enriched”
of informations vital to the task. We will refer to them
as “enriched statements” (subsection 3.2).</p>
        <p>In conclusion, 2 versions of the dataset are available:
VeryfIT (2,021 claims) and VeryfIT_small (352 claims of
which 43 with an enriched version).</p>
        <sec id="sec-3-1-1">
          <title>3.1. Creation of VeryfIT_small</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Data description</title>
      <p>The informations concerning this statement are:</p>
      <p>This example also highlights the importance of
incorporating the appropriate temporal context in the
verification process. Factual information, especially involving
statistics or reports about the state of the world, evolves
over time and failing to account for this can invalidate
the conclusions drawn by experts. Although more
complex statements require a broader knowledge base, by
now language models have shown understanding
abilities well over this level and should not be subjugated by
it.</p>
      <p>The first step to achieve this goal was to exclude around
400 out of the 2,021 claims of VeryfIT for which
information about the political orientation of the speaker was
not available.</p>
      <p>We then mapped, using Wikipedia as a source, the
political orientation of the parties (and thus of the authors
of the claims at the moment of remark) into eight
finegrained, commonly recognized political categories:
far1. Out of all the deceased due to the Covid19 pan- left, left, center-left, center, center-right, right, far-right.
demic, 82% are people over 70 years old. An illustration on the list of all the parties and their
corresponding political orientation is reported in Table 2.
2. Less than half of the citizens over 80 years old had An additional label ‘transverse’ was added to indicate
administered at least one dose of vaccine against a non precise placement in the political spectrum. This
Covid19. label includes one party (“Movimento 5 Stelle”), members
of the Italian institutions above political parties (e.g. the
President of the Republic), and experts not afiliated to
any political party or political coalition like members of
a technical government 2.</p>
      <p>At first glance, the Italian political spectrum may
appear only slightly unbalanced. Despite the absence of a
far-left representation, the distribution of parties across
the spectrum is relatively symmetrical. Out of the 23
political parties in the data, six are from the left, two
from the center-left, six from the center, three from the
center-right, two from the right, and three from the
farright. However, the distribution of claims is not as well
balanced, with a larger number of claims from the rights
and far-right parties than the rest as reported in table 3.</p>
      <p>To ensure the balance of our benchmark we decided
to reduce the label granularity from eight to four, by
colThe VeryfIT dataset consists of 2,021 claims taken from
CheckIT! [16]. Not all claims were included due to the
binary format of the task as VeryfIT classifies claims as
either “Vero” [True] or “Falso” [False], whereas CheckIT! 2https://en.wikipedia.org/wiki/Technocratic_government_(Italy)</p>
      <sec id="sec-4-1">
        <title>Political party</title>
        <p>Alleanza Verdi e Sinistra
Alternativa Popolare
Articolo Uno
Azione
Coraggio Italia
Europa Verde
Forza Italia
Fratelli d’Italia
Impegno Civico
Indipendente
Italexit
Italia Viva
Lega Nord
Liberi e uguali
Movimento 5 Stelle
Nuovo Centro Destra
Partito Democratico
Più Europa
Popolo della Libertà
Possibile
Radicali Italiani
Scelta Civica
Sinistra Ecologia Libertà
Sinistra italiana
Tecnico</p>
        <p>Orientation label
left
center-right
center-left
center
center-right
left
right
far-right
center
transverse
far-right
center
far-right
left
transverse
center-right
center-left
center
right
left
center
center
left
left
transverse
lapsing labels far-left, left and center-left into ‘left’ [SX],
and far-right, right and center-right into ‘right’ [DX].
Labels center [C] and trasversal [T] remained untouched.
The re-aggregated coarse-grained labels are reported in
Table 4.</p>
        <p>Although the distribution is still unbalanced between
Political side
Left [SX]
Center [C]
Right [DX]
Transverse [T]</p>
        <p>True
138
82
327
146</p>
        <p>Total</p>
        <sec id="sec-4-1-1">
          <title>3.2. Enriched statements</title>
          <p>Given the specificity of the statements, many of which
require detailed knowledge of topics related to Italian
institutions and policies, and the occasional ambiguity
arising from their oral nature, the task has been further
divided into two sub-tasks with slight data modifications,
aimed at adding vital context to statements that were
excessively reliant on information external to the
statements themselves. The altered statements account for
around 12% of the VeryfIT_small dataset, as excessive
human intervention would undermine the core principle
of testing on natural data, aligned with what language
models might be asked to handle in real-life scenarios.
In most cases, minimal adjustments were made, such as
retaining the original claim but adding the name of the
politician speaking or clarifying specific references.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Original statement</title>
        <p>Abbiamo 490 grandi
elettori
Oggi in Italia sono 796
quelli che pagano più di 1
milione di euro
[Alle europee] io ho battuto
Salvini in molti capoluoghi
di provincia
In parlamento stiamo
facendo un lavoro che
risponde a una prerogativa
costituzionale. Certamente
si sarebbero tutti auspicati,
me compresa, tempi più
brevi ma non stiamo
perdendo tempo. Stiamo
svolgendo un ruolo che ci
compete e che la
Costituzione da’ al parlamento.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Enriched statement</title>
        <p>Gli elettori dell’area di
centrosinistra che voteranno
per l’elezione del
Presidente della Repubblica
saranno 490.</p>
        <p>Oggi in Italia sono 796
quelli che dichiarano un
reddito superiore ad 1
milione di euro.
[Alle europee] io [Carlo
Calenda] ho battuto
Salvini in molti capoluoghi
di provincia.</p>
        <p>L’elezione dei membri della
Corte Costituzionale e del
Consiglio Superiore della
Magistratura (Csm) è un
dovere che la costituzione
italiana dà al parlamento.</p>
        <sec id="sec-4-3-1">
          <title>3.3. Annotation details</title>
          <p>During the making of the VeryfIT datasets, it was noticed
that not all the statements were actual claims: in articles
with multiple claims to check, the ‘statement’ field was
iflled with a short title resuming them all, often in the
format “[name of the politician] on [topic]”. Regular
expressions were used to highlight statements not starting
with ‘“’ or ‘«’, the two symbols used to denote a dialogue
or part of a speech, and a manual check brought to the
exclusion of around 170 statements. Moreover around
30 statements with formats resembling “[name of the
politician] is [right/wrong] on [topic]: [statement]” were
reformulated as claims by removing hints about the
factuality verdict and the author of the statement. A couple
examples are brought up in table 7.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Original statement</title>
        <p>Giulia Grillo sbaglia: i
medici e gli infermieri
italiani non sono i meno pagati
Secondo Di Maio il governo
investe nelle centrali a
carbone, ma è il contrario
No, per la Corte dei Conti
non ci saranno 17 miliardi
di nuove tasse</p>
      </sec>
      <sec id="sec-4-5">
        <title>Reworded statement</title>
        <p>i medici e gli infermieri
italiani sono i meno pagati
Il governo investe nelle
centrali a carbone
Per la Corte dei Conti
ci saranno 17 miliardi di
nuove tasse</p>
        <p>The goal of partially or entirely removing the initial Preliminary results obtained through the chat function
layer of complexity, by simplifying the extraction of the of Claude 3.5 Sonnet3 and GPT-4o 4 show that
respecrelevant information from the statement for verification, tively two out of the four statements (Claude) and one
is to highlight a stronger correlation between the bench- out of the four statements (GPT) reported in Table 6 get
mark results and the language model’s actual factual wrongly classified when presented in the original version,
knowledge: when working with natural data, the model’s while providing the models with the enriched versions
responses may stem from its dificulty in comprehending brings up the correctly classifications to four out of four
the specific information it is being asked to verify. How- for both models. These results however can only
parever, with altered data, its responses are more directly tially prove the efectiveness of enriched statements as
influenced by gaps in its knowledge. diferent models when presented a partial context could</p>
        <p>Examples of enriched statements are reported in Ta- provide diferent verdicts, even guessing the right one.
ble 6:</p>
        <p>The reasons for enriching the statements in table 6 all
revolve around the lack of pivotal information to
determine factuality: The first statement is completely missing Another important annotation step has been
producthe context and presents an unclear term “grandi elettori” ing the enriched statements. A human annotator5
re[big voters], relatively known in the political context, viewed the VeryfIT_small dataset, identifying statements
but that could be mistaken for a physical feature or for that could benefit from additional context, and produced
a consideration regarding the age of voters; the second enriched variations of those statements. In most cases,
statement has an unclear formulation as “pagare” [to pay] minimal adjustments were made, such as retaining the
does not refer univocally to taxes; the third statement original claim but adding the name of the politician
speakis missing the subject; the fourth and last statement is ing or clarifying anaphoric references.
missing part of its context as “stiamo facendo un lavoro”
[we are doing a job] “stiamo svolgendo un ruolo” [we are
playing a role] both refer to a very specific duty of the
parliament that does not get mentioned directly.
3https://claude.ai/chat
4https://chatgpt.com/
5All the annotations noted in the report was done by the first author
of the paper, master student in Computer Science with a background
in Natural Language Processing</p>
        <p>The decision of applying this annotation step to the
VeryfIT_small subset, instead of the full dataset, is related
to the amount of manual work it would have required.</p>
        <p>Additionally another annotation step involved
completing the “macro_area” [topic] field for all the 352
entries of VeryfIT_small. Although this field was included
in the original dataset, it was missing a value in
approximately 15% of the entries. This was done manually,
classifying statements into the pre-existing topic labels
which are: ‘questioni sociali’ [social matters], ‘economia’
[economy], ‘esteri’ [foreign afairs], ‘giustizia’ [justice],
‘istituzioni’ [institutions], ‘ambiente’ [environment],
‘altro’ [others]. The new labels were chosen by comparing
unlabelled statements with statements that already had
a label and inspecting the contents of the articles from
which they were extracted, sometimes only needing to
look at the ‘tags’ field to find all the information needed.</p>
        <p>To avoid even the smallest imprecision that would have
impaired the original label system made by journalist,
non-certain labels were put in the ‘altro’ category.</p>
        <p>Statistics about the distribution of these labels can be
found in section 3.6.</p>
        <sec id="sec-4-5-1">
          <title>3.4. Data format</title>
          <p>Brief explanation of the data fields:
• annotato: If True, the statement has a revised</p>
          <p>version.
• id: ID of the corresponding article in CheckIT!.
• statement_date: Date of statements difusion.
• statement: The statement.
• verdict: Factuality verdict.
{
},
{
}
"annotato": False,
"id": 991,
"statement_date": 2019-07-12,
"statement": "[Il salario minimo n.d.r.]</p>
          <p>Manca solo a noi e ai Paesi dell’Est</p>
          <p>Europa",
"verdict": "Falso",
"orientamento": ’C’,
"macro_area": "questioni sociali",
"tags": "[’questioni sociali’, ’panzana
pazzesca’, ’italia’, ’eu’, ’salario
minimo’]",
"statement_revised": ""
"annotato": True,
"id": 123,
"statement_date": 2023-02-14,
"statement": "Il canone in bolletta fu una
mia scelta. Costava 113 euro. Averlo
fatto pagare a tutti ha portato a un
abbassamento del costo da 113 a 90 euro
",
"verdict": "Vero",
"orientamento": ’C’,
"macro_area": "altro",
"tags": "["canone", "rai", "bolletta", "</p>
          <p>costo"]",
"statement_revised": "Il canone in bolletta
fu una mia scelta [di Matteo Renzi].</p>
          <p>Costava 113 euro. Averlo fatto pagare a
tutti ha portato a un abbassamento del
costo da 113 a 90 euro"
• orientamento: Orientation of the political party
of the politician author of the statement.</p>
          <p>The models are expected to be evaluated on this task in
• macro_area: Topic of the statement. a zero-shot setting, thereby also better resembling the
conditions of a real use-case scenario. The prompt we
• tags: List of tags. suggest to use for the evaluation is basic, and urges the
model to limit its answer to just the letter corresponding
• statement_revised: Revised version of the state- to the answer. The original prompt in Italian, together
ment, if present. with its English translation, are reported in Box 1.</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>3.5. Example of prompts used for zero shots</title>
          <p>Fields such as ‘macro_area’ and ‘tags’ serve as
indicators of the topic, the former providing a general
categorization and the latter ofering more specific details.
These informations were included with in mind future
tasks that could reveal diferences in factual knowledge
across diferent subjects.</p>
          <p>Prompt
Il seguente statement, nella data indicata, è vero
o falso? Rispondi solo con "Vero" o "Falso".</p>
          <p>The following statement, on the date indicated, is
true or false? Answer only with "True" or "False".</p>
          <p>Box 1: Zero-shot prompt</p>
          <p>The prompt does not contain any information about
the subject of the question or any other informative cues
apart from the time reference needed to anchor the claim
in a temporal context. In this way, our benchmark not
only tests the model in question answering, but also
indirectly tests the instruction-following abilities of the
model in a language diferent than English.</p>
        </sec>
        <sec id="sec-4-5-3">
          <title>3.6. Detailed data statistics</title>
          <p>The full VeryfIT! dataset is composed of 2,021 entries in
the italian language. Out of these claims, 352 form the
VeryfIT_small dataset in which the entries are equally
split across the three main sides of a semplification of
the classical political spectrum (left, right, center) and
a fourth label ‘trasversal’, used to address non precise
placement in the political spectrum or complete absence
of afiliation to any political party or political coalition.</p>
          <p>Of the 352 claims in the VeryfIT_small dataset, 43 have
available an enriched variation of the statement,
providing additional context alongside the original statement.</p>
          <p>The distribution of claims and factuality labels across
topics is presented in Table 8, Table 9, Table 10, Table 11.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Metrics</title>
      <p>Accuracy serves as the evaluation metric of the task due
to its intuitive interpretation and broad applicability.
Accuracy provides a clear measure of a classifier’s overall
performance by calculating the proportion of correct
predictions among total cases examined.</p>
      <p>No other metrics were chosen for the task.</p>
      <p>Total
questioni sociali
economia
istituzioni
esteri
giustizia
altro
ambiente
un-noted
19
11
10
4
3
1
1
23</p>
      <p>CSX</p>
    </sec>
    <sec id="sec-6">
      <title>5. Limitations</title>
      <p>The totality of the data comes from an expert, reliable
source. For this reason, the quality of the verdicts is
assured to be high. One possible limitation is due to the
time-relatedness of said verdicts: claims can be truth
and false at times depending on the temporal context
arXiv preprint arXiv:2406.17789 (2024).
[16] J. Gili, L. Passaro, T. Caselli, Checkit!: A
corpus of expert fact-checked claims for italian, in:
F. Boschetti, G. Lebani, B. Magnini, N. Novielli
(Eds.), Proceedings of the 9th Italian Conference
on Computational Linguistics, CEUR Workshop
Proceedings, CEUR Workshop Proceedings
(CEURWS.org), 2023. Publisher Copyright: © 2023
Copyright for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0
International (CC BY 4.0).; 9th Italian Conference
on Computational Linguistics, CLiC-it 2023 ;
Conference date: 30-11-2023 Through 02-12-2023.
total</p>
      <p>SX</p>
      <p>CSX</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>