VeryfIT - Benchmark of Fact-Checked Claims for Italian:
                                A CALAMITA Challenge
                                Jacopo Gili1 , Viviana Patti1,† , Lucia Passaro2,† and Tommaso Caselli3,†
                                1
                                  Department of Computer Science, University of Turin, Italy
                                2
                                  Department of Computer Science, University of Pisa, Italy
                                3
                                  CLCG, University of Groningen, The Netherlands


                                               Abstract
                                               Achieving factual accuracy is a known pending issue for language models. Their design centered around the interactive
                                               component of user interaction and the extensive use of “spontaneous” training data, has made them highly adept at conversa-
                                               tional tasks but not fully reliable in terms of factual correctness. VeryfIT addresses this issue by evaluating the in-memory
                                               factual knowledge of language models on data written by professional fact-checkers, posing it as a true or false question.
                                               Topics of the statements vary but most are in specific domains related to the Italian government, policies, and social issues.
                                               The task presents several challenges: extracting statements from segments of speeches, determining appropriate contextual
                                               relevance both temporally and factually, and ultimately verifying the accuracy of the statements.

                                               Keywords
                                               fact checking, benchmark, factual knowledge, Italian, fake news, CALAMITA, CheckIT!


                                1. Challenge: Introduction and                                                                          accurately evaluate factual knowledge is more relevant
                                                                                                                                        than ever considering the ease of access of these tools to
                                   Motivation                                                                                           non-experts for any purpose (entertainment, education,
                                The pollution of the information ecosystem by means                                                     professional settings) and the increasing integration of
                                of misleading or false information has reached unprece-                                                 these technologies in every day activities.
                                dented levels at a global scale. This has been possible                                                    Notably, most of these tasks and corresponding bench-
                                thanks to a combination of multiple factors, among which                                                marks are in English with other languages being repre-
                                the collapse of (local and national) journalism; an increas-                                            sented through machine-translated data or no data at all.
                                ing sense of distrust in science and evidence-based facts;                                              This is true for Italian too. For instance, SQUAD-IT [13] is
                                and the presence of computational amplification tools                                                   a machine-translated version of the SQUAD dataset [14]
                                such as bots [1, 2]. In this sense the rise of Large Language                                           and it is the reference for evaluating models on QA-tasks.
                                Models (LLMs) with the constant increase of their perfor-                                                  While machine-translation has been constantly im-
                                mances has introduced both opportunities and challenges                                                 proving, it can indeed easily introduce artefacts in the
                                in the fight against misinformation: while LLMs possess                                                 output text impairing naturalness and correctness, more-
                                the capability to generate coherent and contextually rel-                                               over translated data can be subjected to the loss of nuance
                                evant text, they also pose risks by potentially producing                                               and context as translations may not capture cultural nu-
                                deceptive misinformation at scale [3, 4].                                                               ances or contextual meanings, leading to misunderstand-
                                   Testing factual and common sense knowledge in LLMs                                                   ings or misinterpretations in the target language: certain
                                has been a common although not easy task involving                                                      phrases or idioms may not have direct equivalents in
                                mostly multi-choice question answering, a method easy                                                   other languages, and the presence of linguistic construc-
                                to automate and not prone to ambiguity, and spanning                                                    tions typical of the source language may be encouraged
                                across wide ranges of academic and professional domains                                                 excessively [15].
                                like mathematics, medicine, history, law, general knowl-                                                   By using data from a professional fact-checking
                                edge and many others [5, 6, 7, 8, 9, 10, 11, 12].                                                       agency1 we can test knowledge memorization of LMs
                                   Developing benchmarks to test the ability of LLMs to                                                 and to what extend intra-memory conflicts, resulting
                                                                                                                                        in “hallucinations”, arise. Furthermore, doing so using
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                    Italian data centered around the Italian and European
                                Dec 04 — 06, 2024, Pisa, Italy
                                †
                                                                                                                                        contexts ensures testing LM’s functionalities directly in
                                  These authors contributed equally.                                                                    Italian.
                                $ jacopo.gili584@edu.unito.it (J. Gili); viviana.patti@unito.it
                                                                                                                                           This task is based on CheckIT! [16], a resource of ex-
                                (V. Patti); lucia.passaro@unipi.it (L. Passaro); t.caselli@rug.nl
                                (T. Caselli)                                                                                            pert fact-checked claims designed to fill a gap for the
                                 0009-0007-1343-3760 (J. Gili); 0000-0001-5991-370X (V. Patti);                                        development of AI- assisted fact-checking pipelines for
                                0000-0003-4934-5344 (L. Passaro); 0000-0003-2936-0256 (T. Caselli)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                         Attribution 4.0 International (CC BY 4.0).                                                         Data have been obtained from Pagella Politica


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Italian.                                                      recognizes an intermediate “Ni” [Half true] label. As a
                                                              result, all claims with the “half-true” verdict were dis-
                                                              carded.
2. Challenge: Description                                        Furthermore, we considered pertaining to the task to
                                                              provide also a smaller subset of claims, “VeryfIT_small”,
The challenge is a binary classification task in a zero-shot
                                                              balanced on the political orientation of the politician
setting: for each atomic statement, any LM is asked to
                                                              speaking, as misinformation can occur on all topics but
determine its factuality with respect to the time it was
                                                              when referring to political misinformation each side of
uttered by answering only with one of the two labels,
                                                              the political spectrum has some more widespread topics
“Vero” (true) or “Falso” (false). A third label for half true
                                                              and recurrent formulations.
statements could have been easily kept as it was already
                                                                 Additionally, an annotation task was carried out on the
part of the dataset from which the data is sourced, but in
                                                              VeryfIT_small subset aimed at the clarification of state-
this first stage we opted for the binary setting as to limit
                                                              ments presenting a level of ambiguity that would have
task complexity.
                                                              proven detrimental to the task: around 12% of the state-
   Some cases in the dataset exhibit complexities due to
                                                              ments have available an alternative version “enriched”
the combination of multiple pieces of information within
                                                              of informations vital to the task. We will refer to them
a single claim, which can affect the final determination
                                                              as “enriched statements” (subsection 3.2).
of veracity. For instance, consider the following scenario:
                                                                 In conclusion, 2 versions of the dataset are available:
                                                              VeryfIT (2,021 claims) and VeryfIT_small (352 claims of
  Original claim                 Translation                  which 43 with an enriched version).
  «Se è vero che oltre l’82%     «If it is true that over 82%
  dei morti da Covid hanno       of Covid deaths are over 70
  più di 70 anni, non si         years old, it is not clear why
                                                                  3.1. Creation of VeryfIT_small
  capisce perché meno della      less than half of those over     The first step to achieve this goal was to exclude around
  metà degli over 80 sia stato   80 have been vaccinated so       400 out of the 2,021 claims of VeryfIT for which informa-
  vaccinato finora»              far»                             tion about the political orientation of the speaker was
Table 1
                                                                  not available.
Example of a claim                                                   We then mapped, using Wikipedia as a source, the
                                                                  political orientation of the parties (and thus of the authors
                                                                  of the claims at the moment of remark) into eight fine-
  The informations concerning this statement are:                 grained, commonly recognized political categories: far-
     1. Out of all the deceased due to the Covid19 pan-           left, left, center-left, center, center-right, right, far-right.
        demic, 82% are people over 70 years old.                  An illustration on the list of all the parties and their
                                                                  corresponding political orientation is reported in Table 2.
     2. Less than half of the citizens over 80 years old had      An additional label ‘transverse’ was added to indicate
        administered at least one dose of vaccine against         a non precise placement in the political spectrum. This
        Covid19.                                                  label includes one party (“Movimento 5 Stelle”), members
                                                                  of the Italian institutions above political parties (e.g. the
    This example also highlights the importance of incor-         President of the Republic), and experts not affiliated to
porating the appropriate temporal context in the verifi-          any political party or political coalition like members of
cation process. Factual information, especially involving         a technical government 2 .
statistics or reports about the state of the world, evolves          At first glance, the Italian political spectrum may ap-
over time and failing to account for this can invalidate          pear only slightly unbalanced. Despite the absence of a
the conclusions drawn by experts. Although more com-              far-left representation, the distribution of parties across
plex statements require a broader knowledge base, by              the spectrum is relatively symmetrical. Out of the 23
now language models have shown understanding abili-               political parties in the data, six are from the left, two
ties well over this level and should not be subjugated by         from the center-left, six from the center, three from the
it.                                                               center-right, two from the right, and three from the far-
                                                                  right. However, the distribution of claims is not as well
3. Data description                                               balanced, with a larger number of claims from the rights
                                                                  and far-right parties than the rest as reported in table 3.
The VeryfIT dataset consists of 2,021 claims taken from              To ensure the balance of our benchmark we decided
CheckIT! [16]. Not all claims were included due to the            to reduce the label granularity from eight to four, by col-
binary format of the task as VeryfIT classifies claims as
either “Vero” [True] or “Falso” [False], whereas CheckIT!         2
                                                                      https://en.wikipedia.org/wiki/Technocratic_government_(Italy)
  Political party                 Orientation label                                                     Claims
  Alleanza Verdi e Sinistra       left                              Political side           True        False Total
  Alternativa Popolare            center-right
  Articolo Uno                    center-left                       Left [SX]                  367          138          505
  Azione                          center                            Center [C]                 105           82          187
  Coraggio Italia                 center-right
  Europa Verde                    left                              Right [DX]                 243          327          570
  Forza Italia                    right                             Transverse [T]             209          146          355
  Fratelli d’Italia               far-right
  Impegno Civico                  center                         Table 4
  Indipendente                    transverse                     VeryfIT data after exclusion of claims where information about
  Italexit                        far-right                      political orientation of the speaker was not available: Distri-
  Italia Viva                     center                         bution of verdict labels in the political spectrum after label
  Lega Nord                       far-right                      collapse.
  Liberi e uguali                 left
  Movimento 5 Stelle              transverse
  Nuovo Centro Destra             center-right
  Partito Democratico             center-left                    the two end point (SX and DX), this setting, with the low-
  Più Europa                      center                         est cardinality being 187 (for C) easily allows us generate
  Popolo della Libertà            right                          a perfectly balanced dataset along the political orienta-
  Possibile                       left                           tions. For the first version of VeryfIT_small, each block
  Radicali Italiani               center                         contributes with 88 claims resulting in a total of 352
  Scelta Civica                   center                         entries, with future works planned to expand it.
  Sinistra Ecologia Libertà       left
  Sinistra italiana               left
  Tecnico                         transverse
                                                                                                       Claims
                                                                   Political side           True         False         Total
Table 2                                                            Left [SX]              64 [13]        24 [2]       88 [15]
VeryfIT data: Italian political parties and their orientation.
                                                                   Center [C]              46 [4]        42 [7]       88 [11]
                                                                   Right [DX]              40 [4]        48 [5]        88 [9]
                                       Claims                      Transverse [T]          50 [2]        38 [6]        88 [8]
   Political side             True      False Total                total                 200 [23]      152 [20]      352 [43]
   Left                         44          28           72      Table 5
   Center-left                 323         110          433      VeryfIT_small: Final distribution of verdict labels in the polit-
                                                                 ical spectrum. Highlighted in green the number of labels of
   Center                      105          82          187      enriched statements (explained in subsection 3.2).
   Center-right                  8           2           10
   Right                        79          84          163
   Far-right                   156         241          397
                                                                 3.2. Enriched statements
   Transverse                  209         146          355
                                                                Given the specificity of the statements, many of which
   total                       924         693        1,617     require detailed knowledge of topics related to Italian
                                                                institutions and policies, and the occasional ambiguity
Table 3                                                         arising from their oral nature, the task has been further
VeryfIT data after exclusion of claims where information about divided into two sub-tasks with slight data modifications,
political orientation of the speaker was not available: Distri-
                                                                aimed at adding vital context to statements that were
bution of verdict labels in the political spectrum.
                                                                excessively reliant on information external to the state-
                                                                ments themselves. The altered statements account for
                                                                around 12% of the VeryfIT_small dataset, as excessive
lapsing labels far-left, left and center-left into ‘left’ [SX], human intervention would undermine the core principle
and far-right, right and center-right into ‘right’ [DX]. of testing on natural data, aligned with what language
Labels center [C] and trasversal [T] remained untouched. models might be asked to handle in real-life scenarios.
The re-aggregated coarse-grained labels are reported in In most cases, minimal adjustments were made, such as
Table 4.                                                        retaining the original claim but adding the name of the
   Although the distribution is still unbalanced between politician speaking or clarifying specific references.
   The goal of partially or entirely removing the initial            Preliminary results obtained through the chat function
layer of complexity, by simplifying the extraction of the         of Claude 3.5 Sonnet3 and GPT-4o 4 show that respec-
relevant information from the statement for verification,         tively two out of the four statements (Claude) and one
is to highlight a stronger correlation between the bench-         out of the four statements (GPT) reported in Table 6 get
mark results and the language model’s actual factual              wrongly classified when presented in the original version,
knowledge: when working with natural data, the model’s            while providing the models with the enriched versions
responses may stem from its difficulty in comprehending           brings up the correctly classifications to four out of four
the specific information it is being asked to verify. How-        for both models. These results however can only par-
ever, with altered data, its responses are more directly          tially prove the effectiveness of enriched statements as
influenced by gaps in its knowledge.                              different models when presented a partial context could
   Examples of enriched statements are reported in Ta-            provide different verdicts, even guessing the right one.
ble 6:
                                                                  3.3. Annotation details
 Original statement              Enriched statement
                                                                  During the making of the VeryfIT datasets, it was noticed
 Abbiamo 490 grandi elet-        Gli elettori dell’area di cen-
 tori                            trosinistra che voteranno        that not all the statements were actual claims: in articles
                                 per l’elezione del Presi-        with multiple claims to check, the ‘statement’ field was
                                 dente della Repubblica           filled with a short title resuming them all, often in the
                                 saranno 490.                     format “[name of the politician] on [topic]”. Regular ex-
 Oggi in Italia sono 796         Oggi in Italia sono 796          pressions were used to highlight statements not starting
 quelli che pagano più di 1      quelli che dichiarano un         with ‘“’ or ‘«’, the two symbols used to denote a dialogue
 milione di euro                 reddito superiore ad 1 mil-      or part of a speech, and a manual check brought to the
                                 ione di euro.                    exclusion of around 170 statements. Moreover around
 [Alle europee] io ho battuto    [Alle europee] io [Carlo         30 statements with formats resembling “[name of the
 Salvini in molti capoluoghi     Calenda] ho battuto              politician] is [right/wrong] on [topic]: [statement]” were
 di provincia                    Salvini in molti capoluoghi
                                                                  reformulated as claims by removing hints about the fac-
                                 di provincia.
 In parlamento stiamo            L’elezione dei membri della
                                                                  tuality verdict and the author of the statement. A couple
 facendo un lavoro che           Corte Costituzionale e del       examples are brought up in table 7.
 risponde a una prerogativa      Consiglio Superiore della
 costituzionale. Certamente      Magistratura (Csm) è un              Original statement               Reworded statement
 si sarebbero tutti auspicati,   dovere che la costituzione           Giulia Grillo sbaglia: i         i medici e gli infermieri ital-
 me compresa, tempi più          italiana dà al parlamento.           medici e gli infermieri ital-    iani sono i meno pagati
 brevi ma non stiamo                                                  iani non sono i meno pagati
 perdendo tempo. Stiamo                                               Secondo Di Maio il governo       Il governo investe nelle cen-
 svolgendo un ruolo che ci                                            investe nelle centrali a car-    trali a carbone
 compete e che la Costi-                                              bone, ma è il contrario
 tuzione da’ al parlamento.                                           No, per la Corte dei Conti       Per la Corte dei Conti
                                                                      non ci saranno 17 miliardi       ci saranno 17 miliardi di
Table 6
                                                                      di nuove tasse                   nuove tasse
Comparison of Original and Enriched Statements
                                                                  Table 7
   The reasons for enriching the statements in table 6 all        Examples of reworded statements
revolve around the lack of pivotal information to deter-
mine factuality: The first statement is completely missing           Another important annotation step has been produc-
the context and presents an unclear term “grandi elettori”        ing the enriched statements. A human annotator5 re-
[big voters], relatively known in the political context,          viewed the VeryfIT_small dataset, identifying statements
but that could be mistaken for a physical feature or for          that could benefit from additional context, and produced
a consideration regarding the age of voters; the second           enriched variations of those statements. In most cases,
statement has an unclear formulation as “pagare” [to pay]         minimal adjustments were made, such as retaining the
does not refer univocally to taxes; the third statement           original claim but adding the name of the politician speak-
is missing the subject; the fourth and last statement is          ing or clarifying anaphoric references.
missing part of its context as “stiamo facendo un lavoro”         3
                                                                    https://claude.ai/chat
[we are doing a job] “stiamo svolgendo un ruolo” [we are          4
                                                                    https://chatgpt.com/
                                                                  5
playing a role] both refer to a very specific duty of the           All the annotations noted in the report was done by the first author
parliament that does not get mentioned directly.                    of the paper, master student in Computer Science with a background
                                                                    in Natural Language Processing
   The decision of applying this annotation step to the             {
VeryfIT_small subset, instead of the full dataset, is related         "annotato": False,
to the amount of manual work it would have required.                  "id": 991,
   Additionally another annotation step involved com-                 "statement_date": 2019-07-12,
pleting the “macro_area” [topic] field for all the 352 en-            "statement": "[Il salario minimo n.d.r.]
                                                                           Manca solo a noi e ai Paesi dell’Est
tries of VeryfIT_small. Although this field was included                   Europa",
in the original dataset, it was missing a value in approx-            "verdict": "Falso",
imately 15% of the entries. This was done manually,                   "orientamento": ’C’,
classifying statements into the pre-existing topic labels             "macro_area": "questioni sociali",
which are: ‘questioni sociali’ [social matters], ‘economia’           "tags": "[’questioni sociali’, ’panzana
                                                                           pazzesca’, ’italia’, ’eu’, ’salario
[economy], ‘esteri’ [foreign affairs], ‘giustizia’ [justice],              minimo’]",
‘istituzioni’ [institutions], ‘ambiente’ [environment], ‘al-          "statement_revised": ""
tro’ [others]. The new labels were chosen by comparing              },
unlabelled statements with statements that already had              {
a label and inspecting the contents of the articles from              "annotato": True,
                                                                      "id": 123,
which they were extracted, sometimes only needing to                  "statement_date": 2023-02-14,
look at the ‘tags’ field to find all the information needed.          "statement": "Il canone in bolletta fu una
To avoid even the smallest imprecision that would have                     mia scelta. Costava 113 euro. Averlo
impaired the original label system made by journalist,                     fatto pagare a tutti ha portato a un
non-certain labels were put in the ‘altro’ category.                       abbassamento del costo da 113 a 90 euro
                                                                           ",
   Statistics about the distribution of these labels can be           "verdict": "Vero",
found in section 3.6.                                                 "orientamento": ’C’,
                                                                      "macro_area": "altro",
                                                                      "tags": "["canone", "rai", "bolletta", "
3.4. Data format                                                           costo"]",

  Brief explanation of the data fields:                               "statement_revised": "Il canone in bolletta
                                                                           fu una mia scelta [di Matteo Renzi].

     • annotato: If True, the statement has a revised                      Costava 113 euro. Averlo fatto pagare a
                                                                            tutti ha portato a un abbassamento del
       version.                                                             costo da 113 a 90 euro"
                                                                    }
     • id: ID of the corresponding article in CheckIT!.
     • statement_date: Date of statements diffusion.            Figure 1: Data format
     • statement: The statement.
     • verdict: Factuality verdict.                     3.5. Example of prompts used for zero
     • orientamento: Orientation of the political party      shots
        of the politician author of the statement.
                                                           The models are expected to be evaluated on this task in
     • macro_area: Topic of the statement.                 a zero-shot setting, thereby also better resembling the
                                                           conditions of a real use-case scenario. The prompt we
      • tags: List of tags.                                suggest to use for the evaluation is basic, and urges the
                                                           model to limit its answer to just the letter corresponding
      • statement_revised: Revised version of the state-
                                                           to the answer. The original prompt in Italian, together
        ment, if present.
                                                           with its English translation, are reported in Box 1.
   Fields such as ‘macro_area’ and ‘tags’ serve as indi-
cators of the topic, the former providing a general cat-        Prompt
egorization and the latter offering more specific details.
These informations were included with in mind future            Il seguente statement, nella data indicata, è vero
tasks that could reveal differences in factual knowledge        o falso? Rispondi solo con "Vero" o "Falso".
across different subjects.                                      The following statement, on the date indicated, is
                                                                true or false? Answer only with "True" or "False".


                                                                              Box 1: Zero-shot prompt
   The prompt does not contain any information about                                                    Orientation label
                                                                     Macro_area          SX    CSX      C CDX DX            E-DX      T
the subject of the question or any other informative cues
                                                                     questioni sociali   19     105     27        5   27        101   80
apart from the time reference needed to anchor the claim             economia            11     119     43        1   52         54   52
in a temporal context. In this way, our benchmark not                istituzioni         10      81     10        3   38         33   71
only tests the model in question answering, but also in-             esteri               4      32     17        0   11         41   33
directly tests the instruction-following abilities of the            giustizia            3      11      1        0   13         11   24
                                                                     altro                1      17      8        1    6          8   14
model in a language different than English.                          ambiente             1      10      2        0    2          6   14
                                                                     un-noted            23      59     79        0   14        145   68

3.6. Detailed data statistics                                      Table 9
The full VeryfIT! dataset is composed of 2,021 entries in          VeryfIT data after exclusion of claims where information about
                                                                   political orientation of the speaker was not available: Distri-
the italian language. Out of these claims, 352 form the
                                                                   bution of claims per topic and positioning in the political
VeryfIT_small dataset in which the entries are equally             spectrum.
split across the three main sides of a semplification of
the classical political spectrum (left, right, center) and
a fourth label ‘trasversal’, used to address non precise                                                        Claims
placement in the political spectrum or complete absence              Macro_area                    True           False          Total
of affiliation to any political party or political coalition.        questioni sociali            50 [2]          37 [4]         87 [6]
   Of the 352 claims in the VeryfIT_small dataset, 43 have           economia                     53 [4]          37 [4]         90 [8]
available an enriched variation of the statement, provid-            istituzioni                 46 [11]          17 [5]        63 [16]
ing additional context alongside the original statement.             esteri                       26 [4]          19 [3]         45 [7]
   The distribution of claims and factuality labels across           ambiente                      8 [1]              10         18 [1]
topics is presented in Table 8, Table 9, Table 10, Table 11.         giustizia                         7               8             15
                                                                     altro                        10 [1]          24 [4]         34 [5]
                                       Claims                        total                     200 [23]         152 [20]       352 [43]
   Macro_area                True       False Total
                                                                   Table 10
   questioni sociali           256          170         426        VeryfIT_small: Distribution of claims and factuality labels per
   economia                    264          155         419        topics ordered by total value. Highlighted in green the number
   istituzioni                 243           77         320        of labels of enriched statements.
   esteri                      105           53         158
   giustizia                    60           26          86                                             Orientation label
                                                                     Macro_area                  SX           C      DX               T
   altro                        46           32          78
   ambiente                     42           18          60          questioni sociali         22 [2]        15 [1]   25 [2]     25 [1]
                                                                     economia                  28 [2]        30 [5]   19 [1]         13
   un-noted                    180          294         474          istituzioni               19 [8]         8 [1]   15 [3]     21 [4]
   total                     1,196          825       2,021          esteri                     9 [2]        13 [1]   12 [2]     11 [2]
                                                                     ambiente                       2             7    3 [1]          6
Table 8                                                              giustizia                      2             4        3          6
VeryfIT: Distribution of claims and factuality labels per topics     altro                      6 [1]        11 [3]       11      6 [1]
ordered by total value.                                              total                    88 [15]     88 [11]     88 [9]     88 [8]

                                                                   Table 11
  Further statistics on the original CheckIT! dataset is           VeryfIT_small: Distribution of claims per topic and positioning
available in Figure A and Table A in Appendix A.                   in the simplified political spectrum. Highlighted in green the
                                                                   number of labels of enriched statements.
4. Metrics
Accuracy serves as the evaluation metric of the task due           5. Limitations
to its intuitive interpretation and broad applicability. Ac-
                                                                   The totality of the data comes from an expert, reliable
curacy provides a clear measure of a classifier’s overall
                                                                   source. For this reason, the quality of the verdicts is
performance by calculating the proportion of correct pre-
                                                                   assured to be high. One possible limitation is due to the
dictions among total cases examined.
                                                                   time-relatedness of said verdicts: claims can be truth
   No other metrics were chosen for the task.
                                                                   and false at times depending on the temporal context
in which they are evaluated. LMs could have an hard                massive multitask language understanding,
time discerning informations pertaining specific time              2021. URL: https://arxiv.org/abs/2009.03300.
intervals, given that they could also not have been trained        arXiv:2009.03300.
on data related to them.                                       [6] A. Srivastava, D. Kleyjo, Z. Wu, Beyond the im-
   Another limitation could be the depth of the factual            itation game: Quantifying and extrapolating the
knowledge required to understand and consequently                  capabilities of language models, Transactions on
answer the questions of the dataset. As previously                 Machine Learning Research (2023).
stated, VeryfIT data is about italian/european context         [7] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu,
and touches details of various fields that most probably           H. Wang, C. You, Z. Guo, L. Zhu, et al., Benchmark-
not even the citizens would know about!                            ing large language models on cmexam-a compre-
   Remarkably, the risk of the data being present in train-        hensive chinese medical exam dataset, Advances in
ing corpuses for LMs should be mitigated as the CheckIT!           Neural Information Processing Systems 36 (2024).
dataset is not publicly released.                              [8] S. Lin, J. Hilton, O. Evans, TruthfulQA: Measur-
   Finally, fact-checking is a very complex task and state-        ing how models mimic human falsehoods, in:
ments could carry different degrees of truthness, more             S. Muresan, P. Nakov, A. Villavicencio (Eds.), Pro-
than a binary setting can express. We chose to limit for           ceedings of the 60th Annual Meeting of the Asso-
now the task to a binary classification challenge to not           ciation for Computational Linguistics (Volume 1:
make it too complicated, but we do not exclude further de-         Long Papers), Association for Computational Lin-
velopment towards a multi-label setting to better capture          guistics, Dublin, Ireland, 2022, pp. 3214–3252. URL:
the nuances of the fact-checking process.                          https://aclanthology.org/2022.acl-long.229. doi:10.
                                                                   18653/v1/2022.acl-long.229.
                                                               [9] P. Wang, A. Chan, F. Ilievski, M. Chen, X. Ren,
6. Ethical issues                                                  Pinto: Faithful language reasoning using prompt-
                                                                   generated rationales, in: Workshop on Trustwor-
No ethical issue has arisen from the making of this task,
                                                                   thy and Socially Responsible Machine Learning,
all the data has been sourced through agreements with
                                                                   NeurIPS 2022, 2022.
the original authors.
                                                              [10] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang,
                                                                   P. Szolovits, What disease does this patient have?
7. Data license and copyright                                      a large-scale open domain question answering
                                                                   dataset from medical exams, 2020. URL: https://
   issues                                                          arxiv.org/abs/2009.13081. arXiv:2009.13081.
                                                              [11] A. Talmor, J. Herzig, N. Lourie, J. Berant, Com-
The data cannot be publicly released due to a Data Shar-
                                                                   monsenseqa: A question answering challenge tar-
ing Agreement between University of Groningen and
                                                                   geting commonsense knowledge, 2019. URL: https:
Pagella Politica. At the moment of writing of this contri-
                                                                   //arxiv.org/abs/1811.00937. arXiv:1811.00937.
bution to obtain VeryfIT! contact dr. Tommaso Caselli.
                                                              [12] L. C. Passaro, A. Bondielli, P. Dell’Oglio,
                                                                   A. Lenci, F. Marcelloni,         In-context annota-
References                                                         tion of topic-oriented datasets of fake news:
                                                                   A case study on the notre-dame fire event,
 [1] T. Economist, Disinformation is on the rise. how              Information Sciences 615 (2022) 657–677.
     does it work?, 2024. URL: https://www.economist.              URL:       https://www.sciencedirect.com/science/
     com/science-and-technology/2024/05/01/                        article/pii/S0020025522008167.           doi:https:
     disinformation-is-on-the-rise-how-does-it-work.               //doi.org/10.1016/j.ins.2022.07.128.
 [2] C. Wardle, H. Derakhshan, Information disorder:          [13] D. Croce, A. Zelenanska, R. Basili, Neural learning
     Toward an interdisciplinary framework for research            for question answering in italian, in: C. Ghidini,
     and policymaking, volume 27, Council of Europe                B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA
     Strasbourg, 2017.                                             2018 – Advances in Artificial Intelligence, Springer
 [3] OpenAI, Disrupting deceptive uses of ai by covert             International Publishing, Cham, 2018, pp. 389–402.
     influence operations, 2024.                              [14] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
 [4] C. Chen, K. Shu, Combating misinformation in                  100,000+ questions for machine comprehension of
     the age of llms: Opportunities and challenges, AI             text, 2016. URL: https://arxiv.org/abs/1606.05250.
     Magazine (2024). URL: https://doi.org/10.1002/aaai.           arXiv:1606.05250.
     12188. doi:10.1002/aaai.12188.                           [15] I. Plaza, N. Melero, C. del Pozo, J. Conde, P. Re-
 [5] D. Hendrycks, C. Burns, S. Basart, A. Zou,                    viriego, M. Mayor-Rocher, M. Grandury, Spanish
     M. Mazeika, D. Song, J. Steinhardt, Measuring                 and llm benchmarks: is mmlu lost in translation?,
     arXiv preprint arXiv:2406.17789 (2024).
[16] J. Gili, L. Passaro, T. Caselli, Checkit!: A cor-
     pus of expert fact-checked claims for italian, in:
     F. Boschetti, G. Lebani, B. Magnini, N. Novielli
     (Eds.), Proceedings of the 9th Italian Conference
     on Computational Linguistics, CEUR Workshop
     Proceedings, CEUR Workshop Proceedings (CEUR-
     WS.org), 2023. Publisher Copyright: © 2023 Copy-
     right for this paper by its authors. Use permitted
     under Creative Commons License Attribution 4.0
     International (CC BY 4.0).; 9th Italian Conference
     on Computational Linguistics, CLiC-it 2023 ; Con-
     ference date: 30-11-2023 Through 02-12-2023.
Appendix A


Figure A: Original data from subset d1 of CheckIT!: Claims distribution in the political spectrum in reference with factual
veracity.


                                                                   Orientamento
             Macro_area                   SX        CSX            C CDX DX                       E-DX            T
             economia                      21         243         74            6       119           145       142
             questioni sociali             30         215         62           12        50           203       174
             istituzioni                   11         150         24            6        81            54       144
             esteri                         7          75         25            0        19            99        80
             ambiente                       5          30          8            0         2             9        29
             giustizia                      3          23          4            0        23            23        35
             altro                         33          96         97            2        23           171       107
             total                        110         832       294            26       317           704       711
Table A
Original data from subset d1 of CheckIT!: Distribution of claims per topic and positioning in the full political spectrum. Far-left
label is omitted as non-present in the dataset.