=Paper=
{{Paper
|id=Vol-3878/97_main_long
|storemode=property
|title=Leveraging Large Language Models for Fact Verification in Italian
|pdfUrl=https://ceur-ws.org/Vol-3878/97_main_long.pdf
|volume=Vol-3878
|authors=Antonio Scaiella,Stefano Costanzo,Elisa Passone,Danilo Croce,Giorgio Gambosi
|dblpUrl=https://dblp.org/rec/conf/clic-it/ScaiellaCPCG24
}}
==Leveraging Large Language Models for Fact Verification in Italian==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/97_main_long.pdf</pdf>
<pre>
                                Leveraging Large Language Models for Fact Verification in
                                Italian
                                Antonio Scaiella1,2 , Stefano Costanzo1 , Elisa Passone1 , Danilo Croce1,* and Giorgio Gambosi1
                                1
                                    Department of Enterprise Engineering, University of Rome Tor Vergata, Italy
                                2
                                    Reveal s.r.l.


                                                Abstract
                                                In recent years, Automatic Fact Checking has become a crucial tool for combating fake news by leveraging AI to verify
                                                the accuracy of information. Despite significant advancements, most datasets and models are predominantly available in
                                                English, posing challenges for other languages. This paper presents an Italian resource based on the dataset made available
                                                in the FEVER evaluation campaign, created to train and evaluate fact-checking models in Italian. The dataset comprises
                                                approximately 240k examples, with over 2k test examples manually validated. Additionally, we fine-tuned a state-of-the-art
                                                LLM, namely LLaMA3, on both the original English and translated Italian datasets, demonstrating that fine-tuning significantly
                                                improves model performance. Our results suggest that the fine-tuned models achieve comparable accuracy in both languages,
                                                highlighting the value of the proposed resource.

                                                Keywords
                                                Automatic Fact Checking, Fact Checking in Italian, Resource in Italian, Large Language Model for Fact Verification


                                1. Introduction                                                                                        cial intelligence communities, surveyed in [1] and more
                                                                                                                                       recently in [3] and [4]. In particular, in [1] the authors
                                In recent years, Automatic Fact Checking (AFC) has as- expose a survey on the topic, describing the early develop-
                                sumed a significant role as an instrument to identify fake ments that were surveyed in [5], which is an exhaustive
                                news. AFC is a process that verifies the truthfulness and overview of the subject.
                                accuracy of information, claims, and data contained in a                                                  As with most machine learning paradigms [1], state-
                                text or speech. The focus is on debunking disinformation of-the-art methods require datasets and benchmarks.
                                and misinformation, intercepting errors, and verifying                                                    One of the most impactful campaigns for collecting
                                sources and facts.                                                                                     a large-scale benchmark is FEVER (Fact Extraction and
                                   Automated fact-checking uses AI tools to identify, ver- VERification) [6]. In this context, fact-checking involves
                                ify, and respond to misleading claims, using techniques verifying whether a claim is supported by one or more
                                based on natural language processing, machine learning, pieces of evidence. FEVER is a publicly available dataset
                                knowledge representation, and databases to automati- designed for claim verification against textual sources.
                                cally predict the truthfulness of claims [1]. This is a It comprises about 180K claims generated by altering
                                complex process that involves searching, interpreting, sentences extracted from Wikipedia. The claims are clas-
                                and assessing information. As discussed in [1] a NLP sified into three categories: Supported (a piece of evi-
                                framework for automated fact-checking consists of three dence exists and it supports the claim), Refutes (a piece
                                stages: claim detection to identify claims that require of evidence exists and it contradicts the claim), or NotE-
                                verification; evidence retrieval to find sources supporting noughInfo (there is insufficient evidence to verify the
                                or refuting the claim; and claim verification to assess the claim). The challenge, therefore, is to retrieve the rel-
                                truthfulness of the claim based on the retrieved evidence. evant evidence and verify the accuracy of the claims,
                                   At first, automating the fact-checking process has been categorizing them with the correct label.
                                discussed in the context of computational journalism in                                                   Many works like FEVER have recently focused on
                                works like [2], and has received significant attention in building data and datasets for the task of Fact Verification,
                                the computational linguistics and, in general, the artifi- achieving very good results [7, 8, 9, 10, 11, 12]. However,
                                                                                                                                       all of these datasets are designed for the English language.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                       Although multilingual models exist (e.g., in [13, 14]), fine-
                                Dec 04 — 06, 2024, Pisa, Italy
                                *                                                                                                      tuning a model on a specific language, pre-training it for
                                  Corresponding author.
                                $ scaiella@revealsrl.it (A. Scaiella);                                                                 a specific task and use case, could lead to a significant
                                stefano.costanzo@students.uniroma2.eu (S. Costanzo);                                                   decline in quality if applied to another language. Few
                                passone@ing.uniroma2.it (E. Passone); croce@info.uniroma2.it                                           studies have worked on training models for languages
                                (D. Croce); giorgio.gambosi@uniroma2.it (G. Gambosi)
                                                                                                                                       other than English. An example is the work presented
                                 0000-0001-9111-1950 (D. Croce); 0000-0001-9979-6931
                                (G. Gambosi)                                                                                           in [15], which focuses on developing automated claim
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License detection for Dutch-language fact-checkers.
                                          Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In this work, we propose a FEVER-IT dataset in which               2. Related Work
the FEVER dataset has been translated into Italian to train
the model for the Italian language. Inspired by SQUAD-IT              One of the pioneering works in autonomous fact-
[16] and MSCOCO-IT [17], we worked to obtain quality                  checking was conducted by [21], which proposed cre-
data. Although the training set may be affected by trans-             ating publicly available datasets and developing auto-
lation errors, the test set will not, as it is composed of            mated systems using natural language processing tech-
manually validated data. Furthermore, while the original              nologies. Recent challenges such as CheckThat! at CLEF
FEVER dataset contained evidence only for Supports                    [10, 11, 12] and Fever [7, 8, 9] from 2018 have advanced
and Refutes, in this work we have also added and trans-               fact-checking tasks by leveraging advanced approaches
lated examples for the NotEnoughInfo category using                   and integrating Large Language Models (LLMs) like BERT
the heuristics proposed in [18]. This work extends the ex-            and GPT. These models represent the current state of the
perience described in [19], where translations were done              art in many Natural Language Processing tasks, includ-
using Google API, by using publicly available models                  ing fact-checking. Notable examples of such technology
([20]) and adding data for the NotEnoughInfo category.                include FacTeR-Check [22], a multilingual architecture
   The contribution of this work is twofold. Firstly, we              for semi-automated fact-checking and hoax propagation
release FEVER-IT, a corpus with 228K claims each associ-              analysis using the XLM-RoBERTa Transformer [13], and
ated with at least one (possibly useful) piece of evidence,           FACT-GPT [23], a framework that automates the claim-
including a test set of 2,000 manually validated claims.              matching phase of fact-checking using LLMs to identify
In addition, we fine-tuned and validated a state-of-the-              social media content that supports or contradicts claims
art model, LLaMA3 [14], on both the original English                  previously debunked by fact-checkers.
dataset and the Italian dataset. While this provides a                   The success of these systems is largely due to the capa-
high-performance model ready for the task in both lan-                bilities of LLMs as summarized in [3], which are neural
guages, the primary goal is to assess whether the quality             models based on the Transformer architecture. Specif-
of the Italian data is comparable to the English one. By              ically, decoder-based architectures, such as GPT [24],
training the model separately on each dataset, we can                 GPT-3 [25], and LLaMA [14], generate output sequences
evaluate its stability: if the model performs similarly on            in an auto-regressive manner. These models have demon-
the manually validated Italian dataset and the English                strated impressive capabilities following pre-training on
test set, we can conclude that the quality of the Italian             large collections of documents. One notable outcome is
data is on par with the English data.                                 few-shot learning, where models can adapt to new tasks
   Additionally, we want to assess whether using an Ital-             with only a few examples [25], greatly enhancing their
ian train dataset, despite the noise from automatic trans-            flexibility and applicability.
lation, is truly beneficial. LLMs like LLaMA3 can already                When new annotated data is available, fine-tuning
perform tasks in other languages through zero-shot or                 further enhances a model’s capabilities. This process in-
few-shot learning, without requiring fine-tuning on a                 volves taking the pre-trained base model and training it
specific dataset, especially if that dataset is noisy. There-         on a smaller, specialized dataset relevant to the desired
fore, we aim to compare the performance on the test set               task. Parameter Efficient Fine-Tuning (PEFT) is an opti-
between a LLaMA3 model that hasn’t been fine-tuned on                 mized technique that involves training only a small por-
the noisy Italian data and one that has been fine-tuned, to           tion of the weights, typically by adding a new layer to the
determine whether fine-tuning actually improves results               model. One widely used technique is LoRA [26], which
or if the model performs on par or better without it.                 adds an adapter consisting of two matrices of weights
   The experimental results show that the model without               that are relatively small compared to the original model.
fine-tuning achieves an average accuracy of only about                Extremita [27] is an example of a decoder-based model
45%. Fine-tuning on the English dataset yields about 90%              fine-tuned with LoRA in Italian for multi-task executions.
mean accuracy, while fine-tuning on the Italian dataset                  Several benchmark datasets have been developed to
results in a percentage quite similar to the fine-tuned               fine-tune and evaluate fact-checking systems, typically
English model and much greater than testing without                   collected by organizations like Snopes, FullFact, and Poli-
fine-tuning1 .                                                        tiFact. The FEVER challenge has produced four major
   The remainder of the paper is organized as follows: Sec-           datasets: FEVER (2018) [6], FEVER 2.0 (2019) [8], FEVER-
tion 2 discusses related work, Section 3 presents FEVER-              OUS (2021) [9], and AVeriTeC (2024) [28]. These datasets
IT, Section 4 details the experimental measures, and Sec-             range from labeled claim-evidence associations to veri-
tion 5 provides the conclusions.                                      fied claims with structured and unstructured evidence.
                                                                      Despite the wealth of resources available, there is a lack of
                                                                      large benchmark datasets in Italian. This work addresses
1                                                                     this gap by providing a large-scale Italian resource.
    The resource, fine-tuned models, and code will be released on a
    dedicated repository: https://github.com/crux82/FEVER-it
3. Fact Verification in Italian                                       focused on correcting mistakes related to the proper sen-
                                                                      tence structure in Italian, the accurate meaning of specific
As in [6], the original FEVER dataset is composed of                  English words that MADLAD had translated literally, any
claims that can potentially be verified against an ency-              misunderstandings of the intended meaning in Italian,
clopedic resource, in this case, Wikipedia. The claims are            and a few grammatical errors.
classified into three categories: Supported, Refutes and                 In some cases, translation errors do not completely un-
NotEnoughInfo. For the first two categories, each claim               dermine the examples with respect to the task’s purpose.
is associated with one or more passages from Wikipedia,               For instance, the English sentence from an evidence, “he
each specifying the page from which it was extracted.                 was booked to win a third world championship at a WWE
For the NotEnoughInfo category, no passages are pro-                  event on the night of his death” was translated into Italian
vided because no information was found on Wikipedia                   as “era stato prenotato per vincere un terzo titolo mondiale
to support or refute the claim. For instance, the sentence            in un evento della WWE la notte della sua morte”. A more
“Dan Brown is illiterate.” is a claim associated with pieces          accurate translation would be “si pensava avrebbe vinto
of evidence such as: “Angels and Demons is a 2000 best-               un terzo titolo mondiale in un evento della WWE la notte
selling mystery-thriller novel written by American author             della sua morte”, better capturing the verb’s meaning. In
Dan Brown and published by Pocket Books and then by                   other, more problematic cases, translation errors, loss of
Corgi Books.”. These pieces of evidence prove that the                information, or introduction of hallucinations could even
claim is incorrect, so it can be classified with the label Re-        change the classification in the fact verification task. For
futes. In FEVER, a claim is thus a sentence that expresses            example, in the claim “The Thin Red Line (1998 film) has
information (true or mutated) about a target entity.                  an all-British cast.”, the automatic translation was “La
   To generate the Italian dataset, we started from the               sottile linea rossa (The Thin Red Line) è un film del 1998.”,
dataset version2 proposed in [29], which consists of 260k             which is incorrect because it omits the information about
claims. This version extends the original FEVER by                    the cast. This detail is crucial, as its absence could lead
adding evidence associated with claims justified as NotE-             to incorrect labeling.
noughInfo in FEVER, using the heuristics in [18]. The
approach involved using a search engine to retrieve po-                   Metric      BLEU-1      BLEU-2     BLEU-3      BLEU-4
tential evidence and a textual entailment system based                    Claim        0,9776      0,9695     0,9623      0,9544
on GPT [24]. Claims not judged as Supports or Refutes                    Evidence      0,9529      0,9411     0,9309      0,9207
were classified as NotEnoughInfo.                                     Table 1
   This gives us examples of sentences that are closely               BLEU score metrics of Claim and Evidence manually validated
related to the claim (according to the search engine) but             (gold) respect automatic translation version (silver)
neither support nor refute it. This makes it more straight-
forward and efficient to train and/or evaluate a classifier,
even though some of the derived examples might be some-                               Train (S)    Dev (S)    Test (G)       Total
what noisy, as they were generated through heuristics.                  Supports        114,801      4,638         654     120,095
   For the automatic translation process, we utilized                   Refutes          47,096      4,887         643      52,626
                                                                        NEI              66,380      6,410         766      73,556
MADLAD400 [20], a machine translation system based
                                                                        Total          228,277      15,935      2,063     246,275
on the Transformer architecture3 , trained on MADLAD,
a manually audited, general domain 3T token multilin-                 Table 2
gual dataset based on CommonCrawl, spanning 419 lan-                  Number of claims and evidence in the Italian dataset. (S) indi-
guages. Since the Italian data are obtained through ma-               cates silver data (automatically translated), and (G) indicates
                                                                      gold data (manually validated).
chine translation, and thus potentially incorrect as sug-
gested in [16, 17], we needed validated test data to obtain
a realistic benchmark. Our hypothesis is that an LLM is                  A quantitative analysis of the translation quality sug-
robust enough to generalize from the 228k examples and                gests that MADLAD performs well in translating simple
recognize the relationships involved in FEVER without                 assertive sentences such as claims. In fact, 91% of the
inheriting translation errors. However, to prevent these              claims were not altered by the validators, who considered
errors from being inherited by the model, we manually                 them completely correct. This percentage is lower for the
corrected the translations of the test set.                           Wikipedia passages, dropping to 76%. This discrepancy
   Out of the approximately 16k available test examples,              may be due to the greater complexity of the evidence com-
three annotators were involved in verifying and correct-              pared to the simpler sentence structures in the claims.
ing 2, 063 translations from the test set. The annotators             Additionally, we reported the results in terms of BLEU
                                                                      score [30] for the corrected translations compared to the
2
    https://huggingface.co/datasets/copenlu/fever_gold_evidence       originals, as shown in Table 1. It should be noted that
3
    https://github.com/google-research/google-research/tree/master/
    madlad_400
                                                                      measuring the translation quality after correcting the
sentences introduces a strong bias in the measurements;        4. Experimental Evaluation
however, it provides a more specific idea of the trans-
lation quality, especially in understanding the potential      The goal of our experimentation is to assess the perfor-
noisiness of the training and development sentences. In        mance of a state-of-the-art LLM applied to Fact Verifica-
this case, results of over 95% for BLEU-1 and over 92% for     tion. Specifically, we aim to determine whether a multi-
BLEU-4 suggest that very few terms were altered during         lingual model maintains consistent quality when applied
validation, and even the grammatical patterns remained         to both the English FEVER dataset and our Italian dataset.
largely unchanged. At most, a few mistranslated terms          We utilize LLaMA3-Instruct4 , an instruction-tuned gen-
needed updating, as indicated by the qualitative analysis.     erative text model from META with 8 billion parameters,
   Table 2 summarizes the number of examples created           released in April 2024. This model is trained to execute
for the Italian dataset. In line with the original English     specific instructions or prompts across various tasks. To
material, the dataset is divided into training, develop-       ensure alignment, we evaluate the systems on the manu-
ment, and test sets, with claims categorized into Sup-         ally validated Italian test set and the same subset of 2,063
ports, Refutes, and NotEnoughInfo (NEI). The ta-               claims in the English counterpart. The model is evaluated
ble also distinguishes between silver data (automatically      in 0-shot and 1-shot settings to assess its capability with-
translated) and gold data (manually validated). The train-     out fine-tuning. The prompts used in English and Italian
ing set consists of 228,277 claims, the development set        are provided in Appendix A. Additionally, we fine-tuned
contains 15,935 claims, and the test set has 2,063 claims.     LLaMA3 on the English datasets from [29] and separately
Each Italian claim or evidence is aligned with the English     on the Italian datasets obtained via machine translation.
counterpart, facilitating future research in cross-lingual     Fine-tuning was conducted on an NVIDIA A100 using
fact verification.                                             the LoRA technique5 .
Language Models for Fact Verification. For address-                In FEVER, the title of the document associated with
ing the capabilities of Large Language Models in Fact Veri-    each claim often provides crucial context. For example,
fication, they can be utilized through In-Context Learning     the claim “The University of Leicester discovered and iden-
techniques [31] or by directly fine-tuning the model for       tified the remains of a king.” relies on the document titled
specific downstream tasks. In-context learning relies on       “University of Leicester” to correctly classify the claim
the model’s pre-existing knowledge acquired during pre-        as Supports. To ensure the model’s generalization, we
training and on instructions provided in natural language      will evaluate the impact of including document titles in
at inference time. This method does not involve addi-          prompts. The metrics used to analyze the results are re-
tional training and can be categorized based on the num-       call, precision, accuracy, and F1 score, calculated globally
ber of examples provided: i) 0-shot Learning, where no         and for each label (Supports, Refutes, NotEnough-
examples are given, and the model generates responses          Info).
based solely on its pre-existing knowledge and the pro-            The results are reported in Tables 3 and 4 for the En-
vided instructions; ii) 1-shot Learning, where one example     glish and Italian datasets, respectively. Each table shows
per class is added to provide a more precise context, help-    whether the model underwent fine-tuning (column FT),
ing the model better understand the task by offering a         whether a prompt without examples (0-shot) or with one
concrete reference point; iii) Few-shot Learning, where        example per class (1-shot) was used (column Prompt), and
more than one example per class is provided to give the        whether the document title was included (column Doc).
model additional contextual information during decision-       Notably, if no fine-tuning was performed, the original
making. When the model’s pre-existing knowledge is             LLaMA3-Instruct model was used. Given that the sys-
insufficient, we can fine-tune it on the downstream task.      tem’s response can consist of multiple words, we search
Fine-tuning involves training the model in a traditional       the output for the mention of one of the classes and asso-
manner using input-output pairs (training data) to adjust      ciate the example with that class. If no class is identified,
its parameters. This process improves the model’s per-         the result is classified as NotEnoughInfo. In general,
formance on specific tasks, allowing it to learn from a        the fine-tuned model is extremely stable, consistently
more extensive set of examples. As a result, the model         outputting one of the three categories for every request.
becomes more adept at handling similar queries in the          The non-fine-tuned model, on rare occasions—just a few
future, with a focus on the specific task at hand. We          dozen times out of 2000—produces responses that do not
thus evaluated the application of state-of-the-art LLM,        correspond to any of the required classes. This highlights
namely LLAMA3 [32], by providing just the definition of        the inherent stability of LLaMA3 while also supporting
the task (zero-shot) or adding an example (one-shot) or
                                                               4
by performing fine-tuning, to demonstrate the necessity            https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
                                                               5
                                                                   The following hyperparameters were used: a learning rate of
of a training dataset like the one constructed in this work,
                                                                   0.0001, two epochs, LoRA_R set to 8, LoRA_alpha set to 16, and
as discussed in the following section.                             LoRA_dropout at 0.05. The micro-batch size was 2, and gradient
                                                                   accumulation steps were set to 8.
                                       Support                 Refutes              Not enough info              Macro Average
 FT    Prompt    Doc    Acc
                                 P         R    F1       P        R    F1           P      R     F1             P      R     F1
                 No    0.449   0.784     0.161 0.267   0.647    0.236 0.346       0.395 0.873 0.544           0.609 0.423 0.386
        0-shot
                 Yes   0.374   0.343    0.976 0.507    0.763    0.160 0.265       0.477 0.041 0.075           0.528 0.392 0.282
 No
                 No    0.591   0.555     0.864 0.675   0.699    0.415 0.521       0.586 0.507 0.543           0.613 0.595 0.580
        1-shot
                 Yes   0.383   0.929     0.020 0.039   0.867    0.020 0.040       0.376 0.999 0.546           0.724 0.346 0.208
                 No    0.917   0.932     0.947 0.939   0.924    0.888 0.906       0.899 0.916 0.908           0.918 0.917 0.918
        0-shot
                 Yes   0.922   0.938     0.953 0.945   0.929    0.896 0.912       0.902 0.918 0.910           0.923 0.922 0.923
 Yes
                 No    0.914   0.928     0.948 0.938   0.927    0.883 0.905       0.893 0.911 0.902           0.916 0.914 0.915
        1-shot
                 Yes   0.921   0.931     0.956 0.943   0.927    0.891 0.909       0.907 0.916 0.912           0.922 0.921 0.921

Table 3
Performance in terms of Accuracy, Precision, Recall and F1-measure of our systems on Fever-EN dataset

                                       Support                 Refutes              Not enough info              Macro Average
 FT    Prompt    Doc    Acc
                                 P         R    F1       P        R    F1           P      R     F1             P      R     F1
                 No    0.462   0.411     0.951 0.574   0.607    0.457 0.522       0.585 0.050 0.092           0.534 0.486 0.396
        0-shot
                 Yes   0.507   0.463     0.942 0.620   0.587    0.663 0.622       0.800 0.005 0.010           0.617 0.537 0.418
 No
                 No    0.425   0.376     0.963 0.541   0.671    0.333 0.445       0.478 0.043 0.079           0.508 0.446 0.355
        1-shot
                 Yes   0.462   0.403    0.968 0.569    0.632    0.361 0.459       0.698 0.115 0.197           0.578 0.481 0.409
                 No    0.897   0.897     0.940 0.918   0.924    0.845 0.882       0.877 0.903 0.890           0.899 0.896 0.897
        0-shot
                 Yes   0.901   0.899     0.936 0.917   0.923    0.855 0.888       0.887 0.910 0.898           0.903 0.900 0.901
 Yes
                 No    0.895   0.891     0.947 0.918   0.919    0.843 0.879       0.881 0.894 0.887           0.897 0.895 0.895
        1-shot
                 Yes   0.905   0.913     0.942 0.927   0.924    0.854 0.888       0.883 0.915 0.899           0.907 0.904 0.905

Table 4
Performance in terms of Accuracy, Precision, Recall and F1-measure of our systems on Fever-IT dataset


the soundness of the results achieved.                         LLM. For example, the claim “Il Castello di Praga attira
   A key finding is that the multilingual model generally      oltre 18 milioni di visitatori ogni anno.6 ” was given the
achieves similar, though modest, results on English and        evidence “Il castello è tra le attrazioni turistiche più visitate
Italian datasets without fine-tuning, with accuracy val-       di Praga che attira oltre 1,8 milioni di visitatori all’anno.7 ”
ues around 0.40-0.50 and average F1 scores in the range        The model’s predicted label was Refutes, while the true
of 0.35-0.55. This performance is relatively unstable, and     label was Supports. Here, the true label should be Sup-
the addition of an example in the prompt does not lead         ports since 18 million is indeed greater than 1.8 million,
to significant improvements. In English, there are some        but the model found the numbers inconsistent. In an-
improvements, but in Italian, there are fewer. We believe      other case, the claim “Ned Stark è stato introdotto nel 1996
this is because, although LLaMA is multilingual, the per-      in Tempesta di spade.8 ” was paired with the evidence
centage of Italian examples observed during training is        “Introdotto nel 1996 in Il Trono di Spade, Ned è l’onorevole
less than 1%, making it less performant and less stable in     signore di Winterfell, un’antica fortezza nel nord del con-
this language.                                                 tinente immaginario di Westeros.9 ” The model predicted
   However, when fine-tuning is applied, the results im-       Refutes, although the true label was Supports. The
prove dramatically, with accuracy exceeding 90% in both        confusion here is due to the difference in the book titles,
languages. This demonstrates the utility of the translated     which are from the same series but are distinct works.
dataset, even if it contains some noise. In this scenario,     The error analysis revealed that the model occasionally
adding an example in the prompt leads to negligible but        struggled with mathematical reasoning and contextual
consistent improvements. Additionally, the inclusion of        understanding, highlighting areas for future enhance-
the document title, while sometimes causing inconsis-          ment. Larger models and further fine-tuning could poten-
tencies in zero-shot learning, is better utilized by the       tially address these issues, which remain open questions
fine-tuned model, leading to slight but not significant        for future research.
improvements. This is interesting because it suggests
that the model not relying on document titles is more          6
                                                                 In English: “The Prague Castle attracts over 18 million visitors every
broadly applicable. Overall, the fine-tuned models per-          year.”
form significantly better, highlighting the importance of      7
                                                                 In English: “The castle is among the most visited tourist attractions
the translated dataset for achieving high accuracy in fact       in Prague, attracting over 1.8 million visitors every year.”
                                                               8
verification tasks in both English and Italian.                  In English: “Ned Stark was introduced in 1996 in A Storm of Swords.”
                                                               9
                                                                 In English: “Introduced in 1996 in A Game of Thrones, Ned is the
   The error analysis suggests that the model sometimes          honorable lord of Winterfell, an ancient fortress in the north of the
inherits the mathematical reasoning limitations of the           imaginary continent of Westeros.”
5. Conclusion                                                       Association for Computational Linguistics, Santa
                                                                    Fe, New Mexico, USA, 2018, pp. 3346–3359. URL:
In this work, we have introduced FEVER-IT, an Italian               https://aclanthology.org/C18-1283.
version of the FEVER dataset, designed to improve the           [6] J. Thorne, A. Vlachos, C. Christodoulopoulos,
training and evaluation of models for fact verification in          A. Mittal, FEVER: a large-scale dataset for fact ex-
the Italian language. Using a machine translation system,           traction and VERification, in: M. Walker, H. Ji,
we translated a large-scale dataset of 228,000 claims/-             A. Stent (Eds.), Proceedings of the 2018 Confer-
pieces of evidence pairs and manually validated 2, 000              ence of the North American Chapter of the As-
test instances to ensure meaningful evaluations. This en-           sociation for Computational Linguistics: Human
abled us to fine-tune a state-of-the-art LLM, specifically          Language Technologies, Volume 1 (Long Papers),
LLaMA3, and assess its performance in both English and              Association for Computational Linguistics, New
Italian.                                                            Orleans, Louisiana, 2018, pp. 809–819. URL: https:
   Our experiments demonstrated that the multilingual               //aclanthology.org/N18-1074. doi:10.18653/v1/
model, without fine-tuning, performed similarly on both             N18-1074.
English and Italian datasets, though the accuracy and           [7] J. Thorne,        A. Vlachos,        O. Cocarascu,
stability were limited. Fine-tuning significantly improved          C. Christodoulopoulos, A. Mittal,           The fact
the model’s performance, achieving over 90% accuracy                extraction and VERification (FEVER) shared task,
in both languages. This underscores the importance and              in: Proceedings of the First Workshop on Fact Ex-
effectiveness of the translated dataset, even if it contains        traction and VERification (FEVER), Association for
some noise.                                                         Computational Linguistics, Brussels, Belgium, 2018,
   Future work will explore the performance of larger               pp. 1–9. URL: https://aclanthology.org/W18-5501.
models and further refinement of the dataset to enhance             doi:10.18653/v1/W18-5501.
accuracy and generalization capabilities or explore more        [8] J. Thorne,        A. Vlachos,        O. Cocarascu,
complex settings such as those described in [9].                    C. Christodoulopoulos, A. Mittal, The FEVER2.0
                                                                    shared task, in: Proceedings of the Second
                                                                    Workshop on Fact Extraction and VERifica-
Acknowledgments                                                     tion (FEVER), Association for Computational
The team would like to thank Monika Kakol for her in-               Linguistics, Hong Kong, China, 2019, pp. 1–
valuable support in the validation of the translations.             6.     URL:     https://aclanthology.org/D19-6601.
This work was supported by Project ECS 0000024 Rome                 doi:10.18653/v1/D19-6601.
Technopole, - CUP B83C22002820006, NRP Mission 4                [9] R. Aly, Z. Guo, M. S. Schlichtkrull, J. Thorne, A. Vla-
Component 2 Investment 1.5, Funded by the European                  chos, C. Christodoulopoulos, O. Cocarascu, A. Mit-
Union - NextGenerationEU.                                           tal, The fact extraction and VERification over
                                                                    unstructured and structured information (FEVER-
                                                                    OUS) shared task, in: Proceedings of the Fourth
References                                                          Workshop on Fact Extraction and VERification
                                                                    (FEVER), Association for Computational Linguis-
 [1] Z. Guo, M. S. Schlichtkrull, A. Vlachos, A survey              tics, Dominican Republic, 2021, pp. 1–13. URL: https:
     on automated fact-checking, Trans. Assoc. Comput.              //aclanthology.org/2021.fever-1.1. doi:10.18653/
     Linguistics 10 (2022) 178–206.                                 v1/2021.fever-1.1.
 [2] A. D. Terry Flew, Christina Spurgeon, A. Swift, The       [10] P. Nakov, G. D. S. Martino, T. Elsayed, A. Barrón-
     promise of computational journalism, Journalism                Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari,
     Practice 6 (2012) 157–171.                                     M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi,
 [3] C. Chen, K. Shu, Combating misinformation                      J. M. Struß, T. Mandl, The CLEF-2021 CheckThat!
     in the age of llms: Opportunities and chal-                    lab on detecting check-worthy claims, previously
     lenges, 2023. URL: https://arxiv.org/abs/2311.05656.           fact-checked claims, and fake news, in: Proceed-
     arXiv:2311.05656.                                              ings of the 43rd European Conference on Infor-
 [4] M. Akhtar, M. Schlichtkrull, Z. Guo, O. Cocarascu,             mation Retrieval, ECIR ’21, Lucca, Italy, 2021, pp.
     E. Simperl, A. Vlachos, Multimodal automated fact-             639–649. URL: https://link.springer.com/chapter/10.
     checking: A survey, 2023. URL: https://arxiv.org/              1007/978-3-030-72240-1_75.
     abs/2305.13507. arXiv:2305.13507.                         [11] P. Nakov, A. Barrón-Cedeño, G. Da San Martino,
 [5] J. Thorne, A. Vlachos, Automated fact check-                   F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli,
     ing: Task formulations, methods and future di-                 M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi,
     rections, in: Proceedings of the 27th Interna-                 H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal,
     tional Conference on Computational Linguistics,                J. Beltrán, The clef-2022 checkthat! lab on fighting
     the covid-19 infodemic and fake news detection,                 Associates, Inc., 2023, pp. 67284–67296.
     in: Advances in Information Retrieval, Springer            [21] A. Vlachos, S. Riedel, Fact checking: Task defi-
     International Publishing, Cham, 2022, pp. 416–428.              nition and dataset construction, in: C. Danescu-
[12] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. El-               Niculescu-Mizil, J. Eisenstein, K. McKeown, N. A.
     sayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,          Smith (Eds.), Proceedings of the ACL 2014 Work-
     M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The              shop on Language Technologies and Computational
     clef-2024 checkthat! lab: Check-worthiness, subjec-             Social Science, Association for Computational Lin-
     tivity, persuasion, roles, authorities, and adversarial         guistics, Baltimore, MD, USA, 2014, pp. 18–22. URL:
     robustness, in: N. Goharian, N. Tonellotto, Y. He,              https://aclanthology.org/W14-2508. doi:10.3115/
     A. Lipani, G. McDonald, C. Macdonald, I. Ounis                  v1/W14-2508.
     (Eds.), Advances in Information Retrieval, Springer        [22] A. Martín, J. Huertas-Tato, Álvaro Huertas-García,
     Nature Switzerland, Cham, 2024, pp. 449–458.                    G. Villar-Rodríguez, D. Camacho, Facter-check:
[13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud-                  Semi-automated fact-checking through seman-
     hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,                   tic similarity and natural language inference,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-                Knowledge-Based Systems 251 (2022) 109265.
     lingual representation learning at scale, arXiv                 doi:https://doi.org/10.1016/j.knosys.
     preprint arXiv:1911.02116 (2019).                               2022.109265.
[14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.      [23] E. C. Choi, E. Ferrara, Automated claim match-
     Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-              ing with large language models: Empowering fact-
     bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,               checkers in the fight against misinformation, in:
     G. Lample, Llama: Open and efficient foundation                 Companion Proceedings of the ACM on Web Con-
     language models, 2023. arXiv:2302.13971.                        ference 2024, WWW ’24, Association for Com-
[15] B. Berendt, P. Burger, R. Hautekiet, J. Jagers, A. Plei-        puting Machinery, New York, NY, USA, 2024, p.
     jter, P. Van Aelst, Factrank: Developing auto-                  1441–1449. URL: https://doi.org/10.1145/3589335.
     mated claim detection for dutch-language fact-                  3651910. doi:10.1145/3589335.3651910.
     checkers, Online Social Networks and Media 22              [24] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
     (2021) 100113. doi:https://doi.org/10.1016/                     Improving language understanding by gener-
     j.osnem.2020.100113.                                            ative pre-training,          CoRR abs/1801.06146
[16] D. Croce, A. Zelenanska, R. Basili, Enabling                    (2018). URL: http://arxiv.org/abs/1801.06146.
     deep learning for large scale question answering                arXiv:1801.06146.
     in italian, Intelligenza Artificiale 13 (2019) 49–         [25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
     61. URL: https://doi.org/10.3233/IA-190018. doi:10.             J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     3233/IA-190018.                                                 G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
[17] A. Scaiella, D. Croce, R. Basili, Large scale datasets          G. Krueger, T. Henighan, R. Child, A. Ramesh,
     for image and video captioning in italian, Italian              D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
     Journal of Computational Linguistics 2 (2019) 49–               E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     60. URL: http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_                C. Berner, S. McCandlish, A. Radford, I. Sutskever,
     2_3___scaiella_et_al.pdf.                                       D. Amodei, Language models are few-shot learners,
[18] C. Malon, Team papelo: Transformer networks at                  in: H. Larochelle, M. Ranzato, R. Hadsell, M. Bal-
     FEVER, in: J. Thorne, A. Vlachos, O. Cocarascu,                 can, H. Lin (Eds.), Advances in Neural Information
     C. Christodoulopoulos, A. Mittal (Eds.), Proceed-               Processing Systems 33: Annual Conference on Neu-
     ings of the First Workshop on Fact Extraction                   ral Information Processing Systems 2020, NeurIPS
     and VERification (FEVER), Association for Com-                  2020, December, 2020, pp. 6–12.
     putational Linguistics, Brussels, Belgium, 2018, pp.       [26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu,
     109–113. URL: https://aclanthology.org/W18-5517.                Y. Li, S. Wang, W. Chen,          Lora: Low-rank
     doi:10.18653/v1/W18-5517.                                       adaptation of large language models,           CoRR
[19] L. Canale, A. Messina, Experimenting ai tech-                   abs/2106.09685 (2021). URL: https://arxiv.org/abs/
     nologies for disinformation combat: the idmo                    2106.09685. arXiv:2106.09685.
     project, 2023. URL: https://arxiv.org/abs/2310.11097.      [27] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem-
     arXiv:2310.11097.                                               ita at EVALITA 2023: Multi-task sustainable scaling
[20] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia,                  to large language models at its extreme, in: Pro-
     D. Xin, A. Kusupati, R. Stella, A. Bapna, O. Firat,             ceedings of the Eighth Evaluation Campaign of Nat-
     Madlad-400: A multilingual and document-level                   ural Language Processing and Speech Tools for Ital-
     large audited dataset, in: Advances in Neural In-               ian. Final Workshop (EVALITA 2023), Parma, Italy,
     formation Processing Systems, volume 36, Curran                 September 7th-8th, 2023, volume 3473 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2023. URL:             E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
     https://ceur-ws.org/Vol-3473/paper13.pdf.                          e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
[28] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A                  terms used in t h i s t a s k are :
                                                               − Claim : A s t a t e m e n t o r a s s e r t i o n un der
     dataset for real-world claim verification with ev-                 examination .
     idence from the web, in: A. Oh, T. Naumann,               − Evidence : Information that e i t h e r supports
     A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.),               or opposes the claim .
     Advances in Neural Information Processing Sys-
     tems, volume 36, Curran Associates, Inc., 2023, pp.       Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s
                                                                      b a s e d on t h e e v i d e n c e p r o v i d e d :
     65128–65167.                                              − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e
[29] P. Atanasova, D. Wright, I. Augenstein, Gener-                     claim .
     ating label cohesive and well-formed adversarial          − REFUTES : i f t h e e v i d e n c e d i r e c t l y
     claims, in: Proceedings of the 2020 Conference                   c o n t r a d i c t s the claim .
     on Empirical Methods in Natural Language Pro-             − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
                                                                      evidence to determine the claim ’ s
     cessing (EMNLP), Association for Computational                   validity
     Linguistics, Online, 2020, pp. 3168–3177. URL: https:     ### I n p u t
     //aclanthology.org/2020.emnlp-main.256. doi:10.           − Claim : [ CLAIM HERE ]
     18653/v1/2020.emnlp-main.256.                             − E v i d e n c e : [ EVIDENCE HERE ]
[30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a       # # # Answer : [ANSWER HERE ]
     method for automatic evaluation of machine trans-
     lation, in: Proceedings of the 40th Annual Meet-
     ing on Association for Computational Linguistics,         A.1.2. 1-shot Setting
     ACL ’02, Association for Computational Linguis-           The following prompt is used for 1-shot learning, where
     tics, USA, 2002, p. 311–318. URL: https://doi.org/        the task and classes are explained, and one example per
     10.3115/1073083.1073135. doi:10.3115/1073083.             class is provided. Notice that only the evidence is re-
     1073135.                                                  ported without the title of the original document.
[31] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia,
     J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui, A          ### I n s t r u c t i o n
                                                               E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
     survey on in-context learning, 2024. URL: https:                   e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
     //arxiv.org/abs/2301.00234. arXiv:2301.00234.                      terms used in t h i s t a s k are :
[32] AI@Meta, Llama 3 model card, 2024. URL:                   − Claim : A s t a t e m e n t o r a s s e r t i o n un der
     https://github.com/meta-llama/llama3/blob/main/                    examination .
     MODEL_CARD.md.                                            − Evidence : Information that e i t h e r supports
                                                                        or opposes the claim .

                                                               Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s
                                                                    b a s e d on t h e e v i d e n c e p r o v i d e d :
A. Prompting Engineering                                       − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e
                                                                      claim .
This appendix contains the prompts used in the exper-          − REFUTES : i f t h e e v i d e n c e d i r e c t l y
iments. The prompts are provided in both Italian and                c o n t r a d i c t s the claim .
English, reflecting the task-specific nature of the experi-    − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
ments. Each prompt begins with an explanation of the                evidence to determine the claim ’ s
                                                                    validity
task and the meaning of the classes. In the different vari-
ants, the 0-shot setting does not include any examples, # # # Examples
unlike the 1-shot setting. Where necessary, the name of These e x a m p l e s d e m o n s t r a t e how t o a p p l y t h e
the document from which the evidence is taken is also            evaluation c r i t e r i a :
specified.                                                  − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d
                                                                        Gothic .
                                                               − E v i d e n c e : The Germanic p e o p l e s ( a l s o
A.1. Prompts in English                                               r e f e r r e d to as Teutonic , Suebian , or
                                                                      G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo −
A.1.1. 0-shot Setting                                                 European ethno − l i n g u i s t i c group o f
                                                                      N o r t h e r n European o r i g i n .
The following prompt is used for 0-shot learning, where        − Answer : SUPPORTS
the task and classes are presented without additional
information.                                                   − Claim : T e n n i s i s n o t a s p o r t .
                                                               − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f
### I n s t r u c t i o n                                             r e c r e a t i o n a l p l a y e r s and i s a l s o a
                                                                      popular worldwide s p e c t a t o r s p o r t .
− Answer : REFUTES                                                           − Document : d e n o t e s t h e s o u r c e document f o r
                                                                                 the evidence .
− Claim : Kick − Ass i s a h o r r o r f i l m .
− E v i d e n c e : Kick − Ass i s a 2 0 1 0 B r i t i s h −                 Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s
      American f i l m b a s e d on t h e comic book o f                          b a s e d on t h e e v i d e n c e p r o v i d e d :
         t h e same name by Mark M i l l a r and John                        − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e
      Romita , J r .                                                                claim .
− Answer : NOT ENOUGH INFO                                                   − REFUTES : i f t h e e v i d e n c e d i r e c t l y
### I n p u t                                                                     c o n t r a d i c t s the claim .
− Claim : [ CLAIM HERE ]                                                     − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
− E v i d e n c e : [ EVIDENCE HERE ]                                             evidence to determine the claim ’ s
# # # Answer : [ANSWER HERE ]                                                     validity

                                                                             # # # Examples
                                                                             These e x a m p l e s d e m o n s t r a t e how t o a p p l y t h e
A.1.3. 0-shot Setting with Document Title                                           evaluation c r i t e r i a :
The following prompt is used for 0-shot learning, where                      − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d
                                                                                      Gothic .
the task and classes are explained without additional                        − E v i d e n c e : The Germanic p e o p l e s ( a l s o
information. Each input evidence is provided with the                               r e f e r r e d to as Teutonic , Suebian , or
title of its original document.                                                     G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo −
                                                                                    European ethno − l i n g u i s t i c group o f
### I n s t r u c t i o n                                                           N o r t h e r n European o r i g i n .
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e           − Document : Germanic p e o p l e s
         e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key   − Answer : SUPPORTS
         terms used in t h i s t a s k are :
− Claim : A s t a t e m e n t o r a s s e r t i o n u nd e r                 − Claim : T e n n i s i s n o t a s p o r t .
         examination .                                                       − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f
− Evidence : Information that e i t h e r supports                                  r e c r e a t i o n a l p l a y e r s and i s a l s o a
         or opposes the claim .                                                     popular worldwide s p e c t a t o r s p o r t .
− Document : d e n o t e s t h e s o u r c e document f o r                  − Document : T e n n i s
         the evidence .                                                      − Answer : REFUTES

Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s             − Claim : Kick − Ass i s a h o r r o r f i l m .
       b a s e d on t h e e v i d e n c e p r o v i d e d :                  − E v i d e n c e : Kick − Ass i s a 2 0 1 0 B r i t i s h −
− SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e             American f i l m b a s e d on t h e comic book o f
         claim .                                                                      t h e same name by Mark M i l l a r and John
− REFUTES : i f t h e e v i d e n c e d i r e c t l y                              Romita , J r .
       c o n t r a d i c t s the claim .                                     − Document : Kick − Ass ( f i l m )
− NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t                − Answer : NOT ENOUGH INFO
       evidence to determine the claim ’ s                                   ### I n p u t
       validity                                                              − Claim : [ CLAIM HERE ]
### I n p u t                                                                − E v i d e n c e : [ EVIDENCE HERE ]
− Claim : [ CLAIM HERE ]                                                     − Document : [DOCUMENT HERE ]
− E v i d e n c e : [ EVIDENCE HERE ]                                        # # # Answer : [ANSWER HERE ]
− Document : [DOCUMENT HERE ]
# # # Answer : [ANSWER HERE ]

                                                                             A.2. Prompts in Italian
A.1.4. 1-shot Setting with Document Title                                    A.2.1. 0-shot Setting
The following prompt is used for 1-shot learning, where                      The following prompt is used for 0-shot learning, where
the task and classes are explained, and one example per                      the task and classes are presented without additional
class is provided. Each input evidence is provided with                      information.
the title of its original document.
                                                                             ### I s t r u z i o n i
### I n s t r u c t i o n                                                    Valuta se l ’ affermazione è supportata d a l l e
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e                  p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
         e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key          termini chiave u t i l i z z a t i in questo
         terms used in t h i s t a s k are :                                        c o m p i t o sono :
− Claim : A s t a t e m e n t o r a s s e r t i o n u nd e r                 − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
         examination .                                                              a s s e r z i o n e s o t t o esame .
− Evidence : Information that e i t h e r supports                           − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o
         or opposes the claim .                                                     contraddicono l ’ affermazione .
R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i      ### I n p u t
         s u l l e prove f o r n i t e :                                     − A f f e r m a z i o n e : [ CLAIM HERE ]
− SUPPORTS : s e l e p r o v e co nf erm ano l ’                             − P r o v a : [ EVIDENCE HERE ]
         affermazione .                                                      # # # R i s p o s t a : [ANSWER HERE ]
− REFUTES : s e l e p r o v e c o n t r a d d i c o n o
         direttamente l ’ affermazione .
− NOT ENOUGH INFO : s e l e p r o v e non sono
         s u f f i c i e n t i per determinare l a v a l i d i t à
                                                                             A.2.3. 0-shot Setting with Document Title
         dell ’ affermazione .                                               The following prompt is used for 0-shot learning, where
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]                                   the task and classes are explained without additional
− P r o v a : [ EVIDENCE HERE ]                                              information. Each input evidence is provided with the
# # # R i s p o s t a : [ANSWER HERE ]                                       title of its original document.
                                                                             ### I s t r u z i o n i
                                                                             Valuta se l ’ affermazione è supportata d a l l e
A.2.2. 1-shot Setting                                                               p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
                                                                                    termini chiave u t i l i z z a t i in questo
The following prompt is used for 1-shot learning, where                             c o m p i t o sono :
the task and classes are explained, and one example per                      − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
class is provided. Notice that only the evidence is re-                             a s s e r z i o n e s o t t o esame .
ported without the title of the original document.                           − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o
                                                                                    contraddicono l ’ affermazione .
### I s t r u z i o n i                                                      − Documento : i n d i c a l a f o n t e da c u i è s t a t a
Valuta se l ’ affermazione è supportata d a l l e                                   e s t r a t t a l a prova .
       p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
       termini chiave u t i l i z z a t i in questo                          R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i
       c o m p i t o sono :                                                           s u l l e prove f o r n i t e :
− A f f e r m a z i o n e : Una d i c h i a r a z i o n e o                  − SUPPORTS : s e l e p r o v e co nf erm ano l ’
       a s s e r z i o n e s o t t o esame .                                          affermazione .
− P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o              − REFUTES : s e l e p r o v e c o n t r a d d i c o n o
       contraddicono l ’ affermazione .                                               direttamente l ’ affermazione .
                                                                             − NOT ENOUGH INFO : s e l e p r o v e non sono
R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i               s u f f i c i e n t i per determinare l a v a l i d i t à
         s u l l e prove f o r n i t e :                                              dell ’ affermazione .
− SUPPORTS : s e l e p r o v e co nf erm ano l ’                             ### I n p u t
         affermazione .                                                      − A f f e r m a z i o n e : [ CLAIM HERE ]
− REFUTES : s e l e p r o v e c o n t r a d d i c o n o                      − P r o v a : [ EVIDENCE HERE ]
         direttamente l ’ affermazione .                                     − Documento : [DOCUMENT HERE ]
− NOT ENOUGH INFO : s e l e p r o v e non sono                               # # # R i s p o s t a : [ANSWER HERE ]
         s u f f i c i e n t i per determinare l a v a l i d i t à
         dell ’ affermazione .

# # # Esempi
                                                                             A.2.4. 1-shot Setting with Document Title
Q u e s t i e s e m p i d i m o s t r a n o come a p p l i c a r e i         The following prompt is used for 1-shot learning, where
        c r i t e r i di valutazione :
− A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono             the task and classes are explained, and one example per
        c h i a m a t i anche g o t i c i .                                  class is provided. Each input evidence is provided with
− P r o v a : I p o p o l i g e r m a n i c i ( anche c h i a m a t i        the title of its original document.
        Teutoni , Suebi o Goti n e l l a l e t t e r a t u r a
        p i ù a n t i c a ) sono un gruppo etno −                            ### I s t r u z i o n i
        l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord   Valuta se l ’ affermazione è supportata d a l l e
        europea .                                                                   p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
− R i s p o s t a : SUPPORTS                                                        termini chiave u t i l i z z a t i in questo
                                                                                    c o m p i t o sono :
− A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t .            − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
− P r o v a : I l t e n n i s è p r a t i c a t o da m i l i o n i d i              a s s e r z i o n e s o t t o esame .
       g i o c a t o r i a m a t o r i a l i ed è anche uno                  − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o
       s p o r t popolare a l i v e l l o mondiale .                                contraddicono l ’ affermazione .
− R i s p o s t a : REFUTES                                                  − Documento : i n d i c a l a f o n t e da c u i è s t a t a
                                                                                    e s t r a t t a l a prova .
− A f f e r m a z i o n e : Kick − Ass è un f i l m h o r r o r .
− P r o v a : Kick − Ass è un f i l m b r i t a n n i c o −                  R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i
       americano d e l 2010 b as ato s u l fumetto                                    s u l l e prove f o r n i t e :
       omonimo d i Mark M i l l a r e John Romita J r .                      − SUPPORTS : s e l e p r o v e co nf erm ano l ’
− R i s p o s t a : NOT ENOUGH INFO                                                   affermazione .
− REFUTES : s e l e p r o v e c o n t r a d d i c o n o
     direttamente l ’ affermazione .
− NOT ENOUGH INFO : s e l e p r o v e non sono
     s u f f i c i e n t i per determinare l a v a l i d i t à
     dell ’ affermazione .

# # # Esempi
Q u e s t i e s e m p i d i m o s t r a n o come a p p l i c a r e i
        c r i t e r i di valutazione :
− A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono
        c h i a m a t i anche g o t i c i .
− P r o v a : I p o p o l i g e r m a n i c i ( anche c h i a m a t i
        Teutoni , Suebi o Goti n e l l a l e t t e r a t u r a
        p i ù a n t i c a ) sono un gruppo etno −
        l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord
        europea .
− Documento : P o p o l i g e r m a n i c i
− R i s p o s t a : SUPPORTS

− A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t .
− P r o v a : I l t e n n i s è p r a t i c a t o da m i l i o n i d i
       g i o c a t o r i a m a t o r i a l i ed è anche uno
       s p o r t popolare a l i v e l l o mondiale .
− Documento : T e n n i s
− R i s p o s t a : REFUTES

− A f f e r m a z i o n e : Kick − Ass è un f i l m h o r r o r .
− P r o v a : Kick − Ass è un f i l m b r i t a n n i c o −
       americano d e l 2010 b as ato s u l fumetto
       omonimo d i Mark M i l l a r e John Romita J r .
− Documento : Kick − Ass ( f i l m )
− R i s p o s t a : NOT ENOUGH INFO
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]
− P r o v a : [ EVIDENCE HERE ]
− Documento : [DOCUMENT HERE ]
# # # R i s p o s t a : [ANSWER HERE ]

</pre>