<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Large Language Models for Fact Verification in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Scaiella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Costanzo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisa Passone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Croce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Gambosi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Enterprise Engineering, University of Rome Tor Vergata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Reveal s.r.l</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, Automatic Fact Checking has become a crucial tool for combating fake news by leveraging AI to verify the accuracy of information. Despite significant advancements, most datasets and models are predominantly available in English, posing challenges for other languages. This paper presents an Italian resource based on the dataset made available in the FEVER evaluation campaign, created to train and evaluate fact-checking models in Italian. The dataset comprises approximately 240k examples, with over 2k test examples manually validated. Additionally, we fine-tuned a state-of-the-art LLM, namely LLaMA3, on both the original English and translated Italian datasets, demonstrating that fine-tuning significantly improves model performance. Our results suggest that the fine-tuned models achieve comparable accuracy in both languages, highlighting the value of the proposed resource.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic Fact Checking</kwd>
        <kwd>Fact Checking in Italian</kwd>
        <kwd>Resource in Italian</kwd>
        <kwd>Large Language Model for Fact Verification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>cial intelligence communities, surveyed in [1] and more
recently in [3] and [4]. In particular, in [1] the authors
In recent years, Automatic Fact Checking (AFC) has as- expose a survey on the topic, describing the early
developsumed a significant role as an instrument to identify fake ments that were surveyed in [5], which is an exhaustive
news. AFC is a process that verifies the truthfulness and overview of the subject.
accuracy of information, claims, and data contained in a As with most machine learning paradigms [1],
statetext or speech. The focus is on debunking disinformation of-the-art methods require datasets and benchmarks.
and misinformation, intercepting errors, and verifying One of the most impactful campaigns for collecting
sources and facts. a large-scale benchmark is FEVER (Fact Extraction and</p>
      <p>Automated fact-checking uses AI tools to identify, ver- VERification) [ 6]. In this context, fact-checking involves
ify, and respond to misleading claims, using techniques verifying whether a claim is supported by one or more
based on natural language processing, machine learning, pieces of evidence. FEVER is a publicly available dataset
knowledge representation, and databases to automati- designed for claim verification against textual sources.
cally predict the truthfulness of claims [1]. This is a It comprises about 180K claims generated by altering
complex process that involves searching, interpreting, sentences extracted from Wikipedia. The claims are
clasand assessing information. As discussed in [1] a NLP sified into three categories: Supported (a piece of
eviframework for automated fact-checking consists of three dence exists and it supports the claim), Refutes (a piece
stages: claim detection to identify claims that require of evidence exists and it contradicts the claim), or
NotEverification; evidence retrieval to find sources supporting noughInfo (there is insuficient evidence to verify the
or refuting the claim; and claim verification to assess the claim). The challenge, therefore, is to retrieve the
reltruthfulness of the claim based on the retrieved evidence. evant evidence and verify the accuracy of the claims,</p>
      <p>At first, automating the fact-checking process has been categorizing them with the correct label.
discussed in the context of computational journalism in Many works like FEVER have recently focused on
works like [2], and has received significant attention in building data and datasets for the task of Fact Verification,
the computational linguistics and, in general, the artifi- achieving very good results [7, 8, 9, 10, 11, 12]. However,
all of these datasets are designed for the English language.</p>
      <p>CDLeciC0-4it—200264,:2T0e2n4t,hPIitsaal,iIatnalCyonference on Computational Linguistics, Although multilingual models exist (e.g., in [13, 14]),
fine* Corresponding author. tuning a model on a specific language, pre-training it for
$ scaiella@revealsrl.it (A. Scaiella); a specific task and use case, could lead to a significant
stefano.costanzo@students.uniroma2.eu (S. Costanzo); decline in quality if applied to another language. Few
passone@ing.uniroma2.it (E. Passone); croce@info.uniroma2.it studies have worked on training models for languages
(D.0C0r0o0c-e0)0;0g1i-o9r1g1i1o-.g1a9m50b(oDs.i@Curonciero);m0a020.0i-t0(0G0.1G-9a9m79b-o6s9i3)1 other than English. An example is the work presented
(G. Gambosi) in [15], which focuses on developing automated claim
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License detection for Dutch-language fact-checkers.
Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this work, we propose a FEVER-IT dataset in which
the FEVER dataset has been translated into Italian to train
the model for the Italian language. Inspired by SQUAD-IT One of the pioneering works in autonomous
fact[16] and MSCOCO-IT [17], we worked to obtain quality checking was conducted by [21], which proposed
credata. Although the training set may be afected by trans- ating publicly available datasets and developing
autolation errors, the test set will not, as it is composed of mated systems using natural language processing
techmanually validated data. Furthermore, while the original nologies. Recent challenges such as CheckThat! at CLEF
FEVER dataset contained evidence only for Supports [10, 11, 12] and Fever [7, 8, 9] from 2018 have advanced
and Refutes, in this work we have also added and trans- fact-checking tasks by leveraging advanced approaches
lated examples for the NotEnoughInfo category using and integrating Large Language Models (LLMs) like BERT
the heuristics proposed in [18]. This work extends the ex- and GPT. These models represent the current state of the
perience described in [19], where translations were done art in many Natural Language Processing tasks,
includusing Google API, by using publicly available models ing fact-checking. Notable examples of such technology
([20]) and adding data for the NotEnoughInfo category. include FacTeR-Check [22], a multilingual architecture</p>
      <p>The contribution of this work is twofold. Firstly, we for semi-automated fact-checking and hoax propagation
release FEVER-IT, a corpus with 228K claims each associ- analysis using the XLM-RoBERTa Transformer [13], and
ated with at least one (possibly useful) piece of evidence, FACT-GPT [23], a framework that automates the
claimincluding a test set of 2,000 manually validated claims. matching phase of fact-checking using LLMs to identify
In addition, we fine-tuned and validated a state-of-the- social media content that supports or contradicts claims
art model, LLaMA3 [14], on both the original English previously debunked by fact-checkers.
dataset and the Italian dataset. While this provides a The success of these systems is largely due to the
capahigh-performance model ready for the task in both lan- bilities of LLMs as summarized in [3], which are neural
guages, the primary goal is to assess whether the quality models based on the Transformer architecture.
Specifof the Italian data is comparable to the English one. By ically, decoder-based architectures, such as GPT [24],
training the model separately on each dataset, we can GPT-3 [25], and LLaMA [14], generate output sequences
evaluate its stability: if the model performs similarly on in an auto-regressive manner. These models have
demonthe manually validated Italian dataset and the English strated impressive capabilities following pre-training on
test set, we can conclude that the quality of the Italian large collections of documents. One notable outcome is
data is on par with the English data. few-shot learning, where models can adapt to new tasks</p>
      <p>Additionally, we want to assess whether using an Ital- with only a few examples [25], greatly enhancing their
ian train dataset, despite the noise from automatic trans- flexibility and applicability.
lation, is truly beneficial. LLMs like LLaMA3 can already When new annotated data is available, fine-tuning
perform tasks in other languages through zero-shot or further enhances a model’s capabilities. This process
infew-shot learning, without requiring fine-tuning on a volves taking the pre-trained base model and training it
specific dataset, especially if that dataset is noisy. There- on a smaller, specialized dataset relevant to the desired
fore, we aim to compare the performance on the test set task. Parameter Eficient Fine-Tuning (PEFT) is an
optibetween a LLaMA3 model that hasn’t been fine-tuned on mized technique that involves training only a small
porthe noisy Italian data and one that has been fine-tuned, to tion of the weights, typically by adding a new layer to the
determine whether fine-tuning actually improves results model. One widely used technique is LoRA [26], which
or if the model performs on par or better without it. adds an adapter consisting of two matrices of weights</p>
      <p>The experimental results show that the model without that are relatively small compared to the original model.
ifne-tuning achieves an average accuracy of only about Extremita [27] is an example of a decoder-based model
45%. Fine-tuning on the English dataset yields about 90% fine-tuned with LoRA in Italian for multi-task executions.
mean accuracy, while fine-tuning on the Italian dataset Several benchmark datasets have been developed to
results in a percentage quite similar to the fine-tuned ifne-tune and evaluate fact-checking systems, typically
English model and much greater than testing without collected by organizations like Snopes, FullFact, and
Poliifne-tuning 1. tiFact. The FEVER challenge has produced four major</p>
      <p>The remainder of the paper is organized as follows: Sec- datasets: FEVER (2018) [6], FEVER 2.0 (2019) [8],
FEVERtion 2 discusses related work, Section 3 presents FEVER- OUS (2021) [9], and AVeriTeC (2024) [28]. These datasets
IT, Section 4 details the experimental measures, and Sec- range from labeled claim-evidence associations to
verition 5 provides the conclusions. ifed claims with structured and unstructured evidence.
Despite the wealth of resources available, there is a lack of
large benchmark datasets in Italian. This work addresses
this gap by providing a large-scale Italian resource.
1The resource, fine-tuned models, and code will be released on a
dedicated repository: https://github.com/crux82/FEVER-it</p>
    </sec>
    <sec id="sec-3">
      <title>3. Fact Verification in Italian</title>
      <sec id="sec-3-1">
        <title>2https://huggingface.co/datasets/copenlu/fever_gold_evidence</title>
        <p>3https://github.com/google-research/google-research/tree/master/
madlad_400
focused on correcting mistakes related to the proper
sentence structure in Italian, the accurate meaning of specific
As in [6], the original FEVER dataset is composed of English words that MADLAD had translated literally, any
claims that can potentially be verified against an ency- misunderstandings of the intended meaning in Italian,
clopedic resource, in this case, Wikipedia. The claims are and a few grammatical errors.
classified into three categories: Supported, Refutes and In some cases, translation errors do not completely
unNotEnoughInfo. For the first two categories, each claim dermine the examples with respect to the task’s purpose.
is associated with one or more passages from Wikipedia, For instance, the English sentence from an evidence, “he
each specifying the page from which it was extracted. was booked to win a third world championship at a WWE
For the NotEnoughInfo category, no passages are pro- event on the night of his death” was translated into Italian
vided because no information was found on Wikipedia as “era stato prenotato per vincere un terzo titolo mondiale
to support or refute the claim. For instance, the sentence in un evento della WWE la notte della sua morte”. A more
“Dan Brown is illiterate.” is a claim associated with pieces accurate translation would be “si pensava avrebbe vinto
of evidence such as: “Angels and Demons is a 2000 best- un terzo titolo mondiale in un evento della WWE la notte
selling mystery-thriller novel written by American author della sua morte”, better capturing the verb’s meaning. In
Dan Brown and published by Pocket Books and then by other, more problematic cases, translation errors, loss of
Corgi Books.”. These pieces of evidence prove that the information, or introduction of hallucinations could even
claim is incorrect, so it can be classified with the label Re- change the classification in the fact verification task. For
futes. In FEVER, a claim is thus a sentence that expresses example, in the claim “The Thin Red Line (1998 film) has
information (true or mutated) about a target entity. an all-British cast.”, the automatic translation was “La</p>
        <p>To generate the Italian dataset, we started from the sottile linea rossa (The Thin Red Line) è un film del 1998. ”,
dataset version2 proposed in [29], which consists of 260k which is incorrect because it omits the information about
claims. This version extends the original FEVER by the cast. This detail is crucial, as its absence could lead
adding evidence associated with claims justified as NotE- to incorrect labeling.
noughInfo in FEVER, using the heuristics in [18]. The
approach involved using a search engine to retrieve po- Metric BLEU-1 BLEU-2 BLEU-3 BLEU-4
tential evidence and a textual entailment system based Claim 0,9776 0,9695 0,9623 0,9544
on GPT [24]. Claims not judged as Supports or Refutes Evidence 0,9529 0,9411 0,9309 0,9207
were classified as NotEnoughInfo. Table 1</p>
        <p>This gives us examples of sentences that are closely BLEU score metrics of Claim and Evidence manually validated
related to the claim (according to the search engine) but (gold) respect automatic translation version (silver)
neither support nor refute it. This makes it more
straightforward and eficient to train and/or evaluate a classifier,
even though some of the derived examples might be some- Train (S) Dev (S) Test (G) Total
what noisy, as they were generated through heuristics. Supports 114,801 4,638 654 120,095</p>
        <p>For the automatic translation process, we utilized Refutes 47,096 4,887 643 52,626
MADLAD400 [20], a machine translation system based TNoEtIal 22686,,237870 156,,943150 2,076636 24763,,257556
on the Transformer architecture3, trained on MADLAD,
a manually audited, general domain 3T token multilin- Table 2
gual dataset based on CommonCrawl, spanning 419 lan- Number of claims and evidence in the Italian dataset. (S)
indiguages. Since the Italian data are obtained through ma- cates silver data (automatically translated), and (G) indicates
chine translation, and thus potentially incorrect as sug- gold data (manually validated).
gested in [16, 17], we needed validated test data to obtain
a realistic benchmark. Our hypothesis is that an LLM is
robust enough to generalize from the 228k examples and
recognize the relationships involved in FEVER without
inheriting translation errors. However, to prevent these
errors from being inherited by the model, we manually
corrected the translations of the test set.</p>
        <p>Out of the approximately 16k available test examples,
three annotators were involved in verifying and
correcting 2, 063 translations from the test set. The annotators</p>
      </sec>
      <sec id="sec-3-2">
        <title>A quantitative analysis of the translation quality sug</title>
        <p>gests that MADLAD performs well in translating simple
assertive sentences such as claims. In fact, 91% of the
claims were not altered by the validators, who considered
them completely correct. This percentage is lower for the
Wikipedia passages, dropping to 76%. This discrepancy
may be due to the greater complexity of the evidence
compared to the simpler sentence structures in the claims.
Additionally, we reported the results in terms of BLEU
score [30] for the corrected translations compared to the
originals, as shown in Table 1. It should be noted that
measuring the translation quality after correcting the</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>sentences introduces a strong bias in the measurements;
however, it provides a more specific idea of the
translation quality, especially in understanding the potential The goal of our experimentation is to assess the
perfornoisiness of the training and development sentences. In mance of a state-of-the-art LLM applied to Fact
Verificathis case, results of over 95% for BLEU-1 and over 92% for tion. Specifically, we aim to determine whether a
multiBLEU-4 suggest that very few terms were altered during lingual model maintains consistent quality when applied
validation, and even the grammatical patterns remained to both the English FEVER dataset and our Italian dataset.
largely unchanged. At most, a few mistranslated terms We utilize LLaMA3-Instruct4, an instruction-tuned
genneeded updating, as indicated by the qualitative analysis. erative text model from META with 8 billion parameters,</p>
      <p>Table 2 summarizes the number of examples created released in April 2024. This model is trained to execute
for the Italian dataset. In line with the original English specific instructions or prompts across various tasks. To
material, the dataset is divided into training, develop- ensure alignment, we evaluate the systems on the
manument, and test sets, with claims categorized into Sup- ally validated Italian test set and the same subset of 2,063
ports, Refutes, and NotEnoughInfo (NEI). The ta- claims in the English counterpart. The model is evaluated
ble also distinguishes between silver data (automatically in 0-shot and 1-shot settings to assess its capability
withtranslated) and gold data (manually validated). The train- out fine-tuning. The prompts used in English and Italian
ing set consists of 228,277 claims, the development set are provided in Appendix A. Additionally, we fine-tuned
contains 15,935 claims, and the test set has 2,063 claims. LLaMA3 on the English datasets from [29] and separately
Each Italian claim or evidence is aligned with the English on the Italian datasets obtained via machine translation.
counterpart, facilitating future research in cross-lingual Fine-tuning was conducted on an NVIDIA A100 using
fact verification. the LoRA technique5.</p>
      <p>Language Models for Fact Verification. For address- In FEVER, the title of the document associated with
ing the capabilities of Large Language Models in Fact Veri- each claim often provides crucial context. For example,
ifcation, they can be utilized through In-Context Learning the claim “The University of Leicester discovered and
identechniques [31] or by directly fine-tuning the model for tified the remains of a king. ” relies on the document titled
specific downstream tasks. In-context learning relies on “University of Leicester” to correctly classify the claim
the model’s pre-existing knowledge acquired during pre- as Supports. To ensure the model’s generalization, we
training and on instructions provided in natural language will evaluate the impact of including document titles in
at inference time. This method does not involve addi- prompts. The metrics used to analyze the results are
retional training and can be categorized based on the num- call, precision, accuracy, and F1 score, calculated globally
ber of examples provided: i) 0-shot Learning, where no and for each label (Supports, Refutes,
NotEnoughexamples are given, and the model generates responses Info).
based solely on its pre-existing knowledge and the pro- The results are reported in Tables 3 and 4 for the
Envided instructions; ii) 1-shot Learning, where one example glish and Italian datasets, respectively. Each table shows
per class is added to provide a more precise context, help- whether the model underwent fine-tuning (column FT),
ing the model better understand the task by ofering a whether a prompt without examples (0-shot) or with one
concrete reference point; iii) Few-shot Learning, where example per class (1-shot) was used (column Prompt), and
more than one example per class is provided to give the whether the document title was included (column Doc).
model additional contextual information during decision- Notably, if no fine-tuning was performed, the original
making. When the model’s pre-existing knowledge is LLaMA3-Instruct model was used. Given that the
sysinsuficient, we can fine-tune it on the downstream task. tem’s response can consist of multiple words, we search
Fine-tuning involves training the model in a traditional the output for the mention of one of the classes and
assomanner using input-output pairs (training data) to adjust ciate the example with that class. If no class is identified,
its parameters. This process improves the model’s per- the result is classified as NotEnoughInfo. In general,
formance on specific tasks, allowing it to learn from a the fine-tuned model is extremely stable, consistently
more extensive set of examples. As a result, the model outputting one of the three categories for every request.
becomes more adept at handling similar queries in the The non-fine-tuned model, on rare occasions—just a few
future, with a focus on the specific task at hand. We dozen times out of 2000—produces responses that do not
thus evaluated the application of state-of-the-art LLM, correspond to any of the required classes. This highlights
namely LLAMA3 [32], by providing just the definition of the inherent stability of LLaMA3 while also supporting
the task (zero-shot) or adding an example (one-shot) or
by performing fine-tuning, to demonstrate the necessity
of a training dataset like the one constructed in this work,
as discussed in the following section.</p>
      <sec id="sec-4-1">
        <title>4https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct</title>
        <p>5The following hyperparameters were used: a learning rate of
0.0001, two epochs, LoRA_R set to 8, LoRA_alpha set to 16, and
LoRA_dropout at 0.05. The micro-batch size was 2, and gradient
accumulation steps were set to 8.</p>
        <p>Prompt</p>
        <p>Doc</p>
        <sec id="sec-4-1-1">
          <title>Prompt Doc FT</title>
          <p>No
Yes
FT
No
Yes
0-shot
1-shot
0-shot
1-shot
0-shot
1-shot
0-shot
1-shot
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
Acc</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Not enough info</title>
          <p>P R F1
0.585 0.050 0.092
0.800 0.005 0.010
0.478 0.043 0.079
0.698 0.115 0.197
0.877 0.903 0.890
0.887 0.910 0.898
0.881 0.894 0.887
0.883 0.915 0.899</p>
          <p>P
0.534
0.617
0.508
0.578
0.899
0.903
0.897
0.907
the soundness of the results achieved. LLM. For example, the claim “Il Castello di Praga attira</p>
          <p>A key finding is that the multilingual model generally oltre 18 milioni di visitatori ogni anno.6” was given the
achieves similar, though modest, results on English and evidence “Il castello è tra le attrazioni turistiche più visitate
Italian datasets without fine-tuning, with accuracy val- di Praga che attira oltre 1,8 milioni di visitatori all’anno.7”
ues around 0.40-0.50 and average F1 scores in the range The model’s predicted label was Refutes, while the true
of 0.35-0.55. This performance is relatively unstable, and label was Supports. Here, the true label should be
Supthe addition of an example in the prompt does not lead ports since 18 million is indeed greater than 1.8 million,
to significant improvements. In English, there are some but the model found the numbers inconsistent. In
animprovements, but in Italian, there are fewer. We believe other case, the claim “Ned Stark è stato introdotto nel 1996
this is because, although LLaMA is multilingual, the per- in Tempesta di spade.8” was paired with the evidence
centage of Italian examples observed during training is “Introdotto nel 1996 in Il Trono di Spade, Ned è l’onorevole
less than 1%, making it less performant and less stable in signore di Winterfell, un’antica fortezza nel nord del
conthis language. tinente immaginario di Westeros.9” The model predicted</p>
          <p>However, when fine-tuning is applied, the results im- Refutes, although the true label was Supports. The
prove dramatically, with accuracy exceeding 90% in both confusion here is due to the diference in the book titles,
languages. This demonstrates the utility of the translated which are from the same series but are distinct works.
dataset, even if it contains some noise. In this scenario, The error analysis revealed that the model occasionally
adding an example in the prompt leads to negligible but struggled with mathematical reasoning and contextual
consistent improvements. Additionally, the inclusion of understanding, highlighting areas for future
enhancethe document title, while sometimes causing inconsis- ment. Larger models and further fine-tuning could
potentencies in zero-shot learning, is better utilized by the tially address these issues, which remain open questions
ifne-tuned model, leading to slight but not significant for future research.
improvements. This is interesting because it suggests
that the model not relying on document titles is more
broadly applicable. Overall, the fine-tuned models
perform significantly better, highlighting the importance of
the translated dataset for achieving high accuracy in fact
verification tasks in both English and Italian.</p>
          <p>The error analysis suggests that the model sometimes
inherits the mathematical reasoning limitations of the
6In English: “The Prague Castle attracts over 18 million visitors every
year.”
7In English: “The castle is among the most visited tourist attractions
in Prague, attracting over 1.8 million visitors every year.”
8In English: “Ned Stark was introduced in 1996 in A Storm of Swords.”
9In English: “Introduced in 1996 in A Game of Thrones, Ned is the
honorable lord of Winterfell, an ancient fortress in the north of the
imaginary continent of Westeros.”</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>The team would like to thank Monika Kakol for her in</title>
        <p>valuable support in the validation of the translations.
This work was supported by Project ECS 0000024 Rome
Technopole, - CUP B83C22002820006, NRP Mission 4
Component 2 Investment 1.5, Funded by the European
Union - NextGenerationEU.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion References</title>
      <p>In this work, we have introduced FEVER-IT, an Italian
version of the FEVER dataset, designed to improve the
training and evaluation of models for fact verification in
the Italian language. Using a machine translation system,
we translated a large-scale dataset of 228,000
claims/pieces of evidence pairs and manually validated 2, 000
test instances to ensure meaningful evaluations. This
enabled us to fine-tune a state-of-the-art LLM, specifically
LLaMA3, and assess its performance in both English and
Italian.</p>
      <p>Our experiments demonstrated that the multilingual
model, without fine-tuning, performed similarly on both
English and Italian datasets, though the accuracy and
stability were limited. Fine-tuning significantly improved
the model’s performance, achieving over 90% accuracy
in both languages. This underscores the importance and
efectiveness of the translated dataset, even if it contains
some noise.</p>
      <p>Future work will explore the performance of larger
models and further refinement of the dataset to enhance
accuracy and generalization capabilities or explore more
complex settings such as those described in [9].
the covid-19 infodemic and fake news detection, Associates, Inc., 2023, pp. 67284–67296.
in: Advances in Information Retrieval, Springer [21] A. Vlachos, S. Riedel, Fact checking: Task
defiInternational Publishing, Cham, 2022, pp. 416–428. nition and dataset construction, in: C.
Danescu[12] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. El- Niculescu-Mizil, J. Eisenstein, K. McKeown, N. A.
sayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, Smith (Eds.), Proceedings of the ACL 2014
WorkM. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The shop on Language Technologies and Computational
clef-2024 checkthat! lab: Check-worthiness, subjec- Social Science, Association for Computational
Lintivity, persuasion, roles, authorities, and adversarial guistics, Baltimore, MD, USA, 2014, pp. 18–22. URL:
robustness, in: N. Goharian, N. Tonellotto, Y. He, https://aclanthology.org/W14-2508. doi:10.3115/
A. Lipani, G. McDonald, C. Macdonald, I. Ounis v1/W14-2508.
(Eds.), Advances in Information Retrieval, Springer [22] A. Martín, J. Huertas-Tato, Álvaro Huertas-García,
Nature Switzerland, Cham, 2024, pp. 449–458. G. Villar-Rodríguez, D. Camacho, Facter-check:
[13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- Semi-automated fact-checking through
semanhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, tic similarity and natural language inference,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross- Knowledge-Based Systems 251 (2022) 109265.
lingual representation learning at scale, arXiv doi:https://doi.org/10.1016/j.knosys.
preprint arXiv:1911.02116 (2019). 2022.109265.
[14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [23] E. C. Choi, E. Ferrara, Automated claim
matchLachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- ing with large language models: Empowering
factbro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, checkers in the fight against misinformation, in:
G. Lample, Llama: Open and eficient foundation Companion Proceedings of the ACM on Web
Conlanguage models, 2023. arXiv:2302.13971. ference 2024, WWW ’24, Association for
Com[15] B. Berendt, P. Burger, R. Hautekiet, J. Jagers, A. Plei- puting Machinery, New York, NY, USA, 2024, p.
jter, P. Van Aelst, Factrank: Developing auto- 1441–1449. URL: https://doi.org/10.1145/3589335.
mated claim detection for dutch-language fact- 3651910. doi:10.1145/3589335.3651910.
checkers, Online Social Networks and Media 22 [24] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
(2021) 100113. doi:https://doi.org/10.1016/ Improving language understanding by
generj.osnem.2020.100113. ative pre-training, CoRR abs/1801.06146
[16] D. Croce, A. Zelenanska, R. Basili, Enabling (2018). URL: http://arxiv.org/abs/1801.06146.
deep learning for large scale question answering arXiv:1801.06146.
in italian, Intelligenza Artificiale 13 (2019) 49– [25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
61. URL: https://doi.org/10.3233/IA-190018. doi:10. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
3233/IA-190018. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
[17] A. Scaiella, D. Croce, R. Basili, Large scale datasets G. Krueger, T. Henighan, R. Child, A. Ramesh,
for image and video captioning in italian, Italian D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
Journal of Computational Linguistics 2 (2019) 49– E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
60. URL: http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_ C. Berner, S. McCandlish, A. Radford, I. Sutskever,
2_3___scaiella_et_al.pdf. D. Amodei, Language models are few-shot learners,
[18] C. Malon, Team papelo: Transformer networks at in: H. Larochelle, M. Ranzato, R. Hadsell, M.
BalFEVER, in: J. Thorne, A. Vlachos, O. Cocarascu, can, H. Lin (Eds.), Advances in Neural Information
C. Christodoulopoulos, A. Mittal (Eds.), Proceed- Processing Systems 33: Annual Conference on
Neuings of the First Workshop on Fact Extraction ral Information Processing Systems 2020, NeurIPS
and VERification (FEVER), Association for Com- 2020, December, 2020, pp. 6–12.
putational Linguistics, Brussels, Belgium, 2018, pp. [26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu,
109–113. URL: https://aclanthology.org/W18-5517. Y. Li, S. Wang, W. Chen, Lora: Low-rank
doi:10.18653/v1/W18-5517. adaptation of large language models, CoRR
[19] L. Canale, A. Messina, Experimenting ai tech- abs/2106.09685 (2021). URL: https://arxiv.org/abs/
nologies for disinformation combat: the idmo 2106.09685. arXiv:2106.09685.
project, 2023. URL: https://arxiv.org/abs/2310.11097. [27] C. D. Hromei, D. Croce, V. Basile, R. Basili,
ExtremarXiv:2310.11097. ita at EVALITA 2023: Multi-task sustainable scaling
[20] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, to large language models at its extreme, in:
ProD. Xin, A. Kusupati, R. Stella, A. Bapna, O. Firat, ceedings of the Eighth Evaluation Campaign of
NatMadlad-400: A multilingual and document-level ural Language Processing and Speech Tools for
Itallarge audited dataset, in: Advances in Neural In- ian. Final Workshop (EVALITA 2023), Parma, Italy,
formation Processing Systems, volume 36, Curran September 7th-8th, 2023, volume 3473 of CEUR
Workshop Proceedings, CEUR-WS.org, 2023. URL: E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
https://ceur-ws.org/Vol-3473/paper13.pdf. e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
[28] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A t e r m s used i n t h i s t a s k a r e :
dataset for real-world claim verification with ev- − Claei mxa:m Ain asttiaotne m. e n t or a s s e r t i o n under
idence from the web, in: A. Oh, T. Naumann, − E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s
A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), or o p p o s e s t h e c l a i m .</p>
      <p>Advances in Neural Information Processing
Systems, volume 36, Curran Associates, Inc., 2023, pp. Answer with one o f t h e f o l l o w i n g judgments
65128–65167. − SUPbPaOsReTdS :oni ft hteh ee veivdiednecnec ep rsouvbi ds etadn: t i a t e s t h e
[29] P. Atanasova, D. Wright, I. Augenstein, Gener- c l a i m .</p>
      <p>ating label cohesive and well-formed adversarial − REFUTES : i f t h e e v i d e n c e d i r e c t l y
claims, in: Proceedings of the 2020 Conference c o n t r a d i c t s t h e c l a i m .
on Empirical Methods in Natural Language Pro- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
cessing (EMNLP), Association for Computational evvaildi edni tcye t o d e t e r m i n e t h e claim ’ s
Linguistics, Online, 2020, pp. 3168–3177. URL: https: ### I n p u t
//aclanthology.org/2020.emnlp-main.256. doi:10. − Claim : [ CLAIM HERE ]
18653/v1/2020.emnlp-main.256. − E v i d e n c e : [ EVIDENCE HERE ]
[30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a ### Answer : [ANSWER HERE ]
method for automatic evaluation of machine
translation, in: Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics, A.1.2. 1-shot Setting
ACL ’02, Association for Computational Linguis- The following prompt is used for 1-shot learning, where
tics, USA, 2002, p. 311–318. URL: https://doi.org/ the task and classes are explained, and one example per
10.3115/1073083.1073135. doi:10.3115/1073083. class is provided. Notice that only the evidence is
re1073135. ported without the title of the original document.
[31] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia,</p>
      <p>J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui, A ### I n s t r u c t i o n
survey on in-context learning, 2024. URL: https: E v a l ueavtied einfc et hper ocvliadiemd .i sD es ufipnpiot ri otends bfyo rt hkeey
//arxiv.org/abs/2301.00234. arXiv:2301.00234. t e r m s used i n t h i s t a s k a r e :
[32] AI@Meta, Llama 3 model card, 2024. URL: − Claim : A s t a t e m e n t or a s s e r t i o n under
https://github.com/meta-llama/llama3/blob/main/ e x a m i n a t i o n .</p>
      <p>MODEL_CARD.md. − E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s
or o p p o s e s t h e c l a i m .</p>
      <p>Answer with one o f t h e f o l l o w i n g judgments
A. Prompting Engineering − SUPbPaOsReTdS :oni ft hteh ee veivdiednecnec ep rsouvbi ds etadn: t i a t e s t h e
c l a i m .</p>
      <p>This appendix contains the prompts used in the exper- − REFUTES : i f t h e e v i d e n c e d i r e c t l y
iments. The prompts are provided in both Italian and c o n t r a d i c t s t h e c l a i m .</p>
      <p>English, reflecting the task-specific nature of the experi- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
ments. Each prompt begins with an explanation of the e v i d e n c e t o d e t e r m i n e t h e claim ’ s
task and the meaning of the classes. In the diferent vari- v a l i d i t y
ants, the 0-shot setting does not include any examples, ### Examples
unlike the 1-shot setting. Where necessary, the name of These examples d e m o n s t r a t e how t o a p p l y t h e
the document from which the evidence is taken is also e v a l u a t i o n c r i t e r i a :
specified. − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d
G o t h i c .
− E v i d e n c e : The Germanic p e o p l e s ( a l s o
A.1. Prompts in English r e f e r r e d t o a s T e u t o n i c , Suebian , or
G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo −
A.1.1. 0-shot Setting European ethno − l i n g u i s t i c group o f
Northern European o r i g i n .</p>
      <p>The following prompt is used for 0-shot learning, where − Answer : SUPPORTS
the task and classes are presented without additional
information.
### I n s t r u c t i o n
− Claim : T e n n i s i s not a s p o r t .
− E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f
r e c r e a t i o n a l p l a y e r s and i s a l s o a
p o p u l a r worldwide s p e c t a t o r s p o r t .
− Answer : REFUTES
− Claim : Kick −Ass i s a h o r r o r f i l m .
− E v i d e n c e : Kick −Ass i s a 2010 B r i t i s h −</p>
      <p>American f i l m b a s e d on t h e comic book o f
t h e same name by Mark M i l l a r and John</p>
      <p>Romita , J r .
− Answer : NOT ENOUGH INFO
### I n p u t
− Claim : [ CLAIM HERE ]
− E v i d e n c e : [ EVIDENCE HERE ]
### Answer : [ANSWER HERE ]
A.1.3. 0-shot Setting with Document Title</p>
      <sec id="sec-6-1">
        <title>The following prompt is used for 0-shot learning, where the task and classes are explained without additional information. Each input evidence is provided with the title of its original document.</title>
        <p>### I n s t r u c t i o n
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
t e r m s used i n t h i s t a s k a r e :
− Claim : A s t a t e m e n t or a s s e r t i o n under</p>
        <p>e x a m i n a t i o n .
− E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s</p>
        <p>or o p p o s e s t h e c l a i m .
− Document : d e n o t e s t h e s o u r c e document f o r</p>
        <p>t h e e v i d e n c e .</p>
        <p>Answer with one o f t h e f o l l o w i n g judgments</p>
        <p>b a s e d on t h e e v i d e n c e p r o v i d e d :
− SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e</p>
        <p>c l a i m .
− REFUTES : i f t h e e v i d e n c e d i r e c t l y</p>
        <p>c o n t r a d i c t s t h e c l a i m .
− NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
e v i d e n c e t o d e t e r m i n e t h e c l a im ’ s
v a l i d i t y
### I n p u t
− Claim : [ CLAIM HERE ]
− E v i d e n c e : [ EVIDENCE HERE ]
− Document : [DOCUMENT HERE ]
### Answer : [ANSWER HERE ]
A.1.4. 1-shot Setting with Document Title
− Document : d e n o t e s t h e s o u r c e document f o r</p>
        <p>t h e e v i d e n c e .</p>
        <p>Answer with one o f t h e f o l l o w i n g judgments</p>
        <p>b a s e d on t h e e v i d e n c e p r o v i d e d :
− SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e</p>
        <p>c l a i m .
− REFUTES : i f t h e e v i d e n c e d i r e c t l y</p>
        <p>c o n t r a d i c t s t h e c l a i m .
− NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
e v i d e n c e t o d e t e r m i n e t h e c l a im ’ s
v a l i d i t y
### Examples
These examples d e m o n s t r a t e how t o a p p l y t h e</p>
        <p>e v a l u a t i o n c r i t e r i a :
− Claim : The Germanic p e o p l e s a r e a l s o c a l l e d</p>
        <p>G o t h i c .
− E v i d e n c e : The Germanic p e o p l e s ( a l s o
r e f e r r e d t o a s T e u t o n i c , Suebian , or
G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo −
European ethno − l i n g u i s t i c group o f</p>
        <p>Nor ther n European o r i g i n .
− Document : Germanic p e o p l e s
− Answer : SUPPORTS
− Claim : T e n n i s i s not a s p o r t .
− E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f
r e c r e a t i o n a l p l a y e r s and i s a l s o a
p o p u l a r worldwide s p e c t a t o r s p o r t .
− Document : T e n n i s
− Answer : REFUTES
− Claim : Kick −Ass i s a h o r r o r f i l m .
− E v i d e n c e : Kick −Ass i s a 2010 B r i t i s h −</p>
        <p>American f i l m b a s e d on t h e comic book o f
t h e same name by Mark M i l l a r and John</p>
        <p>Romita , J r .
− Document : Kick −Ass ( f i l m )
− Answer : NOT ENOUGH INFO
### I n p u t
− Claim : [ CLAIM HERE ]
− E v i d e n c e : [ EVIDENCE HERE ]
− Document : [DOCUMENT HERE ]
### Answer : [ANSWER HERE ]
A.2. Prompts in Italian
A.2.1. 0-shot Setting
The following prompt is used for 1-shot learning, where The following prompt is used for 0-shot learning, where
the task and classes are explained, and one example per the task and classes are presented without additional
class is provided. Each input evidence is provided with information.
the title of its original document.
### I n s t r u c t i o n
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
t e r m s used i n t h i s t a s k a r e :
− Claim : A s t a t e m e n t or a s s e r t i o n under</p>
        <p>e x a m i n a t i o n .
− E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s
or o p p o s e s t h e c l a i m .</p>
        <p>### I s t r u z i o n i
V a l u t a s e l ’ a f f e r m a z i o n e è s u p p o r t a t a d a l l e
prove f o r n i t e . Le d e f i n i z i o n i d e i
t e r m i n i c h i a v e u t i l i z z a t i i n q u e s t o
compito sono :
− A f f e r m a z i o n e : Una d i c h i a r a z i o n e o</p>
        <p>a s s e r z i o n e s o t t o esame .
− Prova : I n f o r m a z i o n i che s u p p o r t a n o o
c o n t r a d d i c o n o l ’ a f f e r m a z i o n e .</p>
        <p>R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i</p>
        <p>s u l l e prove f o r n i t e :
− SUPPORTS : s e l e prove confermano l ’</p>
        <p>a f f e r m a z i o n e .
− REFUTES : s e l e prove c o n t r a d d i c o n o</p>
        <p>d i r e t t a m e n t e l ’ a f f e r m a z i o n e .
− NOT ENOUGH INFO : s e l e prove non sono
s u f f i c i e n t i p e r d e t e r m i n a r e l a v a l i d i t à
d e l l ’ a f f e r m a z i o n e .
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]
− Prova : [ EVIDENCE HERE ]
### R i s p o s t a : [ANSWER HERE ]
A.2.2. 1-shot Setting</p>
      </sec>
      <sec id="sec-6-2">
        <title>The following prompt is used for 1-shot learning, where</title>
        <p>the task and classes are explained, and one example per
class is provided. Notice that only the evidence is
reported without the title of the original document.
− A f f e r m a z i o n e : Kick −Ass è un f i l m h o r r o r .
− Prova : Kick −Ass è un f i l m b r i t a n n i c o −
americano d e l 2010 b a s a t o s u l f u m e t t o
omonimo d i Mark M i l l a r e John Romita J r .
− R i s p o s t a : NOT ENOUGH INFO
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]
− Prova : [ EVIDENCE HERE ]
### R i s p o s t a : [ANSWER HERE ]
A.2.3. 0-shot Setting with Document Title</p>
      </sec>
      <sec id="sec-6-3">
        <title>The following prompt is used for 0-shot learning, where the task and classes are explained without additional information. Each input evidence is provided with the title of its original document.</title>
        <p>### I s t r u z i o n i
V a l u t a s e l ’ a f f e r m a z i o n e è s u p p o r t a t a d a l l e
prove f o r n i t e . Le d e f i n i z i o n i d e i
t e r m i n i c h i a v e u t i l i z z a t i i n q u e s t o
compito sono :
− A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
a s s e r z i o n e s o t t o esame .
− Prova : I n f o r m a z i o n i che s u p p o r t a n o o
c o n t r a d d i c o n o l ’ a f f e r m a z i o n e .
− Documento : i n d i c a l a f o n t e da c u i è s t a t a
e s t r a t t a l a prova .</p>
        <p>R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i
s u l l e prove f o r n i t e :
− SUPPORTS : s e l e prove confermano l ’
a f f e r m a z i o n e .
− REFUTES : s e l e prove c o n t r a d d i c o n o
d i r e t t a m e n t e l ’ a f f e r m a z i o n e .
− NOT ENOUGH INFO : s e l e prove non sono
s u f f i c i e n t i p e r d e t e r m i n a r e l a v a l i d i t à
d e l l ’ a f f e r m a z i o n e .
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]
− Prova : [ EVIDENCE HERE ]
− Documento : [DOCUMENT HERE ]
### R i s p o s t a : [ANSWER HERE ]
A.2.4. 1-shot Setting with Document Title</p>
      </sec>
      <sec id="sec-6-4">
        <title>The following prompt is used for 1-shot learning, where the task and classes are explained, and one example per class is provided. Each input evidence is provided with the title of its original document.</title>
        <p>### I s t r u z i o n i
V a l u t a s e l ’ a f f e r m a z i o n e è s u p p o r t a t a d a l l e
prove f o r n i t e . Le d e f i n i z i o n i d e i
t e r m i n i c h i a v e u t i l i z z a t i i n q u e s t o
compito sono :
− A f f e r m a z i o n e : Una d i c h i a r a z i o n e o</p>
        <p>a s s e r z i o n e s o t t o esame .
− Prova : I n f o r m a z i o n i che s u p p o r t a n o o
c o n t r a d d i c o n o l ’ a f f e r m a z i o n e .
− Documento : i n d i c a l a f o n t e da c u i è s t a t a
e s t r a t t a l a prova .</p>
        <p>R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i
s u l l e prove f o r n i t e :
− SUPPORTS : s e l e prove confermano l ’
a f f e r m a z i o n e .
− REFUTES : s e l e prove c o n t r a d d i c o n o
d i r e t t a m e n t e l ’ a f f e r m a z i o n e .
− NOT ENOUGH INFO : s e l e prove non sono
s u f f i c i e n t i p e r d e t e r m i n a r e l a v a l i d i t à
d e l l ’ a f f e r m a z i o n e .
### Esempi
Q u e s t i esempi d i m o s t r a n o come a p p l i c a r e i
c r i t e r i d i v a l u t a z i o n e :
− A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono
c h i a m a t i anche g o t i c i .
− Prova : I p o p o l i g e r m a n i c i ( anche c h i a m a t i
Teutoni , S u e b i o G o t i n e l l a l e t t e r a t u r a
p i ù a n t i c a ) sono un gruppo etno −
l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord
e u r o p e a .
− Documento : P o p o l i g e r m a n i c i
− R i s p o s t a : SUPPORTS
− A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t .
− Prova : I l t e n n i s è p r a t i c a t o da m i l i o n i d i
g i o c a t o r i a m a t o r i a l i ed è anche uno
s p o r t p o p o l a r e a l i v e l l o mondiale .
− Documento : T e n n i s
− R i s p o s t a : REFUTES
− A f f e r m a z i o n e : Kick −Ass è un f i l m h o r r o r .
− Prova : Kick −Ass è un f i l m b r i t a n n i c o −
americano d e l 2010 b a s a t o s u l f u m e t t o
omonimo d i Mark M i l l a r e John Romita J r .
− Documento : Kick −Ass ( f i l m )
− R i s p o s t a : NOT ENOUGH INFO
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]
− Prova : [ EVIDENCE HERE ]
− Documento : [DOCUMENT HERE ]
### R i s p o s t a : [ANSWER HERE ]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>