=Paper=
{{Paper
|id=Vol-2936/paper-29
|storemode=property
|title=Overview of the CLEF-2021 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked
                        Claims in Tweets and Political Debates
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-29.pdf
|volume=Vol-2936
|authors=Shaden Shaar,Fatima Haouari,Watheq Mansour,Maram Hasanain,Nikolay Babulkov,Firoj Alam,Giovanni Da San Martino,Tamer Elsayed,Preslav Nakov
|dblpUrl=https://dblp.org/rec/conf/clef/ShaarHMHBAMEN21
}}
==Overview of the CLEF-2021 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked
                        Claims in Tweets and Political Debates==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-29.pdf</pdf>
<pre>
Overview of the CLEF-2021 CheckThat! Lab Task 2
on Detecting Previously Fact-Checked Claims in
Tweets and Political Debates
Shaden Shaar1 , Fatima Haouari2 , Watheq Mansour2 , Maram Hasanain2 ,
Nikolay Babulkov3 , Firoj Alam1 , Giovanni Da San Martino4 , Tamer Elsayed2 and
Preslav Nakov1
1
  Qatar Computing Research Institute, HBKU, Doha, Qatar
2
  Qatar University, Qatar
3
  Sofia University, Bulgaria
4
  University of Padova, Italy


                                         Abstract
                                         We describe the fourth edition of the CheckThat! Lab, part of the 2021 Conference and Labs of the
                                         Evaluation Forum (CLEF). The lab evaluates technology supporting three tasks related to factuality, and
                                         it covers Arabic, Bulgarian, English, Spanish, and Turkish. Here, we present the task 2, which asks to
                                         detect previously fact-checked claims (in two languages). A total of four teams participated in this task,
                                         submitted a total of sixteen runs, and most submissions managed to achieve sizable improvements over
                                         the baselines using transformer based models such as BERT, RoBERTa. In this paper, we describe the
                                         process of data collection and the task setup, including the evaluation measures used, and we give a
                                         brief overview of the participating systems. Last but not least, we release to the research community all
                                         datasets from the lab as well as the evaluation scripts, which should enable further research in detecting
                                         previously fact-checked claims.

                                         Keywords
                                         Check-Worthiness Estimation, Fact-Checking, Veracity, Verified Claims Retrieval, Detecting Previously
                                         Fact-Checked Claims, Social Media Verification, Computational Journalism, COVID-19


1. Introduction
There has been a growing concern about the spread of dis/mis-information in social media, ad
this has become an urgent social and political issue. Over time, several initiatives for manual
fact-checking have been launched, with over 200 fact-checking organizations actively working
worldwide.1 Unfortunately, these efforts do not scale and they are clearly insufficient, given
the scale of disinformation propagating in different communication channels, which, according
to the World Health Organization, has grown into the First Global Infodemic in the times of
COVID-19.
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" sshaar@hbku.edu.qa (S. Shaar); 200159617@qu.edu.qa (F. Haouari); 200159617@qu.edu.qa (W. Mansour);
maram.hasanain@qu.edu.qa (M. Hasanain); nbabulkov@gmail.com (N. Babulkov); fialam@hbku.edu.qa (F. Alam);
dasan@math.unipd.it (G. D. S. Martino); telsayed@qu.edu.qa (T. Elsayed); pnakov@hbku.edu.qa (P. Nakov)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                      http://tiny.cc/zd1fnz
   There has been a surge in research to develop systems for automatic fact-checking. However,
such systems suffer from credibility issues. Hence, it is important to reduce the manual effort
by detecting when a claim has already been fact-checked. Work in this direction includes [1]
and [2]: the former developed a dataset for the task and proposed a ranking model, while the
latter proposed a neural ranking model using textual and visual modalities.
   To deal with this problem, we launched the CheckThat! Lab, which features a number of tasks
aiming to help automate the fact-checking process and to reduce the spread of disinformation
and misinformation. The CheckThat! lab2 was run for the fourth time in the framework
of CLEF 2021. The purpose of the 2021 edition of the lab was to foster the development of
technology that would enable finding check-worthy claims, finding claims that have been
previously fact-checked, and predicting the veracity of a news article and its topic. Thus, the
lab focuses on three types of content: (i) tweets, (ii) political debates and speeches, and (iii) news
articles.
   In this paper, we describe in detail the second task, detecting previously fact-checked claims, of
the CheckThat! lab tasks.3 Figure 1 shows the full CheckThat! identification and verification
pipeline, including the tasks on detecting check-worthy claims, detecting previously fact-
checked claims, and veracity and topic detection of news articles. The second task is defined
as follows: “given a check-worthy input claim and a set of verified claims, rank the previously
verified claims in order of usefulness to fact-check the input claim.” It consists of the following
two subtasks:

 Subtask 2A: Detecting previously fact-checked claims in tweets. Given a tweet, detect
     whether the claim it makes was previously fact-checked with respect to a collection of
     fact-checked claims. This is a ranking task, offered in Arabic and English, where the
     systems need to return a list of top-𝑛 candidates.

 Subtask 2B: Detecting previously fact-checked claims in political debates or speeches.
     Given a claim in a political debate or a speech, detect whether the claim has been previously
     fact-checked with respect to a collection of previously fact-checked claims. This is a
     ranking task, and it was offered in English.

   For Subtask 2A, we focused on tweets, and it was offered in Arabic, and English. The
participants were free to work on any language(s) of their interest, and they could also use
multilingual approaches that make use of all datasets for training. Subtask 2A attracted four
teams, and the most successful approaches used transformers or a combination of embeddings,
manually engineered features, and neural networks. Section 3 offers more details.
   For Subtask 2B, we focused on political debates and speeches, and we used PolitiFact as
the main data source. The task attracted three teams, and a combination of transformers,
prepossessing, and augmentation approaches performed the best. Section 4 gives more details.
   As for the rest of the paper, Section 2 discusses some related work, and Section 5 concludes
with final remarks.


   2
       http://sites.google.com/view/clef2021-checkthat/
   3
       Refer to [3] for an overview of the full CheckThat! 2021 lab.
Figure 1: The full verification pipeline of the CheckThat! lab 2021. The verified claim retrieval subtask
2A targets tweets, and subtask 2B targets political debates. See [4, 5] for a discussion on tasks 1 and 3.
The grayed tasks were addressed in previous editions of the lab [6, 7]


2. Related Work
A large body of research focused on developing automatic systems for fact-checking [8, 9, 10,
11, 12]. This includes datasets [13, 14], and evaluation campaigns [15, 6, 16, 17, 18]. However,
there are credibility issues with automated systems [19], and thus a reasonable solution is to
build tools to facilitate human fact-checkers, e.g., by detecting previously fact-checked claims.
   This is an underexplored task and the only directly relevant work is [1, 20]; here, we use their
annotation setup and one of their datasets: PolitiFact. Previous work has mentioned the task as
an integral step of an end-to-end automated fact-checking pipeline, but there was very little
detail provided about this component and it was not evaluated [21].
   In an industrial setting, Google has developed the Fact Check Explorer,4 which allows users to
search a number of fact-checking websites. However, the tool cannot handle a complex claim, as
it uses the standard Google search funcionality, which is not optimized for semantic matching
of long claims.
   Another related work is the ClaimsKG dataset and system [22], which includes 28K claims
from multiple sources, organized into a knowledge graph (KG). The system can perform data
exploration, e.g., it can find all claims that contain a certain named entity or keyphrase. In
contrast, we are interested in detecting whether a claim was previously fact-checked.
   Finally, the task is related to semantic relatedness tasks, e.g., from the GLUE benchmark
[23], such as natural language inference (NLI) [24], recognizing textual entailment (RTE) [25],
paraphrase detection [26], and semantic textual similarity (STS-B) [27]. However, it differs from
them in a number of aspects; see [1] for more detail and discussion.

    4
        http://toolbox.google.com/factcheck/explorer
3. Subtask 2A: Detecting Previously Fact-Checked Claims in
   Tweets
Given a tweet, the task asks to detect whether the claim the tweet makes was previously fact-
checked with respect to a collection of fact-checked claims. The task is offered in Arabic and
English. This is a ranking task, where the systems are asked to return a list of top-𝑛 candidates.

3.1. Dataset
Arabic To construct our verified claims collection, we selected 5,921 Arabic claims from
AraFacts [28], and 24,408 English claims from ClaimsKG [29], which we translated to Arabic
using the Google translate API.5 To obtain our tweet–VerClaim pairs, we first selected a set of
1,274 Arabic verified claims from AraFacts such that each claim has at least one stated tweet
example in its corresponding fact-checking article. Second, we selected one tweet example for
each verified claim following the guidelines below:
   1. Select an Arabic tweet.

   2. Avoid tweets where the claim is stated in an image or a video.

   3. Try to choose the tweet example that does not exactly match the text of the claim.

   4. Avoid tweets that are relevant, but do not contain the claim or it is not clear whether they
      are about the claim.

   5. Avoid tweets that have more than one claim.
   The two annotators who constructed the tweet–VerClaim pairs swapped their pairs to double-
check that the selected tweets were compliant with the guidelines. They further resolved any
disagreements by discussing the reasons behind their choice, and excluded the claims where
the disagreement remains. We ended up with 858 tweet–VerClaim pairs.
   Due to the fact that AraFacts contains verified claims from five different Arabic fact-checking
platforms, and since a claim can be verified by multiple sources, we had to check whether the
annotated tweets can be paired with more than one claim from our verified claims collection.
We first adopted Jaccard similarity to check whether each verified claim in our collection was
verified by multiple sources. For each given verified claim, we selected all the claims that had a
Jaccard similarity above 30%; then, we asked the annotators to double-check and to exclude
any non-similar claims. Given similar claims to the ones in our qrels (query relevances), we
constructed new tweet–VerClaim pairs.
   To further verify any missing similar claims, we used Pyserini [30] to index our verified
claims collection and we retrieved the top-25 potentially relevant verified claims for each tweet
in our dataset. One annotator then checked for missing tweet–VerClaim pairs in our previously
constructed qrels, and we expanded the qrels accordingly. Figure 2 presents some input tweet
examples from our dataset and the corresponding top-5 verified claims ranked based on their
relevance using a BM25 system.
   5
       https://cloud.google.com/translate
Figure 2: Task 2A, Arabic: Examples of input tweets and the top-5 most similar verified claims from
our verified claims collection retrieved by a BM25 system. Correct matches to previously verified match-
ing claims are marked with a ✓.

 Input tweet    (a)                                                              ‫ﺳرﻗﺔ ﺑﻧك ﻓﻲ أﻣرﯾﻛﺎ و ﺗوزﯾﻊ اﻟﻔﻠوس ﻋﻠﻰ اﻟﻣﺎرة‬
 Verified claims (1)                                                         (‫ ﺳرﻗﺔ ﺑﻧك وﺗوزﯾﻊ اﻟﻧﻘود ﻋﻠﻰ اﻟﻧﺎس )ﻓﯾدﯾو‬:‫أﻣرﯾﻛﺎ‬     
                (2)    ‫ اﻟﺗﻲ ﻣرت ﻋﻠﻰ اﻟﻌﺎﻟم اﻟﻌرﺑﻲ‬،‫ﻗرر اﻟﻣﻠك ﻋﺑد ﷲ اﻟﺛﺎﻧﻲ ﺑن اﻟﺣﺳﯾن ﺑﺎﻟﺗﻌﺎون ﻣﻊ ﺑﻧك اﻷردن وﺑﺳﺑب ﺟﺎﺋﺣﺔ ﻛوروﻧﺎ‬      
                                                                                   ‫ ﺗوزﯾﻊ ﻣﻧﺢ ﻣﺎﻟﯾﺔ ﻋﻠﻲ اﻟﻣواطﻧﯾن‬،‫واﻟﻌﺎﻟم أﺟﻣﻊ‬
                (3)                            ‫ﻧﺻب ﺧﯾم و ﺗوزﯾﻊ اﻟﺣﻠوﯾﺎت اﺣﺗﻔﺎﻻً ﺑﺧﺑر اﺻﺎﺑﺔ اﻟﻧﺎﺋب ﺟﺑران ﺑﺎﺳﯾل ﺑﻔﯾروس ﻛوروﻧﺎ‬       
                (4)                                                               ّ
                                                              ‫وﯾوزﻋون اﻟﻧﻘود ﻋﻠﻰ اﻟﻧﺎس‬ ‫ﻣﺗظﺎھرون ﻓﻲ أﻣﯾرﻛﺎ ﯾﺳﺗوﻟون ﻋﻠﻰ ﺑﻧك‬         
                (5)                                              ‫)ﻓﯾدﯾو( ﺗوزﯾﻊ وزﯾرة اﻟﺻﺣﺔ ﻓﻲ ﻛورﯾﺎ اﻟﺟﻧوﺑﯾﺔ راﺗﺑﮭﺎ ﻋﻠﻰ اﻟﻔﻘراء‬   

 Input tweet    (b)    ‫ﺑرﻟﯾن ﺗطﺎﻟب ﺑﺳﺣب اﻟﻼﺟﺋﯾن ﻣن إدﻟب إﻟﻰ أﻟﻣﺎﻧﯾﺎ ھؤﻻء ﯾﻌﻠﻣون أن ﻣﺎ ﯾﺣدث ﻓﻲ ﺳورﯾﺔ ﻟﯾس ﺣرﺑﺎ ً أھﻠﯾﺔ وﻻ‬# ‫ھﻧﺎ‬
                        ‫ﺣرﺑﺎ ً طﺎﺋﻔﯾﺔ وﻻ ﺛورة ﺟﯾﺎع إﻧﻣﺎ ھو ﺛورة ﺷﻌﺑﯾﺔ وطﻧﯾﺔ ﺣﻘوﻗﯾﺔ ﯾﻌﻠﻣون أﻛﺛر ﻣن اﻟﺳورﯾﯾن ﻧﻔﺳﮭم اﻟﻣﺣﺳوﺑﯾن ﻋﻠﻰ‬
                                                           ‫ﻋرﺳﺎل_ﺗﺳﺗﻐﯾث‬# ‫اﻟﺛورة ﻣﻠﯾون وﻧﺻف ﺳوري ﺑﺑرﻟﯾن وﻻ ﺳوري ﺑﺧﯾﻣﮫ‬
 Verified claims (1)                                                                                   ‫ﻓﻲ إدﻟب وﻻ ﻓﻲ اﻟﻘﺎﻣﺷﻠﻲ‬     
                 (2)        ‫ﻣظﺎھرة ﻓﻲ أﻟﻣﺎﻧﯾﺎ ﺗطﺎﻟب ﺑﺎﺳﺗﻘﺑﺎل اﻟﺳورﯾﯾن اﻟذﯾن ﯾﻌﯾﺷون ﻓﻲ إدﻟب ﺗﺣت اﻟﻘﺻف ﻹﻧﻘﺎذھم وإﻋطﺎﺋﮭم ﺣﻖ‬          
                                                                                                                  .‫اﻟﻠﺟوء‬
                 (3)                                                           ‫ﺑﺈﯾران اﻟﺛورة ﻋﻧدھم ﻣﺎ ﻓﯾﮭﺎ ﻣزح ﺛورة ﺟﺣﺎﺷﯾﯾﺔ‬       
                 (4)                 . ً ‫ﺑﺄن ﻋﯾﺳﻰ ﻟﯾس ﷲ وﻻ اﺑﻧﺎ‬
                                                            ّ ‫اﻟﻔﺎﺗﯾﻛﺎن ﯾﻌﺛر ﻋﻠﻰ ﻧﺳﺧﺔ ﻗدﯾﻣﺔ ﻣن اﻹﻧﺟﯾل وﯾﺻدم اﻟﻌﺎﻟم اﻟﻣﺳﯾﺣﻲ‬        
                 (5)      ‫ ﻓﻲ رواﯾﺔ أﺧرى ﺗم اﻟﻌﺛور ﻋﻠﻰ اﻣرأة‬.‫وﻣر ﻋﻠﻰ ﻣوﺗﮭﺎ ﺷﮭور‬  ّ ،‫اﻣرأة ﻣﺳﻧﺔ ﯾﻣﻧﯾﺔ ﻣﺎﺗت دون أن ﯾﺳﺄل ﻋﻧﮭﺎ أﺣد‬    
                         ‫ وﻗﺎل اﻟﺑﻌض أن اﻟﻣرأة ﻣﯾﺗﺔ ﻓﻲ ﻣﻧزﻟﮭﺎ ﻣﻧذ أﻛﺛر ﻣن‬.‫ﺳورﯾﺔ ﻣن ادﻟب ﺗﻌﯾش وﺣدھﺎ ﻓﻲ اﺳطﻧﺑول ﻣﯾﺗﺔ ﻓﻲ ﻣﻧزﻟﮭﺎ‬
                                                                                                     .‫ﻋﺎم ﻓﻲ ﻣدﯾﻧﺔ إدﻟب اﻟﺳورﯾﺔ‬


English To construct the verified claims database, we used Snopes, a fact-checking website
that targets rumors spreading in social media, and we collected 13,835 verified claims. Their
fact-checking journalists often cite the tweet or the social media post that spreads the rumor
when writing an article about a claim. We looked over all crawled verified articles, and we
collected 1,401 tweets.
   Table 1 shows examples of the input tweets and the results retrieved by the BM25 baseline.
From example 1 of the table, we can see that the tweet–vclaim pairs are more complex than
simple textual similarity. Then, example 2 shows that, in order to do a good decision about a
pair, we need to understand the contextual meaning of the sentences.
   Table 2 shows statistics about the CT–VCR–21 corpus for Task 2, including both subtasks and
languages. CT–VCR–21 stands for CheckThat! verified claim retrieval 2021. Input–VerClaim
pairs represent input claims with their corresponding verified claims by a fact-checking source.
For Arabic, we randomly split the data into 512 training, 85 development, and 261 test examples.
In total, the Arabic dataset consists of 858 queries, 1,039 qrels, and a collection of 30,329 verified
claims. For English, we split the data into 70%, 15% and 15% for training, development, and test,
respectively.

3.2. Evaluation
For the ranking tasks, as in the two previous editions of the CheckThat! lab, we calculated
Mean Average Precision (MAP), reciprocal rank, Precision@𝑘 (𝑃 @𝑘) and MAP@𝑘 for 𝑘 P
t1, 3, 5, 10, 20, 30u. We used MAP@5 as the official evaluation measure.
Table 1
Task 2A, English: Examples of input tweets and the top-5 most similar verified claims from our verified
claims collection retrieved by a BM25 system. The correct previously verified matching claims to be
retrieved are marked with a ✓.
 Input tweet       (a)   Sen. Mitch McConnell: “As recently as October, now-President Biden
                         said you can’t legislate by executive action unless you are a dicta-
                         tor. Well, in one week, he signed more than 30 unilateral actions.”
                         pic.twitter.com/PYQKe9Geez — Forbes (@Forbes) January 28, 2021

 Verified claims   (1)   When he was still a candidate for the presidency in October 2020,           ✓
                         U.S. President Joe Biden said, “You can’t legislate by executive or-
                         der unless you’re a dictator.”
                   (2)   Photographs you post on Snapchat can now be used as evidence in legal       ✗
                         cases unless you opt out.
                   (3)   U.S. Sen. Mitch McConnell said he would not participate in 2020 election    ✗
                         debates that include female moderators.
                   (4)   U.S. Sen. Majority Leader Mitch McConnell said that U.S. President          ✗
                         Trump "provoked" the attack on the Capitol.
                   (5)   President Joe Biden signed an executive order in 2021 allowing the U.S.     ✗
                         to fund abortions abroad.
 Input tweet       (b)   A supporter of President Donald Trump carries a Confederate battle flag
                         on the second floor of the U.S. Capitol near the entrance to the Sen-
                         ate after breaching security defenses, in Washington, January 6, 2021.
                         Photo by Mike Theiler pic.twitter.com/pbhwfAVsUX — corinne_perkins
                         (@corinne_perkins) January 6, 2021

 Verified claims   (1)   In January 2021, Hillary Clinton suggested U.S. President Donald Trump      ✗
                         spoke by phone with Vladimir Putin on the day of an attack on the U.S.
                         Capitol, Jan. 6, 2021.
                   (2)   In January 2021, OnlyFans removed Donald Trump’s account in the af-         ✗
                         termath of the Jan. 6 attack on the U.S. Capitol.
                   (3)   A Confederate flag was spotted inside and outside the U.S. Capitol          ✓
                         as a pro-Trump mob stormed the building.
                   (4)   A pro-Trump mob chanted “Hang Mike Pence” as they stormed the U.S.          ✗
                         Capitol on Jan. 6, 2021.
                   (5)   Kevin Seefried, who carried a Confederate flag into the U.S. Capitol dur-   ✗
                         ing the attack on the building in January 2021, is registered as a Demo-
                         crat in Delaware.


3.3. Overview of the Systems
A total of four teams participated in this task, submitting sixteen runs. One team participated
in the Arabic task and three teams participated in the English task. Below, we discuss briefly
the approach of each team.
Team bigIR (2A:ar:1) fine-tuned AraBERT [31] by adding two neural network layers on top of
it to predict the relevance score for a given tweet–VerClaim pair. The fine-tuned model was
used to re-rank the candidate claims based on the predicted relevance scores.
Table 2
Task 2: Statistics about the CT–VCR–21 corpus, including the number of Input–VerClaim pairs and the
number of VerClaim claims to match a claim against.
                                               2A–Arabic    2A–English      2B–English
          Input claims                            858           1,401           669
            Training                              512             999           472
            Development                            85             200           119
            Test                                  261             202            78
          Input–VerClaim pairs                   1,039          1,401            804
            Training                               602            999            562
            Development                            102            200            139
            Test                                   335            202            103
          Verified claims (to match against)     30,329        13,835         19,250


Team Aschern [32] (2A:en:1) used TF.IDF, fine-tuned pre-trained sentence-level BERT, and the
re-ranking LambdaMART model. The system is evaluated on the English version of the dataset
collected from tweets.
Team NLytics (2A:en:2) used RoBERTa with a regression function in the final layer by consid-
ering the problem as a ranking task.
Team DIPS [33] (2A:en:3) used Sentence-BERT embeddings for all claims and then computed
the cosine similarity for each pair of an input tweet and a verified claim. The prediction was
made by passing a sorted list of cosine similarities to a neural network.

3.4. Results
Table 3 shows the official evaluation results for subtask 2A for Arabic and for English. We can
see that all four participating teams managed to outperform the corresponding Elastic Search
(ES) baseline, which is actually a strong baseline.

Arabic A single system was submitted for this task by the bigIR team. They used AraBERT
to re-rank a list of candidates retrieved by a BM25 model. They first constructed a balanced
training dataset where the positive examples correspond to the query relevance (qrels) provided
by the organizers, while the negative examples were selected from the top retrieved candidates
by BM25 such that they are not already labeled as positive. Second, they fine-tuned AraBERT
to predict the relevance score for a given tweet–VerClaim pair. They added two neural network
layers on top of AraBERT to perform the classification. Finally, at inference time, they used BM25
to retrieve the top 20 candidate verified-claims. Then, they fed each tweet–VerClaim pair to the
fine-tuned model to obtain a relevance score and to re-rank the candidate claims accordingly.
Their system outperformed the Elastic Search baseline by a sizable margin achieving a MAP@5
of 0.908 (compared to 0.794 for Elastic Search baseline).
Table 3
Task 2A: Official evaluation results, in terms of MRR, MAP@𝑘, and Precision@𝑘. The teams are ranked
by the official evaluation measure: MAP@5. Here, ES baseline is the Elastic Search baseline, which
implements BM25.
      Team           MRR                     MAP                                  Precision
                             @1      @3      @5      @10     @20     @1      @3      @5       @10     @20
    Arabic
1     bigIR          0.924   0.787   0.905   0.908   0.910   0.912   0.908   0.391   0.237    0.120   0.061
2     ES-baseline    0.835   0.682   0.782   0.794   0.799   0.802   0.793   0.344   0.217    0.113   0.058
    English
1     Aschern [32]   0.884   0.861   0.880   0.883   0.884   0.884   0.861   0.300   0.182    0.092   0.046
2     NLytics [34]   0.807   0.738   0.792   0.799   0.804   0.806   0.738   0.289   0.179    0.093   0.048
3     DIPS [33]      0.795   0.728   0.778   0.787   0.791   0.794   0.728   0.282   0.177    0.092   0.048
      ES baseline    0.761   0.703   0.741   0.749   0.757   0.759   0.703   0.262   0.164    0.088   0.046


English Three teams participated for English, submitting a total of ten runs. All of them
managed to improve over the Elastic Search (ES) baseline by a large margin. Team Aschern
performed best; they used TF.IDF, fine-tuned pre-trained sentence-BERT, and LambdaMART for
re-ranking, and scored 13.4 (MAP@5) points above the baseline. The second-best system was
submitted by the NLytics team, which fine-tuned RoBERTa, improving by 5 (MAP@5) points
absolute over the baseline.


4. Subtask 2B: Detecting Previously Fact-Checked Claims in
   Political Debates or Speeches
Given a claim in a political debate or a speech, the task asks to detect whether the claim has
been previously fact-checked with respect to a collection of previously fact-checked claims.
This is also a ranking task, and it was offered in English.

4.1. Dataset
We have 669 claims from political debates [1], matched against 804 verified claims (some input
claims match more than one verified claim) in a collection of 19,250 verified claims in PolitiFact.
We report some statistics about the dataset in the last column of Table 2.

4.2. Evaluation
Similarly to subtask-2A, we treat this as a ranking task, and we report the same evaluation
measures. Once again, MAP@5 is the official evaluation measure.
Table 4
Task 2b, English: Example of input claims and the top-5 most similar verified claims from our verified
claims collection retrieved by a BM25 system. The correct previously verified matching claims to be
retrieved are marked with a ✓.
         Input tweet     (a) Richard Nixon released tax returns when he was under audit.

         Verified claims (1) Richard Nixon released tax returns when he was under audit.                  ✓
                         (2) Every Republican nominee since Richard Nixon, who at one time was            ✗
                             under an audit, has released their tax returns.
                         (3) Even Richard Nixon released his tax returns to the public when he was        ✗
                             running for president ...
                         (4) Richard Nixon was the last president to be impeached.                        ✗
                         (5) Says his campaign has released his past tax returns.                         ✗
         Input tweet     (b) He actually advocated for the actions we took in Libya and urged that Gad-
                             hafi be taken out, after actually doing some business with him one time.

         Verified claims (1) If you actually took the number of Muslims [sic] Americans, we’d be one ✗
                             of the largest Muslim countries in the world.
                         (2) Theres only one of us whos actually cut government spending not two, ✗
                             theres one and youre looking at him.
                         (3) Says Roy Moore “has advocated getting the federal government out of ✗
                             health care altogether, which means doing away with Medicaid, which
                             means doing away with Medicare.”
                         (4) Says Donald Trump is “on record extensively supporting (the) in- ✓
                             tervention in Libya.”
                         (5) “When Moammar Gadhafi was set to visit the United Nations, and ✓
                             no one would let him stay in New York, Trump allowed Gadhafi
                             to set up an elaborate tent at his Westchester County (New York)
                             estate.”


Table 5
Task 2B (English): Official evaluation results, in terms of MAP, MAP@𝑘, and Precision@𝑘. The teams
are ranked by the official evaluation measure: MAP@5.
     Team           MRR                        MAP                                         Precision
                             @1       @3        @5      @10      @20       @1       @3       @5        @10     @20
     ES-baseline    0.350    0.304    0.339    0.346    0.351    0.353    0.304    0.143     0.091     0.052   0.027
 1   DIPS [33]      0.336    0.278    0.313    0.328    0.338    0.342    0.266    0.143     0.099     0.059   0.032
 2   Beasku [35]    0.320    0.266    0.308    0.327    0.332    0.332    0.253    0.139     0.101     0.056   0.028
 3   NLytics [34]   0.216    0.171    0.210    0.215    0.219    0.222    0.165    0.101     0.068     0.038   0.022


4.3. Overview of the Systems
Among the three participating teams, none could beat the official baseline. Below, we offer a
short description of each systems.
Team DIPS [33] (2B:en:2) was the top-ranked team. They used sentence -BERT embeddings
for all claims (input and verified), then computed a cosine similarity for each pair of an input
claim and a verified claim. Finally, they made a prediction by passing a sorted list of cosine
similarities to a neural network.
Team BeaSku [35] (2B:en:3) used triplet loss training to fine-tune sentence BERT. Then, they
used the scores predicted by that model along with BM25 scores as features to train a rankSVM
re-ranker. They further studied the impact of applying online mining of triplets, and they
performed some experiments to augment the dataset automatically.
Team NLytics (2B:en:4) fine-tuned RoBERTa with a regression function in the final layer,
treating the problem as a ranking task.

4.4. Results
Table 5 shows the official results for Task 2B. We can see that only three teams participated in
this subtask, submitting a total of five runs, and no team managed to outperform the Elastic
Search (ES) baseline, which is based on BM25.


5. Conclusion and Future Work
We have provided a detailed overview of the CLEF 2021 CheckThat! lab task 2, which focused
on detecting previously fact-checked claims in tweets (Subtask 2A), and in political debates or
speeches (Subtask 2B). Inline with the general mission of CLEF, we promoted multi-linguality
by offering the task in two different languages: Arabic and English. The participating systems
fine-tuned transformer models (such as BERT and RoBERTa) and some tried data augmentation.
For Subtask 2A, four systems (one for Arabic and three for English) participated, and all
outperformed a BM25 baseline. For Subtask 2B, none of the three participating teams could
beat the baseline.
   We plan a new iteration of the CLEF CheckThat! lab and of task 2, which will offer new
larger training datasets and additional languages.


Acknowledgments
The work of Tamer Elsayed and Maram Hasanain was made possible by NPRP grant #NPRP-
11S-1204-170060 from the Qatar National Research Fund (a member of Qatar Foundation). The
work of Fatima Haouari was supported by GSRA grant #GSRA6-1-0611-19074 from the Qatar
National Research Fund (a member of Qatar Foundation). The statements made herein are solely
the responsibility of the authors.
   This work is part of the Tanbih mega-project,6 developed at the Qatar Computing Research
Institute, HBKU, which aims to limit the impact of “fake news”, propaganda, and media bias
by making users aware of what they are reading, thus promoting media literacy and critical
thinking.


   6
       http://tanbih.qcri.org
References
 [1] S. Shaar, N. Babulkov, G. Da San Martino, P. Nakov, That is a known lie: Detecting
     previously fact-checked claims, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, ACL 2020, 2020, pp. 3607–3618.
 [2] N. Vo, K. Lee, Where are the facts? searching for fact-checked information to alleviate
     the spread of fake news, in: Proceedings of the 2020 Conference on Empirical Methods in
     Natural Language Processing, EMNLP 2020, 2020, pp. 7717–7731.
 [3] P. Nakov, D. S. M. Giovanni, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K.
     Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat!
     lab on detecting check-worthy claims, previously fact-checked claims, and fake news,
     CLEF 2021, 2021.
 [4] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, M. Kutlu, Y. S. Kartal,
     F. Alam, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, T. Elsayed, P. Nakov, Overview
     of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and
     political debates, in: [36], 2021.
 [5] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! lab: Task 3 on
     fake news detection, in: [36], 2021.
 [6] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh,
     F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. Sheikh Ali, Overview of
     CheckThat! 2020: Automatic identification and verification of claims in social media,
     LNCS (12260), 2020.
 [7] T. Elsayed, P. Nakov, A. Barrón-Cedeño, M. Hasanain, R. Suwaileh, G. Da San Martino,
     P. Atanasova, Overview of the CLEF-2019 CheckThat!: Automatic identification and
     verification of claims, in: Experimental IR Meets Multilinguality, Multimodality, and
     Interaction, LNCS, 2019, pp. 301–321.
 [8] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, J. Han, A survey on truth discovery,
     SIGKDD Explor. Newsl. 17 (2016) 1–16.
 [9] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data
     mining perspective, SIGKDD 19 (2017) 22–36.
[10] D. M. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J. Metzger,
     B. Nyhan, G. Pennycook, D. Rothschild, M. Schudson, S. A. Sloman, C. R. Sunstein, E. A.
     Thorson, D. J. Watts, J. L. Zittrain, The science of fake news, Science 359 (2018) 1094–1096.
[11] S. Vosoughi, D. Roy, S. Aral, The spread of true and false news online, Science 359 (2018)
     1146–1151.
[12] N. Vo, K. Lee, The rise of guardians: Fact-checking URL recommendation to combat fake
     news, in: Proceedings of the 41st International ACM SIGIR Conference on Research and
     Development in Information Retrieval, SIGIR 2018, 2018, pp. 275–284.
[13] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential
     debates, in: J. Bailey, A. Moffat, C. C. Aggarwal, M. de Rijke, R. Kumar, V. Murdock,
     T. K. Sellis, J. X. Yu (Eds.), Proceedings of the 24th ACM International Conference on
     Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October
     19 - 23, 2015, 2015, pp. 1835–1838.
[14] I. Augenstein, C. Lioma, D. Wang, L. Chaves Lima, C. Hansen, C. Hansen, J. G. Simonsen,
     MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims, in:
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
     and the 9th International Joint Conference on Natural Language Processing, EMNLP-
     IJCNLP 2019, 2019, pp. 4685–4697.
[15] J. Thorne, A. Vlachos, Automated fact checking: Task formulations, methods and future
     directions, in: COLING, 2018, pp. 3346–3359.
[16] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barrón-Cedeño, T. Elsayed, M. Hasanain,
     R. Suwaileh, F. Haouari, G. Da San Martino, P. Nakov, Overview of CheckThat! 2020
     English: Automatic identification and verification of claims in social media, in: [37], 2020.
[17] M. Hasanain, F. Haouari, R. Suwaileh, Z. Ali, B. Hamdan, T. Elsayed, A. Barrón-Cedeño,
     G. Da San Martino, P. Nakov, Overview of CheckThat! 2020 Arabic: Automatic identifica-
     tion and verification of claims in social media, in: [37], 2020.
[18] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. Kishore Shahi, J. Maria Struß, T. Mandl,
     The CLEF-2021 CheckThat! Lab on detecting check-worthy claims, previously fact-checked
     claims, and fake news, in: ECIR, 2021, pp. 639–649.
[19] P. Arnold, The challenges of online fact checking, Technical Report, Full Fact, 2020.
[20] S. Shaar, F. Alam, G. D. S. Martino, P. Nakov, The role of context in detecting previously
     fact-checked claims, arXiv:2104.07423 (2021).
[21] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
     A. Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: The first-ever
     end-to-end fact-checking system, Proceedings of VLDB Endow. 10 (2017) 1945–1948.
[22] A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze,
     K. Todorov, ClaimsKG: A knowledge graph of fact-checked claims, in: Proceedings of the
     18th International Semantic Web Conference, ISWC 2019, 2019, pp. 309–324.
[23] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, GLUE: A multi-task
     benchmark and analysis platform for natural language understanding, in: Proceedings of
     the 7th International Conference on Learning Representations, ICLR 2019, 2019.
[24] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge corpus for sentence
     understanding through inference, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, NAACL-HLT 2018, 2018, pp. 1112–1122.
[25] L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, B. Magnini, The fifth PASCAL
     recognizing textual entailment challenge, in: Proceedings of the Text Analysis Conference,
     TAC ’09, 2009.
[26] W. B. Dolan, C. Brockett, Automatically constructing a corpus of sentential paraphrases,
     in: Proceedings of the Third International Workshop on Paraphrasing, 2005.
[27] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, SemEval-2017 task 1: Semantic
     textual similarity multilingual and crosslingual focused evaluation, in: Proceedings of the
     11th International Workshop on Semantic Evaluation, SemEval 2017, 2017, pp. 1–14.
[28] Z. S. Ali, W. Mansour, T. Elsayed, A. Al-Ali, AraFacts: The first large Arabic dataset
     of naturally occurring claims, in: Proceedings of the Sixth Arabic Natural Language
     Processing Workshop, 2021, pp. 231–236.
[29] A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze,
     K. Todorov, ClaimsKG: A knowledge graph of fact-checked claims, in: Proceedings of the
     International Semantic Web Conference, ISWC 2019, Springer, 2019, pp. 309–324.
[30] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A python toolkit
     for reproducible information retrieval research with sparse and dense representations, in:
     Proceedings of the 44th Annual International ACM SIGIR Conference on Research and
     Development in Information Retrieval (SIGIR), 2021.
[31] W. Antoun, F. Baly, H. Hajj, AraBERT: Transformer-based model for Arabic language
     understanding, in: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and
     Processing Tools, OSAC ’20, Marseille, France, 2020, pp. 9–15.
[32] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Aschern at CLEF CheckThat! 2021: Lambda-
     calculus of fact-checked claims, in: [36], 2021.
[33] S. Mihaylova, I. Borisova, D. Chemishanov, P. Hadzhitsanev, M. Hardalov, P. Nakov, DIPS
     at CheckThat! 2021: Verified claim retrieval, in: [36], 2021.
[34] A. Pritzkau, NLytics at CheckThat! 2021: Multi-class fake news detection of news articles
     and domain identification with RoBERTa - a baseline model, in: [36], 2021.
[35] B. Skuczyńska, S. Shaar, J. Spenader, P. Nakov, BeaSku at CheckThat! 2021: Fine–Tuning
     Sentence BERT with Triplet Loss and Limited Data, in: [36], 2021.
[36] G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Working Notes. Working
     Notes of CLEF 2021–Conference and Labs of the Evaluation Forum, 2021.
[37] L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Working Notes, CEUR
     Workshop Proceedings, CEUR-WS.org, 2020.

</pre>