Overview of the CLEF-2022 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims Preslav Nakov1 , Giovanni Da San Martino2 , Firoj Alam1 , Shaden Shaar4 , Hamdy Mubarak3 and Nikolay Babulkov5 1 Mohamed bin Zayed University of Artificial Intelligence, UAE 2 University of Padova, Italy 3 Qatar Computing Research Institute, HBKU, Qatar 4 Cornell University, USA 5 Sofia University, Bulgaria Abstract We describe the fourth edition of the CheckThat! Lab, part of the 2022 Conference and Labs of the Evaluation Forum (CLEF). The lab evaluates technology supporting three tasks related to factuality, and it covers seven languages such as Arabic, Bulgarian, Dutch, English, German, Spanish, and Turkish. Here, we present the task 2, which asks to detect previously fact-checked claims (in two languages). A total of six teams participated in this task, submitted a total of 37 runs, and most submissions managed to achieve sizable improvements over the baselines using transformer based models such as BERT, RoBERTa. In this paper, we describe the process of data collection and the task setup, including the evaluation measures, and we give a brief overview of the participating systems. Last but not least, we release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in detecting previously fact-checked claims. Keywords Check-Worthiness Estimation, Fact-Checking, Veracity, Verified Claims Retrieval, Detecting Previously Fact-Checked Claims, Social Media Verification, Computational Journalism, COVID-19 1. Introduction There has been a surge in research to develop systems for automatic fact-checking. However, such systems suffer from credibility issues. Hence, it is important to reduce the manual effort by detecting when a claim has already been fact-checked. Work in this direction includes [1] and [2]: the former developed a dataset for the task and proposed a ranking model, while the latter proposed a neural ranking model using textual and visual modalities. To address this, the CheckThat! lab initiative features a number of tasks aiming to help auto- mate the fact-checking process and to reduce the spread of disinformation and misinformation. The CheckThat! 2022 lab was held in the framework of CLEF 2022 [3].1 Figure 1 shows the full CheckThat! identification and verification pipeline, highlighting the three tasks targeted CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ preslav.nakov@mbzuai.ac.ae (P. Nakov); dasan@math.unipd.it (G. D. S. Martino); fialam@hbku.edu.qa (F. Alam); ss2753@cornell.edu (S. Shaar); hmubarak@hbku.edu.qa (H. Mubarak); nbabulkov@gmail.com (N. Babulkov) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 http://sites.google.com/view/clef2022-checkthat/ Figure 1: The full verification pipeline. The 2022 lab covers three tasks from that pipeline: (i) check- worthiness estimation, (ii) verified claim retrieval, and (iii) fake news detection. The gray tasks were addressed in previous editions of the lab [6, 7]. in this fifth edition of the lab: Task 1 on detecting relevant claims in tweets (this paper), Task 2 on retrieving relevant previously fact-checked tweets [4], and Task 3 on predicting the veracity of news [5]. In this paper, we describe in detail the second task, detecting previously fact-checked claims, of the CheckThat! lab tasks.2 The second task is defined as follows: “given a check-worthy input claim and a set of verified claims, rank the previously verified claims in order of usefulness to fact-check the input claim.” It consists of the following two subtasks: Subtask 2A: Detecting previously fact-checked claims in tweets. Given a tweet, detect whether the claim it makes was previously fact-checked with respect to a collection of fact-checked claims. This is a ranking task, offered in Arabic and English, where the systems need to return a list of top-𝑛 candidates. Subtask 2B: Detecting previously fact-checked claims in political debates or speeches. Given a claim in a political debate or a speech, detect whether the claim has been previously fact-checked with respect to a collection of previously fact-checked claims. This is a ranking task, and it was offered in English. For Subtask 2A, we focused on tweets, and it was offered in Arabic, and English. The participants were free to work on any language(s) of their interest, and they could also use multilingual approaches that make use of all datasets for training. Subtask 2A attracted six teams, and the most successful approaches used transformers or a combination of embeddings, manually engineered features. More details are discussed in Section 3. For Subtask 2B, we focused on political debates and speeches, and we used PolitiFact as the main data source. The task attracted one team, and a combination of transformers, prepossessing, and augmentation approaches performed the best. 2 Refer to [3] for an overview of the full CheckThat! 2022 lab. The remainder of the paper is organized as follows: Section 2 discusses related work. Sections 3 and 4 describe the dataset, the evaluation results, and the participating systems for subtasks 2A and 2B, respectively, and Section 5 concludes with final remarks. 2. Related Work There has been a significant research focused on developing automatic systems for fact-checking [8, 9, 10, 11, 12]. Studies includes the development of datasets [13, 14], and evaluation campaigns [15, 6, 16, 17, 18]. However, such fully automatic systems suffer from credibility issues, e.g., in the eyes of journalists, and manual checking is still the norm. Thus, it is important to reduce that manual effort by detecting whether a claim has already been fact-checked [19]. Hence, a reasonable solution is to build tools to facilitate human fact-checkers, e.g., by detecting previously fact-checked claims. Relevant work in this direction include [1, 20, 21]. In this work, we use their annotation setup and one of their datasets: PolitiFact. Previous work has mentioned the task as an integral step of an end-to-end automated fact-checking pipeline, but there was very little detail provided about this component and it was not evaluated [22]. There has been a number of tools developed such as Fact Check Explorer,3 , which allows users to search a number of fact-checking websites. However, the tool cannot handle a complex claim, as it uses the standard Google search functionality, which is not optimized for semantic matching of long claims. A very recent survey reports what AI technology can offer to assist the work of professional fact-checkers [23], and has pointed out several research problems such as identifying claims worth fact-checking, detecting relevant previously fact-checked claims, retrieving relevant evidence to fact-check a claim, and verifying the claim. Other recent work include memory-enhanced transformers for matching (MTM) to rank fact- checked articles [24], topic-aware evidence reasoning and stance-aware aggregation [25], claim matching [26], sequence-to-sequence transformer models [27] and deep Q-learning network [28]. 3. Subtask 2A: Detecting Previously Fact-Checked Claims in Tweets Given a tweet, the task asks to detect whether the claim the tweet makes was previously fact- checked with respect to a collection of fact-checked claims. The task is offered in Arabic and English. This is a ranking task, where the systems are asked to return a list of top-𝑛 candidates. 3.1. Dataset Arabic For Arabic, we have 908 tweets, matching 1,089 verified claims (some tweets match more than one verified claim) in a collection of 30,379 previously fact-checked claims. The latter 3 http://toolbox.google.com/factcheck/explorer Table 1 Task 2: Statistics about the CT–VCR–22 corpus, including the number of Input–VerClaim pairs and the number of VerClaim claims to match the input claim against. Partition 2A-Arabic 2A-English 2B-English Input Claims 908 1,610 752 Training 512 999 472 Development 85 200 119 Dev-Test 261 202 78 Test 50 209 83 Input-VerClaims pairs 1,089 1,610 869 Training 602 999 562 Development 102 200 139 Dev-Test 335 202 103 Test 50 209 65 Verified claims (to match against) 30,379 13,835 20,771 include 5,921 Arabic claims from AraFacts [29] and 24,408 English claims from ClaimsKG [30], translated to Arabic using the Google Translate API.4 The complete data collection process is discussed in [31]. English To develop the verified claims dataset, we used Snopes, a fact-checking website that targets rumors spreading in social media, and we collected 13,835 verified claims. Their fact-checking journalists often cite the tweet or the social media post that spreads the rumor when writing an article about a claim. We have 1,610 annotated tweets, each matching a single claim in a set of 13,835 verified claims from Snopes. Data Statistics Table 1 shows statistics about the CT–VCR–225 corpus for Task 2, including both subtasks and languages. Input–VerClaim pairs represent input claims with their corre- sponding verified claims by a fact-checking source. The input for subtask 2A (2B) is a tweet (sentence from a political debate or a speech). More details about the corpus construction can be found in [31]. Data Split For Arabic, we provide 512 training, 85 dev, 261 dev-test and 50 test examples. In total, the Arabic dataset consists of 908 queries, 1,089 qrels, and a collection of 30,329 verified claims. For English, we provide 999 training, 200 dev, 202 dev-test and 209 test examples. In total, the English dataset consists of 1,610 queries, 1,610 qrels, and a collection of 13,835 verified claims. 4 http://cloud.google.com/translate 5 CT–VCR–22 stands for CheckThat! verified claim retrieval 2022. 3.2. Evaluation For the ranking tasks, as in the two previous editions of the CheckThat! lab, we calculated Mean Average Precision (MAP), reciprocal rank, Precision@𝑘 (𝑃 @𝑘) and MAP@𝑘 for 𝑘 P t1, 3, 5, 10, 20, 30u. We used MAP@5 as the official evaluation measure. 3.3. Overview of the Systems A total of six teams participated in this task. One team participated in the Arabic task and six teams participated in the English task. Below, we discuss briefly the approaches used for system development by each team. Team AI Rational [32] (2A-en:2) experimented with a architecture that combines semantic, lexical and re-ranking modules and discovered that for the MAP@k and P@k measures the reranking task is equivalent to a classification one. Therefore, previously used re-rankers for this architecture like RankSVM and LambdaMART can be reduced to a classifier like a basic SVM. More specifically a pretrained SBERT, ElasticSearch and a SVM were used respectively as implementations of the stated above modules by the team. Team BigIR [33] (2A-en:3) used the same system proposed in [33] without any further fine- tuning. In other words, the pre-trained model was only fine-tuned on the CheckThat! 2021 dataset only [34], indicating the proposed system is performing well although it was not fine- tuned on the 2022 dataset. BigIR’s system involves three steps. First, preprocessing in which the tweet is preprocessed and expanded with helpful information out of URLs, images, and videos. The second step is retrieving an initial list using a simple lexical retrieval model like BM25. Finally, reranking the initial list using a BERT-based model after fine-tuning it for this task. For English subtask, bigIR used MPNet model, and for Arabic, they used AraBERT. BigIR’s system for Arabic did not perform better than random baseline, which is 0.0, therefore, we do not report the results for Arabic. Team SimBa [35] (2A-en:4 2B-en:1) preprocessed the input claims by removing URLs, @-symbols and user information. They experimented with both unsupervised and supervised methods with blocking and balancing but found their primary submission, an unsupervised approach, to be most successful. For this, they generated sentence embeddings for all input claims and all verified claims using the sentence embedding models “Sentence-BERT” and “SimCSE”, calculated the cosine similarity for all possible pairs of input and verified claims and averaged the two different similarity scores into one. Additionally, they computed the count of similar tokens without stop words and added it to the score. Finally, the five most similar verified claims for each input claim were computed based on the similarity score. Team RIET Lab [36] (2A-en:1) team created a pipeline for claim matching by using a sentence transformer (sentence-t5) for candidate selection and a generative model (gpt-neo)[37] for re-ranking. For finetuning the candidate selection model, they used an MNR loss with hard negatives via BM25. For the generative reranking step, they finetune an autoregressive language model using a new objective that heavily regularizes on mutual information from both a Team Languages Transformers Misc Data augmentation Preprocessing ARABERT ColBERT GPT-Neo English SBERT Arabic ST5 AI Rational [32] 2 ✓ BigIR [33] 3 ✓ ✓ ✓ Fraunhofer SIT [38] 6 ✓ ✓ ✓ motlogelwan 5 ✓ RIET Lab [36] 1 ✓ ✓ ✓ SimBa [35] 1 4 ✓ ✓ Table 2 Overview of the approaches to subtasks 2A. ✓“part of the official submission; Ë“considered in internal experiments. likelihood and posterior perspective. The model yields high precision and due to its generative nature can also give analysts a better idea of confidence, which is important for fact-checking. Team Fraunhofer SIT [38] (2A-en:6) proposed an ensemble classification approach. It uses state-of-the-art sentence transformers for estimating the semantic similarity between a given tweet and collection of previously fact-checked tweets with claims. Furthermore, it incorporates several preprocessing steps as well as back-translation as a data augmentation technique. 3.4. Results Table 3 shows the official results for Task 2A English for all participated teams. We do not report results for Arabic as the scores are zero for both random baseline and the submitted system. Arabic Team bigIR submitted a run for this subtask, however, they have not submitted working note. They used AraBERT to rerank a list of candidates retrieved by a BM25 model. Their approach consists of three main steps such as preprocessing, retrieving an initial list using BM25 and finally reranking the initial list using an AraBERT-based model. As with the random baseline, since the system did not match any input with the verified claims, the performance end up being 0.0. English Six teams participated, submitting a total of thirty-two runs. All teams improved over the random baseline. Team RIET Lab [36] submitted the top run, based on a sentence transformer (sentence-t5) for candidate selection and a generative model (gpt-neo [37]) for re-ranking. Team AI Rational [32] ranked second, using a pretrained SBERT, ElasticSearch, and an SVM. Table 3 Task 2A and 2B: Official evaluation results, in terms of MRR, MAP@𝑘, and Precision@𝑘. The teams are ranked by the official evaluation measure: MAP@5. Here, Baseline refers to the random baseline. Team MRR MAP Precision @1 @3 @5 @10 @3 @5 @10 Task 2A: English 1. RIET Lab [36] 0.957 0.943 0.955 0.956 0.956 0.322 0.194 0.098 2. AI Rational [32] 0.922 0.904 0.919 0.922 0.922 0.313 0.190 0.095 3. BigIR[33] 0.923 0.900 0.921 0.921 0.921 0.316 0.189 0.095 4. SimBa [35] 0.907 0.876 0.905 0.907 0.907 0.314 0.190 0.095 5. motlogelwan˚ 0.878 0.833 0.870 0.873 0.876 0.306 0.187 0.095 6. Fraunhofer SIT [38] 0.624 0.557 0.601 0.610 0.617 0.221 0.141 0.075 Task 2B: English SimBa [35] 0.475 0.408 0.446 0.459 0.459 0.190 0.126 0.063 4. Subtask 2B: Detecting Previously Fact-Checked Claims in Political Debates or Speeches Given a claim in a political debate or a speech, the task asks to detect whether the claim has been previously fact-checked with respect to a collection of previously fact-checked claims. This is also a ranking task, and it was offered in English. 4.1. Dataset We have 752 claims from political debates [1], matched against 869 verified claims (some input claims match more than one verified claim) in a collection of 20,771 verified claims in PolitiFact. We report some statistics about the dataset in the last column of Table 1. 4.2. Evaluation Similarly to subtask-2A, we treat this as a ranking task, and we report the same evaluation measures. Once again, MAP@5 is the official evaluation measure. 4.3. Overview of the Systems Team SimBa [35] (2B-en:1) submitted a total of four runs. The computed different kinds of similarities between input and verified claims, including the cosine-similarity of sentence embeddings and different lexical similarity measures. They made use of a blocking approach to filter dissimilar pairs that can easily be excluded based on sentence embedding based similarity scores, training and applying their classifier only to distinguish between harder cases. For this, they considered the union of the 50 most similar pairs of input and verified claims regarding the similarity scores of four different sentence embedding methods (“Sentence-BERT”, “SimCSE”, “Universal Sentence Encoder” and “InferSent”). Their feature set consisted of “SimCSE”-similarity, Jaccard Distance, count and ratio of similar tokens and “WordNet”-synonyms. A linear support vector classifier was trained on the training data and predicted if a verified claim was relevant. 4.4. Results Table 3 shows the official results for Task 2B, which was offered in English only. The table does not report the random baseline results as scores are zero for all metrics. 5. Conclusion and Future Work We have provided a detailed overview of the CLEF 2022 CheckThat! lab task 2, which focused on detecting previously fact-checked claims in tweets (Subtask 2A), and in political debates or speeches (Subtask 2B). Inline with the general mission of CLEF, we promoted multi-linguality by offering the task in two different languages: Arabic and English. The participating systems fine-tuned transformer models, such as sentence BERT, ST5 and GPT-Neo, and some used data augmentation. For subtask 2A, six systems (one for Arabic and six for English) participated, and all outperformed a random baseline. For Subtask 2B, one participating team could beat the random baseline. In the future, we are considering targeting other tasks, which could play a relevant role in the analysis of journalistic and social media posts, besides the explicit factuality decision. We are considering both coverage bias in the news and subjectivity, among others. Acknowledgments This work is also part of the Tanbih mega-project,6 developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of “fake news”, propaganda, and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. References [1] S. Shaar, N. Babulkov, G. Da San Martino, P. Nakov, That is a known lie: Detecting previously fact-checked claims, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL ’20, 2020, pp. 3607–3618. [2] N. Vo, K. Lee, Where are the facts? searching for fact-checked information to alleviate the spread of fake news, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, 2020, pp. 7717–7731. [3] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection, in: Proceedings of the 13th International Conference of the CLEF Association: Information 6 http://tanbih.qcri.org Access Evaluation meets Multilinguality, Multimodality, and Visualization, CLEF ’2022, Bologna, Italy, 2022. [4] P. Nakov, G. Da San Martino, F. Alam, S. Shaar, H. Mubarak, N. Babulkov, Overview of the CLEF-2022 CheckThat! lab task 2 on detecting previously fact-checked claims, in: Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [5] J. Köhler, G. K. Shahi, J. M. Struß, M. Wiegand, M. Siegel, T. Mandl, M. Schütz, Overview of the CLEF-2022 CheckThat! lab task 3 on fake news detection, in: Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [6] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh, F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. Sheikh Ali, Overview of CheckThat! 2020: Automatic identification and verification of claims in social media, LNCS (12260), Springer, 2020, pp. 215–236. [7] T. Elsayed, P. Nakov, A. Barrón-Cedeño, M. Hasanain, R. Suwaileh, G. Da San Martino, P. Atanasova, Overview of the CLEF-2019 CheckThat!: Automatic identification and verification of claims, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, LNCS, 2019, pp. 301–321. [8] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, J. Han, A survey on truth discovery, SIGKDD Explor. Newsl. 17 (2016) 1–16. [9] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data mining perspective, SIGKDD 19 (2017) 22–36. [10] D. M. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J. Metzger, B. Nyhan, G. Pennycook, D. Rothschild, M. Schudson, S. A. Sloman, C. R. Sunstein, E. A. Thorson, D. J. Watts, J. L. Zittrain, The science of fake news, Science 359 (2018) 1094–1096. [11] S. Vosoughi, D. Roy, S. Aral, The spread of true and false news online, Science 359 (2018) 1146–1151. [12] N. Vo, K. Lee, The rise of guardians: Fact-checking URL recommendation to combat fake news, in: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018, 2018, pp. 275–284. [13] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates, in: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, 2015, pp. 1835–1838. [14] I. Augenstein, C. Lioma, D. Wang, L. Chaves Lima, C. Hansen, C. Hansen, J. G. Simonsen, MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP- IJCNLP 2019, 2019, pp. 4685–4697. [15] J. Thorne, A. Vlachos, Automated fact checking: Task formulations, methods and future directions, in: COLING, Association for Computational Linguistics, 2018, pp. 3346–3359. URL: http://www.aclweb.org/anthology/C18-1283. [16] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barrón-Cedeño, T. Elsayed, M. Hasanain, R. Suwaileh, F. Haouari, G. Da San Martino, P. Nakov, Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media, in: [39], 2020. [17] M. Hasanain, F. Haouari, R. Suwaileh, Z. Ali, B. Hamdan, T. Elsayed, A. Barrón-Cedeño, G. Da San Martino, P. Nakov, Overview of CheckThat! 2020 Arabic: Automatic identifica- tion and verification of claims in social media, in: [39], 2020. [18] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. Kishore Shahi, J. Maria Struß, T. Mandl, The CLEF-2021 CheckThat! Lab on detecting check-worthy claims, previously fact-checked claims, and fake news, in: ECIR, 2021, pp. 639–649. [19] P. Arnold, The challenges of online fact checking, Technical Report, Full Fact, 2020. [20] S. Shaar, F. Alam, G. D. S. Martino, P. Nakov, The role of context in detecting previously fact-checked claims, arXiv:2104.07423 (2021). [21] S. Shaar, F. Alam, G. D. S. Martino, P. Nakov, Assisting the human fact-checkers: Detecting all previously fact-checked claims in a document, arXiv preprint arXiv:2109.07410 (2021). arXiv:2109.07410. [22] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A. Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: The first-ever end-to-end fact-checking system, Proceedings of VLDB Endow. 10 (2017) 1945–1948. [23] P. Nakov, D. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar, G. Da San Martino, Automated fact-checking for assisting human fact-checkers, in: Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI ’21, 2021, pp. 4551–4558. [24] Q. Sheng, J. Cao, X. Zhang, X. Li, L. Zhong, Article reranking by memory-enhanced key sentence matching for detecting previously fact-checked claims, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 5468–5481. URL: https://aclanthology.org/2021.acl-long.425. doi:10.18653/v1/2021.acl-long.425. [25] J. Si, D. Zhou, T. Li, X. Shi, Y. He, Topic-aware evidence reasoning and stance-aware aggregation for fact verification, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 1612–1622. URL: https://aclanthology.org/2021.acl-long.128. doi:10.18653/v1/2021.acl-long.128. [26] A. Kazemi, K. Garimella, D. Gaffney, S. Hale, Claim matching beyond English to scale global fact-checking, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natu- ral Language Processing (Volume 1: Long Papers), Association for Computational Lin- guistics, Online, 2021, pp. 4504–4517. URL: https://aclanthology.org/2021.acl-long.347. doi:10.18653/v1/2021.acl-long.347. [27] K. Jiang, R. Pradeep, J. Lin, Exploring listwise evidence reasoning with t5 for fact verifica- tion, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Process- ing (Volume 2: Short Papers), Association for Computational Linguistics, Online, 2021, pp. 402–410. URL: https://aclanthology.org/2021.acl-short.51. doi:10.18653/v1/2021. acl-short.51. [28] H. Wan, H. Chen, J. Du, W. Luo, R. Ye, A DQN-based approach to finding precise evidences for fact verification, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing, Association for Computational Linguistics, Online, 2021, pp. 1030–1039. URL: https://aclanthology.org/2021.acl-long.83. doi:10.18653/v1/2021.acl-long.83. [29] Z. S. Ali, W. Mansour, T. Elsayed, A. Al-Ali, AraFacts: The first large arabic dataset of naturally occurring claims, in: Proceedings of the Sixth Arabic Natural Language Processing Workshop, 2021, pp. 231–236. [30] A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze, K. Todorov, ClaimsKG: A knowledge graph of fact-checked claims, in: International Semantic Web Conference, Springer, 2019, pp. 309–324. [31] S. Shaar, F. Haouari, W. Mansour, M. Hasanain, N. Babulkov, F. Alam, G. Da San Martino, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 2 on detecting previously fact-checked claims in tweets and political debates, 2021. [32] V. Kostov, AI Rational at CheckThat! 2022: reranking previously fact-checked claims on semantic and lexical similarity, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [33] W. Mansour, T. Elsayed, A. Al-Ali, Did i see it before? detecting previously-checked claims over twitter, in: European Conference on Information Retrieval, Springer, 2022, pp. 367–381. [34] P. Nakov, D. S. M. Giovanni, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news, LNCS (12880), Springer, 2021. [35] A. Hövelmeyer, K. Boland, S. Dietze, SimBa at CheckThat! 2022: lexical and semantic similarity based detection of verified claims in an unsupervised and supervised way, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [36] S. D.-H. Michael Shliselberg, RIET Lab at CheckThat! 2022: improving decoder based re-ranking for claim matching, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [37] S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, 2021. URL: https://doi.org/10.5281/zenodo. 5297715. doi:10.5281/zenodo.5297715. [38] R. A. Frick, I. Vogel, Fraunhofer SIT at CheckThat! 2022: ensemble similarity estimation for finding previously fact-checked claims, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [39] L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Working Notes, CEUR Workshop Proceedings, 2020.