Overview of the CLEF-2024 CheckThat! Lab Task 5 on
                         Rumor Verification using Evidence from Authorities
                         Notebook for the CheckThat! Lab at CLEF 2024

                         Fatima Haouari1 , Tamer Elsayed1 and Reem Suwaileh2
                         1
                             Qatar University, Doha, Qatar
                         2
                             Hamad Bin Khalifa University, Doha, Qatar


                                         Abstract
                                         We present an overview of Task 5 of the seventh edition of the CheckThat! Lab, which is a part of the 2024
                                         Conference and Labs of the Evaluation Forum (CLEF). In the Rumor Verification using Evidence from Authorities
                                         task, given a rumor expressed in a tweet and a set of authorities Twitter accounts for that rumor, participating
                                         systems should retrieve up to 5 evidence tweets posted by those authorities, and determine the veracity of the
                                         rumor according to the retrieved evidence. A total of 3 and 5 teams submitted their runs (5 and 11 runs) for Arabic
                                         and for English, respectively, out of which 2 made submissions for both languages. In this paper, we present our
                                         data construction approach, evaluation setup, and an overview of the participating systems. We publicly release
                                         all the datasets and evaluation scripts to promote further research on this task.

                                         Keywords
                                         Fact Checking, Claims, Social Media


                         1. Introduction
                         The CheckThat! lab runs for the seventh time under the umbrella of CLEF 2024 [1, 2]. In this edition of
                         the lab six tasks were offered: task 1 on Check-Worthiness Estimation, task 2 on subjectivity detection,
                         task 3 on persuasion techniques, task 4 on detecting hero, villain, and victim from memes, task 5 on
                         rumor verification using evidence from authorities (this paper), and task 6 on robustness of credibility
                         assessment with adversarial examples (InCrediblAE).
                            In this paper, we describe in detail Task 5 of this year’s lab,1 Rumor Verification using Evidence from
                         Authorities. Task 5 is defined as follows: “Given a rumor expressed in a tweet and a set of authorities
                         (one or more authority Twitter accounts) for that rumor, represented by a list of tweets from their timelines
                         during the period surrounding the rumor, the system should retrieve up to 5 evidence tweets from those
                         timelines, and determine if the rumor is supported (true), refuted (false), or unverifiable (in case not enough
                         evidence to verify it exists in the given tweets) according to the evidence.”
                            The rest of this paper is organized as follows. We give an overview of the related work in Section 2.
                         We define our task, and present our adopted evaluation measures in Section 3, and Section 4 respectively.
                         We present a full overview about the Arabic shared task and the English shared task in Section 5 and
                         Section 6 respectively, including our datasets construction approach, an overview of the participants’
                         systems, and discuss the evaluation results . Finally, we conclude in Section 7.


                         2. Related Work
                         A large number of existing studies in the broader literature have studied rumor verification in social
                         media [3, 4, 5, 6, 7, 8, 9]. Most early studies has incorporated the propagation networks such as the
                         structure of replies [6, 10, 7, 11, 8], stance of replies [12, 13, 3, 4, 5], or retweeters metadata [9] as a

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ 200159617@qu.edu.qa (F. Haouari); telsayed@qu.edu.qa (T. Elsayed); rsuwaileh@hbku.edu.qa (R. Suwaileh)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             Refer to [2] for an overview of the full CheckThat! 2024 lab.

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
source of evidence. Some authors have also suggested that evidence from the Web [14, 15], or stance
of authority tweets towards rumors [16, 17] can further improve the automatic rumor verification.
Rumor verification in social media was addressed in multiple languages mainly in English [12, 14, 4, 3]
or Chinese [18, 6, 7]. However, Arabic rumor verification is still under-studied. Most of the existing
studies relied on the rumor textual content solely for verification [19, 20, 21, 22]. Recently, Haouari et al.
[11] exploited the replies structure, Althabiti et al. [23] proposed detecting sarcasm and hate speech in
the replies, while Albalawi et al. [24] leveraged the images and videos embedded in the rumor tweet.
Differently, we propose incorporating the evidence tweets retrieved from the authority timelines for
Arabic and English rumor verification.


3. Task Definition
The Rumor Verification using Evidence from Authorities task consists of two subtasks defined as follows:

       • Evidence Retrieval: Given a rumor expressed in a tweet and a set of authorities for that rumor,
         the system should retrieve evidence tweets posted by any of those authorities. An evidence tweet
         is a tweet that can be further used to detect the veracity of the rumor. The set of authorities has
         one or more authority Twitter accounts, represented by a list of tweets from their timelines that
         are posted during the period surrounding the rumor.

       • Rumor Verification: Based solely on the evidence tweets retrieved by the above subtask,
         determine if the rumor is supported (true), refuted (false), or unverifiable (in case not enough
         evidence to verify it exists).


4. Evaluation Measures
Evidence retrieval. The official evaluation measure for evidence retrieval is Mean Average Precision
(MAP). We also report Recall@5.

Rumor Verification. We use the Macro-F1 to evaluate the classification of the rumors. Additionally,
we consider a Strict Macro-F1 where the rumor label is considered correct only if at least one retrieved
authority evidence is correct.


5. Arabic Shared Task
In this section, we give an overview about the Arabic shared task. We present our dataset construction
approach in Section 5.1. The approaches adopted by the participating systems, and their evaluation
results are presented in Section 5.2, and Section 5.3 respectively.

5.1. Dataset
To construct our dataset,2 we randomly selected 160 rumors from two existing datasets namely AuFIN [25,
26] and AuSTR [17]. Specifically, we selected 99 (61.9%) from AuFIN and 61 (38.1%) from AuSTR. We
then annotated the dataset following two steps 1) finding authorities that may tweet evidence that can
help in rumor verification (Section 5.1.1), and 2) evidence extraction including the authority timelines
collection and annotation (Section 5.1.2). Our task dataset, covers 160 rumors annotated with their
corresponding 692 authority timelines, comprising about 34k annotated tweets in total. We randomly
split the data into 96 training, 32 development, and 32 test examples.


2
    https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task5
5.1.1. Authority Finding
The task is proposed recently by Haouari et al. [25]. They define an authority for a specific rumor as an
entity having the real knowledge or power to verify or deny that rumor. For example, if the rumor is about
a Sports event in Qatar, then ministry and minister of Sports and Youth, and managers of the event are
potential authorities. AuFIN rumors are already associated with their relevant authorities, however
AuSTR rumors are only associated with an authority tweet either supporting, refuting or irrelevant
to the rumor. Therefore, for AuSTR rumors, in addition to considering the authority of the associated
authority tweet, we collected more authorities for each rumor following the same approach proposed
by Haouari et al. [25]. Two annotators, co-organizers of this task, performed the task independently,
then met to discuss their annotations. Only potential authorities that both annotators agreed upon
during their meeting were kept in our data.

5.1.2. Evidence extraction
To collect the authority timelines, we used the Twitter Academic search API which facilitates collecting
users historical timelines.3 We consider the rumor tweet as a pointer to the time span of the rumor
propagation, where we assume that the rumor is circulating for a few days before and/or after the rumor
tweet posting time. Therefore, we limit the authority timelines to the tweets within 3 days before and
after the rumor tweet posting time. To extract the evidence from the collected timelines, we performed
two steps:
(1) Annotation: Following our annotation guidelines, one annotator labeled all tweets in all authority
timelines as supporting, refuting, or carrying not enough info towards the corresponding rumor tweet.
To measure the quality of our data, and to have a double-annotated sample, a second annotator then
labeled solely one authority timeline per rumor. At the end of this stage, we measured the data quality
of the double-annotated sample using Cohen’s Kappa for inter-annotator agreement [27] as 0.67, which
indicates “substantial” agreement [28]. It is worth mentioning, that any disagreement between the
annotators was then resolved in the next stage.
(2) Resolving Disagreements: As a final step, both annotators met to discuss and resolve any
disagreements in the double-annotated sample, and hence decide the final labels. Refer to [29] for more
details about our data construction process.

5.2. Overview of the Participating Systems
For the Arabic shared task, 3 teams submitted a total of 5 runs. In the following, we present their
proposed approaches.

bigIR. This team submitted 2 runs adopting two SOTA models for fact checking namely MLA [30]
and KGAT [31]. For evidence retrieval, MLA is a BERT-based binary classifier fine-tuned to classify
whether an authority tweet is an evidence or non-evidence, and the input to the model are pairs of
(rumor_tweet, authority_tweet). At training they considered only a sample of non-evidence tweets
for each rumor.4 At inference, every (rumor_tweet, authority_tweet) of the test set is passed to the
fine-tuned model and softmax scores were used to get the top N authority tweets. KGAT model is also
BERT-based model, however the margin ranking loss is adopted to maximize the distance between the
positive and the negative (rumor_tweet, authority_tweet) pairs. At inference, the scores obtained by
passing each pair is used to rank the authority tweets.
   The top 5 retrieved evidence tweets are then used to fine-tune their rumor verification model,
adopting the MLA and KGAT claim verification models. KGAT is reasoning model adopting Kernel
Graph Attention Network to construct a fully connected graph using the retrieved evidence. MLA on
the other hand, adopts multi-task learning considering the verification as the main task, and evidence

3
    https://developer.x.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all
4
    They set the number of negative examples as 4 times the number of positive examples.
retrieval as an auxiliary task. For all their evidence retrieval and rumor verification models they used
MARBERTv2 [32],5 an Arabic BERT model pre-trained using 1 billion Arabic tweets. They fine-tuned
all models for 5 epochs using a batch of size 8 using 4 different learning rates [2e-5, 3e-5, 4e-5, 5e-5].
They selected the best evidence retrieval and rumor verification models on the dev set based on Mean
Average Precision (MAP) and Macro F1 respectively.

IAI Group. This group submitted 2 runs, adopting a zero-shot setup for both evidence retrieval and
rumor verification. For evidence retrieval, they experimented with two approaches namely 1) using
ColBERT-XM [33], a multilingual pre-trained model for semantic search, and 2) using cross-encoders.
For rumor verification, in both runs, they leveraged the xlm-roberta-nli which is a RoBERTa model
pre-trained with a combination of Natural Language Inference (NLI) data in multiple languages [34].

SCUoL. This team submitted 1 run where they focused solely on the rumor verification subtask. They
leveraged an existing Arabic fact checking system [35], where they passed the rumor tweet solely to
the system to get the veracity label.

5.3. Evaluation Results
In this section, we present and discuss the results of the participating systems for both evidence retrieval
and rumor verification against our baseline.

Baseline: We adopted KGAT [31], a SOTA model for fact-checking. We fine-tuned both its evidence
retrieval and rumor verification models on FEVER English fact-checking dataset [36] following the
authors setup but using multilingual BERT (mBERT) [37].6 We then test on our Arabic test data.

Evidence Retrieval: As presented in Table. 1, 4 out 5 runs managed to outperform the baseline
significantly. bigIR team primary run outperformed all models in terms of all evaluation measures,
fine-tuning their model on AuRED data. We can also observe that although IAI Group adopted a
zero-shot approach significantly outperformed the baseline. Moreover bigIR secondary run which is the
model used as a baseline. i.e., KGAT, but fine-tuned on AuRED show a big improvement which shows
the importance of in-domain data for the task.

Table 1
Evidence retrieval (Arabic) official evaluation results, in terms of MAP, and Recall@5. The teams are ranked
only based on their primary runs by the official evaluation measure MAP. Submissions with a + sign indicate
submissions by task organisers.
                          Rank                 Team (run ID)             MAP     Recall@5
                                                 +
                            1             bigIR (bigIR-MLA-Ar)           0.618     0.673
                            -      IAI Group (IAI-Arabic-Crossencoder)   0.586     0.601
                            2        IAI Group (IAI-Arabic-COLBERT)      0.564     0.581
                            -             bigIR+ (bigIR-KGAT-Ar)         0.560     0.625
                                                     Baseline            0.345     0.423
                            3                  SCUoL (SCUoL)               -         -


Rumor Verification: As presented in Table. 2, we observe that IAI Group primary and secondary
runs outperformed all runs significantly although adopting a zero-shot approach. The results highlight
that even the two models fine-tuned on the task data, bigIR models, could not achieve comparable

5
    https://huggingface.co/UBC-NLP/MARBERTv2
6
    https://huggingface.co/bert-base-multilingual-uncased
results to the best performing model. We observe that one of the bigIR models outperforms the baseline
on Macro F1 only but could not beat it in terms Strict Macro F1, while the second could not even beat
the baseline across all measures. This could be attributed to the small number of training examples. i.e.,
96 rumors only. Finally, we observe that the run submitted by SCUol team achieved better than the
baseline although not considering the authority evidence.

Table 2
Rumor verification (Arabic) official evaluation results, in terms of Macro F1, and Strict Macro F1. The teams are
ranked only based on their primary runs by the official evaluation measure Macro F1. Submissions with a + sign
indicate submissions by task organisers.
                     Rank                  Team (run ID)             Macro F1   Strict Macro F1
                        1        IAI Group (IAI-Arabic-COLBERT)       0.600          0.581
                        -      IAI Group (IAI-Arabic-Crossencoder)    0.460          0.433
                        2             bigIR+ (bigIR-MLA-Ar)           0.368          0.300
                        3                SCUoL (SCUoL)                0.355            -
                                                Baseline              0.347          0.347
                                            +
                        -             bigIR (bigIR-KGAT-Ar)           0.258          0.258


6. English Shared Task
This year, a major extension for the Authority Finding task is running over English data in addition
to Arabic. As English is a globally dominant language, this attracted more researchers, developers,
and participants, thereby increasing the task’s visibility and impact. In this section, we present our
dataset construction approach (Section 6.1), participating systems (Section 6.2), and evaluation results
(Section 6.3) of the English Shared Task.

6.1. Dataset
To construct the English dataset, we translated the Arabic dataset (refer to Section 5.1). The rationale
behind this approach is that topics and issues concerning the Arab region are frequently discussed in
English, especially by Arab users who communicate in English on Twitter. Additionally, international
journalists who do not speak Arabic are interested in ongoing discussions in the region. Therefore,
translations help us capture representative rumors that can be discussed within English content on
Twitter while also reducing the annotation effort. We have followed a two stage process of automatic
translation and manual validation that we discuss in the following.

Automatic translation We automatically translated the entire Arabic dataset using Googletrans
library.7 We translated all 160 rumor tweets, and their associated authority tweets.

Manual Validation While automatic translation can expedite the development of monolingual and
cross-lingual authority finding systems, it introduces several challenges that could affect the quality
and reliability of the resulting data. To address this, we manually validated the translations of a random
sample of tweets. Specifically, we reviewed the translations of all rumors and a sample of 2,138 tweets
from authorities timelines. We edited 514 (24%) tweets to correct errors and inaccuracies while 1,624
tweets (75.96%) remained unedited.


7
    https://py-googletrans.readthedocs.io/en/latest/
Challenges Through the validation process, we have observed issues and challenges that we addressed
to maintain the quality and reliability of the dataset. We discuss a few in the following:

    • Inaccuracies: Automatic translation tools can produce inaccurate translations, especially for
      complex sentences, idiomatic expressions, and context-dependent phrases. These inaccuracies
      can lead to errors in the dataset.
    • Loss of Nuance and Context: Automatic translations may fail to capture the nuanced meanings
      and cultural context of the original text. This can result in a loss of important information.
    • Inconsistencies: Automatic translations may change for the same text, leading to inconsistencies
      within the same dataset.

  Despite these challenges, we opted to enable the development of English systems and consider better
approaches for constructing the dataset in the future.

6.2. Overview of the Participating Systems
For the English shared task, 5 teams submitted a total of 11 runs. In the following, we present their
proposed approaches.

AuthEv-LKolb [38]. This team participated with 3 runs. They adopted OpenAI GPT-4 assistant for
rumor verification in all their runs, where they pass each single rumor-evidence pair prompting GPT-4
to return a judgement and a confidence. The N judgements are then combined into a final label. For
evidence retrieval for two of their runs, they adopted OpenAI embeddings and computed the cosine
similarity between the embedding vectors of the rumor tweet and each authority tweet to get the closest
top N. In their third run, they adopted a simple PyTerrier BatchRetrieve pipeline of BM25 and PL2
to retrieve the top evidence tweets. The authors in one of their runs used external data where they
collected the authorities Twitter account information, and augmented the input text with the authority
name and bio for both the evidence retrieval and rumor verification.

Axolotl [39]. They submitted 3 runs, where they adopted for evidence retrieval BM25 lexical retrieval.
To retrieve for relevant authority tweets, they give importance to hashtags in the rumor tweet by
boosting them with respect to just text. For two of their runs they further reranked the top retrieved
tweets using either sentence-t5-base,8 or Llama3 8B. For rumor verification, they adopted a stance-based
approach using either Llama3 8B or all-mpnet-base-v2.9

bigIR. The team participated with 2 runs adopting the same models and setup used for Arabic.
However, they replaced MARERTv2 with the English BERT base [37].10

DEFAULT [40]. They formulated the task as a retrieval-augmented classification and jointly trained
the rumor verification classifier and the evidence retriever. They fine-tuned ColBERT [41] and used
MaxSim score as the similarity score.

IAI Group. They adopted a zero-shot setup for both their runs. For evidence retrieval they adopted
either ColBERT or cross-encoders. For rumor verification they exploited a RoBERTa model pre-trained
with NLI task data.

6.3. Evaluation Results
In this section, we present and discuss the results of the participating systems for both evidence retrieval
and rumor verification against our baseline.
8
 https://huggingface.co/sentence-transformers/sentence-t5-base
9
 https://huggingface.co/sentence-transformers/all-mpnet-base-v2
10
   https://huggingface.co/google-bert/bert-base-uncased
Table 3
Evidence retrieval (English) official evaluation results, in terms of MAP, and Recall@5. The teams are ranked
only based on their primary runs by the official evaluation measure MAP. Submissions with a + sign indicate
submissions by task organisers.
   Rank                                 Team (run ID)                                  MAP     Recall@5
      -                        IAI Group (IAI-English-Crossencoder)                    0.628      0.676
      1                                bigIR+ (bigIR-MLA-En)                           0.604      0.677
      2      Axolotl (run_rr=llama_sp=llama_ rewrite=3_boundary=0,4_hashtagW=1)        0.566      0.617
      3                            DEFAULT (DEFAULT-Colbert1)                          0.559      0.634
      4                          IAI Group (IAI-English-COLBERT)                       0.557      0.590
      5                          AuthEv-LKolb (AuthEv-LKolb-oai)                       0.549      0.587
      -                               bigIR+ (bigIR-KGAT-En)                           0.537      0.618
      -               AuthEv-LKolb (AuthEv-LKolb-terrier-oai-preprocessing)            0.524      0.563
      -                     AuthEv-LKolb (AuthEv-LKolb-oai-extdata)                    0.510      0.619
      -       Axolotl (run_rr=dl_sp=llama_ rewrite=0_boundary=0,2_hashtagW=1)          0.489      0.545
      -        Axolotl (run_rr=none_sp=dl_ rewrite=0_boundary=0,1_hashtagW=1)          0.489      0.545
                                            Baseline                                   0.335      0.445


Baseline: We adopted the same model fine-tuned for the Arabic shared task baseline (refer to Sec-
tion 5.3), but we tested on our English test data.

Evidence Retrieval: As shown in Table. 3, all the submitted runs outperformed our baseline. Looking
at the primary runs, we observe that the models fine-tuned using the task data, bigIR-MLA-En and
DEFAULT-Colbert1 runs, got the 1𝑠𝑡 and 3𝑟𝑑 place respectively. The results also highlights that although
Axolotl team run achieved a 2𝑛𝑑 position, bigIR outperforms it with a big margin. Interestingly, the IAI
Group secondary run, under zero-shot setup, improved the retrieval in terms of MAP compared to the
leading team but could not improve the recall of evidence.

Table 4
Rumor verification (English) official evaluation results, in terms of Macro F1, and Strict Macro F1. The teams
are ranked only based on their primary runs by the official evaluation measure Macro F1. Submissions with a +
sign indicate submissions by task organisers.
Rank                                 Team (run ID)                                 Macro F1     Strict Macro F1
  -                      AuthEv-LKolb (AuthEv-LKolb-oai-extdata)                     0.895            0.876
  1                          AuthEv-LKolb (AuthEv-LKolb-oai)                         0.879            0.861
  -                AuthEv-LKolb (AuthEv-LKolb-terrier-oai-preprocessing)             0.831            0.831
  2       Axolotl (run_rr=llama_sp=llama_rewrite=3_boundary=0,4_hashtagW=1)          0.687            0.687
  -        Axolotl (run_rr=dl_sp=llama_rewrite=0_boundary=0,2_hashtagW=1)            0.630            0.570
  -         Axolotl (run_rr=none_sp=dl_rewrite=0_boundary=0,1_hashtagW=1)            0.574            0.492
                                        Baseline                                     0.495            0.495
  3                           DEFAULT (DEFAULT-Colbert1)                             0.482            0.454
  -                       IAI Group (IAI-English-Crossencoder)                       0.459            0.444
  4                               bigIR+ (bigIR-MLA-En)                              0.458            0.428
  5                         IAI Group (IAI-English-COLBERT)                          0.373            0.373
  -                              bigIR+ (bigIR-KGAT-En)                              0.373            0.373


Rumor Verification: As presented in Table. 4, only 2 teams were able to outperform the baseline,
AuthEv-LKolb and Axolotl, adopting LLMs namely GPT4 or Llama respectively. The results highlight
that the models adopting the fine-tuning setup (bigIR and DEFAULT models), or zero-shot setup using
pre-trained language models (IAI group models) could not outperform the baseline. We can conclude
that, adopting LLMs can perform well on the verification task achieving Macro F1 of 0.895. However,
further investigation is required to compare their performance against models fine-tuned using the task
data but with a large number of rumors.


7. Conclusion
In this paper, we presented a detailed overview of the CLEF 2024 CheckThat! Lab Task 5 for Rumor
Verification using Evidence from authorities. For evidence retrieval, participants adopted either a
zero-shot setup or a fine-tuning setup using the task data. For the zero-shot setup they leveraged
existing pre-trained language models, LLMs, traditional lexical retrieval such BM25, or combination
of these models. For rumor verification, only the models adopting LLMs managed to outperform the
baseline. As a future work, we plan to enlarge the task dataset and incorporate more languages.


Acknowledgments
The work of Fatima Haouari was supported by GSRA grant #GSRA6-1-0611-19074 from the Qatar
National Research Fund (a member of Qatar Foundation). The work of Tamer Elsayed was made
possible by NPRP grant #NPRP-11S-1204-170060 from the Qatar National Research Fund. The work
of Reem Suwaileh is partially supported by NPRP 14C-0916-210015 from the Qatar National Research
Fund, which is a part of Qatar Research Development and Innovation Council (QRDI). The statements
made herein are solely the responsibility of the authors.


References
 [1] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
     M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-Worthiness,
     Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: N. Goharian, N. Tonel-
     lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
     Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
 [2] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
     G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of
     the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities
     and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M.
     Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), 2024.
 [3] S. Kumar, K. Carley, Tree LSTMs with convolution units to predict stance and rumor veracity in
     social media conversations, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th
     Annual Meeting of the Association for Computational Linguistics, Association for Computational
     Linguistics, Florence, Italy, 2019, pp. 5047–5058. URL: https://aclanthology.org/P19-1498. doi:10.
     18653/v1/P19-1498.
 [4] N. Bai, F. Meng, X. Rui, Z. Wang, A multi-task attention tree neural net for stance classification
     and rumor veracity detection, Applied Intelligence (2022) 1–11.
 [5] S. Roy, M. Bhanu, S. Saxena, S. Dandapat, J. Chandra, gDART: Improving rumor verification in
     social media with Discrete Attention Representations, Information Processing & Management 59
     (2022) 102927.
 [6] J. Ma, W. Gao, K.-F. Wong, Rumor Detection on Twitter with Tree-structured Recursive Neural
     Networks, in: Proceedings of the 56th Annual Meeting of the Association for Computational
     Linguistics (Volume 1: Long Papers), 2018, pp. 1980–1989.
 [7] J. Choi, T. Ko, Y. Choi, H. Byun, C.-k. Kim, Dynamic graph convolutional networks with attention
     mechanism for rumor detection on social media, Plos one 16 (2021) e0256039.
 [8] N. Bai, F. Meng, X. Rui, Z. Wang, Rumor detection based on a Source-Replies conversation Tree
     Convolutional Neural Net, Computing 104 (2022) 1155–1171.
 [9] Y. Liu, Y.-F. Wu, Early detection of fake news on social media through propagation path classifi-
     cation with recurrent and convolutional networks, in: Proceedings of the AAAI conference on
     artificial intelligence, volume 32, 2018.
[10] T. Bian, X. Xiao, T. Xu, P. Zhao, W. Huang, Y. Rong, J. Huang, Rumor Detection on Social Media
     with Bi-Directional Graph Convolutional Networks, in: Proceedings of the AAAI Conference on
     Artificial Intelligence, 2020, pp. 549–556.
[11] F. Haouari, M. Hasanain, R. Suwaileh, T. Elsayed, ArCOV19-rumors: Arabic COVID-19 Twitter
     dataset for misinformation detection, in: N. Habash, H. Bouamor, H. Hajj, W. Magdy, W. Zaghouani,
     F. Bougares, N. Tomeh, I. Abu Farha, S. Touileb (Eds.), Proceedings of the Sixth Arabic Natural
     Language Processing Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual),
     2021, pp. 72–81. URL: https://aclanthology.org/2021.wanlp-1.8.
[12] A. Zubiaga, M. Liakata, R. Procter, G. Wong Sak Hoi, P. Tolmie, Analysing how people orient
     to and spread rumours in social media by looking at conversational threads, PloS one 11 (2016)
     e0150989.
[13] L. Derczynski, K. Bontcheva, M. Liakata, R. Procter, G. Wong Sak Hoi, A. Zubiaga, SemEval-2017
     task 8: RumourEval: Determining rumour veracity and support for rumours, in: Proceedings
     of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for
     Computational Linguistics, Vancouver, Canada, 2017, pp. 69–76.
[14] J. Dougrez-Lewis, E. Kochkina, M. Arana-Catania, M. Liakata, Y. He, PHEMEPlus: Enriching Social
     Media Rumour Verification with External Evidence, in: Proceedings of the Fifth Fact Extraction
     and VERification Workshop (FEVER), 2022, pp. 49–58.
[15] X. Hu, Z. Guo, J. Chen, L. Wen, P. S. Yu, MR2: A Benchmark for Multimodal Retrieval-Augmented
     Rumor Detection in Social Media, in: Proceedings of the 46th International ACM SIGIR Conference
     on Research and Development in Information Retrieval, SIGIR ’23, Association for Computing
     Machinery, New York, NY, USA, 2023, p. 2901–2912. URL: https://doi.org/10.1145/3539618.3591896.
     doi:10.1145/3539618.3591896.
[16] F. Haouari, T. Elsayed, Detecting Stance of Authorities Towards Rumors in Arabic Tweets: A
     Preliminary Study, in: Advances in Information Retrieval, Springer Nature Switzerland, Cham,
     2023, pp. 430–438.
[17] F. Haouari, T. Elsayed, Are authorities denying or supporting? Detecting stance of authorities
     towards rumors in Twitter, Social Network Analysis and Mining 14 (2024) 34.
[18] J. Ma, W. Gao, P. Mitra, S. Kwon, B. J. Jansen, K.-F. Wong, M. Cha, Detecting rumors from
     microblogs with recurrent neural networks, in: Proceedings of the Twenty-Fifth International
     Joint Conference on Artificial Intelligence, 2016, pp. 3818–3824.
[19] M. K. Elhadad, K. F. Li, F. Gebali, COVID-19-FAKES: A Twitter (Arabic/English) Dataset for
     Detecting Misleading Information on COVID-19, in: International Conference on Intelligent
     Networking and Collaborative Systems, Springer, 2020, pp. 256–268.
[20] A. R. Mahlous, A. Al-Laith, Fake News Detection in Arabic Tweets during the COVID-19 Pandemic,
     International Journal of Advanced Computer Science and Applications 12 (2021).
[21] M. Al-Yahya, H. Al-Khalifa, H. Al-Baity, D. AlSaeed, A. Essam, Arabic Fake News Detection:
     Comparative Study of Neural Networks and Transformer-Based Approaches, Complexity 2021
     (2021).
[22] A. Sawan, T. Thaher, N. Abu-el rub, Sentiment Analysis Model for Fake News Identification in
     Arabic Tweets, in: 2021 IEEE 15th International Conference on Application of Information and
     Communication Technologies (AICT), 2021, pp. 1–6.
[23] S. Althabiti, M. A. Alsalka, E. Atwell, Detecting Arabic Fake News on Social Media using Sarcasm
     and Hate Speech in Comments (2022).
[24] R. M. Albalawi, A. T. Jamal, A. O. Khadidos, A. M. Alhothali, Multimodal Arabic Rumors Detection,
     IEEE Access (2023).
[25] F. Haouari, T. Elsayed, W. Mansour, Who can verify this? Finding authorities for rumor verification
     in Twitter, Information Processing & Management 60 (2023) 103366.
[26] F. Haouari, Z. Sheikh Ali, T. Elsayed, Overview of the CLEF-2023 CheckThat! Lab Task 5 on
     Authority Finding in Twitter, in: Working Notes of CLEF 2023–Conference and Labs of the
     Evaluation Forum, CLEF ’2023, Thessaloniki, Greece, 2023.
[27] J. Cohen, A Coefficient of Agreement for Nominal Scales, Educational and psychological measure-
     ment 20 (1960) 37–46.
[28] J. R. Landis, G. G. Koch, The Measurement of Observer Agreement for Categorical Data, Biometrics
     (1977) 159–174.
[29] F. Haouari, T. Elsayed, R. Suwaileh, AuRED: Enabling Arabic Rumor Verification using Evidence
     from Authorities over Twitter, in: Proceedings of ArabicNLP 2024, 2024.
[30] C. Kruengkrai, J. Yamagishi, X. Wang, A multi-level attention model for evidence-based fact
     checking, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Findings of the Association for Computational
     Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 2447–
     2460.
[31] Z. Liu, C. Xiong, M. Sun, Z. Liu, Fine-grained fact verification with kernel graph attention network,
     in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
     2020, pp. 7342–7351.
[32] M. Abdul-Mageed, A. Elmadany, et al., ARBERT & MARBERT: Deep Bidirectional Transformers
     for Arabic, in: Proceedings of the 59th Annual Meeting of the Association for Computational
     Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
     1: Long Papers), 2021, pp. 7088–7105.
[33] A. Louis, V. Saxena, G. van Dijck, G. Spanakis, Colbert-xm: A modular multi-vector representation
     model for zero-shot multilingual information retrieval, arXiv preprint arXiv:2402.15059 (2024).
[34] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
     D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, Association for Computational Linguistics, Online,
     2020, pp. 8440–8451.
[35] S. Althabiti, M. A. Alsalka, E. Atwell, Ta’keed: The first generative fact-checking system for arabic
     claims, arXiv preprint arXiv:2401.14067 (2024).
[36] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for fact
     extraction and verification, in: Proceedings of the 2018 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
     Papers), 2018, pp. 809–819.
[37] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
     for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.
[38] L. Kolb, A. Hanbury, AuthEv-LKolb at CheckThat! 2024: A Two-Stage Approach To Evidence-
     Based Social Media Claim Verification, in: [42], 2024.
[39] A. Pasin, N. Ferro, SEUPD@CLEF: Team Axolotl on Rumor Verification using Evidence from
     Authorities, in: [42], 2024.
[40] S. Adhikari, H. Sharma, R. Kumari, S. Satapara, M. Desarkar, DEFAULT at CheckThat! 2024:
     Retrieval Augmented Classification using Differentiable Top-K Operator for Rumor Verification
     based on Evidence from Authorities, in: [42], 2024.
[41] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late
     interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research
     and development in Information Retrieval, 2020, pp. 39–48.
[42] G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF
     2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024.