DEFAULT at CheckThat! 2024: Retrieval Augmented
                         Classification using Differentiable Top-K Operator for
                         Rumor Verification based on Evidence from Authorities
                         Sayanta Adhikari1,*,† , Himanshu Sharma1,† , Rupa Kumari1,† , Shrey Satapara1 and
                         Maunendra Desarkar1
                         1
                             Indian Institute of Technology Hyderabad, Telangana, 502285, India


                                        Abstract
                                        The paper describes Team DEFAULT’s submission at CheckThat! 2024 Task-5 on Rumor Verification based on
                                        Evidence from Authorities: In this paper, we present an approach for rumor verification on Twitter, focusing on
                                        integrating evidence from authoritative accounts to determine the veracity of rumors. We propose an architecture
                                        and a training regime as the preferred method to ensure seamless gradient flow. We formulate rumor verification
                                        using evidence from authorities as a Retrieval-Augmented Classification (RAC) task. By re-parameterizing the
                                        Top-K operator and applying Entropy-based Smoothing, our method addresses the discontinuity issues faced after
                                        retrieval, enhancing the accuracy of rumor verification. Using this classification-aware retrieval, the retriever
                                        achieves Recall@5 0.778, outperforming the baseline, placing team DEFAULT third on the test data leaderboard
                                        for retrieval. For classification, our approach performs on par with the baseline.

                                        Keywords
                                        Rumor Verification, Retrieval Augmented Classification, Differential TopK, Optimal Transport


                         1. Introduction
                         In the present era, social media has become one of the widely used mediums for information sharing
                         due to its capabilities of fast information sharing at a low cost. This has made online social media a
                         preferred choice for many individuals and organizations for propaganda-driven misinformation sharing
                         to influence public opinions and decisions [1]. The spread of rumors and misinformation through
                         social media has become a significant concern. Verifying the veracity of rumors and combating the
                         dissemination of misinformation is crucial for maintaining the integrity of online discourse. This paper
                         proposes a novel approach, Retrieval-Augmented Classification (RAC), which combines document
                         retrieval and classification techniques to address the problem of rumor verification [2].
                            The shared task “Rumor Verification using Evidence from Authorities" at CHECKTHAT! LAB at
                         CLEF-2024 [3, 4] related to rumor verification contains two steps approach. The first step involves
                         document retrieval, wherein authoritative tweets related to a rumor are analyzed to identify the most
                         relevant tweets. These sources, including reputable organizations or subject matter experts, can provide
                         valuable evidence supporting or refuting the rumor. The second step is classification, where the retrieved
                         evidence is leveraged to determine the rumor’s veracity, categorizing it as Supported (TRUE), Refuted
                         (FALSE), or Unverifiable (NEUTRAL).
                            To illustrate the methodology, consider a scenario involving a rumor circulating on social media
                         about a potential disease outbreak. The RAC method would first retrieve relevant documents using
                         sophisticated algorithms. These documents would then be analyzed based on key features identified

                         The code can be found at https://github.com/SAYANTA-ADHIKARI/RAC-SOFT-TopK
                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ ai22mtech12005@iith.ac.in (S. Adhikari); ai22mtech12008@iith.ac.in (H. Sharma); rupa06012000@gmail.com (R. Kumari);
                         ai22mtech02003@iith.ac.in (S. Satapara); maunendra@cse.iith.ac.in (M. Desarkar)
                          0009-0008-3717-9223 (S. Adhikari); 0009-0000-6189-515X (H. Sharma); 0000-0001-6222-1288 (S. Satapara);
                         0000-0003-1963-7338 (M. Desarkar)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
by machine learning models. The rumor would be classified as true if these sources corroborate the
outbreak with compelling evidence. Conversely, if the sources refute the claim or lack sufficient evidence,
the rumor would be labelled as false or unverifiable, respectively.
   In traditional retrieval systems, the relevance of a document is determined solely by its similarity
to the query. However, for tasks like rumor verification, the evidence required to validate or refute
a claim may not necessarily resemble the claim itself. This discrepancy between the query and the
desired evidence can lead to suboptimal retrieval performance when using traditional similarity-based
techniques.
   To address this challenge, we proposed a classification-aware retrieval approach by providing an
alignment between the retriever and the classifier, resulting in better retrieval. To jointly train the
retriever and classifier, we removed discontinuity associated with the Top-K document selection for
retrieval by replacing it with Soft Top-K, which allows the gradients to flow between retrieval and
classification module, resulting in end-to-end training using a common loss function. Details about the
proposed approach and its performance, along with analyses, are discussed in the subsequent sections
of the paper.


2. Related Works
Rumor verification and Fact-checking are well-known tasks in NLP that have attracted many researchers.
There has been a lot of work related to rumor verification related to dataset collection and training
methods. Fact Checking[5] is one of the early works on claim verification collected from claim verifica-
tion websites. Fact Extraction and Verification (FEVER)[6] is a well-known shared task organized at
SemEVal for fact verification.
   Liu et al. (2020) [7] proposed an approach using a Kernel Graph Attention Network (KGAT). Bekoulis
et al. (2021) [8] emphasized the importance of evidence-aware sentence selection, while Kruengkrai et
al. (2021) [9] presented a multi-level attention model for integrating evidence. These studies provide
valuable insights for developing effective RAC systems for rumor verification using evidence from
authorities.
   Recent rumor verification research on retrieval augmented verification [10] integrates retrieval and
classification using a zero-shot approach by retrieving real-time web-scraped evidence and matching
claim tests using pretrained language systems. Their graph-structured representation gathers evidence
automatically and highlights unverifiable claim parts. There has been some work for a comprehensive
rumor debunking system using an LLM (involving retrieval, discrimination, and guided generation)[11].
Various systems have been developed to enhance the extraction and application of clinical trial informa-
tion. One such system is CliVER [12], an end-to-end system that uses retrieval-augmented techniques to
automatically retrieve clinical trial abstracts, extract pertinent sentences, and apply the PICO framework
to support or refute scientific claims. This system represents a significant advancement in integrating
artificial intelligence and clinical research methodologies, streamlining the process of evidence synthesis
and decision-making in clinical settings.


3. Preliminary
3.1. Retrieval Augmented Classification
Verifying facts or rumors is challenging due to the subjective nature of the task. It requires access to
contextual information regarding the domain from the current timeline. The verification task can be
reduced to evidence retrieval and claim verification based on the retrieved evidence. This aligns closely
with the domain of retrieval-augmented generation (RAG) [13], where the task is to generate an answer
in context with a retrieved document. Similarly, we posed rumor verification based on evidence from
authoritative sources as a RAC task where a class needs to be predicted based on the original claim and
retrieved evidence.
   RAC can be approached in two ways: 1) training the retriever and classifier independently ( Inde-
pendent Training), and 2) training the retriever and classifier together ( Joint Training). Independent
training allows each component to be trained separately and then combined. However, a major draw-
back is the lack of alignment between the retrieval and classification processes despite being a pipeline.
The classifier’s performance is inherently linked to the retrieval quality, contradicting the notion of
independence. The dependency between the retrieved relevant evidence and the classification of the
given rumor highlights the need for a joint training objective. Joint Training allows for alignment
between the retriever and classifier components, but the major challenge is the discontinuity between
these processes.

3.2. Differential Top-K
To address the issue of discontinuity (Figure 1(a)), we referred to an Optimal Transport (OT) trick
for reparameterizing the Top-K function with entropy regularisation (to make it smooth) [14]. This
technique first formulates the extraction of Top-K elements from a vector into an Optimal Transport
Problem and then applies entropy regularisation to facilitate smooth gradient flow. We used SOFT
(Scalable Optimal transport-based diFferenTialble) Top-K operator in place of the Top-K operator to get
the Top K elements.
1. Problem Formulation Consider the score vector (containing relevance scores for each of the tweets
concerning the rumor tweet) to be 𝑋 = {𝑥𝑖 }𝑛𝑖=1 , where 𝑛 is the total number of tweets provided in the
timeline. The standard top-𝑘 operator returns indexes with Top-K elements, which is equivalent to a
vector 𝐴 = [𝐴1 , ..., 𝐴𝑛 ], such that
            {︃
              1 if 𝑥𝑖 is one of the top-𝑘 relevant tweets in 𝑋with respect to the rumor tweet,
      𝐴𝑖 =                                                                                           (1)
              0 otherwise.
Using 𝐴, we can extract the Top-K elements from 𝑋. In the case of sorted Top-K, 𝐴 is a matrix that,
when multiplied with the input 𝑋, provides us with the Top-K elements in sorted order.
2. Re-parameterizing Top-K Operator as OT Problem: Now, let’s consider the probability associated
with the score vector, 𝑋 = {𝑥𝑖 }𝑛𝑖=1 and the output support space,      ]︀ 1} (0 to map all the Top-K
                                                                [︀ 𝐵 = {0,
elements and 1 for the remaining, 𝑚 = 2) be 𝜇 = 𝑛1 1𝑛 and 𝜈 = 𝑛𝑘 , 𝑛−𝑘
                                                                     𝑛    respectively, where 𝑛 is the
total number of timeline tweets and 𝑘 represents the total number of evidence tweets that needs to be
retrieved from the timeline.
              Γ* = argminΓ≥0 ⟨𝐶, Γ⟩,      s.t.,   Γ1𝑚 = 𝜇,    Γ𝑇 1𝑛 = 𝜈,    Γ, Γ* ∈ R𝑛×𝑚                (2)
Here, Γ𝑖,𝑗 represents the probability of mapping the input 𝑥𝑖 of 𝑋 to the output 𝑏𝑗 of 𝐵 and 𝑐𝑖,𝑗 of 𝐶
represents the cost incurred to move from 𝑥𝑖 to 𝑏𝑗 . Here, Γ represents the joint probability distribution
over the support 𝑋 cartesian product 𝐵.
3. Solution: Under the above conditions, the optimal transport plan Γ* is given by (in closed form):
                           {︃                                  {︃
                    *
                              1
                                , if 𝑖 ≤ 𝑘,            *         0, if 𝑖 ≤ 𝑘,
                  Γ𝜎𝑖 ,1 =    𝑛                    , Γ𝜎𝑖 ,2 = 1                                         (3)
                             0, if 𝑘 + 1 ≤ 𝑖 ≤ 𝑛                  𝑛 , if 𝑘 + 1 ≤ 𝑖 ≤ 𝑛

where 𝜎 being the sorting permutation, i.e., 𝑥𝜎1 < 𝑥𝜎2 < · · · < 𝑥𝜎𝑛 . Based on Γ* we define 𝐴 =
𝑛Γ* · [1, 0]𝑇 . The matrix 𝐴 is the mapping matrix that provides the position of Top-K elements.
4. Smoothing by Entropy Regularization: Employing entropy regularisation to the OT problem
yields a smoothed approximation. The OT optimization problem further changes to:
                Γ*𝜖 = argminΓ≥0 ⟨𝐶, Γ⟩ + 𝜖𝐻(Γ), s.t., Γ1𝑚 = 𝜇, Γ𝑇 1𝑛 = 𝜈, 𝜖 > 0
where 𝐻(Γ) = 𝑖,𝑗 Γ𝑖,𝑗 log Γ𝑖,𝑗 is the entropy regularizer. Based on the above Γ*𝜖 we define: 𝐴𝜖 =
                  ∑︀

𝑛Γ*,𝜖 · [1, 0]𝑇 as the smoothed counterpart of the standard top-𝑘 operator output (𝐴 in Equation 1).
Throughout our approach, we consider sorted Top-K. Using the Soft-Top-K operator in place of the
Top-K operator helps train the model end-to-end and thus helps in aligning the retriever and the
classifier accordingly.
4. Methodology

                                                                                                                                         Density
                                                                                                                                          Loss


          Timelines                                                                           Timelines
                                                    Z1        x                                                                                      Z1        x
             D                                                                                   D               Backpropagation
                                           Top-K
                                          Indexes
                      Retriever   Top-K              Z2           x        Classifier                                         Soft
                                                                                                          Retriever                                   Z2           x       Classifier
                                                                                                                             Top-K


            Query                                        Z3           x                                                                                   Z3           x
                                                                                                Query
              x                                                                                   x                                                                                     Cross-
    (a)                             Discontinuity                                       (b)                                                                                             Entropy
                                                                                                                                   Backpropagation                                       Loss


Figure 1: (a) This figure illustrates the discontinuity arising between the Retriever and Classifier phases, primarily
because of the involvement of indices in the Top-K relevant tweets selection process for a given query tweet
(𝑥) among the timelines (𝐷). (b) This figure provides the final architecture of our approach, where the Top-K
operator is replaced with the Soft Top-K operator. The output 𝐴 of Soft Top-K is then used along with Timeline
to get the embeddings related to Top-K evidences (denoted by 𝑧1, 𝑧2, 𝑧3). Then, these are passed through the
classifier to get probabilities, which are then used to compute loss (Cross-Entropy Loss). Density loss is also
used to guide the retriever further. The orange dotted lines in the figure show the flow of gradients for our
architecture.

   To perform RAC for rumor verification based on evidence from authorities, we propose a novel
architecture that can be trained end-to-end. We propose a transformer-encoder-based architecture
as a retriever followed by a Top-K operator to extract relevant evidence. This is then used to help
the classifier classify the Query Tweet(x). As shown in Figure 1(a), this Top-K operator provides a
discontinuity in the pipeline. To remove this discontinuity, we parameterized Top-K with a smoother
version, Soft Top-K (details provided in subsection 3.2). Figure 1(b) shows the final architecture along
with different losses (defined in subsection 4.1) that are used for training purposes.
   If the classifier cannot classify correctly, then it won’t be able to guide the retriever regarding the
relevance of the tweets and vice versa. So, providing models with no information about the downstream
tasks might lead to poor performance and sub-optimal convergence. To counter this effect, we propose
a training method for our architecture. In this, we will first independently train the classifier and then.
We will jointly train both the retriever and classifier. Further, we freeze the retriever and train the
classifier again to increase the classifier’s performance.
   We define a Retriever, 𝑅, parameterized by 𝜓. It computes embeddings for each document (Timelines
𝐷) and the Query Tweet 𝑥. The similarity score between the embedding of 𝑥 and the embedding
of 𝐷 is used to extract the relevant tweets. We define the score for 𝐷 as 𝑆𝐷 . To extract the Top-K
relevant tweets, we pass it through Soft Top-K function, which returns a matrix 𝐴, which gives us
information about the Top-K relevant document indexes. We multiply the matrix 𝐴 with 𝐷 to extract
Top-K relevant documents. As we are using Soft Top-K, directly multiplying 𝐴 to 𝐷 leads to a change
in the token ids of the word, so we multiply 𝐴 with the classifier embedding corresponding to 𝐷 to
get Top-K document embeddings. We define the classifier 𝐻 as a combination of two functions, 𝐸
parameterized by 𝜃 and 𝐶 parameterized by 𝜑. Here, 𝐸 represents the initial embedding layer of the
BERT model, and 𝐶 represents the classification head. The classifier can be represented as a composition
of 𝐸 and 𝐶, i.e., 𝐻 = 𝐶 ∘ 𝐸. The classifier verifies 𝑥 in context with the embeddings of the extracted
evidences (𝑍 = {𝑧𝑖 }𝐾  𝑖=1 = 𝐴 × 𝐸([𝑥, 𝐷]; 𝜃)). Providing this evidence with 𝑥 might lead to an overflow
of the model’s context window. To deal with this problem, we get logits concerning each evidence, i.e.,
𝑃𝑖 = 𝐶(𝑧𝑖 , 𝜑), 𝑖 = 1, 2, · · · , 𝐾, and then perform a weighted aggregation of logits using the relevance
scores, i.e., 𝑆𝐾 = {𝑠𝑖 }𝐾𝑖=1 = 𝐴 × 𝑆𝐷 , provided by the retriever for each of the evidence. The probability
associated with query tweet 𝑥 based on the evidence set 𝑍, denoted as 𝑃𝑥 ;
                                                                                        ∑︀𝐾
                                                                                                 𝑠𝑖 .𝑃𝑖
                                                                          𝑃𝑥 =            𝑖=1
                                                                                         ∑︀ 𝐾
                                                                                                                                                                                                  (4)
                                                                                              𝑖=1 𝑠𝑖
4.1. Losses and Optimization Objective
To train our model, we use cross-entropy loss (ℒ𝐶𝐸 ) using 𝑃𝑥 and ground truth value 𝑦
                                                        𝑁       𝐶
                                                 1 ∑︁ ∑︁
                                    ℒ𝐶𝐸 = −              𝑦𝑖𝑐 . log(𝑃𝑥𝑖 𝑐 )                             (5)
                                                 𝑁
                                                    𝑖=1 𝑐=1

where 𝑁 denotes the total number of samples and 𝐶 denotes the total number of classes. To provide
better guidance to the Retriever, we introduced a new loss term, called the Density Loss ℒ𝐷𝐿 over the
output of the Soft-Top-K operator. The Soft-Top-K operator returns a matrix 𝐴 to provide the tweets
that must be considered. While forming the data, we are already aware of those positions so that we
can provide the ground truth 𝐴* matrix. We compute the density loss as a mean of cross-entropy of
each row of 𝐴 for each row of 𝐴* . Mathematically,
                                             𝑁      𝑟       𝐶
                                     1 ∑︁ ∑︁ ∑︁ *
                             ℒ𝐷𝐿 = −           𝐴𝑖 [𝑗, 𝑐]. log (𝐴𝑖 [𝑗, 𝑐])                              (6)
                                     𝑁𝑟
                                            𝑖=1 𝑗=1 𝑐=1

where 𝐴𝑖 and 𝐴*𝑖 represents the predicted and ground truth 𝐴 matrix for the query input 𝑥𝑖 and 𝑟
denotes the total number of rows in 𝐴𝑖 . The final loss is an aggregation of these two losses. As both the
losses are of the same scale, we add them with equal weight. Based on the defined losses, we define our
optimization problem as

                                           arg min ℒ𝐶𝐸 + ℒ𝐷𝐿                                           (7)
                                                 𝜃,𝜑,𝜓

In practice, we use Adam optimizer to train this objective. For more details regarding the training
process, refer to Algorithm 1. The required datasets for training as per Algorithm 1 are independent
training classifier datasets (𝒟𝐶 ) and joint training datasets (𝒟𝐽 ). Further details of this dataset are
provided in subsection 5.2.


5. Experimental Setup
5.1. Dataset Description
We utilized the dataset from CLEF 2024 CheckThat! Lab, Task 5 [4] includes Twitter data curated for
rumor verification in English and Arabic. Our experiments focused on the English dataset, comprising
96 training and 32 validation samples. Each sample contains an id (unique identifier), rumor (tweet
text), timeline (tweets from authorities during the rumor’s timeframe), label (veracity: SUPPORTS,
REFUTES, or NOT ENOUGH INFO), and evidence (tweets from authorities aiding classification). We
augmented the dataset to increase the sample size.

Table 1
Table provides us with the final count of samples we got after augmenting the dataset samples.
                                                                                  Train      Val
                           Dataset-stats                             Parts
                                                                                 samples   samples
                 Provided data                     -                    -           96       32
               Total data created            Independent            Retriever      233       48
          with augmentation provided           Training             Classifier     297       51
                 in Section 5.2             Joint Training              -          305       41
Algorithm 1 Training Regime
Input: Independent Classification Dataset 𝒟𝐶 , Joint Training Dataset 𝒟𝐽 , Epochs (𝑇𝐶1 , 𝑇𝐶2 , 𝑇𝐽 )
Parameters: Number of Evidences (𝑘), epsilon (𝜖), Classifier Embedding function (𝐸) parameterized
by 𝜃, Classification function (𝐶) parameterized by 𝜑, Retriever Function (𝑅) parameterized by 𝜓
  1: Initialize 𝐻 = 𝐶 ∘ 𝐸 (Refer section 4).                                ◁ Can use pretrained weights.
  2: for 𝑡 = 1, 2, . . . , 𝑇𝐶1 do                                             ◁ Initial training of classifier
  3:     for (𝑧, 𝑦) ∈ 𝒟𝐶 do                                           ◁ Batched Training is also possible
  4:          Optimize 𝜃, 𝜑 using ℒ𝐶𝐸 (Equation 5)
  5:     end for
  6: end for
  7: Initialize 𝑅                                                           ◁ Can use pretrained weights.
  8: for 𝑡 = 1, 2, . . . , 𝑇𝐽 do                                                            ◁ Joint Training
  9:     for (𝑥, 𝐷, 𝐴* , 𝑦) ∈ 𝒟𝐽 do
 10:          𝑆𝐷 ← 𝑅(𝐷, 𝑥; 𝜓)                                          ◁ Get Scores for all the documents
 11:          𝐴 ← 𝑆𝑜𝑓 𝑡_𝑇 𝑜𝑝𝐾(𝑆𝐷 ; 𝑘, 𝜖)                                                ◁ Get Top-K indices
 12:          𝑆𝑘 ← 𝐴 × 𝑆𝐷                                                           ◁ Get Top-K Scores(𝑆𝑘 )
 13:          𝑍𝑘 ← 𝐴 × 𝐸([𝑥, 𝐷]; 𝜃)                              ◁ Get Top-K Document Embedding (𝑍𝑘 )
 14:          𝑃𝑘 ← 𝐶(𝑍𝑘 ; 𝜑)                                            ◁ Get Prediction Probabilities(𝑃𝑘 )
 15:          Using 𝑃𝑘 and 𝑆𝑘 , compute 𝑃𝑥 using Equation 4
 16:          Optimize 𝜃, 𝜑, 𝜓 using Equation 7
 17:     end for
 18: end for
 19: Freeze 𝜓
 20: for 𝑡 = 1, 2, . . . , 𝑇𝐶2 do               ◁ Further training of classifier for boosting performance
 21:     for (𝑥, 𝐷, 𝐴, 𝑦) ∈ 𝒟𝐽 do
 22:          Repeat Steps 10 to 15 to get 𝑃𝑥
 23:          Optimize 𝜃, 𝜑 using ℒ𝐶𝐸 (𝑃𝑥 , 𝑦) (Equation 5)
 24:     end for
 25: end for
 26: Return (𝜃, 𝜑, 𝜓)


5.2. Data Augmentation and Training
5.2.1. Independent Training
Retriever: For independent training of the Retriever, we use contrastive training [15]. Each rumor
tweet is paired with an evidence tweet as a positive sample and 𝑙 (3 in our experiments) non-evidence
tweets from the timeline as negative samples. These triplets train the model with contrastive loss
functions [15, 16]. We create multiple samples with randomly chosen negative tweets for robustness
and exclude samples without any evidence tweets. We have considered multiple score functions for
scoring the similarity between the tweets: (a) Euclidean Distance between the representation vectors,
(b) Cosine similarity between the two representation vectors, and (c) MaxSim similarity proposed in the
paper of ColBERT [15]. We initialize the retriever with colber-ir/colbertv2.0 checkpoint weights from
huggingface. To further finetune the model, we use a batch size of 1 (fixed), epochs as 5 (using early
stopping), learning rate as 5𝑒 − 5 with similarity score as MaxSim, and contrastive loss as provided in
[15].
Classifier: For independent training of the Classifier, we create tweet pairs. Each rumor tweet is paired
with an evidence tweet and labelled according to the original data label ("SUPPORTS" or "REFUTES")
or paired with a non-evidence tweet and labelled as "NOT ENOUGH INFO." This process ensures a
balanced class distribution in the final training dataset. After initializing with pretrained weights, we
used a batch size of 2 (fixed), epochs of 7 (got using early stopping) and a learning rate of 1𝑒 − 5 to
fine-tune the classifier.

5.2.2. Joint Training
For joint training, each rumor tweet is paired with a document set of size 𝑛 (64 in our experiments).
The document set includes all, some, or none of the evidence tweets, filled to 𝑛 with non-evidence
timeline tweets. Document sets with evidence are labelled based on the original data point, while those
without evidence are labelled "NOT ENOUGH INFO." We shuffle the document sets to avoid bias from
tweet order and ensure a balanced class distribution in the final dataset. To train the model, we used a
batch size of 1 (fixed), with a learning rate of 1𝑒 − 5, a K value of 5 (given), and the epsilon value of Soft
Top-K as 0.01 (fixed). We train the model for 5 epochs (using early stopping).

5.2.3. Our Approach
As joint training starts training from pretrained weights, it is difficult for the classifier to guide the
retriever and vice-versa. In our approach, we first independently train the classifier on the independent
training dataset with a similar hyperparameter provided in subsubsection 5.2.1 for 5 epochs (got using
early stopping). Then, we finetune the whole architecture (Retriever + Classifier) end-to-end. We used
the dataset presented in subsubsection 5.2.2 to perform joint training. We use the hyperparameters
provided in subsubsection 5.2.2 for joint training. After this, we again finetuned the classifier with a
frozen retriever to further boost the classifier’s performance. To further train, we used a batch size of 1
(fixed), with a learning rate of 1𝑒 − 5 (fixed) for 5 epochs.
   We provide the statistics about the dataset we got after performing augmentation in Table 1. Further,
we used these data to train our model. As we have tweets as our input data, we preprocessed each tweet
by removing the links in the tweet. We converted each of the emojis in the tweet with their relevant
text translation using the ‘emoji’ python package [17]. We used a single NVIDIA Tesla 32GB V100 GPU
to train our models. It took around an hour to train the whole model on the dataset.

5.3. Evaluation Metrics
The primary measure for evaluating evidence retrieval is Mean Average Precision (MAP). Under
this metric, systems receive no credit for retrieving tweets related to unverifiable rumors. Another
important evaluation metric is Recall@5 (R@5), which measures the proportion of relevant tweets
retrieved among the top 5 retrieved tweets. We utilize the Macro-F1 (M-F1) score for classification
evaluation, which calculates the harmonic mean of precision and recall across all classes. Additionally,
we consider a Strict Macro-F1 score, where the correctness of a rumor label is contingent upon at least
one retrieved authority evidence being correct.


6. Results
Table 2 provides results related to different experiments we have performed. We can conclude from
the results that MaxSim similarity performs better than the other similarities we considered. We also
observed that those initialized with ColBERT pretrained weights performed better than those initialized
with BERT pretrained weights. This is trivial as ColBERT is specifically trained for information retrieval
and matches between individual tokens of the two texts instead of comparing the overall pooled vectors
representing the two texts- claim and candidate evidence tweet. Inspection at finer granularity helps
it identify the matches better. We can also see that Joint Training performs better than Independent
Training. We can also observe that our proposed training curriculum performs better than both purely
Joint and Independent Training. We can also observe that it reduces performance when different
pretrained models are used for the retriever and the classifier. Overall, we can observe that ColBERT-B
with our Approach performs best among all our approaches. This approach can beat KGAT’s retriever
performance by a huge margin, but the classification performance is less than that of KGAT.
Table 2
This table provides the results of different experiments we performed. Additionally, it also provides a comparison
between the different training techniques we used. It also compares different similarity metrics and different
pre-trained backbone models. Here, BERT-B [18] means using the bert-base-uncased pretrained checkpoint, and
ColBERT-B [19] initializes with colbert-ir/colbertv2.0 pretrained checkpoints. All the results provided in this table
are on the validation data provided in the task.
                                                                                 Inference Performance
       Pretrained Models                                                      Retriever         Classifier
                                                                            Performance Performance
                                     Method         Similarity Metric
                                                                                                      Strict-
      Retriever     Classifier                                              MAP R@5 M-F1
                                                                                                      M-F1
                   Baseline                                   -             0.561 0.636 0.508         0.508
      BERT-B         -               Zero-shot            MaxSim            0.084 0.123         -         -
     ColBERT-B       -               Zero-shot            MaxSim            0.164 0.229         -         -
         -        BERT-B                                      -                -        -    0.296        -
            BERT-B                                         Cosine           0.268 0.456 0.249          0.199
                                   Independent
            BERT-B                                        L2-norm           0.245     0.35    0.21      0.16
                                     Training
            BERT-B                                        MaxSim             0.27     0.48   0.251      0.2
           ColBERT-B                                      MaxSim            0.466 0.524 0.269          0.233
            BERT-B                                         Cosine           0.308 0.412 0.195          0.117
                                   Joint Training
           ColBERT-B                                       Cosine           0.581 0.646 0.364          0.309
                                    from Scratch
           ColBERT-B                                      MaxSim            0.606 0.662 0.362          0.321
     ColBERT-B    BERT-B                                  MaxSim            0.404 0.508 0.193           0.09
                                       Our
            BERT-B                                        MaxSim            0.388 0.441 0.256          0.195
                                     Approach
           ColBERT-B                                      MaxSim            0.733 0.778 0.472          0.472

Table 3
This table provides results of our approach and baseline on the test dataset
                                      Retriever Performance         Classifier Performance
                       Method
                                      MAP         R@5               M-F1       Strict M-F1
                      Baseline        0.335        0.445            0.495         0.495
                    Our Approach      0.559       0.634             0.482          0.454


  Table 3 presents the results obtained from the test data provided by the CHECKTHAT! LAB for the
specified task. From the result, it is evident that classifier-guided retriever training outperforms the
baseline by a huge margin and classifier performance is on par with that of the baseline.


7. Conclusion
We present a joint training framework to simultaneously optimize an evidence retriever and a rumor
classifier in an end-to-end fashion. We show that our approach performs better than both independent
and joint individually. Our experiments have shown that merging these two approaches together leads
to better performance. From the results, we can conclude that our approach can retrieve relevant
tweets accurately and it can extract at least one relevant tweet for all the rumor claims as Macro-F1 and
Strict-Macro-F1 are the same for ColBERT-B with our Approach.
   Also, the results show the importance of joint training. Using the Soft Top-K operation as a differen-
tiable approximation of the standard Top-K operation can not only encounter discontinuity but enhance
the model’s performance. Further, we conclude that having Soft-TopK-based reparameterization and
independent training followed by joint training leads to better performance. Moreover, we observe
that the classifier-guided retriever boosts the performance of the retriever, such that it outperforms the
baseline by a huge margin, whereas the classifier’s performance is on par with the baseline.
References
 [1] D. Varshney, D. K. Vishwakarma, A review on rumour prediction and veracity assessment in
     online social network, Expert Systems with Applications 168 (2021) 114208. URL: https://www.
     sciencedirect.com/science/article/pii/S0957417420309362. doi:https://doi.org/10.1016/j.
     eswa.2020.114208.
 [2] J. Yu, J. Jiang, L. M. S. Khoo, H. L. Chieu, R. Xia, Coupled hierarchical transformer for stance-aware
     rumor verification in social media conversations, Association for Computational Linguistics, 2020.
 [3] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
     M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-worthiness,
     subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel-
     lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
     Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
 [4] F. Haouari, T. Elsayed, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor
     Verification using Evidence from Authorities, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García
     Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation
     Forum, CLEF 2024, Grenoble, France, 2024.
 [5] A. Vlachos, S. Riedel, Fact checking: Task definition and dataset construction, in: Proceedings
     of the ACL 2014 workshop on language technologies and computational social science, 2014, pp.
     18–22.
 [6] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The fact extraction and
     VERification (FEVER) shared task, in: J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopou-
     los, A. Mittal (Eds.), Proceedings of the First Workshop on Fact Extraction and VERification
     (FEVER), Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1–9. URL:
     https://aclanthology.org/W18-5501. doi:10.18653/v1/W18-5501.
 [7] Z. Liu, C. Xiong, M. Sun, Z. Liu, Fine-grained fact verification with kernel graph attention network,
     in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of
     the Association for Computational Linguistics, Association for Computational Linguistics, Online,
     2020, pp. 7342–7351. URL: https://aclanthology.org/2020.acl-main.655. doi:10.18653/v1/2020.
     acl-main.655.
 [8] G. Bekoulis, C. Papagiannopoulou, N. Deligiannis, Understanding the impact of evidence-aware
     sentence selection for fact checking, in: A. Feldman, G. Da San Martino, C. Leberknight, P. Nakov
     (Eds.), Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinfor-
     mation, and Propaganda, Association for Computational Linguistics, Online, 2021, pp. 23–28. URL:
     https://aclanthology.org/2021.nlp4if-1.4. doi:10.18653/v1/2021.nlp4if-1.4.
 [9] C. Kruengkrai, J. Yamagishi, X. Wang, A multi-level attention model for evidence-based fact check-
     ing, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Findings of the Association for Computational Lin-
     guistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 2447–2460.
     URL: https://aclanthology.org/2021.findings-acl.217. doi:10.18653/v1/2021.findings-acl.
     217.
[10] A. U. Dey, A. Llabrés, E. Valveny, D. Karatzas, Retrieval augmented verification: Unveiling
     disinformation with structured representations for zero-shot real-time evidence-guided fact-
     checking of multi-modal social media posts, arXiv preprint arXiv:2404.10702 (2024).
[11] J. Xu, L. Xian, Z. Liu, M. Chen, Q. Yin, F. Song, The future of combating rumors? retrieval,
     discrimination, and generation, 2024. arXiv:2403.20204.
[12] H. Liu, A. Soroush, J. G. Nestor, E. Park, B. Idnay, Y. Fang, J. Pan, S. Liao, M. Bernard, Y. Peng,
     C. Weng, Retrieval augmented scientific claim verification, JAMIA Open 7 (2024) ooae021.
     doi:10.1093/jamiaopen/ooae021.
[13] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
     T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
     in Neural Information Processing Systems 33 (2020) 9459–9474.
[14] Y. Xie, H. Dai, M. Chen, B. Dai, T. Zhao, H. Zha, W. Wei, T. Pfister, Differentiable top-k with
     optimal transport, Advances in Neural Information Processing Systems 33 (2020) 20520–20531.
[15] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late
     interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research
     and development in Information Retrieval, 2020, pp. 39–48.
[16] I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, Y. Weill, N. Koenigstein, Metricbert: Text
     representation learning via self-supervised triplet training, in: ICASSP 2022 - 2022 IEEE In-
     ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 1–5.
     doi:10.1109/ICASSP43922.2022.9746018.
[17] emoji — pypi.org, https://pypi.org/project/emoji/, 2024. [Accessed 31-05-2024].
[18] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training of deep bidirectional transformers for
     language understanding, in: Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[19] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late
     interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research
     and development in Information Retrieval, 2020, pp. 39–48.