1. Introduction

Conference and Labs of the Evaluation Forum, September

DEFAULT at CheckThat! 2024: Retrieval Augmented Classification using Diferentiable Top-K Operator for Rumor Verification based on Evidence from Authorities

Sayanta Adhikari

Himanshu Sharma

Rupa Kumari

Shrey Satapara

Maunendra Desarkar

0 0 Indian Institute of Technology Hyderabad , Telangana, 502285 , India

2024

0 9 12

The paper describes Team DEFAULT's submission at CheckThat! 2024 Task-5 on Rumor Verification based on Evidence from Authorities: In this paper, we present an approach for rumor verification on Twitter, focusing on integrating evidence from authoritative accounts to determine the veracity of rumors. We propose an architecture and a training regime as the preferred method to ensure seamless gradient flow. We formulate rumor verification using evidence from authorities as a Retrieval-Augmented Classification (RAC) task. By re-parameterizing the Top-K operator and applying Entropy-based Smoothing, our method addresses the discontinuity issues faced after retrieval, enhancing the accuracy of rumor verification. Using this classification-aware retrieval, the retriever achieves Recall@5 0.778, outperforming the baseline, placing team DEFAULT third on the test data leaderboard for retrieval. For classification, our approach performs on par with the baseline.

eol>Rumor Verification Retrieval Augmented Classification Diferential TopK Optimal Transport

1. Introduction

In the present era, social media has become one of the widely used mediums for information sharing due to its capabilities of fast information sharing at a low cost. This has made online social media a preferred choice for many individuals and organizations for propaganda-driven misinformation sharing to influence public opinions and decisions [ 1 ]. The spread of rumors and misinformation through social media has become a significant concern. Verifying the veracity of rumors and combating the dissemination of misinformation is crucial for maintaining the integrity of online discourse. This paper proposes a novel approach, Retrieval-Augmented Classification (RAC), which combines document retrieval and classification techniques to address the problem of rumor verification [ 2 ].

The shared task “Rumor Verification using Evidence from Authorities" at CHECKTHAT! LAB at CLEF-2024 [ 3, 4 ] related to rumor verification contains two steps approach. The first step involves document retrieval, wherein authoritative tweets related to a rumor are analyzed to identify the most relevant tweets. These sources, including reputable organizations or subject matter experts, can provide valuable evidence supporting or refuting the rumor. The second step is classification, where the retrieved evidence is leveraged to determine the rumor’s veracity, categorizing it as Supported (TRUE), Refuted (FALSE), or Unverifiable (NEUTRAL).

To illustrate the methodology, consider a scenario involving a rumor circulating on social media about a potential disease outbreak. The RAC method would first retrieve relevant documents using sophisticated algorithms. These documents would then be analyzed based on key features identified by machine learning models. The rumor would be classified as true if these sources corroborate the outbreak with compelling evidence. Conversely, if the sources refute the claim or lack suficient evidence, the rumor would be labelled as false or unverifiable , respectively.

In traditional retrieval systems, the relevance of a document is determined solely by its similarity to the query. However, for tasks like rumor verification, the evidence required to validate or refute a claim may not necessarily resemble the claim itself. This discrepancy between the query and the desired evidence can lead to suboptimal retrieval performance when using traditional similarity-based techniques.

To address this challenge, we proposed a classification-aware retrieval approach by providing an alignment between the retriever and the classifier, resulting in better retrieval. To jointly train the retriever and classifier, we removed discontinuity associated with the Top-K document selection for retrieval by replacing it with Soft Top-K, which allows the gradients to flow between retrieval and classification module, resulting in end-to-end training using a common loss function. Details about the proposed approach and its performance, along with analyses, are discussed in the subsequent sections of the paper.

2. Related Works

Rumor verification and Fact-checking are well-known tasks in NLP that have attracted many researchers. There has been a lot of work related to rumor verification related to dataset collection and training methods. Fact Checking[ 5 ] is one of the early works on claim verification collected from claim verification websites. Fact Extraction and Verification (FEVER)[ 6 ] is a well-known shared task organized at SemEVal for fact verification.

Liu et al. (2020) [ 7 ] proposed an approach using a Kernel Graph Attention Network (KGAT). Bekoulis et al. (2021) [ 8 ] emphasized the importance of evidence-aware sentence selection, while Kruengkrai et al. (2021) [ 9 ] presented a multi-level attention model for integrating evidence. These studies provide valuable insights for developing efective RAC systems for rumor verification using evidence from authorities.

Recent rumor verification research on retrieval augmented verification [ 10 ] integrates retrieval and classification using a zero-shot approach by retrieving real-time web-scraped evidence and matching claim tests using pretrained language systems. Their graph-structured representation gathers evidence automatically and highlights unverifiable claim parts. There has been some work for a comprehensive rumor debunking system using an LLM (involving retrieval, discrimination, and guided generation)[ 11 ]. Various systems have been developed to enhance the extraction and application of clinical trial information. One such system is CliVER [ 12 ], an end-to-end system that uses retrieval-augmented techniques to automatically retrieve clinical trial abstracts, extract pertinent sentences, and apply the PICO framework to support or refute scientific claims. This system represents a significant advancement in integrating artificial intelligence and clinical research methodologies, streamlining the process of evidence synthesis and decision-making in clinical settings.

3. Preliminary 3.1. Retrieval Augmented Classification

Verifying facts or rumors is challenging due to the subjective nature of the task. It requires access to contextual information regarding the domain from the current timeline. The verification task can be reduced to evidence retrieval and claim verification based on the retrieved evidence. This aligns closely with the domain of retrieval-augmented generation (RAG) [ 13 ], where the task is to generate an answer in context with a retrieved document. Similarly, we posed rumor verification based on evidence from authoritative sources as a RAC task where a class needs to be predicted based on the original claim and retrieved evidence.

RAC can be approached in two ways: 1) training the retriever and classifier independently ( Independent Training), and 2) training the retriever and classifier together ( Joint Training). Independent training allows each component to be trained separately and then combined. However, a major drawback is the lack of alignment between the retrieval and classification processes despite being a pipeline. The classifier’s performance is inherently linked to the retrieval quality, contradicting the notion of independence. The dependency between the retrieved relevant evidence and the classification of the given rumor highlights the need for a joint training objective. Joint Training allows for alignment between the retriever and classifier components, but the major challenge is the discontinuity between these processes.

3.2. Diferential Top-K

To address the issue of discontinuity (Figure 1(a)), we referred to an Optimal Transport (OT) trick for reparameterizing the Top-K function with entropy regularisation (to make it smooth) [ 14 ]. This technique first formulates the extraction of Top-K elements from a vector into an Optimal Transport Problem and then applies entropy regularisation to facilitate smooth gradient flow. We used SOFT (Scalable Optimal transport-based diFferenTialble) Top-K operator in place of the Top-K operator to get the Top K elements. 1. Problem Formulation Consider the score vector (containing relevance scores for each of the tweets concerning the rumor tweet) to be = {}=1, where is the total number of tweets provided in the timeline. The standard top- operator returns indexes with Top-K elements, which is equivalent to a vector = [1, ..., ], such that = {︃1 if is one of the top- relevant tweets in with respect to the rumor tweet, 0 otherwise.

Using , we can extract the Top-K elements from . In the case of sorted Top-K, is a matrix that, when multiplied with the input , provides us with the Top-K elements in sorted order. 2. Re-parameterizing Top-K Operator as OT Problem: Now, let’s consider the probability associated with the score vector, = {}=1 and the output support space, = {0, 1} (0 to map all the Top-K elements and 1 for the remaining, = 2) be = 1 1 and = [︀ , − ]︀ respectively, where is the total number of timeline tweets and represents the total number of evidence tweets that needs to be retrieved from the timeline.

Γ* = argminΓ≥ 0⟨, Γ⟩, s.t., Γ1 = , Γ 1 = , Γ, Γ* ∈ R× Here, Γ, represents the probability of mapping the input of to the output of and , of represents the cost incurred to move from to . Here, Γ represents the joint probability distribution over the support cartesian product . 3. Solution: Under the above conditions, the optimal transport plan Γ* is given by (in closed form): Γ* ,1 = {︃ 1 , if ≤ , 0, if + 1 ≤ ≤ , Γ* ,2 = {︃0, if ≤ , 1 , if + 1 ≤ ≤ where being the sorting permutation, i.e., 1 < 2 < · · · < . Based on Γ* we define = Γ* · [ 1, 0 ] . The matrix is the mapping matrix that provides the position of Top-K elements. 4. Smoothing by Entropy Regularization: Employing entropy regularisation to the OT problem yields a smoothed approximation. The OT optimization problem further changes to: Γ* = argminΓ≥ 0⟨, Γ⟩ + (Γ), s.t., Γ1 = , Γ 1 = , > 0 where (Γ) = ∑︀, Γ, log Γ, is the entropy regularizer. Based on the above Γ* we define: = Γ* , · [ 1, 0 ] as the smoothed counterpart of the standard top- operator output ( in Equation 1). Throughout our approach, we consider sorted Top-K. Using the Soft-Top-K operator in place of the Top-K operator helps train the model end-to-end and thus helps in aligning the retriever and the classifier accordingly. (1) (2) (3)

4. Methodology

(a)

Timelines

D Quxery

To perform RAC for rumor verification based on evidence from authorities, we propose a novel architecture that can be trained end-to-end. We propose a transformer-encoder-based architecture as a retriever followed by a Top-K operator to extract relevant evidence. This is then used to help the classifier classify the Query Tweet(x). As shown in Figure 1(a), this Top-K operator provides a discontinuity in the pipeline. To remove this discontinuity, we parameterized Top-K with a smoother version, Soft Top-K (details provided in subsection 3.2). Figure 1(b) shows the final architecture along with diferent losses (defined in subsection 4.1) that are used for training purposes.

If the classifier cannot classify correctly, then it won’t be able to guide the retriever regarding the relevance of the tweets and vice versa. So, providing models with no information about the downstream tasks might lead to poor performance and sub-optimal convergence. To counter this efect, we propose a training method for our architecture. In this, we will first independently train the classifier and then. We will jointly train both the retriever and classifier. Further, we freeze the retriever and train the classifier again to increase the classifier’s performance.

We define a Retriever, , parameterized by . It computes embeddings for each document (Timelines ) and the Query Tweet . The similarity score between the embedding of and the embedding of is used to extract the relevant tweets. We define the score for as . To extract the Top-K relevant tweets, we pass it through Soft Top-K function, which returns a matrix , which gives us information about the Top-K relevant document indexes. We multiply the matrix with to extract Top-K relevant documents. As we are using Soft Top-K, directly multiplying to leads to a change in the token ids of the word, so we multiply with the classifier embedding corresponding to to get Top-K document embeddings. We define the classifier as a combination of two functions, parameterized by and parameterized by . Here, represents the initial embedding layer of the BERT model, and represents the classification head. The classifier can be represented as a composition of and , i.e., = ∘ . The classifier verifies in context with the embeddings of the extracted evidences ( = {}=1 = × ([, ]; )). Providing this evidence with might lead to an overflow of the model’s context window. To deal with this problem, we get logits concerning each evidence, i.e., = (, ), = 1, 2, · · · , , and then perform a weighted aggregation of logits using the relevance scores, i.e., = {}=1 = × , provided by the retriever for each of the evidence. The probability associated with query tweet based on the evidence set , denoted as ; = ∑︀=1 . ∑︀ =1 (4)

4.1. Losses and Optimization Objective

To train our model, we use cross-entropy loss (ℒ ) using and ground truth value (5) (6) (7) 1 ∑︁ ∑︁ . log() ℒ = −

=1 =1 where denotes the total number of samples and denotes the total number of classes. To provide better guidance to the Retriever, we introduced a new loss term, called the Density Loss ℒ over the output of the Soft-Top-K operator. The Soft-Top-K operator returns a matrix to provide the tweets that must be considered. While forming the data, we are already aware of those positions so that we can provide the ground truth * matrix. We compute the density loss as a mean of cross-entropy of each row of for each row of * . Mathematically,

1 ∑︁ ∑︁ ∑︁ * [, ]. log ([, ]) ℒ = − =1 =1 =1 where and * represents the predicted and ground truth matrix for the query input and denotes the total number of rows in . The final loss is an aggregation of these two losses. As both the losses are of the same scale, we add them with equal weight. Based on the defined losses, we define our optimization problem as arg min ℒ + ℒ

,, In practice, we use Adam optimizer to train this objective. For more details regarding the training process, refer to Algorithm 1. The required datasets for training as per Algorithm 1 are independent training classifier datasets ( ) and joint training datasets ( ). Further details of this dataset are provided in subsection 5.2.

5. Experimental Setup 5.1. Dataset Description

We utilized the dataset from CLEF 2024 CheckThat! Lab, Task 5 [ 4 ] includes Twitter data curated for rumor verification in English and Arabic. Our experiments focused on the English dataset, comprising 96 training and 32 validation samples. Each sample contains an id (unique identifier), rumor (tweet text), timeline (tweets from authorities during the rumor’s timeframe), label (veracity: SUPPORTS, REFUTES, or NOT ENOUGH INFO), and evidence (tweets from authorities aiding classification). We augmented the dataset to increase the sample size.

5.2. Data Augmentation and Training 5.2.1. Independent Training

Retriever: For independent training of the Retriever, we use contrastive training [15]. Each rumor tweet is paired with an evidence tweet as a positive sample and (3 in our experiments) non-evidence tweets from the timeline as negative samples. These triplets train the model with contrastive loss functions [15, 16]. We create multiple samples with randomly chosen negative tweets for robustness and exclude samples without any evidence tweets. We have considered multiple score functions for scoring the similarity between the tweets: (a) Euclidean Distance between the representation vectors, (b) Cosine similarity between the two representation vectors, and (c) MaxSim similarity proposed in the paper of ColBERT [15]. We initialize the retriever with colber-ir/colbertv2.0 checkpoint weights from huggingface. To further finetune the model, we use a batch size of 1 (fixed), epochs as 5 (using early stopping), learning rate as 5 − 5 with similarity score as MaxSim, and contrastive loss as provided in [15].

Classifier: For independent training of the Classifier, we create tweet pairs. Each rumor tweet is paired with an evidence tweet and labelled according to the original data label ("SUPPORTS" or "REFUTES") or paired with a non-evidence tweet and labelled as "NOT ENOUGH INFO." This process ensures a balanced class distribution in the final training dataset. After initializing with pretrained weights, we used a batch size of 2 (fixed), epochs of 7 (got using early stopping) and a learning rate of 1 − 5 to ifne-tune the classifier.

5.2.2. Joint Training

For joint training, each rumor tweet is paired with a document set of size (64 in our experiments). The document set includes all, some, or none of the evidence tweets, filled to with non-evidence timeline tweets. Document sets with evidence are labelled based on the original data point, while those without evidence are labelled "NOT ENOUGH INFO." We shufle the document sets to avoid bias from tweet order and ensure a balanced class distribution in the final dataset. To train the model, we used a batch size of 1 (fixed), with a learning rate of 1 − 5, a K value of 5 (given), and the epsilon value of Soft Top-K as 0.01 (fixed). We train the model for 5 epochs (using early stopping).

5.2.3. Our Approach

As joint training starts training from pretrained weights, it is dificult for the classifier to guide the retriever and vice-versa. In our approach, we first independently train the classifier on the independent training dataset with a similar hyperparameter provided in subsubsection 5.2.1 for 5 epochs (got using early stopping). Then, we finetune the whole architecture (Retriever + Classifier) end-to-end. We used the dataset presented in subsubsection 5.2.2 to perform joint training. We use the hyperparameters provided in subsubsection 5.2.2 for joint training. After this, we again finetuned the classifier with a frozen retriever to further boost the classifier’s performance. To further train, we used a batch size of 1 (fixed), with a learning rate of 1 − 5 (fixed) for 5 epochs.

We provide the statistics about the dataset we got after performing augmentation in Table 1. Further, we used these data to train our model. As we have tweets as our input data, we preprocessed each tweet by removing the links in the tweet. We converted each of the emojis in the tweet with their relevant text translation using the ‘emoji’ python package [17]. We used a single NVIDIA Tesla 32GB V100 GPU to train our models. It took around an hour to train the whole model on the dataset.

5.3. Evaluation Metrics

The primary measure for evaluating evidence retrieval is Mean Average Precision (MAP). Under this metric, systems receive no credit for retrieving tweets related to unverifiable rumors. Another important evaluation metric is Recall@5 (R@5), which measures the proportion of relevant tweets retrieved among the top 5 retrieved tweets. We utilize the Macro-F1 (M-F1) score for classification evaluation, which calculates the harmonic mean of precision and recall across all classes. Additionally, we consider a Strict Macro-F1 score, where the correctness of a rumor label is contingent upon at least one retrieved authority evidence being correct.

6. Results

Table 2 provides results related to diferent experiments we have performed. We can conclude from the results that MaxSim similarity performs better than the other similarities we considered. We also observed that those initialized with ColBERT pretrained weights performed better than those initialized with BERT pretrained weights. This is trivial as ColBERT is specifically trained for information retrieval and matches between individual tokens of the two texts instead of comparing the overall pooled vectors representing the two texts- claim and candidate evidence tweet. Inspection at finer granularity helps it identify the matches better. We can also see that Joint Training performs better than Independent Training. We can also observe that our proposed training curriculum performs better than both purely Joint and Independent Training. We can also observe that it reduces performance when diferent pretrained models are used for the retriever and the classifier. Overall, we can observe that ColBERT-B with our Approach performs best among all our approaches. This approach can beat KGAT’s retriever performance by a huge margin, but the classification performance is less than that of KGAT.

7. Conclusion

We present a joint training framework to simultaneously optimize an evidence retriever and a rumor classifier in an end-to-end fashion. We show that our approach performs better than both independent and joint individually. Our experiments have shown that merging these two approaches together leads to better performance. From the results, we can conclude that our approach can retrieve relevant tweets accurately and it can extract at least one relevant tweet for all the rumor claims as Macro-F1 and Strict-Macro-F1 are the same for ColBERT-B with our Approach.

Also, the results show the importance of joint training. Using the Soft Top-K operation as a diferentiable approximation of the standard Top-K operation can not only encounter discontinuity but enhance the model’s performance. Further, we conclude that having Soft-TopK-based reparameterization and independent training followed by joint training leads to better performance. Moreover, we observe that the classifier-guided retriever boosts the performance of the retriever, such that it outperforms the baseline by a huge margin, whereas the classifier’s performance is on par with the baseline. optimal transport, Advances in Neural Information Processing Systems 33 (2020) 20520–20531. [15] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48. [16] I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, Y. Weill, N. Koenigstein, Metricbert: Text representation learning via self-supervised triplet training, in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 1–5. doi:10.1109/ICASSP43922.2022.9746018. [17] emoji — pypi.org, https://pypi.org/project/emoji/, 2024. [Accessed 31-05-2024]. [18] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of NAACL-HLT, 2019, pp. 4171–4186. [19] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48.

[1]

Varshney ,

D. K.

Vishwakarma , A review on rumour prediction and veracity assessment in online social network , Expert Systems with Applications 168 ( 2021 ) 114208 . URL: https://www. sciencedirect.com/science/article/pii/S0957417420309362. doi:https://doi.org/10.1016/j. eswa. 2020 . 114208 .

[2]

Yu ,

Jiang ,

L. M. S.

Khoo ,

H. L.

Chieu ,

Xia , Coupled hierarchical transformer for stance-aware rumor verification in social media conversations , Association for Computational Linguistics , 2020 .

[3]

Barrón-Cedeño ,

Alam ,

Chakraborty ,

Elsayed ,

Nakov ,

Przybyła ,

J. M.

Struß ,

Haouari ,

Hasanain ,

Ruggeri ,

Song ,

Suwaileh , The

CLEF

-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness , in: N. Goharian , N.

Tonellotto , Y.

He , A.

Lipani , G.

McDonald , C.

Macdonald , I. Ounis (Eds.), Advances in Information Retrieval , Springer Nature Switzerland, Cham, 2024 , pp. 449 - 458 .

[4]

Haouari ,

Elsayed ,

Suwaileh , Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor Verification using Evidence from Authorities , in: G. Faggioli,

Ferro ,

Galuščáková , A . García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , CLEF 2024 , Grenoble, France, 2024 .

[5]

Vlachos ,

Riedel , Fact checking: Task definition and dataset construction , in: Proceedings of the ACL 2014 workshop on language technologies and computational social science , 2014 , pp. 18 - 22 .

[6]

Thorne ,

Vlachos ,

Cocarascu ,

Christodoulopoulos ,

Mittal , The fact extraction and VERification (FEVER) shared task , in: J. Thorne , A.

Vlachos , O.

Cocarascu , C.

Christodoulopoulos , A . Mittal (Eds.), Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) , Association for Computational Linguistics , Brussels, Belgium, 2018 , pp. 1 - 9 . URL: https://aclanthology.org/W18-5501. doi: 10 .18653/v1/ W18 -5501.

[7]

Liu ,

Xiong ,

Sun ,

Liu , Fine-grained fact verification with kernel graph attention network , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 7342 - 7351 . URL: https://aclanthology.org/ 2020 .acl-main. 655 . doi: 10 .18653/v1/ 2020 . acl-main. 655 .

[8]

Bekoulis ,

Papagiannopoulou , N. Deligiannis, Understanding the impact of evidence-aware sentence selection for fact checking , in: A. Feldman , G. Da San Martino, C. Leberknight, P. Nakov (Eds.), Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship , Disinformation, and Propaganda, Association for Computational Linguistics, Online, 2021 , pp. 23 - 28 . URL: https://aclanthology.org/ 2021 .nlp4if- 1 .4. doi: 10 .18653/v1/ 2021 .nlp4if- 1 .4.

[9]

Kruengkrai ,

Yamagishi ,

Wang , A multi-level attention model for evidence-based fact checking , in: C. Zong , F.

Xia , W.

Li , R.

Navigli (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , Association for Computational Linguistics , Online, 2021 , pp. 2447 - 2460 . URL: https://aclanthology.org/ 2021 .findings-acl. 217 . doi: 10 .18653/v1/ 2021 .findings-acl. 217 .

[10]

A. U.

Dey ,

Llabrés ,

Valveny ,

Karatzas , Retrieval augmented verification: Unveiling disinformation with structured representations for zero-shot real-time evidence-guided factchecking of multi-modal social media posts , arXiv preprint arXiv:2404.10702 ( 2024 ).

[11]

Xu ,

Xian ,

Liu ,

Chen ,

Yin ,

Song , The future of combating rumors? retrieval, discrimination , and generation, 2024 . arXiv: 2403 . 20204 .

[12]

Liu ,

Soroush ,

J. G.

Nestor ,

Park ,

Idnay ,

Fang ,

Pan ,

Liao ,

Bernard ,

Peng ,

Weng , Retrieval augmented scientific claim verification , JAMIA Open 7 ( 2024 ) ooae021 . doi: 10 .1093/jamiaopen/ooae021.

[13]

Lewis ,

Perez ,

Piktus ,

Petroni ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W.-t. Yih,

Rocktäschel , et al., Retrieval-augmented generation for knowledge-intensive nlp tasks , Advances in Neural Information Processing Systems 33 ( 2020 ) 9459 - 9474 .

[14]

Xie ,

Dai ,

Chen ,

Dai ,

Zhao ,

Zha ,

Wei , T. Pfister, Diferentiable top-k with