Aschern at CheckThat! 2021: Lambda-Calculus of Fact-Checked Claims

Aschern at CheckThat! 2021: Lambda-Calculus of Fact-Checked Claims AntonChernyavskiy HSE University

Moscow Russia

DmitryIlvovsky HSE University

Moscow Russia

PreslavNakov Qatar Computing Research Institute HBKU

Doha Qatar

Aschern at CheckThat! 2021: Lambda-Calculus of Fact-Checked Claims 1613-0073 5F587EA01A7D214529BAA62729A55D8D GROBID - A machine learning software for extracting information from scholarly documents fact-checking lexical similarity semantic similarity sentence-BERT TF.IDF LambdaMART

We describe our system for the CLEF 2021 CheckThat! Lab Task 2 Subtask A on detecting previously factchecked claims. We developed a pipeline using TF.IDF, sentence-BERT fine-tuned on the training data, and reranking using LambdaMART and the predicted similarity scores and positions in the ranked list as features. We examined the quality of each model on the validation set and analyzed its contribution to the final result using the trained LambdaMART. The official evaluation ranked our system 1 𝑠𝑡 by a wide margin over other participants and the organizers' baseline.

Introduction

Social media provide an easy way to share information online. However, this also causes problems since some users may share false claims. Such claims are often sensational, which further contributes to their fast spread. One possible solution is to fact-check suspicious claims, but this is a difficult and time-consuming task when done manually. Even if the process is automated, it is impossible to fact-check every claim on the web. One could also ask: is it really necessary to fact-check everything? For example, if we aim to limit the spread of some false claim, then it is enough to fact-check only one post where it is present. Then, we can try to find posts that repeat that claim.

The CLEF 2021 CheckThat! Lab Task 2 [1] aims at solving that problem: given a tweet it asks to match it against a database of previously fact-checked claims. The participating systems are asked to rank the list of previously fact-checked claims according to their relevance, so that more useful ones are ranked higher. The task features two datasets for claims collected from tweets and from political debates, and it is offered in English and in Arabic. Below, we describe the system that we built for the English version of the dataset collected from tweets (Subtask 2A). At the core of our system is the sentence-BERT model [2], which was originally pre-trained on the Semantic Textual Similarity benchmark (STSb) data. We further fine-tuned it on the task data and then applied LambdaMART [3] to rerank the top-20 results. As features, LambdaMART uses the relevance scores and ranks predicted by sentence-BERT and TF.IDF.

CLEF 2021 -Conference and Labs of the Evaluation Forum, September 21-24, 2021, Bucharest, Romania aschernyavskiy 1@edu.hse.ru (A. Chernyavskiy); dilvovsky@hse.ru (D. Ilvovsky); pnakov@hbku.edu.qa (P. Nakov)

Related Work

There are many studies that have addressed disinformation and misinformation [4,5,6,7,8,9,10]. However, there are only a few directly related to our task. In the ClaimBuster system [11], the problem was mentioned as part of the general fact-checking pipeline, but no evaluation of its solution was provided. The ClaimKG dataset was presented in [12], where claims from different fact-checking websites can be retrieved by some keywords using a knowledge graph. The original task formulation together with a dataset aimed to address the problem of detecting previously fact-checked claims were presented in [13], where the authors used data from Snopes and PolitiFact. They proposed a solution, which combined Elasticsearch, sentence-BERT, and reranking using RankSVM. Their dataset was then used within the framework of the CLEF 2020 CheckThat! Lab Task 2 [14]. Then, an expanded and cleaned-up dataset consisting of tweets was reused at the CLEF CheckThat! Lab 2021 Task 2A [1].

The winning team of the CLEF 2020 CheckThat! Lab Task 2 was Buster.AI [15], who proposed a solution based on RoBERTa, adversarial hard negative examples, and additional training on external data from FEVER, SciFact, and the Liar datasets. Team UNIPI-NLE [16] performed a cascade training of sentence-BERT models with the preliminary use of Elasticsearch to prune the list of possible candidates. Team UB_ET [17] applied DPH and LambdaMART over querydependent features. Other participants also used Elasticsearch and sentence-BERT as well as Terrier, KD search, Universal Sentence Encoder (USE), TF.IDF, and BM25 to perform retrieval, and to compute similarity scores [14].

Dataset

We use the data presented within the CLEF CheckThat! Lab 2020 Task 2 Subtask A for English. The verified claims (VerClaims) database contains 13,825 claims. There are 1,000 positively labeled <Claim, VerClaim> pairs in the training set, and 200 input claims in the validation and the test sets. For each VerClaim, there is some additional information coming from the article that fact-checkers wrte about the claim: title, subtitle, author, and date of verification.

Evaluation Measures

The official evaluation measure is MAP@𝑘 for 𝑘 " 5 (Mean Average Precision for the top-𝑘 VerClaims in the ranked list). Additional evaluation measures computed by the scoring script include MAP, MRR (Mean Reciprocal Rank), and P@𝑘 (precision for the top-𝑘 in the same range) for 𝑘 P t1, 3, 10, 𝑎𝑙𝑙u.

Method

Our pipeline is similar to that in [13], but we changed and improved its components. It is presented schematically in Figure 1. First, we independently calculate lexical and semantic similarity scores between the input Claim and each VerClaim using TF.IDF and sentence-BERT [2], respectively. We calculate each score for three possible input options: (i) <Claim, VerClaim> (ii) <Claim + Title, VerClaim>, (iii) <Claim + Title + Subtitle, VerClaim>. Here "+" denotes concatenation using [SEP] as a separator. Thus, we obtain six independent models. After that, we use LambdaMART [3] to re-rank the top-20 results selected by sentence-BERT trained on the input <Claim + Title, VerClaim>. Here, the features are predicted relevance scores and reciprocal ranks for each of the six models.

Lexical Similarity

To estimate the lexical similarity, we use TF.IDF, as a base model. Since TF.IDF depends on the number of words in the document/corpus, we tried to apply some data-specific pre-processing, e.g., clean up the input text by removing URLs, but this did not improve the results. Thus, our final lexical similarity approach converts the input to lowercase and computes embeddings accounting for the frequency of terms on a logarithmic scale tf " 1 `logptfq. Then, we calculate the similarities of the input Claim and VerClaims as the cosine similarity between the corresponding embeddings. Finally, we use these scores in the re-ranker.

Semantic Similarity

Our TF.IDF approach relies on word matching. However, there are positive examples in the dataset where such word matching score would be very low, e.g., when comparing the Claim "More Fake News. This was photoshopped, obviously, but the wind was strong and the hair looks good? Anything to demean!" to the VerClaim "The White House posted and then deleted an unflattering photograph of President Trump that displayed marked facial coloration." Thus, we also use sentence-BERT as an additional semantic similarity. This model is based on Siamese networks, where each component (BERT) independently computes embeddings for the Claim and for the VerClaim, and then the similarity between them is calculated using a cosine. Since our task is an instance of the general task of determining the semantic similarity of two pieces of text, we fine-tune the model from the checkpoint that was trained on the STSb (Semantic Sentence Similarity benchmark).

Note, that using the sentence-BERT model to obtain sentence embeddings without any taskspecific fine-tuning leads to the bad results for this task [13]. However, training with the MSE loss function is difficult due to the large class imbalance. Here, 𝑀 𝑆𝐸 " ř p𝑦 𝑖 ´cosp𝑓 pcq, 𝑓 pvcqqq 2 , where 𝑓 is the sentence-BERT encoder, 𝑦 𝑖 is the relevance score, and it equals 1 for positive <Claim (c), VerClaim (vc)> pairs, and 0 for negative ones. Note that there are many more negative pairs than positive ones. At the same time, if the triplets are composed of these pairs, then the problem of hard negative mining arises (the search for complex negative examples). Therefore, we apply Multiple Negatives Ranking (MNR) loss [18], which uses only positively marked pairs during training. To this end, it contrasts the similarities between the input Claim and the relevant VerClaim vs. Claim and all other VerClaims in the batch using softmax (Figure 2). This allows to simultaneously maximize the relevance score for the input positive pair and to minimize the scores for all other possible pairs in the batch.

It was proved, that the MNR loss function selects hard negatives by itself by using a temperature parameter in the softmax [19]. However, the model requires large batch sizes, since in order to find such an example in a batch, it must be present there. To overcome this limitation, we manually form the input training sequence at each epoch using the current model as follows. We choose an arbitrary <Claim, VerClaim> anchor pair from the training set (which contains only positive pairs). Then we select the top-𝑘 (𝑘 is a hyperparameter) of the closest Claims from the unused ones and we add them paired with the relevant VerClaims to the result sequence along with the anchor pair. The process ends when there are no unused Claims left.

We additionally make the MNR loss symmetric to be able to contrast to the positive pair all possible negative pairs: (Claim, VerClaim 𝑖 ) and (Claim 𝑖 , VerClaim).

Reranking

At the reranking stage, we apply the LambdaMART model, which is based on Gradient Boosted Decision Trees. This is a learning-to-rank approach, which achieved the best results in different tasks, e.g., in the Yahoo! Learning to Rank Challenge (2011) [20]. To train the LambdaMART model, we use a 12-dimensional vector of features = 2 types of models * 3 types of input * 2 features (estimated relevance score and position in the ranked list of VerClaims).

To implement such a stacking approach, in order to prevent LabmdaMART from "peeping" into the labels encoded in the features, we use only the part of the training data that was not available when training sentence-BERT. In this part, for each claim, we select the top-50 candidates using a single model that achieved the best results on the validation set (it turned out to be the sentence-BERT model, trained on the input <Claim + Title, VerClaim>; see Section 7). Then, we supplement each of the resulting sets with the relevant VerClaim, if it was missing. Then, we train the model using all possible triplets that can be constructed in each set using the Claim as the anchor. At the inference stage, we only take the top-20 sentence-BERT results to minimize the final error. Note that we used LambdaMART, which can adjust the training procedure to optimize a specific evaluation measure (unlike RankSVM). To this end, the optimizer takes into account how much gain in the measure can be obtained by swapping two candidates from the triplet in the ranked list, while leaving the others untouched. In this case, we tuned the model to the main competition quality metric MAP@5.

Experimental Setup

Data Split

To train sentence-BERT, we took the first 800 claims from the training dataset, and we used the remaining 200 claims for validation. Then, out of those 200, we took 170 to train LambdaMART, and we validated its quality against the remaining 30 claims.

Parameter Settings

We used the Sentence-transformers framework1 to train sentence-BERT models. We used the pre-trained stsb-bert-base for the input <Claim, Verclaim>, and stsb-bert-large for two other variants. We used the following hyperparameter values: learning rate of 1e-5, batch size of 6, training for 20 epochs, and the default optimizer with the number of warm up steps equal to 10% of the total number of training steps. For the MNR loss, we set the temperature to 0.05 and 𝑘 to 7 to form the input sequence. We validated the model for each epoch, and we chose the best checkpoint. We used the LambdaMART implementation from the Python learning-to-rank toolkit, 2 and the following values for the hyperparameters: number of boosting stages of 1,500, maximum tree depth of 3, learning rate of 0.02, maximum leaf nodes of 12, fraction of queries to use for fitting the base learners of 0.3, fraction of features to use for selecting the best split of 0.3. We kept the best checkpoint as evaluated on the validation set.

Experiments and Results

Lexical Similarity

A comparison of approaches to estimate the lexical similarity for each of the three input types is presented in Table 1. Here, we applied the source BM25 Okapi algorithm [21] in addition to Elasticsearch, where it is used to build the index. We found that our best TF.IDF approach, which used Title and Subtitle to calculate scores, outperformed BM25 and Elasticsearch on MAP@5. We also evaluated TF.IDF with the standard tf term calculation, but the results were worse. The results also show the importance of using the title as an additional input.

Semantic Similarity

The results on the official development set for sentence-BERT are presented in Table 2. Note that we used the base model for the input <Claim>, and the large variant in the other cases.

The base model achieved a MAP@5 of 0.855 on the input <Claim + Title, VerClaim>.

Therefore, the gain from the use of the Title is not as large as in the case of the lexical component. Although the best quality on the development set was achieved by the model trained on the input <Claim + Title + Subtitle, VerClaim>, we chose the one trained on <Claim + Title, VerClaim> as the core model, as it achieved MAP@5 of 0.772 vs. 0.739 on our validation sample. Moreover, the training data from which we took part for validation turned out to be much more complicated than the development set. Finally, the results for our best semantic model are better than those for our best lexical model.

Table 3

Results on the development set. Here, shaar is a baseline submission (Elasticsearch) by the organizers.

Rank Team MAP@5 MAP@1 RR P@3 P@5

Reranking

Reranking with LambdaMART improved MAP@5 to 0.941 on the development set. The results for other participants are shown in Table 3. We further estimated the importance of each of the 12 features using the trained LambdaMART model (Table 4). These results confirm that the most important features come from sentence-BERT (the semantic component), which used the claim with the title as an input. However, TF.IDF approaches (the lexical component) also have relatively high importance. Thus, we can conclude that the importance of the similarity score predicted by the TF.IDF approach on the input <Claim + Title + Subtitle, VerClaim> is higher than for the sentence-BERT base estimated on the same input. If we completely exclude the results of the lexical component from the features, MAP@5 on the development set drop to 0.899.

Official Results on the Test Set

The official evaluation results on the test set are presented in Table 5. We can see that our system outperforms the systems by the other participants and also the organizers' baseline by a large margin. The table also demonstrates the stability of our solution. Thus, the test performance coincides with what we observed on the validation set.

Table 5

Official results on the test set. shaar is a baseline submission (Elasticsearch) of the competition organizers.

Rank Team

MAP@5 MAP@1 RR P@3 P@5

Conclusion and Future Work

We have described our system for the CLEF 2021 CheckThat! Lab Task 2 Subtask A English on detecting previously fact-checked claims. We developed a pipeline using TF.IDF, fine-tuned sentence-BERT, and reranking using LambdaMART, which used similarity scores and ranks as features. We examined the performance of each model on the validation set and analyzed its contribution to the final reranker. The official evaluation ranked our system 1 𝑠𝑡 by a wide margin ahead of other participants and the organizers' baseline.

In future work, we plan to experiment with other Transformer-based sentence encoders such as RoBERTa [22] and MPNet [23]. Another direction we want to explore is to use other potentially relevant data besides STSb for model pre-training.

Figure 1 :1Figure 1: For the input Claim, TF.IDF and sentence-BERT independently evaluate the relevance of each VerClaim from the database, returning a similarity score and a position in a fully ranked list. The LambdaMART model then reranks the top-20 results from the sentence-BERT model using all predicted scores and positions as features.

Softmax

3 TFigure 2 :32Figure 2:For the batch of positive pairs <Claim, VerClaim>, Mutiple Negatives Ranking loss contrasts the similarities between the input claim 𝑐 𝑖 and the relevant verified claim 𝑣𝑐 𝑖 vs. between 𝑐 𝑖 and all other 𝑣𝑐 𝑗 in the batch using softmax. ‚ denotes the dot-product.

Table 11Lexical model comparison on the development set.MethodInput typeMAP@5 MAP@1 P@3 P@5Claim0.7280.6830.260 0.161ElasticsearchClaim+Title0.8340.7810.295 0.182Claim+Title+Subtitle0.8590.8220.300 0.184Claim0.4140.3520.159 0.105BM25 OkapiClaim+Title0.5860.5280.214 0.137Claim+Title+Subtitle0.6460.6080.230 0.140Claim0.6620.5770.250 0.155TF.IDFClaim+Title0.8320.7790.298 0.183Claim+Title+Subtitle0.8610.8190.305 0.184

Table 22Semantic model comparison on the development set.MethodInput typeMAP@5 MAP@1 P@3 P@5Claim0.8260.7840.290 0.177sentence-BERTClaim+Title0.8720.8390.302 0.185Claim+Title+Subtitle0.8820.8490.307 0.185

Table 44Evaluation of the importance of 12 features produced by the pipeline components. It is estimated by the LambdaMART model. Each model provides two features: RR (Reciprocal Rank, that is the position in the ranked list) and Sim. score (the predicted similarity score).1aschern0.9410.9320.940 0.318 0.1912simihaylova0.9360.9270.935 0.315 0.1903gs_chm0.9020.8570.901 0.318 0.1924shaar0.8180.7760.820 0.286 0.177MethodInput typeRRSim. scoreClaim0.0700.054TF.IDFClaim+Title0.0750.084Claim+Title+Subtitle 0.0570.088Claim0.0780.066sentence-BERTClaim+Title0.0810.188Claim+Title+Subtitle 0.0770.081

http://github.com/UKPLab/sentence-transformers http://github.com/jma127/pyltr

Acknowledgments

Anton Chernyavskiy and Dmitry Ilvovsky performed this research in the framework of the HSE University Basic Research Program, funded by the Russian Academic Excellence Project 5-100.

Preslav Nakov contributed as part of the Tanbih mega-project (tanbih.qcri.org), developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of "fake news", propaganda, and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking.

The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news PNakov GDa San Martino TElsayed ABarrón-Cedeño RMíguez SShaar FAlam FHaouari MHasanain NBabulkov ANikolov GKShahi JMStruß TMandl Proceedings of the 43rd European Conference on Information Retrieval, ECIR '21 the 43rd European Conference on Information Retrieval, ECIR '21

Lucca, Italy

2021 Sentence-BERT: Sentence embeddings using Siamese BERTnetworks NReimers IGurevych Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP '19 the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP '19

Hong Kong, China

2019 Adapting boosting for information retrieval measures QWu CBurges KSvore JGao Information Retrieval 13 2009 Detection and resolution of rumours in social media: A survey AZubiaga AAker KBontcheva MLiakata RProcter ACM Comput. Surv 51 2018 A survey on truth discovery YLi JGao CMeng QLi LSu BZhao WFan JHan SIGKDD Explor. Newsl 17 2016 The spread of true and false news online SVosoughi DRoy S Science 359 2018 Stance detection: A survey DKüçük FCan ACM Comput. Surv 53 2020 A survey on computational propaganda detection GDa San Martino SCresci ABarrón-Cedeño SYu RDPietro PNakov Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-PRICAI '20 the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-PRICAI '20 2020 CredEye: A credibility lens for analyzing and explaining misinformation KPopat SMukherjee JStrötgen GWeikum Proceedings of the Web Conference, WWW ' the Web Conference, WWW ' 2018 18 MHardalov AArora PNakov IAugenstein A survey on stance detection for mis-and disinformation identification 2021 ClaimBuster: The first-ever end-to-end fact-checking system NHassan GZhang FArslan JCaraballo DJimenez SGawsane SHasan MJoseph AKulkarni AKNayak VSable CLi MTremayne Proc. VLDB Endow 10 2017 ClaimsKG: A knowledge graph of fact-checked claims ATchechmedjiev PFafalios KBoland MGasquet MZloch BZapilko SDietze KTodorov Proceedings of the 18th International Semantic Web Conference, ISWC '19 the 18th International Semantic Web Conference, ISWC '19

Auckland, New Zealand

2019 That is a known lie: Detecting previously fact-checked claims SShaar NBabulkov GDa San Martino PNakov Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL '20 the 58th Annual Meeting of the Association for Computational Linguistics, ACL '20 2020 ABarrón-Cedeño TElsayed PNakov GD SMartino MHasanain RSuwaileh FHaouari NBabulkov BHamdan ANikolov SShaar ZSAli Overview of CheckThat 2020: Automatic identification and verification of claims in social media 2020 CLEF MBouziane HPerrin ACluzeau JMardas ASadeq TeamBuster ai at CheckThat! 2020 insights and recommendations to improve fact-checking 2020 CLEF UNIPI-NLE at CheckThat! 2020: Approaching fact checking from a sentence similarity perspective through the lens of transformers LCPassaro ABondielli ALenci FMarcelloni CLEF 2020 EThuma NMotlogelwa TLeburu-Dingalo MMudongo UB_ET at CheckThat! 2020: Exploring ad hoc retrieval approaches in verified claims retrieval CLEF 2020 Efficient natural language response suggestion for smart reply MHenderson RAl-Rfou BStrope Y.-HSung LLukács RGuo SKumar BMiklos RKurzweil ArXiv 1705.00652 2017 PKhosla PTeterwak CWang ASarna YTian PIsola AMaschinot CLiu DKrishnan arXiv 2004.11362 Supervised contrastive learning 2020 Yahoo! learning to rank challenge overview OChapelle YChang Journal of Machine Learning Research -Proceedings Track 14 2011 The probabilistic relevance framework: BM25 and beyond SRobertson HZaragoza Found. Trends Inf. Retr 3 2009 RoBERTa: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov ArXiv 1907.11692 2019 MPNet: Masked and permuted pre-training for language understanding KSong XTan TQin JLu TLiu arXiv 2004.09297 2020