=Paper=
{{Paper
|id=Vol-2936/paper-38
|storemode=property
|title=Aschern at CLEF CheckThat! 2021: Lambda-Calculus of Fact-Checked Claims
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-38.pdf
|volume=Vol-2936
|authors=Anton Chernyavskiy,Dmitry Ilvovsky,Preslav Nakov
|dblpUrl=https://dblp.org/rec/conf/clef/ChernyavskiyIN21
}}
==Aschern at CLEF CheckThat! 2021: Lambda-Calculus of Fact-Checked Claims==
Aschern at CheckThat! 2021:
Lambda-Calculus of Fact-Checked Claims
Anton Chernyavskiy1 , Dmitry Ilvovsky1 and Preslav Nakov2
1
HSE University, Moscow, Russia
2
Qatar Computing Research Institute, HBKU, Doha, Qatar
Abstract
We describe our system for the CLEF 2021 CheckThat! Lab Task 2 Subtask A on detecting previously fact-
checked claims. We developed a pipeline using TF.IDF, sentence-BERT fine-tuned on the training data,
and reranking using LambdaMART and the predicted similarity scores and positions in the ranked list
as features. We examined the quality of each model on the validation set and analyzed its contribution
to the final result using the trained LambdaMART. The official evaluation ranked our system 1𝑠𝑡 by a
wide margin over other participants and the organizers’ baseline.
Keywords
fact-checking, lexical similarity, semantic similarity, sentence-BERT, TF.IDF, LambdaMART
1. Introduction
Social media provide an easy way to share information online. However, this also causes
problems since some users may share false claims. Such claims are often sensational, which
further contributes to their fast spread. One possible solution is to fact-check suspicious claims,
but this is a difficult and time-consuming task when done manually. Even if the process is
automated, it is impossible to fact-check every claim on the web. One could also ask: is it really
necessary to fact-check everything? For example, if we aim to limit the spread of some false
claim, then it is enough to fact-check only one post where it is present. Then, we can try to find
posts that repeat that claim.
The CLEF 2021 CheckThat! Lab Task 2 [1] aims at solving that problem: given a tweet it
asks to match it against a database of previously fact-checked claims. The participating systems
are asked to rank the list of previously fact-checked claims according to their relevance, so
that more useful ones are ranked higher. The task features two datasets for claims collected
from tweets and from political debates, and it is offered in English and in Arabic. Below, we
describe the system that we built for the English version of the dataset collected from tweets
(Subtask 2A). At the core of our system is the sentence-BERT model [2], which was originally
pre-trained on the Semantic Textual Similarity benchmark (STSb) data. We further fine-tuned it
on the task data and then applied LambdaMART [3] to rerank the top-20 results. As features,
LambdaMART uses the relevance scores and ranks predicted by sentence-BERT and TF.IDF.
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" aschernyavskiy 1@edu.hse.ru (A. Chernyavskiy); dilvovsky@hse.ru (D. Ilvovsky); pnakov@hbku.edu.qa
(P. Nakov)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work
There are many studies that have addressed disinformation and misinformation [4, 5, 6, 7, 8, 9, 10].
However, there are only a few directly related to our task. In the ClaimBuster system [11],
the problem was mentioned as part of the general fact-checking pipeline, but no evaluation
of its solution was provided. The ClaimKG dataset was presented in [12], where claims from
different fact-checking websites can be retrieved by some keywords using a knowledge graph.
The original task formulation together with a dataset aimed to address the problem of detecting
previously fact-checked claims were presented in [13], where the authors used data from Snopes
and PolitiFact. They proposed a solution, which combined Elasticsearch, sentence-BERT, and
reranking using RankSVM. Their dataset was then used within the framework of the CLEF 2020
CheckThat! Lab Task 2 [14]. Then, an expanded and cleaned-up dataset consisting of tweets
was reused at the CLEF CheckThat! Lab 2021 Task 2A [1].
The winning team of the CLEF 2020 CheckThat! Lab Task 2 was Buster.AI [15], who proposed
a solution based on RoBERTa, adversarial hard negative examples, and additional training on
external data from FEVER, SciFact, and the Liar datasets. Team UNIPI-NLE [16] performed a
cascade training of sentence-BERT models with the preliminary use of Elasticsearch to prune
the list of possible candidates. Team UB_ET [17] applied DPH and LambdaMART over query-
dependent features. Other participants also used Elasticsearch and sentence-BERT as well as
Terrier, KD search, Universal Sentence Encoder (USE), TF.IDF, and BM25 to perform retrieval,
and to compute similarity scores [14].
3. Dataset
We use the data presented within the CLEF CheckThat! Lab 2020 Task 2 Subtask A for English.
The verified claims (VerClaims) database contains 13,825 claims. There are 1,000 positively
labeled pairs in the training set, and 200 input claims in the validation
and the test sets. For each VerClaim, there is some additional information coming from the
article that fact-checkers wrte about the claim: title, subtitle, author, and date of verification.
4. Evaluation Measures
The official evaluation measure is MAP@𝑘 for 𝑘 “ 5 (Mean Average Precision for the top-𝑘
VerClaims in the ranked list). Additional evaluation measures computed by the scoring script
include MAP, MRR (Mean Reciprocal Rank), and P@𝑘 (precision for the top-𝑘 in the same
range) for 𝑘 P t1, 3, 10, 𝑎𝑙𝑙u.
5. Method
Our pipeline is similar to that in [13], but we changed and improved its components. It is
presented schematically in Figure 1. First, we independently calculate lexical and semantic
similarity scores between the input Claim and each VerClaim using TF.IDF and sentence-BERT
[2], respectively.
Sentence-BERT
VerClaim
+Title
Claim +Title + Subtitle
Top-20 VerClaims
TF.IDF
VerClaims
VerClaim Reranking
+Title
LambdaMART
+Title + Subtitle
Figure 1: For the input Claim, TF.IDF and sentence-BERT independently evaluate the relevance of
each VerClaim from the database, returning a similarity score and a position in a fully ranked list.
The LambdaMART model then reranks the top-20 results from the sentence-BERT model using all
predicted scores and positions as features.
We calculate each score for three possible input options: (i) (ii) , (iii) . Here “+” denotes
concatenation using [SEP] as a separator. Thus, we obtain six independent models. After that,
we use LambdaMART [3] to re-rank the top-20 results selected by sentence-BERT trained on
the input . Here, the features are predicted relevance scores
and reciprocal ranks for each of the six models.
5.1. Lexical Similarity
To estimate the lexical similarity, we use TF.IDF, as a base model. Since TF.IDF depends on the
number of words in the document/corpus, we tried to apply some data-specific pre-processing,
e.g., clean up the input text by removing URLs, but this did not improve the results. Thus, our
final lexical similarity approach converts the input to lowercase and computes embeddings
accounting for the frequency of terms on a logarithmic scale tf “ 1 ` logptfq. Then, we
calculate the similarities of the input Claim and VerClaims as the cosine similarity between
the corresponding embeddings. Finally, we use these scores in the re-ranker.
5.2. Semantic Similarity
Our TF.IDF approach relies on word matching. However, there are positive examples in the
dataset where such word matching score would be very low, e.g., when comparing the Claim
“More Fake News. This was photoshopped, obviously, but the wind was strong and the hair looks
good? Anything to demean!” to the VerClaim “The White House posted and then deleted an
unflattering photograph of President Trump that displayed marked facial coloration.” Thus, we
also use sentence-BERT as an additional semantic similarity. This model is based on Siamese
networks, where each component (BERT) independently computes embeddings for the Claim
and for the VerClaim, and then the similarity between them is calculated using a cosine.
T
c1 vc1 c1 vc1 vc2 vc3 vcn
c2 vc2 c2 vc1 vc2 vc3 vcn
Softmax
c3 vc3 c3 vc1 vc2 vc3 vcn
cn vcn cn vc1 vc2 vc3 vcn
Figure 2: For the batch of positive pairs , Mutiple Negatives Ranking loss con-
trasts the similarities between the input claim 𝑐𝑖 and the relevant verified claim 𝑣𝑐𝑖 vs. between 𝑐𝑖 and
all other 𝑣𝑐𝑗 in the batch using softmax. ‚ denotes the dot-product.
Since our task is an instance of the general task of determining the semantic similarity of
two pieces of text, we fine-tune the model from the checkpoint that was trained on the STSb
(Semantic Sentence Similarity benchmark).
Note, that using the sentence-BERT model to obtain sentence embeddings without any task-
specific fine-tuning leads to the bad results for this task [13]. However,řtraining with the MSE loss
function is difficult due to the large class imbalance. Here, 𝑀 𝑆𝐸 “ p𝑦𝑖 ´ cosp𝑓 pcq, 𝑓 pvcqqq2 ,
where 𝑓 is the sentence-BERT encoder, 𝑦𝑖 is the relevance score, and it equals 1 for positive
pairs, and 0 for negative ones. Note that there are many more
negative pairs than positive ones. At the same time, if the triplets are composed of these pairs,
then the problem of hard negative mining arises (the search for complex negative examples).
Therefore, we apply Multiple Negatives Ranking (MNR) loss [18], which uses only positively
marked pairs during training. To this end, it contrasts the similarities between the input Claim
and the relevant VerClaim vs. Claim and all other VerClaims in the batch using softmax
(Figure 2). This allows to simultaneously maximize the relevance score for the input positive
pair and to minimize the scores for all other possible pairs in the batch.
It was proved, that the MNR loss function selects hard negatives by itself by using a tempera-
ture parameter in the softmax [19]. However, the model requires large batch sizes, since in order
to find such an example in a batch, it must be present there. To overcome this limitation, we
manually form the input training sequence at each epoch using the current model as follows. We
choose an arbitrary anchor pair from the training set (which contains
only positive pairs). Then we select the top-𝑘 (𝑘 is a hyperparameter) of the closest Claims
from the unused ones and we add them paired with the relevant VerClaims to the result
sequence along with the anchor pair. The process ends when there are no unused Claims left.
We additionally make the MNR loss symmetric to be able to contrast to the positive pair all
possible negative pairs: (Claim, VerClaim𝑖 ) and (Claim𝑖 , VerClaim).
5.3. Reranking
At the reranking stage, we apply the LambdaMART model, which is based on Gradient Boosted
Decision Trees. This is a learning-to-rank approach, which achieved the best results in different
tasks, e.g., in the Yahoo! Learning to Rank Challenge (2011) [20]. To train the LambdaMART
model, we use a 12-dimensional vector of features = 2 types of models * 3 types of input * 2
features (estimated relevance score and position in the ranked list of VerClaims).
To implement such a stacking approach, in order to prevent LabmdaMART from “peeping”
into the labels encoded in the features, we use only the part of the training data that was
not available when training sentence-BERT. In this part, for each claim, we select the top-50
candidates using a single model that achieved the best results on the validation set (it turned
out to be the sentence-BERT model, trained on the input ; see
Section 7). Then, we supplement each of the resulting sets with the relevant VerClaim, if it was
missing. Then, we train the model using all possible triplets that can be constructed in each set
using the Claim as the anchor. At the inference stage, we only take the top-20 sentence-BERT
results to minimize the final error. Note that we used LambdaMART, which can adjust the
training procedure to optimize a specific evaluation measure (unlike RankSVM). To this end,
the optimizer takes into account how much gain in the measure can be obtained by swapping
two candidates from the triplet in the ranked list, while leaving the others untouched. In this
case, we tuned the model to the main competition quality metric MAP@5.
6. Experimental Setup
6.1. Data Split
To train sentence-BERT, we took the first 800 claims from the training dataset, and we used the
remaining 200 claims for validation. Then, out of those 200, we took 170 to train LambdaMART,
and we validated its quality against the remaining 30 claims.
6.2. Parameter Settings
We used the Sentence-transformers framework1 to train sentence-BERT models. We used the
pre-trained stsb-bert-base for the input , and stsb-bert-large
for two other variants. We used the following hyperparameter values: learning rate of 1e-5,
batch size of 6, training for 20 epochs, and the default optimizer with the number of warm
up steps equal to 10% of the total number of training steps. For the MNR loss, we set the
temperature to 0.05 and 𝑘 to 7 to form the input sequence. We validated the model for each
epoch, and we chose the best checkpoint. We used the LambdaMART implementation from the
Python learning-to-rank toolkit,2 and the following values for the hyperparameters: number of
boosting stages of 1,500, maximum tree depth of 3, learning rate of 0.02, maximum leaf nodes of
12, fraction of queries to use for fitting the base learners of 0.3, fraction of features to use for
selecting the best split of 0.3. We kept the best checkpoint as evaluated on the validation set.
1
http://github.com/UKPLab/sentence-transformers
2
http://github.com/jma127/pyltr
Table 1
Lexical model comparison on the development set.
Method Input type MAP@5 MAP@1 P@3 P@5
Claim 0.728 0.683 0.260 0.161
Elasticsearch Claim+Title 0.834 0.781 0.295 0.182
Claim+Title+Subtitle 0.859 0.822 0.300 0.184
Claim 0.414 0.352 0.159 0.105
BM25 Okapi Claim+Title 0.586 0.528 0.214 0.137
Claim+Title+Subtitle 0.646 0.608 0.230 0.140
Claim 0.662 0.577 0.250 0.155
TF.IDF Claim+Title 0.832 0.779 0.298 0.183
Claim+Title+Subtitle 0.861 0.819 0.305 0.184
Table 2
Semantic model comparison on the development set.
Method Input type MAP@5 MAP@1 P@3 P@5
Claim 0.826 0.784 0.290 0.177
sentence-BERT Claim+Title 0.872 0.839 0.302 0.185
Claim+Title+Subtitle 0.882 0.849 0.307 0.185
7. Experiments and Results
7.1. Lexical Similarity
A comparison of approaches to estimate the lexical similarity for each of the three input types
is presented in Table 1. Here, we applied the source BM25 Okapi algorithm [21] in addition
to Elasticsearch, where it is used to build the index. We found that our best TF.IDF approach,
which used Title and Subtitle to calculate scores, outperformed BM25 and Elasticsearch on
MAP@5. We also evaluated TF.IDF with the standard tf term calculation, but the results were
worse. The results also show the importance of using the title as an additional input.
7.2. Semantic Similarity
The results on the official development set for sentence-BERT are presented in Table 2. Note
that we used the base model for the input , and the large variant in the other cases.
The base model achieved a MAP@5 of 0.855 on the input .
Therefore, the gain from the use of the Title is not as large as in the case of the lexical
component. Although the best quality on the development set was achieved by the model
trained on the input , we chose the one trained
on as the core model, as it achieved MAP@5 of 0.772 vs. 0.739
on our validation sample. Moreover, the training data from which we took part for validation
turned out to be much more complicated than the development set. Finally, the results for our
best semantic model are better than those for our best lexical model.
Table 3
Results on the development set. Here, shaar is a baseline submission (Elasticsearch) by the organizers.
Rank Team MAP@5 MAP@1 RR P@3 P@5
1 aschern 0.941 0.932 0.940 0.318 0.191
2 simihaylova 0.936 0.927 0.935 0.315 0.190
3 gs_chm 0.902 0.857 0.901 0.318 0.192
4 shaar 0.818 0.776 0.820 0.286 0.177
Table 4
Evaluation of the importance of 12 features produced by the pipeline components. It is estimated by
the LambdaMART model. Each model provides two features: RR (Reciprocal Rank, that is the position
in the ranked list) and Sim. score (the predicted similarity score).
Method Input type RR Sim. score
Claim 0.070 0.054
TF.IDF Claim+Title 0.075 0.084
Claim+Title+Subtitle 0.057 0.088
Claim 0.078 0.066
sentence-BERT Claim+Title 0.081 0.188
Claim+Title+Subtitle 0.077 0.081
7.3. Reranking
Reranking with LambdaMART improved MAP@5 to 0.941 on the development set. The results
for other participants are shown in Table 3.
We further estimated the importance of each of the 12 features using the trained LambdaMART
model (Table 4). These results confirm that the most important features come from sentence-
BERT (the semantic component), which used the claim with the title as an input. However,
TF.IDF approaches (the lexical component) also have relatively high importance. Thus, we can
conclude that the importance of the similarity score predicted by the TF.IDF approach on the
input is higher than for the sentence-BERT
base estimated on the same input. If we completely exclude the results of the lexical component
from the features, MAP@5 on the development set drop to 0.899.
7.4. Official Results on the Test Set
The official evaluation results on the test set are presented in Table 5. We can see that our
system outperforms the systems by the other participants and also the organizers’ baseline
by a large margin. The table also demonstrates the stability of our solution. Thus, the test
performance coincides with what we observed on the validation set.
Table 5
Official results on the test set. shaar is a baseline submission (Elasticsearch) of the competition orga-
nizers.
Rank Team MAP@5 MAP@1 RR P@3 P@5
1 aschern 0.883 0.861 0.884 0.300 0.182
2 NLytics 0.799 0.738 0.807 0.289 0.179
3 simihaylova 0.787 0.728 0.795 0.282 0.177
4 shaar 0.749 0.703 0.761 0.262 0.164
8. Conclusion and Future Work
We have described our system for the CLEF 2021 CheckThat! Lab Task 2 Subtask A English
on detecting previously fact-checked claims. We developed a pipeline using TF.IDF, fine-tuned
sentence-BERT, and reranking using LambdaMART, which used similarity scores and ranks
as features. We examined the performance of each model on the validation set and analyzed
its contribution to the final reranker. The official evaluation ranked our system 1𝑠𝑡 by a wide
margin ahead of other participants and the organizers’ baseline.
In future work, we plan to experiment with other Transformer-based sentence encoders
such as RoBERTa [22] and MPNet [23]. Another direction we want to explore is to use other
potentially relevant data besides STSb for model pre-training.
Acknowledgments
Anton Chernyavskiy and Dmitry Ilvovsky performed this research in the framework of the HSE
University Basic Research Program, funded by the Russian Academic Excellence Project 5-100.
Preslav Nakov contributed as part of the Tanbih mega-project (tanbih.qcri.org), developed at
the Qatar Computing Research Institute, HBKU, which aims to limit the impact of “fake news”,
propaganda, and media bias by making users aware of what they are reading, thus promoting
media literacy and critical thinking.
References
[1] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, The
CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked
claims, and fake news, in: Proceedings of the 43rd European Conference on Information
Retrieval, ECIR ’21, Lucca, Italy, 2021, pp. 639–649.
[2] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-
networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language
Processing, EMNLP-IJCNLP ’19, Hong Kong, China, 2019, pp. 3982–3992.
[3] Q. Wu, C. Burges, K. Svore, J. Gao, Adapting boosting for information retrieval measures,
Information Retrieval 13 (2009) 254–270.
[4] A. Zubiaga, A. Aker, K. Bontcheva, M. Liakata, R. Procter, Detection and resolution of
rumours in social media: A survey, ACM Comput. Surv. 51 (2018).
[5] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, J. Han, A survey on truth discovery,
SIGKDD Explor. Newsl. 17 (2016) 1–16.
[6] S. Vosoughi, D. Roy, S. Aral, The spread of true and false news online, Science 359 (2018)
1146–1151.
[7] D. Küçük, F. Can, Stance detection: A survey, ACM Comput. Surv. 53 (2020).
[8] G. Da San Martino, S. Cresci, A. Barrón-Cedeño, S. Yu, R. D. Pietro, P. Nakov, A survey on
computational propaganda detection, in: Proceedings of the Twenty-Ninth International
Joint Conference on Artificial Intelligence, IJCAI-PRICAI ’20, 2020, pp. 4826–4832.
[9] K. Popat, S. Mukherjee, J. Strötgen, G. Weikum, CredEye: A credibility lens for analyzing
and explaining misinformation, in: Proceedings of the Web Conference, WWW ’18, 2018,
pp. 155–158.
[10] M. Hardalov, A. Arora, P. Nakov, I. Augenstein, A survey on stance detection for mis- and
disinformation identification, 2021.
[11] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
A. Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: The first-ever
end-to-end fact-checking system, Proc. VLDB Endow. 10 (2017) 1945–1948.
[12] A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze,
K. Todorov, ClaimsKG: A knowledge graph of fact-checked claims, in: Proceedings of the
18th International Semantic Web Conference, ISWC ’19, Auckland, New Zealand, 2019, pp.
309–324.
[13] S. Shaar, N. Babulkov, G. Da San Martino, P. Nakov, That is a known lie: Detecting
previously fact-checked claims, in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, ACL ’20, 2020, pp. 3607–3618.
[14] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. D. S. Martino, M. Hasanain, R. Suwaileh,
F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. S. Ali, Overview of CheckThat
2020: Automatic identification and verification of claims in social media, in: CLEF, 2020.
[15] M. Bouziane, H. Perrin, A. Cluzeau, J. Mardas, A. Sadeq, Team Buster.ai at CheckThat!
2020 insights and recommendations to improve fact-checking, in: CLEF, 2020.
[16] L. C. Passaro, A. Bondielli, A. Lenci, F. Marcelloni, UNIPI-NLE at CheckThat! 2020:
Approaching fact checking from a sentence similarity perspective through the lens of
transformers, in: CLEF, 2020.
[17] E. Thuma, N. Motlogelwa, T. Leburu-Dingalo, M. Mudongo, UB_ET at CheckThat! 2020:
Exploring ad hoc retrieval approaches in verified claims retrieval, in: CLEF, 2020.
[18] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos,
R. Kurzweil, Efficient natural language response suggestion for smart reply, ArXiv
1705.00652 (2017).
[19] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krish-
nan, Supervised contrastive learning, arXiv 2004.11362 (2020).
[20] O. Chapelle, Y. Chang, Yahoo! learning to rank challenge overview., Journal of Machine
Learning Research - Proceedings Track 14 (2011) 1–24.
[21] S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond,
Found. Trends Inf. Retr. 3 (2009) 333–389.
[22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, ArXiv 1907.11692
(2019).
[23] K. Song, X. Tan, T. Qin, J. Lu, T. Liu, MPNet: Masked and permuted pre-training for
language understanding, arXiv 2004.09297 (2020).