1. Introduction

SimBa at CheckThat! 2022: Lexical and Semantic Similarity Based Detection of Verified Claims in an Unsupervised and Supervised Way

Alica Hövelmeyer

0 1 2

Katarina Boland

0 1 2

Stefan Dietze

0 1 2 0 GESIS - Leibniz Institute for the Social Sciences , Unter Sachsenhausen 6-8, 50667 Cologne , Germany 1 Heinrich-Heine-Universität Düsseldorf (HHU) , Universitätsstraße 1, 40225 Düsseldorf , Germany 2 Starlink - here. Thanks, @elonmusk pic.twitter.com/dZbaYqWYCf - Mykhailo Fedorov

One step in many automated fact-checking pipelines is verified claim retrieval, i.e. checking whether a claim has been fact-checked before. We approach this task as a semantic textual similarity problem. For this, we examine the extent to which an input claim and a verified claim are similar at semantic, textual, lexical and referential levels using a variety of NLP tools. We rank similar pairs based on these features using a supervised and an unsupervised model. We participate in two subtasks and compare our results for subtask 2A: detecting previously fact-checked claims from tweets and subtask 2B: detecting previously fact-checked claims in political debates for English data. We find that the combination of semantic and lexical similarity features performs best in finding relevant claim pairs for both subtasks. Furthermore, our unsupervised method is on par with the supervised one and seems to generalize well over similar tasks.

eol>fact-checking STS semantic similarity lexical similarity sentence embeddings

1. Introduction

our approach is that we compare a supervised and an unsupervised method to rank the given data by similarity and are able to propose an unsupervised method that is on par with supervised approaches. Furthermore, there is evidence that our unsupervised method generalizes well over similar tasks. The code for both subtasks is available on github1.

2. Related Work

This submission is part of the 5th edition of the CheckThat! Lab. Previous editions, also held in conjunction with the Conference and Labs of the Evaluation Forum (CLEF) that also featured the task of Detecting Previously Fact-Checked Claims / Claim Retrieval, took place in 2020[ 2 ] and 2021[ 3 ]. The approaches proposed by the participants are similar to ours in various aspects.

For the lab in 2020 the data to be processed exclusively consisted of tweets as input claims. Many of the participants used pre-processing and cleaned the tweets, removing tweet-specific characters like hashtags [ 4 ] [ 5 ] [ 6 ] [ 7 ]. Some teams solely made use of lexical and string similarity features[ 5 ] [ 6 ], whereas other teams used pre-trained language models to evaluate semantic similarity. These teams fine-tuned RoBERTa[ 8 ][ 4 ] or used Sentence-BERT [ 9 ] [ 10 ] [ 11 ] [ 7 ] or Universal Sentence Encoder[ 12 ][ 13 ] in order to calculate the distances between sentence embeddings. Diferent variations of Blocking-techniques were also used[ 10 ][ 5 ] [ 7 ]. Similar to our approach, some teams combined lexical and semantic similarity features [ 4 ] [ 13 ] [ 11 ].

In 2021 all teams made use of the sentence embedding model Sentence-BERT. Team NLytics[ 14 ] ofered an unsupervised approach based on the distances of sentence embeddings gained using Sentence-BERT. This approach performed well for only one of the proposed subtasks.

Team DIPS[ 15 ] and Team Aschern[ 16 ] made use of the combination of a semantic similarity feature (also gained using the sentence embedding model Sentence-BERT) and a string (BM25 by Team DIPS) or lexical (TF.IDF by Team Aschern) similarity feature. Diferent from us, they only presented supervised approaches to rank the data based on these features.

3. Task Definition 3.1. Detection of previously fact-checked claims

One of the tasks that arise in the broader context of automated fact-checking is to check whether a claim has been fact-checked before. This can be considered the second step of a claim retrieval and verification pipeline, after the detection of check-worthy claims in diferent kinds of textual utterances and before the verification of those claims. This is addressed by task 2 [ 3 ]. More precisely, the task is to rank the most relevant verified claims out of a collection of already verified claims for a given input claim.

1https://github.com/Alihoe/CLEFCheckThat2aSimBa, https://github.com/Alihoe/CLEFCheckThat2bSimBa 3.2. Data The subtasks cover two diferent types of media that are used to disseminate claims. Subtask A deals with tweets, subtask B with political debates and speeches. Both types of text sequences containing claim utterances will simply be referred to as input claims in the following. For both tasks diferent kinds of already fact-checked claims are made available. These will be called verified claims .[ 1 ]

Both input claims and verified claims consist of one or a few coherent sentences.

The input claims of subtask A are given as strings, divided into a training dataset of 1167 input claims, a development test dataset of 201 input claims and a final test dataset of 209 input claims. A human-annotated mapping from every input claim to the most relevant verified claim (query relevance or qrels-file) constitutes the gold standard. Verified claims are crawled from the fact-checking website Snopes and are provided in JSON format containing title, subtitle, author, date and a vclaim-entry with the content of the claim.

The input claims of subtask B are also provided as strings, divided into a training dataset of 702 input claims, a development test dataset of 79 input claims and a final test dataset of 65 input claims. Here, a human-annotated mapping from every input claim to one or more relevant verified claims is given in addition to the training data and as a gold standard for the test data. Furthermore, transcripts of the debates or speeches the input claims are obtained from are given for the test data. 19250 verified claims are taken from the fact-checking website PolitiFact and made available in JSON format containing the entries vclaim_id, vclaim, date, truth_label, speaker, url, title and text.

The mappings of input claims to verified claims will be referred to as input-ver-claim pairs.

4. Similarity-Based Features 4.1. Semantic Similarity

The task is formulated as a ranking-problem, where input-ver-claim-pairs are ranked depending on the relevance of the verified claim for fact-checking the input claim. Thus, the task can be considered a semantic textual similarity problem (STS) where sentences are compared by their semantic content to rank sentences containing similar claims highest (cf. [ 17 ]).

4.1.1. Sentence Embeddings

One promising way to deal with STS-problems is the usage of sentence embeddings. Sentence embeddings are fixed-sized vector representations that capture the meaning of sentences in so far that embeddings of semantically similar sentences are close in the corresponding vector space (cf. [ 18 ]). Sentence embedding models are usually trained on a huge amount of natural language data or rely on models that are trained on such. Thus they reflect the empirical distribution of linguistic elements and can be viewed as an appropriate method to investigate semantic similarity. That’s because relying on the distributional hypothesis, "there is a correlation between distributional similarity and meaning similarity"[ 19 ].

The usefulness of the application of sentence embeddings has already been demonstrated by the participants of last year’s lab. The sentence embedding model Sentence-BERT [ 9 ] was used by the top-ranked teams of both subtask A and subtask B [ 16 ] [ 15 ]. Therefore, we use them as starting points for diferent components of our application.

Sentence-BERT (SBERT) is a modification of the transformer-based pre-trained language models BERT [ 20 ] or RoBERTa[ 8 ] using a Siamese network structure. The language models are trained on natural language inference (NLI) data and a pooling operation is added to their outputs in order to derive fixed-sized vector representations of the input sentences.

The idea of training on NLI data in a supervised way in order to get meaningful sentence embeddings was introduced by the authors of the sentence embedding model InferSent[ 18 ] (InferSent). However they did not build their model upon a tranformer-based language model, but on an encoder based on a bi-directional LSTM architecture fed with pre-trained word embeddings (GloVe[ 21 ] or fastText[ 22 ]).

Similarly, the model Universal Sentence Encoder[ 12 ] (UniversalSE) averages together word and bi-gram level embeddings, passes the representations through a feed-forward deep neural network (DNN) and is trained on NLI data.

The authors of SimCSE[ 23 ](SimCSE) also train their model on NLI data, but within a contrastive learning framework. Otherwise their model is similar to Sentence-BERT, relying on the pre-trained language models BERT and RoBERTa and adding a pooling operation to one of their output layers.

All sentence embedding models are also able encode small paragraphs instead of just sentences.

4.1.2. Measuring Semantic Similarity Using Sentence Embeddings

For all of these sentence embeddings methods, there are pre-trained models available that can be used out of the box. For Sentence-BERT we used sentence-transformers/all-mpnet-base-v2, because it performs best for STS tasks compared to the other pretrained models2. For InferSent we experimented with both versions, but report here only on the results obtained using version 2, which works with fastText[ 22 ], because we got better results than using the GloVe-vocabulary in pre-liminary experiments. For Universal Sentence Encoder we used TF2.0 Saved Model (v4)3, because this is the most widely used model available for Universal Sentence Encoder and for SimCSE we used princeton-nlp/sup-simcse-roberta-large4, because this also performs best for STS tasks compared to the other pretrained models 5.

Since sentence embeddings are vector representations of sentences within the same vector space, their similarity can be measured applying cosine similarity (CosSim), resulting in similarity scores which are rational numbers ∈ [-100, 100]. These similarity scores should be referred to as SentEmb.

4.2. Other Measures of Similarity

In the following other measures of similarity are presented. An overview of their corresponding metrics can be found in Table 1.

2https://www.sbert.net/docs/pretrained_models.html 3https://tfhub.dev/google/universal-sentence-encoder/4 4https://huggingface.co/princeton-nlp/sup-simcse-roberta-large 5https://github.com/princeton-nlp/SimCSE

LevDist StringSim SimCount SimRatio SimCount SimRatio SimCount SimRatio

Feature SBERT

InferSent UniversalSE

SimCSE LevDist SeqMat JaccChar

JaccTok WordCount

WordRatio WordTokRatio

SynCount

SynRatio SynTokRatio

NERatio NeTokRatio

Metric ∈ [-100, 100] ∈ − Z ∈ [ 0, 1 ] ∈ N ∈ [0, 100] ∈ N ∈ [0, 100] ∈ N ∈ [0, 100]

4.2.1. String Similarity

In addition to the study of semantic similarity using sentence embeddings, there are other ways in which the similarity of sentences can be measured.

The most naive approach to measure the similarity of two sentences is to compare them at the string level, i.e. to see how far the characters and strings that make up a sentence difer from those of other sentences. We used three diferent methods to measure the string similarity of sentences: Levenshtein Distance, Jaccard Distance and Sequence Matching.

Levenshtein Distance (LevDist) is a metric to measure the distance between two strings by counting the number of operations (insertions, deletions or substitutions) needed to change one string into the other. Sentences which are similar thus have a small Levenshtein Distance. In order to adjust this distance score to the other similarity scores, such that a higher value signifies a higher similarity, we multiplied the Levenshtein Distance by -1. In practice, we thereby get negative three- or two-digit integers as similarity scores for almost all input-ver-claim pairs.

In general, Jaccard Distance is used to measure the similarity of sets. It is computed by dividing the size of the intersection by the size of the union of the sets. The closer this value is to one, the more similar are the sets. In context of sentence-similarity it can be applied in two ways: either regarding the characters (JaccChar) or the tokens (JaccTok) a sentence consists of as elements of a set.

The Sequence Matching-algorithm (SeqMat) provided by the Python library diflib works by comparing "the longest contiguous matching subsequence that contains no ’junk’ elements" and recursively repeating this on the remaining subsequences. Junk elements are determined heuristically based on the frequency of their duplicates in the text sequence 6.

Both the application of Jaccard Distance and Sequence Matching generate rational numbers ∈ [ 0, 1 ]. These similarity scores will be referred to as StringSim.

4.2.2. Lexical Similarity

Another type of similarity, which is not clearly distinguishable from semantic and string similarity, is lexical similarity or similarity of words. We used one method to capture lexical similarity between sentences and simply counted how often two claims contained the same words.

For this, we tokenized all claims using NLTK’s word tokenizer[ 24 ], filtered out stop words and counted how often two claims contained the same tokens (WordCount). In order to value the number of equal tokens of shorter sentences higher than those of longer ones, we also computed a normalized ratio. For this we divided 100 by the number of tokens of both claims and multiplied the obtained value by two times the number of equal tokens.7 We did this both including stop words (WordTokRatio) and not including them (WordRatio).

Counting equal tokens we gained a positive integer similarity score, usually with less than three digits. We call this kind of discrete score SimCount. Computing the ratios we obtained percentages similar to the SentEmb-scores ∈ [0, 100]. This kind of scores will be referred to as SimRatio.

4.2.3. Referential Similarity

Another way to think of similarity between sentences is to examine whether they refer to the same objects. To represent this kind of similarity we used two methods. Similar to the lexical similarity approach, we counted how often two claims contained words which are synonyms of each other. Additionally, we counted how often two claims contain the same named entities (NEs).

To compare the synonyms, we used WordNet[ 25 ] and looked for all available synsets the tokens mentioned in a claim are part of. We tokenized the sentences the same way as above. Then we counted how often two claims contained the same synsets (SynCount). Here we also computed the ratio of the count of synonyms regarding all synonyms (SynRatio) and all tokens (SynTokRatio) in the two sentences.

In order to compare NEs we used the entity-fishing system[ 26 ], which recognizes named entities mentioned in a text and disambiguates them using Wikidata. The system is able to return the the Wikipedia and Wikidata identifiers of those mentions. We counted how often two claims contained named entities related to the same Wikipedia or Wikidata entry (NE). We also additionally computed the ratio of the count of NEs regarding all NEs (NERatio) and all tokens (NETokRatio) in the two sentences.

Similarly to the lexical similarity scores, we obtained two diferent kinds of metrics for these similarities: SimCount and SimRatio (see Table 1).

6https://docs.python.org/3/library/diflib.html 7e.g.: If two claims consisted of ten tokens each and had ten tokens in common, we would obtain a WordTokRatio of (100/20)*10*2 = 100. If they only had one token in common the obtained ratio would be (100/20)*1*2 = 10. If both claims consisted of 50 tokens each, the obtained ratios would be (100/100)*10*2 = 20 and (100/20)*1*2 = 2.

4.3. Pre-Processsing 4.3.1. Cleaning tweets

For both subtasks we experimented with diferent ways of pre-processing the input claims. We cleaned the tweets given in subtask 2a to get rid of redundant information. We removed URLs, @-symbols and user-information (see Table 2).

Cleaned Tweet Starlink — here. Thanks, elonmusk

4.3.2. Including context

For subtask 2b, we tried incorporating the input claims’ contexts within the speech or debate they were obtained from. We included the lines that were spoken before and after the relevant claim and integrated information about the current speaker by prepending "speaker X said" to the line of speech, where X is substituted by the name of the respective speaker (see Table 3).

5. Model 5.1. Unsupervised Approach

Contextualized Input Claim donald trump said "And Obama would send pillows and sheets." donald trump said "He wouldn’t send anything else." donald trump said "It’s the whole thing." We tried out an unsupervised and a supervised method to utilize the information we gained on the diferent kinds of similarity. The main idea of the unsupervised approach is to rank the input-ver-claim pairs by the diferent similarity scores described above. Therefore a general similarity score is computed, combining the varying metrics (see Table 1). This general score can roughly be compared to the percentage to which two sentences are similar where two exactly equal sentences would have a score of roughly 100. However, our way of combining the diferent similarity scores does not ensure that the resulting score is smaller than 100. It can sometimes be a little higher.

The general similarity score is computed the following way: • taking the mean of all SentEmb-, SimRatio- and StringSim-scores normalized to [0, 100] • incorporating the LevDist: First the LevDist is divided by -100, which generates a positive factor that is smaller the more similar two sentences are. Then the similarity score obtained by computing the mean, is divided by this factor. 8 • adding the SimCount-scores to the obtained score

For the output the five most similar verified claims for an input claim are computed relying on the general similarity score.

5.2. Supervised Approach

For the supervised approach we built a feature set out of the diferent similarity scores in order to classify if a verified claim is relevant for an input claim. We experimented with diferent methods to optimize our classification results. We used Blocking and Balancing in order to optimize our training results. Additionally we tried out diferent Classifiers and applied Feature Selection to further improve our output. Lastly we also made use of a heuristic based on our supervised approach to find relevant verified claims for all input claims.

To optimize the training, we used a Blocking approach. Instead of generating negative training instances by pairing each input claim with all but the true matching verified claims in the dataset, we computed the 50 most similar verified claims according to either of the four SentEmb scores and generated negative training instances using only those. More specifically, we extracted 4 sets of input-ver-claim-pairs, one set for each SentEmb method, with each set containing the 50 most similar verified claims identified by this method. Then we used the union of these sets as our final training set. We observed that all true input-ver-claims were covered. Besides the computational advantage of a smaller training set, this way the model may better learn to distinguish cases that are similar on the surface as all very dissimilar pairs have been filtered out before training.

Then all similarity scores, (also the SentEmb-scores) were added as features. As targets we obtained the relevance scores from the qrels-file of the training data. An unlabeled feature set was built for the test data.

After blocking, the percentage of true positives in our training data was still beneath 1% for both subtasks. That’s why we applied Random Undersampling as a Balancing method and experimented with diferent parameters (see Tables 4 and 5).

Then a Classifier was trained on the training data to predict relevance scores for the test data. We also experimented with diferent classifiers suited for binary classification, such as KNN, Logistic Regression, Linear SVC and a Decision Tree (see Tables 6, 7).

We experimented with diferent selections of features out of the similarity features presented above. The influence of the ensemble of features is shown in Tables 13 and 14. Additionally we 8e.g.: Given is a SentEmb mean of 50.0. If two sentences consist of quite similar strings, one could imagine them having a LevDist of -50. If two sentences are not that similar, they could have a LevDist of -200. Applying the technique described, incorporating LevDist would result in the sim score 100 for the similar sentences and 25 for the varying sentences. This way it is not ensured that the obtained similarity score is ∈ [0, 100]. In practice, however, the calculated values are in this range. included the feature TokenCount which represents the sum of tokens of both input claim and verified claim.

If no relevant verified claim was predicted for an input claim, we relied on our unsupervised approach heuristically and chose the five most similar verified claims based on the mean of sentence embedding similarity scores. For 2A we chose SBERT, InferSent and SimCSE as SentEmb scores, for 2B all four models, including UniversalSE.

6. Results 6.1. Evaluation Metric

The task is considered a ranking task and is evaluated as such. The oficial ranking evaluation measure is Mean Average Precision at 5 (MAP@5). Additionally the provided scorer computes the measures MAP@k for k=1, 3, 5, 10, MRR and Precision@k for k= 3, 5, 10 (cf. [ 3 ]). The MAP@k metric measures the mean of correctly classified pairs in the top k of the returned output. MRR or Mean Reciprocal Rank measures how far the assigned rank of a correct pair difers from its correct rank (i.e. the first rank for subtask A) on average.

Positives

KNN Logistic Regression

Linear SVC Decision Tree

KNN Logistic Regression

Linear SVC Decision Tree

6.2. Subtask 2A

For Subtask 2A we got the best result with our unsupervised approach, combining the similarity scores of SBERT, SimCSE, WordCount and WordTokRatio with a MAP@5 of 0.9175 (see Table 13).

However the output we submitted made use of SBERT, SimCSE and WordCount and scored slightly worse (0.9075) (see Table 8). We still achieved a score above the baselines utilizing a simple and fast unsupervised ranking method. 0.322 0.189 0.190 0.190 0.187 0.141 0.011 0

6.3. Subtask 2B

For Subtask 2B we got the best results using a supervised approach. All similarity features were included, except from JaccChar (see Table 14). We made use of Random Undersampling to increase the percentage of positives in the training data (relevant input-ver-claim pairs) to 8%. Then a Logistic Regression Classifier was trained and predicted 111 input-ver-claim pairs. The unsupervised heuristic described above was used to find relevant verified claims for the remaining input claims. This way the output achieved a MAP@5 of 0.4882.

The output we submitted also scored slightly worse than our best result with a MAP@5 of 0.459. To generate this output we used Linear Support Vector Classification and sampled to 14% positives. The considered features were SimCSE, JaccTok, WordCount, WordRatio, SynCount and SynRatio. This is still the best result for subtask 2B (see Table 9).

6.4. Result of Pre-Processing

It turned out that our pre-processing approach did not improve our results on the test data for Subtask 2A (see Table 10), although it did for the development test data. This is an issue worth investigating in future work. Tweet-specific units of text such as user-information were removed and it showed that it would have been useful to incorporate this kind of information for solving the task 2A. Nevertheless the pre-processing ensured that the data of both tasks was more similar and thereby helped assessing similarity of claims in general contexts.

The incorporation of context for subtask 2B also did not improve the results on the development test data and on the final test data. That is why we used the original data for subtask 2B.

7. Observations 7.1. Evaluation of Features 7.1.1. Powerful Features for Subtask A and Subtask B

CosSim SBERT

CosSim InferSent CosSim UniversalSE

CosSim SimCSE

LevDist JaccChar JaccTok

SeqMat WordCount

WordRatio WordTokRatio

SynCount

SynRatio SynTokRatio

NERatio NETokRatio

The observation of the results of using the supervised approach on single features (see Table 11) gives a good overview of their independent performance. As expected, the most successful features for both subtasks are the cosine similarities of the sentence embeddings. Especially SBERT, UniversalSE and SimCSE performed best on both task. That’s because, as explained above, sentence embeddings are really useful to capture STS.

Interestingly SBERT is the most powerful feature for Subtask 2A and SimCSE the most powerful one for Subtask 2B. It would be worth further investigations to identify the reason for this diference. Both models are pre-trained on a large share of the same data, so maybe the contrastive training objective of SimCSE is partly responsible for it.

Another important observation is the fact that the lexical similarity features WordCount, WordRatio and WordTokRatio perform also really well for both tasks. This is kind of surprising, because these features are generated in such a simple way.

In contrast, the Jaccard Similarity of characters JaccChar is the weakest similarity feature. This can be explained by the fact that the consideration of equal characters, regardless of their order, doesn’t have much informational value for the meaning of a sentence as a whole.

One interesting finding regarding the diferences between the subtasks is the varying performance of string similarity features. The string similarity features LevDist and SeqMat are the only features that produce a higher MAP@5 for Subtask 2B than for Subtask 2A. Looking at the data, it is noticeable that the input claims and the verified claims provided for Subtask 2B often share long, continuous strings (see Table 12).

7.1.2. Feature Set

One of the most intriguing observations is the fact that both the unsupervised and the supervised approach perform best if lexical similarity is considered besides semantic similarity (see Tables 13 and 14). The SentEmb features do not seem to cover lexical similarity and their performance benefits from the additional information contained by lexical similarity features. This is also supported by the observation that these two types of features do not have a strong correlation (see Tables 15 and 16).

Also it can be observed that especially for subtask B it is helpful to consider the combination of almost all similarity features in the supervised approach (see Table 14).

Overall a higher number of features mostly increases the performance of the supervised approach and decreases it for the unsupervised approach as relatively uninformative features have a too high impact on the latter.

7.2. Supervised vs Unsupervised Approach

One important observation with respect to our results is the fact that the unsupervised approach performs nearly as good as the unsupervised approach for subtask B and even better than the unsupervised approach for subtask A.

Since the task is a ranking problem, the unsupervised approach seems to perform suficiently well for the given task. For similar tasks with the constraint to only find pairs that are relevant with a high certainty, the supervised approach might be more helpful.

Also it is reasonable to assume that the unsupervised approach generalizes well over similar tasks, because it is independent of the training data. This assumption is supported by the fact that the features that produce the best outputs are almost the same for both subtask A and Semantic Similarity Semantic Similarity and Lexical Similarity Semantic Similarity, Lexical Similarity and Referential Similarity Semantic Similarity, String Similarity and Lexical Similarity Semantic Similarity, String Similarity, Lexical Similarity and Referential Similarity ALL subtask B for the unsupervised approach (see Tables 13 and 14), while the supervised approach relies on diferent features for the subtasks to produce good outputs.

8. Future Work

It would be interesting to investigate the generalizability of our approach and to check if the assumption that the unsupervised approach generalizes better than the supervised approach is true. Also a detailed assessment of the impact of pre-processing would be beneficial for related works.

9. Conclusion

We treated the task to detect previously fact-checked claims as a STS-task. To solve it, we investigated diferent kinds of similarity measures between sentences, covering semantic, lexical and referential similarity. We found that it is beneficial to combine semantic similarity measures gained by calculating the distance of sentence embeddings with lexical similarity measures gained by counting shared words. Furthermore, we found that an unsupervised approach can be even more successful than a supervised approach for this task. Overall, our proposed approaches provide very good results for both subtasks with a MAP@5 of 0.907 for subtask A and a MAP@5 of 0.459 for subtask B, both scoring above the baselines and even being the top-ranked output for subtask B.

A. Appendix

NE SBERT InferSent UniversalSE SimCSE LevDist JaccChar JaccTok SeqMat WordCount WordRatio WordTokRatio SynCount SynRatio SynTokRatio NE NERatio NeTokRatio -0.06 -0.59 -0.09 -0.01 1.0 -0.23 -0.03 0.57 -0.33 0.23 0.33 -0.37 -0.01 0.0 -0.07 -0.05 -0.05 SynRatio NERatio NETokRatio

[1]

Nakov , G. Da San Martino,

Alam ,

Shaar ,

Mubarak ,

Babulkov , Overview of the CLEF-2022 CheckThat! lab task 2 on detecting previously fact-checked claims , in: Working Notes of CLEF 2022- Conference and Labs of the Evaluation Forum , CLEF ' 2022 , Bologna, Italy, 2022 .

[2]

Shaar ,

Nikolov ,

Babulkov ,

Alam ,

Barrón-Cedeño ,

Elsayed ,

Hasanain ,

Suwaileh ,

Haouari ,

G. D. S.

Martino ,

Nakov , Overview of checkthat! 2020 english: Automatic identification and verification of claims in social media ., in: L. Cappellato , C.

Eickhof , N.

Ferro , A . Névéol (Eds.), CLEF (Working Notes) , volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org , 2020 . URL: http://ceur-ws. org/ Vol- 2696 /paper_265. pdf.

[3]

Shaar ,

Haouari ,

Mansour ,

Hasanain ,

Babulkov ,

Alam , G. Da San Martino, T. Elsayed,

Nakov , Overview of the CLEF-2021 CheckThat! lab task 2 on detecting previously fact-checked claims in tweets and political debates , in: Working Notes of CLEF 2021- Conference and Labs of the Evaluation Forum , CLEF ' 2021 , Bucharest, Romania (online), 2021 . URL: http://ceur-ws. org/ Vol- 2936 /paper-29.pdf.

[4]

Bouziane ,

Perrin ,

Cluzeau ,

Mardas ,

Sadeq , Team buster.ai at checkthat! 2020 insights and recommendations to improve fact-checking , in: L. Cappellato , C.

Eickhof , N. F.

0001 , A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum , Thessaloniki, Greece, September 22-25 , 2020 , volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org , 2020 . URL: http://ceur-ws. org/ Vol- 2696 /paper_134.pdf.

[5]

Thuma ,

N. P.

Motlogelwa ,

Leburu-Dingalo ,

Mudongo , Ub_et at checkthat! 2020: Exploring ad hoc retrieval approaches in verified claims retrieval , in: L. Cappellato , C.

Eickhof , N. F.

[6]

McDonald ,

Dong ,

Zhang ,

Hampson ,

Young ,

Cao ,

J. L.

Leidner ,

Stevenson , The university of shefield at checkthat! 2020: Claim identification and verification on twitter , in: L. Cappellato , C.

Eickhof , N. F.

[7]

G. S.

Cheema ,

Hakimov ,

Ewerth , Check square at checkthat! 2020: Claim detection in social media via fusion of transformer and syntactic features , in: L. Cappellato , C.

Eickhof , N. F.

[8]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . URL: https: //arxiv.org/abs/ 1907 .11692. doi: 10 .48550/ARXIV. 1907 . 11692 .

[9]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , 2019 . arXiv: 1908 .10084.

[10]

L. C.

Passaro ,

Bondielli ,

Lenci ,

Marcelloni , Unipi-nle at checkthat! 2020: Approaching fact checking from a sentence similarity perspective through the lens of transformers , in: L. Cappellato , C.

Eickhof , N. F.

[11]

Shukla ,

Sharma , Tiet at clef checkthat! 2020: Verified claim retrieval , in: L. Cappellato , C.

Eickhof , N. F.

[12]

Cer ,

Yang ,

Kong ,

Hua ,

Limtiaco , R. S. John, N. Constant , M. GuajardoCespedes, S. Yuan,

Tar ,

Sung ,

Strope ,

Kurzweil , Universal sentence encoder, CoRR abs/ 1803 .11175 ( 2018 ). URL: http://arxiv.org/abs/ 1803 .11175. arXiv: 1803 .11175.

[13]

Martinez-Rico ,

Araujo ,

Martinez-Romo , Nlpir@uned at checkthat! 2020: A preliminary approach for check-worthiness and claim retrieval tasks using neural networks and graphs , 2020 .

[14]

Pritzkau , Nlytics at checkthat! 2021: Detecting previously fact-checked claims by measuring semantic similarity , in: Working Notes of CLEF 2021- Conference and Labs of the Evaluation Forum , CLEF ' 2021 , Bucharest, Romania (online), 2021 . URL: http://ceur-ws. org/ Vol- 2936 /paper-47.pdf.

[15]

Mihaylova ,

Borisova ,

Chemishanov ,

Hadzhitsanev ,

Hardalov ,

Nakov , Dips at checkthat! 2021: Verified claim retrieval , in: Working Notes of CLEF 2021- Conference and Labs of the Evaluation Forum , CLEF ' 2021 , Bucharest, Romania (online), 2021 . URL: http://ceur-ws. org/ Vol- 2936 /paper-45.pdf.

[16]

Chernyavskiy ,

Ilvovsky ,

Nakov , Aschern at checkthat! 2021: Lambda-calculus of fact-checked claims , in: Working Notes of CLEF 2021- Conference and Labs of the Evaluation Forum , CLEF ' 2021 , Bucharest, Romania (online), 2021 . URL: http://ceur-ws. org/ Vol- 2936 /paper-38.pdf.

[17]

Agirre ,

Cer ,

Diab ,

Gonzalez-Agirre , SemEval -2012 task 6: A pilot on semantic textual similarity , in: *SEM 2012: The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012 ), Association for Computational Linguistics , Montréal, Canada, 2012 , pp. 385 - 393 . URL: https://aclanthology.org/S12-1051.

[18]

Conneau ,

Kiela ,

Schwenk ,

Barrault ,

Bordes , Supervised learning of universal sentence representations from natural language inference data , 2017 . URL: https://arxiv. org/abs/1705.02364. doi: 10 .48550/ARXIV.1705.02364.

[19]

Sahlgren , The distributional hypothesis , The Italian Journal of Linguistics 20 ( 2008 ) 33 - 54 .

[20]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi: 10 .18653/v1/ N19 -1423.

[21]

Pennington ,

Socher ,

C. D.

Manning , Glove: Global vectors for word representation , in: Empirical Methods in Natural Language Processing (EMNLP) , 2014 , pp. 1532 - 1543 . URL: http://www.aclweb.org/anthology/D14-1162.

[22]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, Enriching word vectors with subword information , arXiv preprint arXiv:1607.04606 ( 2016 ).

[23]

Gao ,

Yao , D. Chen, SimCSE: Simple contrastive learning of sentence embeddings , in: Empirical Methods in Natural Language Processing (EMNLP) , 2021 .

[24]

Bird ,

Klein , E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, " O'Reilly Media , Inc." , 2009 .

[25]

Fellbaum , WordNet: An Electronic Lexical Database, Bradford Books , 1998 .

[26]

Lopez , entity-fishing, https://github.com/kermitt2/entity-fishing, 2016 - 2022 . arXiv:1:dir:cb0ba3379413db12b0018b7c3af8d0d2d864139c.