=Paper=
{{Paper
|id=Vol-3180/paper-40
|storemode=property
|title=SimBa at CheckThat! 2022: Lexical and Semantic Similarity Based Detection of Verified
                        Claims in an Unsupervised and Supervised Way
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-40.pdf
|volume=Vol-3180
|authors=Alica Hövelmeyer,Katarina Boland,Stefan Dietze
|dblpUrl=https://dblp.org/rec/conf/clef/HovelmeyerBD22
}}
==SimBa at CheckThat! 2022: Lexical and Semantic Similarity Based Detection of Verified
                        Claims in an Unsupervised and Supervised Way==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-40.pdf</pdf>
<pre>
SimBa at CheckThat! 2022: Lexical and Semantic
Similarity Based Detection of Verified Claims in an
Unsupervised and Supervised Way
Alica Hövelmeyer1,2 , Katarina Boland1,2 and Stefan Dietze1,2
1
    Heinrich-Heine-Universität Düsseldorf (HHU), Universitätsstraße 1, 40225 Düsseldorf, Germany
2
    GESIS - Leibniz Institute for the Social Sciences, Unter Sachsenhausen 6-8, 50667 Cologne, Germany


                                         Abstract
                                         One step in many automated fact-checking pipelines is verified claim retrieval, i.e. checking whether a
                                         claim has been fact-checked before. We approach this task as a semantic textual similarity problem. For
                                         this, we examine the extent to which an input claim and a verified claim are similar at semantic, textual,
                                         lexical and referential levels using a variety of NLP tools. We rank similar pairs based on these features
                                         using a supervised and an unsupervised model. We participate in two subtasks and compare our results
                                         for subtask 2A: detecting previously fact-checked claims from tweets and subtask 2B: detecting previously
                                         fact-checked claims in political debates for English data. We find that the combination of semantic and
                                         lexical similarity features performs best in finding relevant claim pairs for both subtasks. Furthermore,
                                         our unsupervised method is on par with the supervised one and seems to generalize well over similar
                                         tasks.

                                         Keywords
                                         fact-checking, STS, semantic similarity, lexical similarity, sentence embeddings


1. Introduction
The dissemination of true or false information through traditional channels, such as political
speeches, or channels that have emerged in recent years, such as social media, is a powerful tool
for shaping public opinion. Therefore, the analysis of claims made online or by public speakers
is a popular field of research. The CLEF CheckThat Lab[1] contributes by offering shared tasks
related to it. This paper reports on our submission for Task 2: Detecting previously fact-checked
claims for English language.
   We approach this task as a semantic textual similarity problem that we solve by combining
different kinds of similarity features. We want to build on the success of sentence embedding
models by trying out different models, their combinations and different ways of weighing them.
In addition, we also want to contribute to a better understanding of sentence embeddings by
investigating what kinds of possible similarities between sentences they capture and how their
performance can be improved by adding complementary information. The combination of
lexical and semantic similarity features proves to be particularly helpful. The great strength of

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ alica.hoevelmeyer@hhu.de (A. Hövelmeyer); katarina.boland@hhu.de (K. Boland); stefan.dietze@hhu.de
(S. Dietze)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
our approach is that we compare a supervised and an unsupervised method to rank the given
data by similarity and are able to propose an unsupervised method that is on par with supervised
approaches. Furthermore, there is evidence that our unsupervised method generalizes well over
similar tasks. The code for both subtasks is available on github1 .


2. Related Work
This submission is part of the 5th edition of the CheckThat! Lab. Previous editions, also held in
conjunction with the Conference and Labs of the Evaluation Forum (CLEF) that also featured
the task of Detecting Previously Fact-Checked Claims / Claim Retrieval, took place in 2020[2]
and 2021[3]. The approaches proposed by the participants are similar to ours in various aspects.
   For the lab in 2020 the data to be processed exclusively consisted of tweets as input claims.
Many of the participants used pre-processing and cleaned the tweets, removing tweet-specific
characters like hashtags [4] [5] [6] [7]. Some teams solely made use of lexical and string
similarity features[5] [6], whereas other teams used pre-trained language models to evaluate
semantic similarity. These teams fine-tuned RoBERTa[8][4] or used Sentence-BERT [9] [10] [11]
[7] or Universal Sentence Encoder[12][13] in order to calculate the distances between sentence
embeddings. Different variations of Blocking-techniques were also used[10][5] [7]. Similar to
our approach, some teams combined lexical and semantic similarity features [4] [13] [11].
   In 2021 all teams made use of the sentence embedding model Sentence-BERT. Team
NLytics[14] offered an unsupervised approach based on the distances of sentence embed-
dings gained using Sentence-BERT. This approach performed well for only one of the proposed
subtasks.
   Team DIPS[15] and Team Aschern[16] made use of the combination of a semantic similarity
feature (also gained using the sentence embedding model Sentence-BERT) and a string (BM25
by Team DIPS) or lexical (TF.IDF by Team Aschern) similarity feature. Different from us, they
only presented supervised approaches to rank the data based on these features.


3. Task Definition
3.1. Detection of previously fact-checked claims
One of the tasks that arise in the broader context of automated fact-checking is to check whether
a claim has been fact-checked before. This can be considered the second step of a claim retrieval
and verification pipeline, after the detection of check-worthy claims in different kinds of textual
utterances and before the verification of those claims. This is addressed by task 2 [3]. More
precisely, the task is to rank the most relevant verified claims out of a collection of already
verified claims for a given input claim.


   1
       https://github.com/Alihoe/CLEFCheckThat2aSimBa, https://github.com/Alihoe/CLEFCheckThat2bSimBa
3.2. Data
The subtasks cover two different types of media that are used to disseminate claims. Subtask A
deals with tweets, subtask B with political debates and speeches. Both types of text sequences
containing claim utterances will simply be referred to as input claims in the following. For both
tasks different kinds of already fact-checked claims are made available. These will be called
verified claims.[1]
   Both input claims and verified claims consist of one or a few coherent sentences.
   The input claims of subtask A are given as strings, divided into a training dataset of 1167
input claims, a development test dataset of 201 input claims and a final test dataset of 209 input
claims. A human-annotated mapping from every input claim to the most relevant verified claim
(query relevance or qrels-file) constitutes the gold standard. Verified claims are crawled from
the fact-checking website Snopes and are provided in JSON format containing title, subtitle,
author, date and a vclaim-entry with the content of the claim.
   The input claims of subtask B are also provided as strings, divided into a training dataset
of 702 input claims, a development test dataset of 79 input claims and a final test dataset of
65 input claims. Here, a human-annotated mapping from every input claim to one or more
relevant verified claims is given in addition to the training data and as a gold standard for the
test data. Furthermore, transcripts of the debates or speeches the input claims are obtained
from are given for the test data. 19250 verified claims are taken from the fact-checking website
PolitiFact and made available in JSON format containing the entries vclaim_id, vclaim, date,
truth_label, speaker, url, title and text.
   The mappings of input claims to verified claims will be referred to as input-ver-claim pairs.


4. Similarity-Based Features
4.1. Semantic Similarity
The task is formulated as a ranking-problem, where input-ver-claim-pairs are ranked depending
on the relevance of the verified claim for fact-checking the input claim. Thus, the task can be
considered a semantic textual similarity problem (STS) where sentences are compared by their
semantic content to rank sentences containing similar claims highest (cf. [17]).

4.1.1. Sentence Embeddings
One promising way to deal with STS-problems is the usage of sentence embeddings. Sentence
embeddings are fixed-sized vector representations that capture the meaning of sentences in so
far that embeddings of semantically similar sentences are close in the corresponding vector space
(cf. [18]). Sentence embedding models are usually trained on a huge amount of natural language
data or rely on models that are trained on such. Thus they reflect the empirical distribution
of linguistic elements and can be viewed as an appropriate method to investigate semantic
similarity. That’s because relying on the distributional hypothesis, "there is a correlation
between distributional similarity and meaning similarity"[19].
   The usefulness of the application of sentence embeddings has already been demonstrated by
the participants of last year’s lab. The sentence embedding model Sentence-BERT [9] was used
by the top-ranked teams of both subtask A and subtask B [16] [15]. Therefore, we use them as
starting points for different components of our application.
   Sentence-BERT (SBERT) is a modification of the transformer-based pre-trained language
models BERT [20] or RoBERTa[8] using a Siamese network structure. The language models
are trained on natural language inference (NLI) data and a pooling operation is added to their
outputs in order to derive fixed-sized vector representations of the input sentences.
   The idea of training on NLI data in a supervised way in order to get meaningful sentence
embeddings was introduced by the authors of the sentence embedding model InferSent[18]
(InferSent). However they did not build their model upon a tranformer-based language model,
but on an encoder based on a bi-directional LSTM architecture fed with pre-trained word
embeddings (GloVe[21] or fastText[22]).
   Similarly, the model Universal Sentence Encoder[12] (UniversalSE) averages together word
and bi-gram level embeddings, passes the representations through a feed-forward deep neural
network (DNN) and is trained on NLI data.
   The authors of SimCSE[23](SimCSE) also train their model on NLI data, but within a con-
trastive learning framework. Otherwise their model is similar to Sentence-BERT, relying on the
pre-trained language models BERT and RoBERTa and adding a pooling operation to one of their
output layers.
   All sentence embedding models are also able encode small paragraphs instead of just sen-
tences.

4.1.2. Measuring Semantic Similarity Using Sentence Embeddings
For all of these sentence embeddings methods, there are pre-trained models available that can
be used out of the box. For Sentence-BERT we used sentence-transformers/all-mpnet-base-v2,
because it performs best for STS tasks compared to the other pretrained models2 . For InferSent
we experimented with both versions, but report here only on the results obtained using version
2, which works with fastText[22], because we got better results than using the GloVe-vocabulary
in pre-liminary experiments. For Universal Sentence Encoder we used TF2.0 Saved Model (v4)3 ,
because this is the most widely used model available for Universal Sentence Encoder and for
SimCSE we used princeton-nlp/sup-simcse-roberta-large4 , because this also performs best for
STS tasks compared to the other pretrained models 5 .
   Since sentence embeddings are vector representations of sentences within the same vector
space, their similarity can be measured applying cosine similarity (CosSim), resulting in
similarity scores which are rational numbers ∈ [-100, 100]. These similarity scores should be
referred to as SentEmb.

4.2. Other Measures of Similarity
In the following other measures of similarity are presented. An overview of their corresponding
metrics can be found in Table 1.
   2
     https://www.sbert.net/docs/pretrained_models.html
   3
     https://tfhub.dev/google/universal-sentence-encoder/4
   4
     https://huggingface.co/princeton-nlp/sup-simcse-roberta-large
   5
     https://github.com/princeton-nlp/SimCSE
Table 1
Metrics of similarity features.
                    Kind of Similarity      Group         Feature       Metric
                   Semantic Similarity     SentEmb       SBERT          ∈ [-100, 100]
                                                        InferSent
                                                       UniversalSE
                                                         SimCSE
                    String Similarity       LevDist       LevDist       ∈ −Z
                                           StringSim      SeqMat        ∈ [0, 1]
                                                         JaccChar
                                                          JaccTok
                    Lexical Similarity     SimCount     WordCount       ∈N
                                           SimRatio     WordRatio       ∈ [0, 100]
                                                       WordTokRatio
                  Referential Similarity   SimCount     SynCount        ∈N
                                           SimRatio     SynRatio        ∈ [0, 100]
                                                       SynTokRatio
                                           SimCount        NE           ∈N
                                           SimRatio      NERatio        ∈ [0, 100]
                                                       NeTokRatio


4.2.1. String Similarity
In addition to the study of semantic similarity using sentence embeddings, there are other ways
in which the similarity of sentences can be measured.
   The most naive approach to measure the similarity of two sentences is to compare them at
the string level, i.e. to see how far the characters and strings that make up a sentence differ
from those of other sentences. We used three different methods to measure the string similarity
of sentences: Levenshtein Distance, Jaccard Distance and Sequence Matching.
   Levenshtein Distance (LevDist) is a metric to measure the distance between two strings by
counting the number of operations (insertions, deletions or substitutions) needed to change
one string into the other. Sentences which are similar thus have a small Levenshtein Distance. In
order to adjust this distance score to the other similarity scores, such that a higher value signifies
a higher similarity, we multiplied the Levenshtein Distance by -1. In practice, we thereby get
negative three- or two-digit integers as similarity scores for almost all input-ver-claim pairs.
   In general, Jaccard Distance is used to measure the similarity of sets. It is computed by
dividing the size of the intersection by the size of the union of the sets. The closer this value is
to one, the more similar are the sets. In context of sentence-similarity it can be applied in two
ways: either regarding the characters (JaccChar) or the tokens (JaccTok) a sentence consists
of as elements of a set.
   The Sequence Matching-algorithm (SeqMat) provided by the Python library difflib works by
comparing "the longest contiguous matching subsequence that contains no ’junk’ elements"
and recursively repeating this on the remaining subsequences. Junk elements are determined
heuristically based on the frequency of their duplicates in the text sequence 6 .
  Both the application of Jaccard Distance and Sequence Matching generate rational numbers
∈ [0, 1]. These similarity scores will be referred to as StringSim.

4.2.2. Lexical Similarity
Another type of similarity, which is not clearly distinguishable from semantic and string
similarity, is lexical similarity or similarity of words. We used one method to capture lexical
similarity between sentences and simply counted how often two claims contained the same
words.
  For this, we tokenized all claims using NLTK’s word tokenizer[24], filtered out stop words
and counted how often two claims contained the same tokens (WordCount). In order to value
the number of equal tokens of shorter sentences higher than those of longer ones, we also
computed a normalized ratio. For this we divided 100 by the number of tokens of both claims
and multiplied the obtained value by two times the number of equal tokens.7 We did this both
including stop words (WordTokRatio) and not including them (WordRatio).
  Counting equal tokens we gained a positive integer similarity score, usually with less than
three digits. We call this kind of discrete score SimCount. Computing the ratios we obtained
percentages similar to the SentEmb-scores ∈ [0, 100]. This kind of scores will be referred to as
SimRatio.

4.2.3. Referential Similarity
Another way to think of similarity between sentences is to examine whether they refer to the
same objects. To represent this kind of similarity we used two methods. Similar to the lexical
similarity approach, we counted how often two claims contained words which are synonyms of
each other. Additionally, we counted how often two claims contain the same named entities
(NEs).
   To compare the synonyms, we used WordNet[25] and looked for all available synsets the
tokens mentioned in a claim are part of. We tokenized the sentences the same way as above.
Then we counted how often two claims contained the same synsets (SynCount). Here we also
computed the ratio of the count of synonyms regarding all synonyms (SynRatio) and all tokens
(SynTokRatio) in the two sentences.
   In order to compare NEs we used the entity-fishing system[26], which recognizes named
entities mentioned in a text and disambiguates them using Wikidata. The system is able to
return the the Wikipedia and Wikidata identifiers of those mentions. We counted how often
two claims contained named entities related to the same Wikipedia or Wikidata entry (NE). We
also additionally computed the ratio of the count of NEs regarding all NEs (NERatio) and all
tokens (NETokRatio) in the two sentences.
   Similarly to the lexical similarity scores, we obtained two different kinds of metrics for these
similarities: SimCount and SimRatio (see Table 1).
    6
      https://docs.python.org/3/library/difflib.html
    7
      e.g.: If two claims consisted of ten tokens each and had ten tokens in common, we would obtain a Word-
TokRatio of (100/20)*10*2 = 100. If they only had one token in common the obtained ratio would be (100/20)*1*2 =
10. If both claims consisted of 50 tokens each, the obtained ratios would be (100/100)*10*2 = 20 and (100/20)*1*2 = 2.
4.3. Pre-Processsing
4.3.1. Cleaning tweets
For both subtasks we experimented with different ways of pre-processing the input claims. We
cleaned the tweets given in subtask 2a to get rid of redundant information. We removed URLs,
@-symbols and user-information (see Table 2).

Table 2
Example of a cleaned tweet from the test data.
       Original Tweet                             Cleaned Tweet
       Starlink — here. Thanks, @elonmusk
       pic.twitter.com/dZbaYqWYCf — Mykhailo
                                                  Starlink — here. Thanks, elonmusk
       Fedorov (@FedorovMykhailo) February 28,
       2022


4.3.2. Including context
For subtask 2b, we tried incorporating the input claims’ contexts within the speech or debate
they were obtained from. We included the lines that were spoken before and after the relevant
claim and integrated information about the current speaker by prepending "speaker X said" to
the line of speech, where X is substituted by the name of the respective speaker (see Table 3).

Table 3
Example of a contextualized claim from the development test data.
       Original Input Claim                       Contextualized Input Claim
                                                  donald trump said "And Obama would send
                                                  pillows and sheets." donald trump said "He
       He wouldn’t send anything else.
                                                  wouldn’t send anything else." donald trump
                                                  said "It’s the whole thing."


5. Model
5.1. Unsupervised Approach
We tried out an unsupervised and a supervised method to utilize the information we gained
on the different kinds of similarity. The main idea of the unsupervised approach is to rank the
input-ver-claim pairs by the different similarity scores described above. Therefore a general
similarity score is computed, combining the varying metrics (see Table 1). This general score
can roughly be compared to the percentage to which two sentences are similar where two
exactly equal sentences would have a score of roughly 100. However, our way of combining the
different similarity scores does not ensure that the resulting score is smaller than 100. It can
sometimes be a little higher.
   The general similarity score is computed the following way:

     • taking the mean of all SentEmb-, SimRatio- and StringSim-scores normalized to [0, 100]
     • incorporating the LevDist: First the LevDist is divided by -100, which generates a positive
       factor that is smaller the more similar two sentences are. Then the similarity score
       obtained by computing the mean, is divided by this factor. 8
     • adding the SimCount-scores to the obtained score

  For the output the five most similar verified claims for an input claim are computed relying
on the general similarity score.

5.2. Supervised Approach
For the supervised approach we built a feature set out of the different similarity scores in order
to classify if a verified claim is relevant for an input claim. We experimented with different
methods to optimize our classification results. We used Blocking and Balancing in order to
optimize our training results. Additionally we tried out different Classifiers and applied Feature
Selection to further improve our output. Lastly we also made use of a heuristic based on our
supervised approach to find relevant verified claims for all input claims.
   To optimize the training, we used a Blocking approach. Instead of generating negative training
instances by pairing each input claim with all but the true matching verified claims in the dataset,
we computed the 50 most similar verified claims according to either of the four SentEmb scores
and generated negative training instances using only those. More specifically, we extracted 4
sets of input-ver-claim-pairs, one set for each SentEmb method, with each set containing the
50 most similar verified claims identified by this method. Then we used the union of these
sets as our final training set. We observed that all true input-ver-claims were covered. Besides
the computational advantage of a smaller training set, this way the model may better learn to
distinguish cases that are similar on the surface as all very dissimilar pairs have been filtered
out before training.
   Then all similarity scores, (also the SentEmb-scores) were added as features. As targets we
obtained the relevance scores from the qrels-file of the training data. An unlabeled feature set
was built for the test data.
   After blocking, the percentage of true positives in our training data was still beneath 1%
for both subtasks. That’s why we applied Random Undersampling as a Balancing method and
experimented with different parameters (see Tables 4 and 5).
   Then a Classifier was trained on the training data to predict relevance scores for the test data.
We also experimented with different classifiers suited for binary classification, such as KNN,
Logistic Regression, Linear SVC and a Decision Tree (see Tables 6, 7).
   We experimented with different selections of features out of the similarity features presented
above. The influence of the ensemble of features is shown in Tables 13 and 14. Additionally we
    8
     e.g.: Given is a SentEmb mean of 50.0. If two sentences consist of quite similar strings, one could imagine
them having a LevDist of -50. If two sentences are not that similar, they could have a LevDist of -200. Applying the
technique described, incorporating LevDist would result in the sim score 100 for the similar sentences and 25 for the
varying sentences. This way it is not ensured that the obtained similarity score is ∈ [0, 100]. In practice, however,
the calculated values are in this range.
Table 4
Subtask 2A: Impact of balancing using KNN.
          Positives                  Selected Features                  Classified Positives   MAP@5
                      Semantic Similarity: SBERT, InferSent, SimCSE
           0.62 %                                                               138            0.8865
                      Lexical Similarity: WordCount, WordTokRatio
            1%                               "                                  196            0.8865
            2%                               "                                  247            0.8760
                      Semantic Similarity: SBERT, InferSent, SimCSE
            3%        Lexical Similarity: WordCount, WordRatio, Word-           287            0.8664
                      TokRatio
            4%                               "                                  340            0.8712
            5%                               "                                  373            0.8784
                      Semantic Similarity: SBERT, InferSent, SimCSE
            6%                                                                  402            0.8656
                      Lexical Similarity: WordRatio, WordTokRatio
            7%                               "                                  433            0.8652
            8%                               "                                  468            0.8628
            9%                               "                                  486            0.8628
            10 %                             "                                  520            0.8700
                      Semantic Similarity: SBERT, InferSent, SimCSE
            20 %                                                                714            0.8776
                      Lexical Similarity: WordRatio
            30 %      Semantic Similarity: SBERT, InferSent, SimCSE            973             0.8836
            40 %                             "                                 1158            0.8884
            50 %                             "                                 1368            0.8896
                      Semantic Similarity: SBERT, InferSent, SimCSE
            60 %                                                               1532            0.8805
                      String Similarity: LevDist
            70 %                              "                                1660            0.8829
            80 %                              "                                1852            0.8805
                      Semantic Similarity: SBERT, InferSent, SimCSE
            90 %      String Similarity: LevDist                               2111            0.8896
                      Referential Similarity: SynCount, SynTokRatio
                      Semantic Similarity: SBERT, InferSent, SimCSE
           100 %      String Similarity: LevDist                               2102            0.8713
                      Referantial Similarity: SynCount


included the feature TokenCount which represents the sum of tokens of both input claim and
verified claim.
  If no relevant verified claim was predicted for an input claim, we relied on our unsupervised
approach heuristically and chose the five most similar verified claims based on the mean of
sentence embedding similarity scores. For 2A we chose SBERT, InferSent and SimCSE as SentEmb
scores, for 2B all four models, including UniversalSE.


6. Results
6.1. Evaluation Metric
The task is considered a ranking task and is evaluated as such. The official ranking evaluation
measure is Mean Average Precision at 5 (MAP@5). Additionally the provided scorer computes
the measures MAP@k for k=1, 3, 5, 10, MRR and Precision@k for k= 3, 5, 10 (cf. [3]). The
MAP@k metric measures the mean of correctly classified pairs in the top k of the returned
output. MRR or Mean Reciprocal Rank measures how far the assigned rank of a correct pair
Table 5
Subtask 2B: Impact of balancing using Logistic Regression.
           Positives                  Selected Features                  Classifed Positives   MAP@5
                       Semantic Similarity: SimCSE
                       String Similarity: LevDist
            0.65 %     Lexical Similarity: WordCount, WordRatio, Word-           11            0.4721
                       TokRatio
                       TokenCount
             1%                               "                                  15            0.4669
                       Semantic Similarity: SimCSE
                       String Similarity: LevDist
                       Lexical Similarity: WordCount, WordRatio, Word-
             2%                                                                  22            0.4669
                       TokRatio
                       Referential Similarity: SynTokRatio
                       TokenCount
                       Semantic Similarity: SimCSE
                       String Similarity: LevDist
             3%        Lexical Similarity: WordCount, WordRatio, Word-           26            0.4503
                       TokRatio
                       TokenCount
             4%                                "                                 34            0.4579
             5%                                "                                 43            0.4464
                       Semantic Similarity: SimCSE
                       String Similarity: LevDist
             6%                                                                  61            0.4531
                       Lexical Similarity: WordRatio, WordTokRatio
                       TokenCount
             7%                                "                                68             0.4531
             8%                                "                                82             0.4608
             9%                                "                                92             0.4608
             10 %                              "                                106            0.4454
                       Semantic Similarity: SimCSE
                       String Similarity: LevDist
             20 %      Lexical Similarity: WordRatio, WordTokRatio              258            0.4569
                       Referential Similarity: SynRatio
                       TokenCount
                       Semantic Similarity: SimCSE
             30 %      Lexcial Similarity: WordRatio, WordTokRatio              453            0.4436
                       Referential Similarity: SynTokRatio
             40 %                              "                                637            0.4359
             50 %                              "                                809            0.4332
             60 %                              "                                981            0.4324
             70 %                              "                                1171           0.4436
                       Semantic Similarity: SimCSE
             80 %      Lexical Similarity: WordCount, WordTokRatio              1265           0.4436
                       Referential Similarity: SynCount, SynTokRatio
             90 %                              "                                1393           0.4551
                       Semantic Similarity: SimCSE
            100 %      Lexcial Similarity: WordRatio, WordTokRatio              1547           0.4551
                       Referential Similarity: SynCount, SynTokRatio


differs from its correct rank (i.e. the first rank for subtask A) on average.
Table 6
Subtask 2A: Impact of Classifier without balancing using features SBERT, InferSent, SimCSE, WordCount,
WordTokRatio and with balancing to 50% positives using features SBERT, InferSent, SimCSE.
                    Classifier       MAP@5 No Balancing        MAP@5 with Balancing
                      KNN                    0.8865                     0.8896
               Logistic Regression           0.8832                     0.8844
                  Linear SVC                 0.8792                     0.8820
                 Decision Tree               0.8502                     0.8478


Table 7
Subtask 2B: Impact of classifier with balancing to 50% Positives using features SimCSE, SynTokRatio,
WordRatio, WordTokRatio and to 8% Positives using features SimCSE, LevDist, WordRatio, WordTokRatio,
TokenCount.
                Classifier        MAP@5 with 50% Positives      MAP@5 with 8% Positives
                  KNN                        0.4328                       0.4179
           Logistic Regression               0.4332                       0.4608
              Linear SVC                     0.4340                       0.4485
             Decision Tree                   0.4136                       0.3538


6.2. Subtask 2A
For Subtask 2A we got the best result with our unsupervised approach, combining the similarity
scores of SBERT, SimCSE, WordCount and WordTokRatio with a MAP@5 of 0.9175 (see Table
13).
   However the output we submitted made use of SBERT, SimCSE and WordCount and scored
slightly worse (0.9075) (see Table 8). We still achieved a score above the baselines utilizing a
simple and fast unsupervised ranking method.

Table 8
Subtask 2A: Results.
                                 User/Team            MAP@5     P@5      RR
                                  mshlis              0.956     0.322    0.957
                                 watheq9              0.921     0.189    0.923
                                  Viktor              0.922     0.190    0.922
                               Team_SimBa             0.907     0.190    0.907
                               motlogelwan            0.873     0.187    0.878
                        fraunhofersit_checkthat22     0.610     0.141    0.624
                            Team_Vax_Misinfo          0.020     0.011    0.096
                             Random Baseline             0        0      0
                              BM25 Baseline           0.8179
6.3. Subtask 2B
For Subtask 2B we got the best results using a supervised approach. All similarity features were
included, except from JaccChar (see Table 14). We made use of Random Undersampling to
increase the percentage of positives in the training data (relevant input-ver-claim pairs) to 8%.
Then a Logistic Regression Classifier was trained and predicted 111 input-ver-claim pairs.
The unsupervised heuristic described above was used to find relevant verified claims for the
remaining input claims. This way the output achieved a MAP@5 of 0.4882.
   The output we submitted also scored slightly worse than our best result with a MAP@5 of
0.459. To generate this output we used Linear Support Vector Classification and sampled to
14% positives. The considered features were SimCSE, JaccTok, WordCount, WordRatio, SynCount
and SynRatio. This is still the best result for subtask 2B (see Table 9).

Table 9
Subtask 2B: Results.
                               User/Team       MAP@5      P@5     RR
                              Team_SimBa        0.459    0.126    0.475
                           Team_Vax_Misinfo     0.091    0.040    0.131
                            Random Baseline        0       0      0
                             BM25 Baseline      0.3207


6.4. Result of Pre-Processing
It turned out that our pre-processing approach did not improve our results on the test data
for Subtask 2A (see Table 10), although it did for the development test data. This is an issue
worth investigating in future work. Tweet-specific units of text such as user-information were
removed and it showed that it would have been useful to incorporate this kind of information
for solving the task 2A. Nevertheless the pre-processing ensured that the data of both tasks was
more similar and thereby helped assessing similarity of claims in general contexts.
   The incorporation of context for subtask 2B also did not improve the results on the devel-
opment test data and on the final test data. That is why we used the original data for subtask
2B.

Table 10
Subtask 2A: Impact of pre-processing.
                                                         MAP@5
                                 with Pre-Processing     0.9143
                                without Pre-Processing   0.9270
7. Observations
7.1. Evaluation of Features
7.1.1. Powerful Features for Subtask A and Subtask B

Table 11
Supervised Approach: Comparison of performance of similarity scores independently.
                  Similarity Scores    MAP@5 Subtask 2A      MAP@5 Subtask 2B
                  CosSim SBERT               0.8711          0.3664
                 CosSim InferSent            0.4208          0.1846
                CosSim UniversalSE           0.7153          0.3872
                 CosSim SimCSE               0.7973          0.3946
                      LevDist                0.1271          0.1833
                     JaccChar                0.0522          0.0569
                      JaccTok                0.4014          0.2763
                      SeqMat                 0.2698          0.2790
                   WordCount                 0.5667          0.2731
                    WordRatio                0.6454          0.2967
                  WordTokRatio               0.6630          0.2954
                    SynCount                 0.3228          0.2024
                     SynRatio                0.3196          0.2508
                   SynTokRatio               0.3071          0.2359
                        NE                   0.4549          0.1600
                     NERatio                 0.4357          0.1556
                   NETokRatio                0.4620          0.1654


  The observation of the results of using the supervised approach on single features (see Table
11) gives a good overview of their independent performance. As expected, the most successful
features for both subtasks are the cosine similarities of the sentence embeddings. Especially
SBERT, UniversalSE and SimCSE performed best on both task. That’s because, as explained
above, sentence embeddings are really useful to capture STS.
  Interestingly SBERT is the most powerful feature for Subtask 2A and SimCSE the most
powerful one for Subtask 2B. It would be worth further investigations to identify the reason for
this difference. Both models are pre-trained on a large share of the same data, so maybe the
contrastive training objective of SimCSE is partly responsible for it.
  Another important observation is the fact that the lexical similarity features WordCount,
WordRatio and WordTokRatio perform also really well for both tasks. This is kind of surprising,
because these features are generated in such a simple way.
  In contrast, the Jaccard Similarity of characters JaccChar is the weakest similarity feature.
This can be explained by the fact that the consideration of equal characters, regardless of their
order, doesn’t have much informational value for the meaning of a sentence as a whole.
  One interesting finding regarding the differences between the subtasks is the varying perfor-
mance of string similarity features. The string similarity features LevDist and SeqMat are the
only features that produce a higher MAP@5 for Subtask 2B than for Subtask 2A. Looking at the
data, it is noticeable that the input claims and the verified claims provided for Subtask 2B often
share long, continuous strings (see Table 12).

Table 12
Comparison of input-ver-claim-pairs of subtask A and subtask B. The claims of subtask B share a long,
continuous string.
                                     Input Claim               Verified Claim
                                                               Time magazine compared
                                                               Russian President Vladimir
                             TIME’s new cover: How Putin
              Subtask 2A                                       Putin to Adolf Hitler on the
                             shattered Europe’s dreams
                                                               cover of the March 14 / March
                                                               21, 2022, issue.
                             160 million people like
                             their private insurance,
                                                               160 million people like
              Substask 2B    and if they don’t like it, they
                                                               their private insurance.
                             can buy into a Medicare-like
                             proposal..


7.1.2. Feature Set
One of the most intriguing observations is the fact that both the unsupervised and the supervised
approach perform best if lexical similarity is considered besides semantic similarity (see Tables
13 and 14). The SentEmb features do not seem to cover lexical similarity and their performance
benefits from the additional information contained by lexical similarity features. This is also
supported by the observation that these two types of features do not have a strong correlation
(see Tables 15 and 16).
   Also it can be observed that especially for subtask B it is helpful to consider the combination
of almost all similarity features in the supervised approach (see Table 14).
   Overall a higher number of features mostly increases the performance of the supervised
approach and decreases it for the unsupervised approach as relatively uninformative features
have a too high impact on the latter.

7.2. Supervised vs Unsupervised Approach
One important observation with respect to our results is the fact that the unsupervised approach
performs nearly as good as the unsupervised approach for subtask B and even better than the
unsupervised approach for subtask A.
   Since the task is a ranking problem, the unsupervised approach seems to perform sufficiently
well for the given task. For similar tasks with the constraint to only find pairs that are relevant
with a high certainty, the supervised approach might be more helpful.
   Also it is reasonable to assume that the unsupervised approach generalizes well over similar
tasks, because it is independent of the training data. This assumption is supported by the fact
that the features that produce the best outputs are almost the same for both subtask A and
Table 13
Subtask 2A: Comparison of the combination of different similarity scores in a supervised and unsuper-
vised way.
                                                                MAP@5       MAP@5            MAP@5
    Kinds of Similarity Scores   Similarity Scores
                                                                UnSup    KNN 50% positives   KNN no balancing
                                 SBERT, InferSent, Univer-
    Semantic Similarity                                         0.8883        0.8734         0.8672
                                 salSE, SimCSE
                                 SBERT, UniversalSE, Sim-       0.8972        0.8748         0.8829
                                 CSE
                                 SBERT, InferSent, SimCSE       0.8793        0.8896         0.8664
                                 SBERT, SimCSE                  0.8896        0.8781         0.8748
    Semantic Similarity and      SBERT, SimCSE, Word-           0.9075        0.8839         0.8792
    Lexical Similarity           Count
                                 SBERT, SimCSE, Word-           0.9151        0.8877         0.8792
                                 TokRatio
                                 SBERT, SimCSE, Word-
                                                                0.9175        0.8955         0.8801
                                 Count, WordTokRatio
                                 SBERT, InferSent, SimCSE,
                                 WordCount, WordTokRa-          0.8911        0.8832         0.8865
                                 tio
                                 SBERT, InferSent, SimCSE,      0.8941        0.8817         0.8780
                                 WordCount
    Semantic     Similarity,
                                 SBERT, SIMCSE, Word-
    Lexical Similarity and                                      0.9172        0.8863         0.8825
                                 TokRatio, NETokRatio
    Referential Similarity
    Semantic     Similarity,
    String Similarity and        SimCSE, LevDist, all Word-     0.8027        0.8521         0.8742
    Lexical Similarity           Sims
                                 SimCSE, LevDist, WordRa-
                                                                0.7986        0.8305         0.8670
                                 tio, WordTokRatio
    Semantic     Similarity,
    String Similarity, Lex-      SBERT, WordCount, Jacc-
                                                                0.8929        0.8748         0.8744
    ical Similarity and          Tok, NETokRatio
    Referential Similarity
                                 SBERT, WordTokRatio, Jac-
                                                                0.9001        0.8720         0.8844
                                 cTok, NETokRatio
                                 SimCSE, JaccTok, all Word-
                                                                0.5509        0.8473         0.8650
                                 Sims, all SynSims
                                 SimCSE, SeqMat, JaccTok,
                                                                0.5490        0.8417         0.8518
                                 all WordSims, all SynSims
                                 SimCSE, SeqMat, JaccTok,
                                 all WordSims, SynRatio,        0.6540        0.8323         0.8602
                                 SynTokRatio
                                 SimCSE, LevDist, all Word-
                                                                0.7448        0.8628         0.8550
                                 Sims, SynTokRatio
                                 SimCSE, LevDist, WordRa-
                                 tio, WordTokRatio, SynRa-      0.7425        0.8501         0.8778
                                 tio
                                 SimCSE,      SynTokRatio,
                                                                0.7030        0.8573         0.8610
                                 WordRatio, WordTokRatio
                                 SimCSE,       WordCount,
                                 WordTokRatio, SynCount,        0.5850        0.8537         0.8650
                                 SynTokRatio
                                 SimCSE, SynCount, Syn-
                                 TokRatio,      WordRatio,      0.5842        0.8585         0.8586
                                 WordTokRatio
    ALL                          ALL Except JaccChar, NERatio   0.6323        0.8521         0.8754
                                 ALL Except JaccChar            0.6540        0.8620         0.8754
                                 ALL                            0.6376        0.8642         0.8793
Table 14
Subtask 2B: Comparison of the combination of different similarity scores in a supervised and unsuper-
vised way.
                                                                  MAP@5         MAP@5            MAP@5
        Kinds of Similarity Scores   Similarity Scores
                                                                  UnSup    LogReg 8% positives   LinearSVC
                                     SBERT, InferSent, Univer-
        Semantic Similarity                                       0.4721         0.4190          0.4454
                                     salSE, SimCSE
                                     SBERT, UniversalSE, Sim-     0.4672         0.4190          0.4454
                                     CSE
                                     SBERT, InferSent, SimCSE     0.4310         0.4190          0.4454
                                     SBERT, SimCSE                0.4190         0.4344          0.4454
        Semantic Similarity and      SBERT, SimCSE, Word-         0.4395         0.4554          0.4531
        Lexical Similarity           Count
                                     SBERT, SimCSE, Word-         0.4654         0.4479          0.4537
                                     TokRatio
                                     SBERT, SimCSE, Word-
                                                                  0.4718         0.4479          0.4332
                                     Count, WordTokRatio
                                     SBERT, InferSent, SimCSE,
                                     WordCount, WordTokRa-        0.4654         0.4479          0.4332
                                     tio
                                     SBERT, InferSent, SimCSE,    0.4583         0.4554          0.4562
                                     WordCount
        Semantic     Similarity,
                                     SBERT, SIMCSE, Word-
        Lexical Similarity and                                    0.4190         0.4479          0.4691
                                     TokRatio, NETokRatio
        Referential Similarity
        Semantic     Similarity,
        String Similarity, Lex-      SBERT, WordCount, Jacc-
                                                                  0.4595         0.4056          0.4590
        ical Similarity and          Tok, NETokRatio
        Referential Similarity
                                     SBERT, WordTokRatio, Jac-
                                                                  0.4415         0.4338          0.4295
                                     cTok, NETokRatio
                                     SimCSE, JaccTok, all Word-
                                                                  0.3205         0.4646          0.4308
                                     Sims, all SynSims
                                     SimCSE, SeqMat, JaccTok,
                                                                  0.3195         0.4646          0.4308
                                     all WordSims, all SynSims
                                     SimCSE, SeqMat, JaccTok,
                                     all WordSims, SynRatio,      0.3367         0.4646          0.4308
                                     SynTokRatio
        Semantic     Similarity,
        String Similarity and        SimCSE, LevDist, all Word-   0.3731         0.4608          0.4428
        Lexical Similarity           Sims
                                     SimCSE, LevDist, all Word-
                                                                  0.3477         0.4646          0.4308
                                     Sims, SynTokRatio
                                     SimCSE, LevDist, WordRa-
                                                                  0.3641         0.4608          0.4569
                                     tio, WordTokRatio
                                     SimCSE, LevDist, WordRa-
                                     tio, WordTokRatio, SynRa-    0.3542         0.4646          0.4340
                                     tio
                                     SimCSE,      SynTokRatio,
                                                                  0.3355         0.4646          0.4340
                                     WordRatio, WordTokRatio
                                     SimCSE,       WordCount,
                                     WordTokRatio, SynCount,      0.3118         0.4531          0.4269
                                     SynTokRatio
                                     SimCSE, SynCount, Syn-
                                     TokRatio,      WordRatio,    0.3301         0.4646          0.4385
                                     WordTokRatio
        ALL                          ALL Except JaccChar, NER-    0.3301         0.4869          0.4436
                                     atio
                                     ALL Except JaccChar          0.3147         0.4882          0.4436
                                     ALL                          0.3147         0.4749          0.4513
subtask B for the unsupervised approach (see Tables 13 and 14), while the supervised approach
relies on different features for the subtasks to produce good outputs.


8. Future Work
It would be interesting to investigate the generalizability of our approach and to check if the
assumption that the unsupervised approach generalizes better than the supervised approach is
true. Also a detailed assessment of the impact of pre-processing would be beneficial for related
works.


9. Conclusion
We treated the task to detect previously fact-checked claims as a STS-task. To solve it, we
investigated different kinds of similarity measures between sentences, covering semantic, lexical
and referential similarity. We found that it is beneficial to combine semantic similarity measures
gained by calculating the distance of sentence embeddings with lexical similarity measures
gained by counting shared words. Furthermore, we found that an unsupervised approach can be
even more successful than a supervised approach for this task. Overall, our proposed approaches
provide very good results for both subtasks with a MAP@5 of 0.907 for subtask A and a MAP@5
of 0.459 for subtask B, both scoring above the baselines and even being the top-ranked output
for subtask B.


References
 [1] P. Nakov, G. Da San Martino, F. Alam, S. Shaar, H. Mubarak, N. Babulkov, Overview of
     the CLEF-2022 CheckThat! lab task 2 on detecting previously fact-checked claims, in:
     Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF ’2022,
     Bologna, Italy, 2022.
 [2] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barrón-Cedeño, T. Elsayed, M. Hasanain,
     R. Suwaileh, F. Haouari, G. D. S. Martino, P. Nakov, Overview of checkthat! 2020 english:
     Automatic identification and verification of claims in social media., in: L. Cappellato,
     C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF (Working Notes), volume 2696 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_265.
     pdf.
 [3] S. Shaar, F. Haouari, W. Mansour, M. Hasanain, N. Babulkov, F. Alam, G. Da San Martino,
     T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 2 on detecting
     previously fact-checked claims in tweets and political debates, in: Working Notes of CLEF
     2021—Conference and Labs of the Evaluation Forum, CLEF ’2021, Bucharest, Romania
     (online), 2021. URL: http://ceur-ws.org/Vol-2936/paper-29.pdf.
 [4] M. Bouziane, H. Perrin, A. Cluzeau, J. Mardas, A. Sadeq, Team buster.ai at checkthat! 2020
     insights and recommendations to improve fact-checking, in: L. Cappellato, C. Eickhoff, N. F.
     0001, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation
     Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_134.pdf.
 [5] E. Thuma, N. P. Motlogelwa, T. Leburu-Dingalo, M. Mudongo, Ub_et at checkthat! 2020:
     Exploring ad hoc retrieval approaches in verified claims retrieval, in: L. Cappellato,
     C. Eickhoff, N. F. 0001, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and
     Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/
     paper_204.pdf.
 [6] T. McDonald, Z. Dong, Y. Zhang, R. Hampson, J. Young, Q. Cao, J. L. Leidner, M. Stevenson,
     The university of sheffield at checkthat! 2020: Claim identification and verification on
     twitter, in: L. Cappellato, C. Eickhoff, N. F. 0001, A. Névéol (Eds.), Working Notes of CLEF
     2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September
     22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http:
     //ceur-ws.org/Vol-2696/paper_162.pdf.
 [7] G. S. Cheema, S. Hakimov, R. Ewerth, Check square at checkthat! 2020: Claim detection in
     social media via fusion of transformer and syntactic features, in: L. Cappellato, C. Eickhoff,
     N. F. 0001, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the
     Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_216.
     pdf.
 [8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. URL: https:
     //arxiv.org/abs/1907.11692. doi:10.48550/ARXIV.1907.11692.
 [9] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     2019. arXiv:1908.10084.
[10] L. C. Passaro, A. Bondielli, A. Lenci, F. Marcelloni, Unipi-nle at checkthat! 2020: Ap-
     proaching fact checking from a sentence similarity perspective through the lens of trans-
     formers, in: L. Cappellato, C. Eickhoff, N. F. 0001, A. Névéol (Eds.), Working Notes of
     CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, Septem-
     ber 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL:
     http://ceur-ws.org/Vol-2696/paper_169.pdf.
[11] U. Shukla, A. Sharma, Tiet at clef checkthat! 2020: Verified claim retrieval, in: L. Cappellato,
     C. Eickhoff, N. F. 0001, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and
     Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/
     paper_197.pdf.
[12] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-
     Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal sentence encoder,
     CoRR abs/1803.11175 (2018). URL: http://arxiv.org/abs/1803.11175. arXiv:1803.11175.
[13] J. Martinez-Rico, L. Araujo, J. Martinez-Romo, Nlpir@uned at checkthat! 2020: A prelimi-
     nary approach for check-worthiness and claim retrieval tasks using neural networks and
     graphs, 2020.
[14] A. Pritzkau, Nlytics at checkthat! 2021: Detecting previously fact-checked claims by
     measuring semantic similarity, in: Working Notes of CLEF 2021—Conference and Labs of
     the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021. URL: http://ceur-ws.
     org/Vol-2936/paper-47.pdf.
[15] S. Mihaylova, I. Borisova, D. Chemishanov, P. Hadzhitsanev, M. Hardalov, P. Nakov, Dips
     at checkthat! 2021: Verified claim retrieval, in: Working Notes of CLEF 2021—Conference
     and Labs of the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021. URL:
     http://ceur-ws.org/Vol-2936/paper-45.pdf.
[16] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Aschern at checkthat! 2021: Lambda-calculus
     of fact-checked claims, in: Working Notes of CLEF 2021—Conference and Labs of the
     Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021. URL: http://ceur-ws.
     org/Vol-2936/paper-38.pdf.
[17] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, SemEval-2012 task 6: A pilot on semantic
     textual similarity, in: *SEM 2012: The First Joint Conference on Lexical and Computational
     Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume
     2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval
     2012), Association for Computational Linguistics, Montréal, Canada, 2012, pp. 385–393.
     URL: https://aclanthology.org/S12-1051.
[18] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of universal
     sentence representations from natural language inference data, 2017. URL: https://arxiv.
     org/abs/1705.02364. doi:10.48550/ARXIV.1705.02364.
[19] M. Sahlgren, The distributional hypothesis, The Italian Journal of Linguistics 20 (2008)
     33–54.
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
     N19-1423. doi:10.18653/v1/N19-1423.
[21] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
     Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. URL:
     http://www.aclweb.org/anthology/D14-1162.
[22] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
     information, arXiv preprint arXiv:1607.04606 (2016).
[23] T. Gao, X. Yao, D. Chen, SimCSE: Simple contrastive learning of sentence embeddings, in:
     Empirical Methods in Natural Language Processing (EMNLP), 2021.
[24] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with
     the natural language toolkit, " O’Reilly Media, Inc.", 2009.
[25] C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, 1998.
[26] P. Lopez, entity-fishing, https://github.com/kermitt2/entity-fishing, 2016–2022.
     arXiv:1:dir:cb0ba3379413db12b0018b7c3af8d0d2d864139c.


A. Appendix
Table 15
Subtask 2A: Spearman correlation between different similiarity scores for the test data.
 SimScore        SBERT         InferSent    UniversalSE     SimCSE      LevDist    JaccChar      JaccTok   SeqMat
 SBERT           1.0           -0.1         0.43            0.51        0.06       0.0           0.02      0.05
 InferSent       -0.1          1.0          0.09            -0.09       -0.54      0.38          0.44      -0.36
 UniversalSE     0.43          0.09         1.0             0.36        0.08       0.05          0.15      0.08
 SimCSE          0.51          -0.09        0.36            1.0         0.05       0.01          0.02      0.03
 LevDist         0.06          -0.54        0.08            0.05        1.0        -0.28         -0.23     0.74
 JaccChar        0.0           0.38         0.05            0.01        -0.28      1.0           0.37      -0.14
 JaccTok         0.02          0.44         0.15            0.02        -0.23      0.37          1.0       -0.06
 SeqMat          0.05          -0.36        0.08            0.03        0.74       -0.14         -0.06     1.0
 WordCount       0.1           0.51         0.19            0.07        -0.43      0.37          0.67      -0.28
 WordRatio       0.17          0.24         0.24            0.13        -0.12      0.29          0.63      -0.04
 WordTokRatio    0.19          0.2          0.26            0.15        -0.05      0.25          0.59      0.02
 SynCount        0.07          0.5          0.16            0.05        -0.42      0.31          0.48      -0.27
 SynRatio        0.14          0.28         0.23            0.12        -0.14      0.22          0.45      -0.05
 SynTokRatio     0.14          0.26         0.23            0.12        -0.11      0.21          0.42      -0.04
 NE              0.28          0.11         0.38            0.24        -0.0       0.09          0.23      0.02
 NERatio         0.28          0.1          0.38            0.24        0.0        0.09          0.23      0.02
 NeTokRatio      0.28          0.1          0.38            0.24        0.01       0.08          0.23      0.03
 SimScore        WordCount     WordRatio    WordTokRatio    SynCount    SynRatio   SynTokRatio
 SBERT           0.1           0.17         0.19            0.07        0.14       0.14
 InferSent       0.51          0.24         0.2             0.5         0.28       0.26
 UniversalSE     0.19          0.24         0.26            0.16        0.23       0.23
 SimCSE          0.07          0.13         0.15            0.05        0.12       0.12
 LevDist         -0.43         -0.12        -0.05           -0.42       -0.14      -0.11
 JaccChar        0.37          0.29         0.25            0.31        0.22       0.21
 JaccTok         0.67          0.63         0.59            0.48        0.45       0.42
 SeqMat          -0.28         -0.04        0.02            -0.27       -0.05      -0.04
 WordCount       1.0           0.9          0.86            0.72        0.64       0.62
 WordRatio       0.9           1.0          0.98            0.6         0.67       0.64
 WordTokRatio    0.86          0.98         1.0             0.57        0.66       0.64
 SynCount        0.72          0.6          0.57            1.0         0.91       0.92
 SynRatio        0.64          0.67         0.66            0.91        1.0        0.97
 SynTokRatio     0.62          0.64         0.64            0.92        0.97       1.0
 NE              0.32          0.36         0.36            0.2         0.24       0.23
 NERatio         0.31          0.36         0.36            0.19        0.24       0.23
 NeTokRatio      0.31          0.36         0.36            0.19        0.24       0.23
 SimScore        NE            NERatio      NETokRatio
 SBERT           0.28          0.28         0.28
 InferSent       0.11          0.1          0.1
 UniversalSE     0.38          0.38         0.38
 SimCSE          0.24          0.24         0.24
 LevDist         -0.0          0.0          0.01
 JaccChar        0.09          0.09         0.08
 JaccTok         0.23          0.23         0.23
 SeqMat          0.02          0.02         0.03
 WordCount       0.32          0.31         0.31
 WordRatio       0.36          0.36         0.36
 WordTokRatio    0.36          0.36         0.36
 SynCount        0.2           0.19         0.19
 SynRatio        0.24          0.24         0.24
 SynTokRatio     0.23          0.23         0.23
 NE              1.0           1.0          1.0
 NERatio         1.0           1.0          1.0
 NeTokRatio      1.0           1.0          1.0
Table 16
Subtask 2B: Spearman correlation between different similiarity scores for the test data.
 SimScore        SBERT         InferSent    UniversalSE     SimCSE      LevDist     JaccChar      JaccTok   SeqMat
 SBERT           1.0           0.26         0.59            0.61        -0.06       0.21          0.04      0.04
 InferSent       0.26          1.0          0.39            0.15        -0.59       0.47          0.38      -0.14
 UniversalSE     0.59          0.39         1.0             0.48        -0.09       0.28          0.25      0.09
 SimCSE          0.61          0.15         0.48            1.0         -0.01       0.19          0.1       0.05
 LevDist         -0.06         -0.59        -0.09           -0.01       1.0         -0.23         -0.03     0.57
 JaccChar        0.21          0.47         0.28            0.19        -0.23       1.0           0.33      0.06
 JaccTok         0.04          0.38         0.25            0.1         -0.03       0.33          1.0       0.21
 SeqMat          0.04          -0.14        0.09            0.05        0.57        0.06          0.21      1.0
 WordCount       0.25          0.6          0.43            0.2         -0.33       0.36          0.59      -0.01
 WordRatio       0.11          0.11         0.28            0.14        0.23        0.16          0.59      0.28
 WordTokRatio    0.16          0.04         0.29            0.16        0.33        0.08          0.5       0.33
 SynCount        0.31          0.57         0.46            0.22        -0.37       0.32          0.41      -0.03
 SynRatio        0.26          0.3          0.4             0.21        -0.01       0.18          0.43      0.16
 SynTokRatio     0.27          0.29         0.42            0.22        0.0         0.18          0.4       0.18
 NE              0.34          0.17         0.43            0.3         -0.07       0.18          0.21      0.05
 NERatio         0.34          0.16         0.43            0.3         -0.05       0.17          0.21      0.06
 NeTokRatio      0.34          0.16         0.43            0.3         -0.05       0.17          0.21      0.07
 SimScore        WordCount     WordRatio    WordTokRatio    SynCount    SynRatio    SynTokRatio
 SBERT           0.25          0.11         0.16            0.31        0.26        0.27
 InferSent       0.6           0.11         0.04            0.57        0.3         0.29
 UniversalSE     0.43          0.28         0.29            0.46        0.4         0.42
 SimCSE          0.2           0.14         0.16            0.22        0.21        0.22
 LevDist         -0.33         0.23         0.33            -0.37       -0.01       0.0
 JaccChar        0.36          0.16         0.08            0.32        0.18        0.18
 JaccTok         0.59          0.59         0.5             0.41        0.43        0.4
 SeqMat          -0.01         0.28         0.33            -0.03       0.16        0.18
 WordCount       1.0           0.76         0.7             0.74        0.64        0.63
 WordRatio       0.76          1.0          0.96            0.45        0.63        0.61
 WordTokRatio    0.7           0.96         1.0             0.4         0.6         0.62
 SynCount        0.74          0.45         0.4             1.0         0.87        0.9
 SynRatio        0.64          0.63         0.6             0.87        1.0         0.95
 SynTokRatio     0.63          0.61         0.62            0.9         0.95        1.0
 NE              0.37          0.32         0.32            0.27        0.26        0.25
 NERatio         0.37          0.33         0.32            0.26        0.27        0.25
 NeTokRatio      0.37          0.32         0.32            0.26        0.2760.25
 SimScore        NE            NERatio      NETokRatio
 SBERT           0.34          0.34         0.34
 InferSent       0.17          0.16         0.16
 UniversalSE     0.43          0.43         0.43
 SimCSE          0.3           0.3          0.3
 LevDist         -0.07         -0.05        -0.05
 JaccChar        0.18          0.17         0.17
 JaccTok         0.21          0.21         0.21
 SeqMat          0.05          0.06         0.07
 WordCount       0.37          0.37         0.37
 WordRatio       0.32          0.33         0.32
 WordTokRatio    0.32          0.32         0.32
 SynCount        0.27          0.26         0.26
 SynRatio        0.26          0.27         0.27
 SynTokRatio     0.25          0.25         0.25
 NE              1.0           1.0          1.0
 NERatio         1.0           1.0          1.0
 NeTokRatio      1.0           1.0          1.0

</pre>