RUB-DFL at CheckThat! 2022: Transformer Models
and Linguistic Features for Identifying Relevant
Claims
Zehra Melce Hüsünbeyi1 , Oliver Deck1 and Tatjana Scheffler1
1
    Digital Forensic Linguistics, Ruhr-Universität-Bochum, Universitätsstraße 150, 44801 Bochum, Germany


                                         Abstract
                                         We describe our system for the CLEF 2022 CheckThat! Lab Task 1 Subtasks A,B,C on check-worthiness
                                         estimation, verifiable factual claims detection, and harmful tweet detection in both English and Turkish.
                                         We used transformer-based models as well as an ELMo-based attention network. We experimented with
                                         data pre-processing, data augmentation and adding linguistic features. The official evaluation ranked
                                         our system 1st and 2nd for the Turkish data while we achieved average results for the English data.

                                         Keywords
                                         claim identification, check-worthiness, English, Turkish, linguistic features, Twitter


1. Introduction
The CheckThat! lab at CLEF [1, 2] aims at providing automated solutions that facilitate or
support fake news detection and related subtasks. Automated systems can provide the basis for
human fact checkers and may take over some of the more tedious tasks in dealing with an ever
increasing number of online disinformation. This paper gives an overview of team RUB-DFL’s
system for Task 1: Identifying Relevant Claims in Tweets [3]. Fact checking should only be
applied to claims (and not e.g. opinions or predictions about the future), so identifying claims
and an assessment of their relevance can be used to prioritize which claims to check.
   Our team participated in three of the four subtasks, namely check-worthiness estimation,
claim detection, and harmful tweet detection, for both the English and Turkish data sets.
We conducted experiments with transformer-based models, data augmentation and linguistic
features, as well as ELMo embeddings and attention networks. Our system reached 1st place for
claim identification and check-worthiness estimation in Turkish and average results on English
data. For harmful tweet detection, we placed 9th on the English data and 2nd on the Turkish
data.


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ melce.husunbeyi@rub.de (Z. M. Hüsünbeyi); oliver.deck@rub.de (O. Deck); tatjana.scheffler@rub.de
(T. Scheffler)
 https://tscheffler.github.io/ (T. Scheffler)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work
Disinformation detection has received significant attention in NLP in recent years. Many
systems, data sets and challenges take a holistic approach of so-called fake news detection
[4, 5, 6]. However, the CheckThat! lab has a different aim: Many of the challenges in the
previous years, as well as in 2022, have been looking at smaller, more manageable subtasks
of disinformation detection such as check-worthiness identification and detecting previously
fact-checked claims [7, 8]. In this manner, automated systems can play out their strengths in
pattern detection while receiving oversight from human fact checkers. Real-world NLP systems
can thus provide e.g. a list of check-worthy claims which can be used a starting point for
journalistic investigation.
   Similar datasets to this challenge can be found in ClaimBuster [9] and ClaimsKG [10], as
well as in previous years’ CheckThat! labs [11, 12, 13]. While there was no task on claim
identification in the 2021 CheckThat! challenge, the winning systems on check-worthiness
estimation and detection of previously fact-checked claims in English Tweets were teams
NLP&IR@UNED [14] and team Aschern [15], respectively. Both used BERT models and team
Aschern additionally used TF-IDF and the re-ranking LambdaMART model. More information
on all the participating systems and the approaches they employed can be found in the official
overview papers published by the task organizers [7, 8].


3. Data and Pre-processing
The data for all three subtasks tackled by our team consisted of between 2891 and 4542 tweets.
Tweets were provided with binary labels corresponding to either check-worthiness (subtask 1a),
containing a claim (1b) or containing hateful speech (1c). Since there is considerable overlap
between the datasets – e.g. every check-worthy claim in subtask 1a is automatically a positive
example for a claim in subtask 1b – it is not particularly helpful to combine the datasets to
gain a larger basis for training models. However, for subtask 1a, we could utilize last year’s
CheckThat! data, which we did in section 5.
   Simple data pre-processing steps were taken into consideration for the experiments in section
4.2. These include: changing all text to lower case, removing all URLs, twitter mentions,
punctuation that does is not part of an emoticon, and removing all remaining characters that
are not letters, numbers, white space, or #.
   We also considered two very simple approaches to data augmentation: Adding additional
data from last year’s challenge, as mentioned above, as well as adding linguistic inquiry and
word counts (LIWC) [16]. For this step, we tokenized the tweet text and looked up each token in
the LIWC dictionary; a word list categorized by psycholinguistic and cognitive dimensions, such
as NegativeEmotion, Pronoun, or Health. For each token found in the LIWC dictionary,
the corresponding LIWC category was simply appended to the tweet text. The hope was to
push the classifier to pay greater attention to these psycholinguistic features, instead of relying
simply on the given text.
4. Experiments
4.1. Transformer-based models
Transformer-based models like BERT [17] have significantly improved the performance on a
wide range of NLP problems such as claim detection related tasks which are typically framed
as text classification problems. The BERT architecture follows masked language model (MLM)
and next sentence prediction (NSP) procedures. This structure allows the model to learn the
relationship between masked words and bidirectionally incoming text, to predict whether a
second sentence follows a first, and to examine sentence relationships in an advanced way. After
the release of BERT, other transformer-based pretrained language models have employed similar
approaches while refining aspects like model size, training speed and efficiency, multilingual
embeddings and more.
   After analysing recent studies on benchmark datasets [18] for text classification tasks, we
decided to experiment with autoencoding pretrained language models (PLMs) which mostly
outperform autoregressive PLMs (e.g., OpenGBT) and earlier contextualized language models
(e.g., CNN and RNN based models). The following PLMs were chosen by considering criteria
such as domain compatibility, latency and capacity constraints: BERTweet [19] because it fits
the target domain of the task; XLM-R [20] to experiment with multilingual embedding spaces;
ConvBERT [21] and ELECTRA [22] as more computationally efficient models. For the Turkish
data, we also used the multilingual XLM-R model, but switched to the Turkish variants of the
other models: BERTurk 1 , ConvBERTurk 2 and the Turkish ELECTRA3 model.
   Despite the significance of hyperparameter tuning, the growing parameter space and lack of
memory limit the tuning process to the chosen hyperparameters. We tuned hyperparameters
along with controlled experiments and used a fixed seed value used to ensure consistency. For
all experiments, weighted-average F1 scores are presented, considering the size of each class
and their contribution to the f-score. We used the rich and publicly available AI repository
Huggingface4 for the PLMs.

English As seen in Table 1, all four models lead to comparable results, although ConvBERT
was slightly ahead for subtask 1a (check-worthiness of tweets) with an f-score of 0.839, while
BERTweet achieved the highest f-score on subtasks 1b (verifiable factual claims detection) at
0.814 and 1c (harmful tweet detection) at 0.895. However, all f-scores, with the exception of
XLM-R in subtask 1a, were within 0.03 points of each other. Such close results, combined
with different systems winning different, though related, tasks on very similar data prohibit
identifying a clearly superior approach. Further experimentation is needed to explore relevant
factors for the success of a particular model.

Turkish For the Turkish data, BERTurk provided the highest f-score for subtask 1a at 0.813, the
multilingual model XLM-R achieved the highest f-score on subtask 1b at 0.768 and ConvBERTurk

   1
     https://huggingface.co/dbmdz/bert-base-turkish-cased
   2
     https://huggingface.co/dbmdz/convbert-base-turkish-cased
   3
     https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator
   4
     https://huggingface.co/
Table 1
Transformer-based models without data pre-processing (English).
                    English                                 accuracy    precision    recall    f-score
                                               BERTweet        0.824       0.823     0.824      0.823
   Check-worthiness of tweets (EN)             XLM-R           0.807       0.788     0.807      0.790
                                               ConvBERT       0.838        0.841     0.838      0.839
                                               ELECTRA         0.826       0.815     0.826      0.818
                                               BERTweet       0.816        0.814     0.816      0.814
   Verifiable factual claims detection (EN)    XLM-R           0.809       0.807     0.809      0.806
                                               ConvBERT        0.810       0.810     0.810      0.804
                                               ELECTRA         0.828       0.826     0.828      0.826
                                               BERTweet        0.907       0.889     0.907      0.895
   Harmful tweet detection (EN)                XLM-R          0.910        0.828     0.910      0.867
                                               ConvBERT        0.903       0.885     0.903      0.892
                                               ELECTRA         0.910       0.828     0.910      0.867


Table 2
Transformer-based models without data pre-processing (Turkish).
                  Turkish                                    accuracy    precision    recall    f-score
                                              BERTurk           0.820        0.808     0.820     0.813
  Check-worthiness of tweets (TR)             XLM-R             0.833        0.803     0.833      0.805
                                              ConvBERTurk       0.830        0.800     0.830      0.806
                                              ELECTRA           0.827        0.795     0.827      0.801
                                              BERTurk           0.782        0.777     0.782      0.772
  Verifiable factual claims detection (TR)    XLM-R             0.777        0.771     0.777     0.768
                                              ConvBERTurk       0.762        0.755     0.762      0.756
                                              ELECTRA           0.770        0.764     0.770      0.764
                                              BERTurk           0.781        0.773     0.782      0.776
  Harmful tweet detection (TR)                XLM-R             0.736        0.675     0.736      0.630
                                              ConvBERTurk       0.788        0.777     0.788     0.781
                                              ELECTRA           0.760        0.743     0.761      0.748


took the lead in task 1c at 0.781 f-score, see Table 2. Again, the close field – only XLM-R in
subtask 1c deviated by more than 0.03 f-score from any of the other systems – provided little
insight into which system would perform best in general.
4.2. Transformer-based models with pre-processed data
To investigate the merits of data pre-processing described in section 3, we ran the same systems
again on the simpler, cleaner data. With fewer confusing factors such as Twitter mentions and
punctuation, the transformer-based models presumably encountered fewer situations they had
not seen in training. As can be seen in Table 3 and Table 4 we therefore achieved slightly higher
f-scores.

English ConvBERT increased in subtask 1a from 0.839 to 0.843, overtaking BERTweet (previ-
ously 0.814) with 0.817 in subtask 1b and achieving even numbers with BERTweet (previously
0.895) in subtask 1c where both increased to 0.906 f-score with the pre-processed data. XLM-R
and BERTweet were not strongly affected by the pre-processing (BERTweet performance shows
a slight decrease for subtask 1a and XLM-R for subtask 1b). ELECTRA, however, exhibited lower
scores for all subtasks, leading to the assumption that it managed to pick up on signals that
were removed by pre-processing.

Turkish Simple pre-processing lead to small increases in all three subtasks for the Turkish
data as well: In subtask 1a, the top system BERTurk increased from 0.813 to 0.822 f-score, for
subtask 1b, BERTurk overtook XLM-R (which decreased from 0.768 to 0.740) with an f-score of
0.788. In subtask 1c, ConvBERTurk increase from 0.781 to 0.781. All increases are fairly small
and some systems even decreased in performance. However, since the best models for each task
showed improvements, it seems that pre-processing also helps with the agglutinative structure
of the Turkish language.

Table 3
Transformer-based models with data pre-processing (English).
                   English                               accuracy      precision   recall   f-score
                                              BERTweet         0.819      0.821    0.819     0.820
   Check-worthiness of tweets (EN)            XLM-R            0.820      0.784    0.820     0.793
                                              ConvBERT         0.845      0.841    0.845     0.843
                                              ELECTRA          0.775      0.601    0.775     0.677
                                              BERTweet         0.819      0.817    0.819     0.816
   Verifiable factual claims detection (EN)   XLM-R            0.799      0.797    0.799     0.795
                                              ConvBERT         0.820      0.818    0.820     0.817
                                              ELECTRA          0.787      0.784    0.787     0.784
                                              BERTweet         0.914      0.902    0.914     0.906
   Harmful tweet detection (EN)               XLM-R            0.910      0.828    0,910     0,867
                                              ConvBERT         0.909      0.903    0.909     0.906
                                              ELECTRA          0.910      0.828    0.910     0.867
Table 4
Transformer-based models with data pre-processing (Turkish).
                                                           accuracy    precision   recall   f-score
                                             BERTurk           0.828      0.818    0.827     0.822
  Check-worthiness of tweets (TR)            XLM-R             0.833      0.798    0.833     0.792
                                             ConvBERTurk       0.805      0.776    0.805     0.787
                                             ELECTRA           0.817      0.780    0.817     0.789
                                             BERTurk           0.794      0.789    0.794     0.788
  Verifiable factual claims detection (TR)   XLM-R             0.755      0.747    0.755     0.740
                                             ConvBERTurk       0.767      0.760    0.767     0.760
                                             ELECTRA           0.721      0.710    0.721     0.711
                                             BERTurk           0.774      0.759    0.774     0.762
  Harmful tweet detection (TR)               XLM-R             0.736      0.542    0.736     0.625
                                             ConvBERTurk       0.783      0.781    0.783     0.782
                                             ELECTRA           0.773      0.769    0.773     0.771


4.3. Transformer-based models with data augmentation
In a first simple step, we focused on augmenting the data of subtask 1a (check-worthiness
estimation) with additional data from last year’s CheckThat! challenge; subtasks 1b and 1c
were different from last year. We first collected the tweets from the 2021 challenge, removed
all duplicates and negative examples (to balance the dataset more towards the positive, i.e.
check-worthy class). This left us with an additional 875 English and 237 Turkish tweets, which
we added to the training data.
   As can be seen in Table 5, results varied based on the language of the data. For English, we
saw f-score increases from 0.839 to 0.854 (without pre-processing) and 0.843 to 0.853 for the
ConvBERT system, indicating that the additional data contained textual markers that could
be picked up by the transformer model. For Turkish, on the other hand, the performance of
the winning system BERTweet decreased from 0.813 to 0.805 without pre-processing. For the
pre-processed data, while BERTweet achieved 0.822 f-score on the non-augmented data, its
performance dropped to 0.797 when trained on the augmented data, reaching second place
behind ConvBERTurk with 0.805 f-score. It therefore seems that the Turkish systems may have
picked up specific markers of the 2022 data before that better solved the development set, but
may not have been actual markers of check-worthiness, leading to a reduced performance when
trained on additional data from 2021. Further discussion of the challenges of the Turkish dataset
can be found in the Error Analysis section below.
   Our second approach to data augmentation was adding LIWC categories to tweets. This
was only possible for the English data, since we had no access to the Turkish version of LIWC.
Table 6 shows the best performing systems for each subtask on this augmented data. As can
Table 5
Transformer-based models, data augmentation with additional positive samples and pre-processing.
                                                         accuracy    precision    recall    f-score
                                         BERTweet         0.824       0.823       0.824     0.823
    Check-worthiness of tweets (EN)
                                         XLM-R            0.787       0.787       0.787     0.787
    without data pre-processing
                                         ConvBERT         0.855       0.853       0.855     0.854
                                         ELECTRA          0.815       0.807       0.815     0.810
                                         BERTweet         0.819       0.821       0.819     0.820
    Check-worthiness of tweets (EN)
                                         XLM-R            0.786       0.776       0.786     0.780
    with data pre-processing
                                         ConvBERT         0.854       0.852       0.854     0.853
                                         ELECTRA          0.845       0.835       0.845     0.835
                                         BERTurk          0.803       0.807       0.803     0.805
    Check-worthiness of tweets (TR)
                                         XLM-R            0.785       0.749       0.785     0.763
    without data pre-processing
                                         ConvBERTurk      0.802       0.788       0.802     0.794
                                         ELECTRA          0.814       0.795       0.814     0.802
                                         BERTurk          0.797       0.797       0.797     0.797
    Check-worthiness of tweets (TR)
                                         XLM-R            0.789       0.754       0.789     0.768
    with data pre-processing
                                         ConvBERTurk      0.804       0.805       0.805     0.805
                                         ELECTRA          0.806       0.784       0.806     0.793


be seen, there was no increase in performance when compared to training on either the raw
or pre-processed data. Highest f-scores were 0.821 as opposed to 0.843 for subtask 1a, 0.771 as
opposed to 0.817 for subtask 1b, and 0.895 as opposed to 0.906 for subtask 1c.
   One explanation is that transformer-based models are trained on natural text and artificially
appended LIWC categories are not something the model has seen in training. Such features may
be more helpful when integrated in an ensemble model where one part picks up on the LIWC
features and can then be combined with the transformers’ output. Due to time constraints, we
must leave this experiment for future work.

Table 6
Best-performing transformer-based models, data augmentation with LIWC categories.
                                                          accuracy    precision    recall    f-score
   Check-worthiness of tweets (EN)            ConvBERT    0.826         0.818      0.826      0.821
   Verifiable factual claims detection (EN)   ELECTRA     0.774         0.770      0.774      0.771
   Harmful tweet detection (EN)               BERTweet    0.905         0.889      0.905      0.895
4.4. Transformer-based models with additional linguistic features
For this approach, we first calculated nine basic linguistic features as a baseline: word count,
character count, punctuation count, emoji count, contains emoji, contains non-Twitter URL,
number of LIWC categories, text complexity, and sentiment. For Turkish, only the first six
features were calculated. The features were concatenated with our transformer-based models
to see if adding simple linguistic markers would lead to improvements.
   For the English data, we also calculated 239 additional linguistic features with the help
of the lingfeat library5 which was originally used for readability assessment [23]. Due to
time constraints, we were not able to implement our own feature set specifically adapted for
claim detection and relied on this out-of-the-box solution for English. The 239 features include
semantic (e.g. Wikipedia knowledge), discourse (e.g. entity density), syntactic (e.g. part-of-
speech), lexico-semantic (e.g. type token ratio), as well as shallow traditional features (e.g.
average number of tokens). An overview of all features can be found in [23, p. 10672].
   The transformer-based models capture different levels of semantic and syntactic knowledge by
use of multi-head attention layers. By concatenating the last four layers of our best performing
transformer-based model for each task, we aimed to obtain better representations. These 3072-
dimensional document embeddings were processed through a fully connected layer with 1024
hidden units and the ReLU activation function. A dropout regularization with a rate of 0.2
was then performed. The resulting hidden layer was incorporated with the 9-dimensional
and 239-dimensional external linguistic features separately for the English datasets and the 6-
dimensional numerical features for the Turkish datasets. The concatenated vectors were passed
to a fully connected layer with 128 hidden units and the ReLU activation function. Another
dropout regularization with a rate of 0.1 was applied to the hidden layer and predictions were
generated with a sigmoid activation function.
   The results for both the baseline features and the whole range of linguistic features provided
by the lingfeat library can be found in Table 7. Like before, only the best performing models
are shown here. As can be seen, the performance was lower than our pure transformer-based
models trained on pre-processed data in Tables 3 and 4. What is more, the 239 linguistic features
for English lead to lower performance than the 9 simple features. Other experiments with a
logistic regression classifier on the linguistic features alone provided very low numbers that
barely beat a random baseline. From this we can gather that simply adding a large list of
linguistic features which are not necessarily adapted to the task at hand is not helpful. Instead,
the low performance of the linguistic features lead to a deterioration of the ensemble when
compared to the transformer models alone. However, with more fine-tuning and by identifying
linguistic features that are domain-specific, different fusion techniques could be explored in the
future.


   5
       https://github.com/brucewlee/lingfeat
Table 7
Best-performing transformer-based models merged with additional linguistic features.
                                                                                       accuracy   precision   recall   f-score
                              Check-worthiness of tweets (EN)            ConvBERT      0.777      0.753       0.777    0.684
9 basic ling. features
                              Verifiable factual claims detection (EN)   ELECTRA       0.816      0.816       0.816    0.816
                              Harmful tweet detection (EN)               BERTweet      0.910      0.828       0.910    0.867
                              Check-worthiness of tweets (TR)            BERTurk       0.835      0.806       0.835    0.809
6 basic ling. features
                              Verifiable factual claims detection (TR)   BERTurk       0.765      0.778       0.765    0.769
                              Harmful tweet detection (TR)               ConvBERTurk   0.783      0.774       0.783    0.777
                              Check-worthiness of tweets (EN)            ConvBERT      0.775      0.714       0.775    0.680
239 advanced ling. features
                              Verifiable factual claims detection (EN)   ELECTRA       0.653      0.760       0.653    0.539
                              Harmful tweet detection (EN)               BERTweet      0.903      0.827       0.903    0.864


4.5. ELMo embeddings, attention network and linguistic features
In a final round of experiments, we moved away from the transformer architecture and evaluated
the basic and advanced linguistic features in an ensemble of ELMo embeddings [see 24] in
an attention network. Pre-trained ELMo embeddings were processed along with the encoder,
a bidirectional RNN based model. We used a GRU rather than an LSTM model to decrease
parameters and prevent overfitting given the small size of our corpus. The models were trained
with a 500-dimensional bidirectional GRU token encoder. Then, an attention layer producing a
sequence vector with indicative tokens received the hidden states of the encoder layer. After
dropout regularization with a rate of 0.2, the attention layer’s output vectors were merged
with either the 9-dimensional or 239-dimensional external linguistic features separately. These
concatenated vectors were then fed to a fully connected layer with a ReLU activation function.
We also added dropout regularization with a rate of 0.1 to the hidden layer. Predictions were
created using the sigmoid activation function.
   The results are shown in Table 8. As with the transformer models in section 4.2, the ELMo
embedding ensemble performed worse when compared to the transformer models trained on
pre-processed text. For the runs with 9 linguistic features, the subtask 1a f-score was 0.803
which would take 3rd place in direct comparison with the transformer models. In subtask 1b,
all transformer models beat the 0.7612 f-score of the ELMo ensemble, but in subtask 1c, the
0.881 f-score would place it in 3rd place behind the 0.906 of BERTweet and ConvBERT. When
compared to combining transformer models with linguistic features, the attention network with
ELMo embeddings performed much better, which may be based on the transformers picking
up more relevant linguistic features in their training process inherently, while the architecture
used in this chapter lends itself more easily to adding additional signals.
   Again, the 239-dimensional linguistic features lead to lower performance. Since the features
are not task-specific for any of the three subtasks, they may simply provide too much noisy
data leading to lower performance in the systems.
Table 8
ELMo embeddings and attention network model merged with additional linguistics features
                                                                         accuracy   precision   recall   f-score
                              Check-worthiness of tweets (EN)             0.814      0.800      0.814    0.803
9 basic ling. features
                              Verifiable factual claims detection (EN)    0.767      0.763      0.767    0.7612
                              Harmful tweet detection (EN)                0.897      0.872      0.897    0.881
                              Check-worthiness of tweets (EN)             0.774      0.599      0.774    0.675
239 advanced ling. features
                              Verifiable factual claims detection (EN)    0.370      0.137      0.370    0.200
                              Harmful tweet detection (EN)                0.910      0.828      0.910    0,867


4.6. Official results on the test set
We submitted the best models in terms of f-score measure for subtasks 1a, 1b, and 1c in both
English and Turkish: For subtask 1a English ConvBert with additional data, for Turkish BERTurk
with data pre-processing. For subtask 1b English we chose ELECTRA, for Turkish BERTurk
with data pre-processing. For subtask 1c English BERTweet with data pre-processing, and for
Turkish ConvBert with data pre-processing were chosen.
   Our systems reached average scores on the English data, placing 6th out of 13 teams in subtask
1a with an F1 score for the positive class of 0.525 (winning system: 0.698). In subtask 1b, we
placed 6th out of 9 systems with an F1 accuracy score of 0.709 (winning system: 0.761) and for
subtask 1c we placed 9th out of 11 teams with an F1 for the positive class of 0.273 (winning
system: 0.397).
   On the Turkish data, we placed 1st in subtask 1a (F1 positive class: 0.212) and 1b (F1 accuracy:
0.801) and 2nd in subtask 1c (F1 positive class: 0.353, winning system: 0.366). While the scores
for task 1c were low across both languages, as well as in the Arabic, Bulgarian and Dutch data
sets, the extremely low numbers for task 1a (check-worthiness) in Turkish are an outlier. Here,
we were the only team that managed to surpass 0.2 F1 score. It seems that all systems overfit on
the training and development data and were not capable of identifying actual check-worthiness
markers that would translate to performing well on the test set.


5. Error Analysis
Due to the overall low evaluation scores on the test set of the check-worthiness subtask in
Turkish, we analyzed some of the incorrectly predicted results of our best model. Out of a total
of 67 misclassified tweets, there were 5 false negative and 62 false positive instances.
   We checked the false negatives for clues to improve recall. In one example tweet, a well-
known Turkish person is mentioned with a mention tag. The tweet also contains the use of
the quotation sign and the last sentence ends with a question mark. It can be interpreted as
containing a claim, with the author exhibiting a skeptical distance from that claim. There were
also examples in which an exclamation mark was placed in two parentheses, signifying sarcastic
use, and suffixes were used to compare two opposite situations. We also found cases where the
masses were tried to be mobilized around a claim with the words of the address.
   In the false positive samples, on the other hand, there was a large number of tweets which are
difficult to classify. In our manual re-evaluation, we found sentences that could be reclassified
as checkworthy claims. It can be seen that quotations which specify the source are frequently
used to strengthen statements that can be very dangerous, as in the following example:

  (1) Prof. Serhat Fındık: Hindistan Covid i aşılamayı bırakıp, İvermectin’e geçerek yendi. Afrika
      da aynı şekilde. İvermectin çok ucuz bir ilaçtır. Küresel ilaç şirketleri ucuz ilaçları sevmezler.
      ‘Prof. Serhat Fındık: India defeated Covid by stopping the vaccine and switching to Iver-
      mectin, likewise in Africa. Ivermectin is a very inexpensive drug. Global pharmaceutical
      companies do not like cheap drugs.’

   The tweet in (1) is judged “non-checkworthy”, even though in our view it does contain several
checkworthy claims (mixed with opinions). The high rate of such gray area cases in the Turkish
test data could partially explain the extremely low scores across all systems submitted for this
task.


6. Conclusion and Future Work
We have described our system for the CLEF 2022 CheckThat! Lab Task 1. We tackled subtasks
1a, 1b and 1c on check-worthiness, claim detection, and harmful tweet detection, in both English
and Turkish. We experimented with four different transformer-based architectures as well as
an ELMo-based attention network ensemble. We also tried different methods of pre-processing,
data augmentation and included a number of linguistic features. We placed 6th , 6th and 9th for
the English data and 1st , 1st , and 2nd for Turkish for the three subtasks. During this trial-and-
error process, we realized that transformer based models already capture more comprehensive
linguistic features than those we included in the system.
   In the future, we plan to investigate more adapted and task-specific linguistic features,
especially since transformer models rely on large amounts of training text which are not
available for the majority of the world’s languages. Additionally, we will examine what features
are most relevant for our problem for designing a more interpretable model.


Acknowledgments
This work was partially funded by the German Federal Ministry of Education and Research
(BMBF) in the project “NoFake: AI assisted system for the crowdsourcing based detection of
disinformation spread via digital platforms” (16KIS1518K).


References
 [1] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, The CLEF-2022 CheckThat! Lab on fighting the
     covid-19 infodemic and fake news detection, in: M. Hagen, S. Verberne, C. Macdonald,
     C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Springer
     International Publishing, Cham, 2022, pp. 416–428.
 [2] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the
     CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection,
     in: A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald,
     G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, F. Nicola (Eds.), Proceedings of the 13th
     International Conference of the CLEF Association: Information Access Evaluation meets
     Multilinguality, Multimodality, and Visualization, CLEF ’2022, Bologna, Italy, 2022.
 [3] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, R. Míguez, T. Caselli, M. Kutlu,
     W. Zaghouani, C. Li, S. Shaar, H. Mubarak, A. Nikolov, Y. S. Kartal, J. Beltrán, Overview
     of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets, in:
     N. Faggioli, Guglielmo andd Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF
     2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
 [4] X. Zhou, R. Zafarani, A Survey of Fake News: Fundamental Theories, Detection Methods,
     and Opportunities, ACM Computing Surveys 53 (2020) 1–40. doi:10.1145/3395046.
     arXiv:1812.00315.
 [5] I. Augenstein, Towards Explainable Fact Checking, arXiv:2108.10274 [cs, stat] (2021).
     arXiv:2108.10274.
 [6] Z. Guo, M. Schlichtkrull, A. Vlachos, A Survey on Automated Fact-Checking, Transactions
     of the Association for Computational Linguistics 10 (2022) 178–206. doi:10.1162/tacl_
     a_00454.
 [7] S. Shaar, F. Haouari, W. Mansour, M. Hasanain, N. Babulkov, F. Alam, P. Nakov, Overview
     of the CLEF-2021 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims in
     Tweets and Political Debates, in: CEUR Workshop Proceedings, Bucharest, Romania, 2021,
     p. 13.
 [8] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, M. Kutlu, Y. S. Kartal,
     F. Alam, J. Beltrán, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! Lab Task
     1 on Check-Worthiness Estimation in Tweets and Political Debates, in: CEUR Workshop
     Proceedings, Bucharest, Romania, 2021, p. 24.
 [9] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
     A. Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: The first-ever end-
     to-end fact-checking system, Proceedings of the VLDB Endowment 10 (2017) 1945–1948.
     doi:10.14778/3137765.3137815.
[10] A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze,
     K. Todorov, ClaimsKG: A Knowledge Graph of Fact-Checked Claims, in: C. Ghidini,
     O. Hartig, M. Maleshkova, V. Svátek, I. Cruz, A. Hogan, J. Song, M. Lefrançois, F. Gandon
     (Eds.), The Semantic Web – ISWC 2019, Lecture Notes in Computer Science, Springer
     International Publishing, Cham, 2019, pp. 309–324. doi:10.1007/978-3-030-30796-7_
     20.
[11] M. Hasanain, R. Suwaileh, T. Elsayed, A. Barron-Cedeno, P. Nakov, Overview of the
     CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. Task 2:
     Evidence and Factuality, in: CEUR Workshop Proceedings, Lugano, Switzerland, 2019,
     p. 15.
[12] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, Overview of the CLEF-2019 Check-
     That! Lab: Automatic Identification and Verification of Claims. Task 1: Check-Worthiness,
     in: CEUR Workshop Proceedings, Lugano, Switzerland, 2019, p. 15.
[13] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barron-Cedeno, T. Elsayed, M. Hasanain,
     R. Suwaileh, F. Haouari, Overview of CheckThat! 2020 English: Automatic Identification
     and Verification of Claims in Social Media, in: CEUR Workshop Proceedings, Thessaloniki,
     Greece, 2020, p. 24.
[14] J. R. Martinez-Rico, J. Martinez-Romo, L. Araujo, NLP\&IR@UNED at CheckThat! 2021:
     Check-worthiness estimation and fake news detection using transformer models, in: CEUR
     Workshop Proceedings, Bucharest, Romania, 2021, p. 13.
[15] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Aschern at CheckThat! 2021: Lambda-Calculus of
     Fact-Checked Claims, in: CEUR Workshop Proceedings, Bucharest, Romania, 2021, p. 10.
[16] J. W. Pennebaker, M. E. Francis, R. J. Booth, Linguistic inquiry and word count: LIWC
     2015, Pennebaker Conglomerates (2015).
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidi-
     rectional Transformers for Language Understanding, arXiv:1810.04805 [cs] (2019).
     arXiv:1810.04805.
[18] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning–
     based text classification: a comprehensive review, ACM Computing Surveys (CSUR) 54
     (2021) 1–40.
[19] D. Q. Nguyen, T. Vu, A. Tuan Nguyen, BERTweet: A pre-trained language model for
     English Tweets, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 9–14. doi:10.18653/v1/2020.emnlp-demos.2.
[20] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning
     at Scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. doi:10.
     18653/v1/2020.acl-main.747.
[21] Z.-H. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, S. Yan, ConvBERT: Improving BERT with
     Span-based Dynamic Convolution, in: Advances in Neural Information Processing Systems,
     volume 33, Curran Associates, Inc., 2020, pp. 12837–12848.
[22] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training Text Encoders as
     Discriminators Rather Than Generators, 2020. arXiv:2003.10555.
[23] B. W. Lee, Y. S. Jang, J. Lee, Pushing on Text Readability Assessment: A Transformer Meets
     Handcrafted Linguistic Features, in: Proceedings of the 2021 Conference on Empirical
     Methods in Natural Language Processing, Association for Computational Linguistics,
     Online and Punta Cana, Dominican Republic, 2021, pp. 10669–10686. doi:10.18653/v1/
     2021.emnlp-main.834.
[24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     Contextualized Word Representations, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New
Orleans, Louisiana, 2018, pp. 2227–2237. doi:10.18653/v1/N18-1202.