=Paper=
{{Paper
|id=Vol-3180/paper-37
|storemode=property
|title=TOBB ETU at CheckThat! 2022: Detecting Attention-Worthy and Harmful Tweets and Check-Worthy
Claims
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-37.pdf
|volume=Vol-3180
|authors=Ahmet Bahadir Eyuboglu,Mustafa Bora Arslan,Ekrem Sonmezer,Mucahid Kutlu
|dblpUrl=https://dblp.org/rec/conf/clef/EyubogluASK22
}}
==TOBB ETU at CheckThat! 2022: Detecting Attention-Worthy and Harmful Tweets and Check-Worthy
Claims==
TOBB ETU at CheckThat! 2022: Detecting
Attention-Worthy and Harmful Tweets and
Check-Worthy Claims
Ahmet Bahadir Eyuboglu, Mustafa Bora Arslan, Ekrem Sonmezer and Mucahid Kutlu
TOBB University of Economics and Technology, Ankara, Turkey
Abstract
In this paper, we present our participation in CLEF 2022 CheckThat! Lab’s Task 1 on detecting check-
worthy and verifiable claims and attention-worthy and harmful tweets. We participated in all subtasks
of Task1 for Arabic, Bulgarian, Dutch, English, and Turkish datasets. We investigate the impact of
fine-tuning various transformer models and how to increase training data size using machine translation.
We also use feed-forward networks with the Manifold Mixup regularization for the respective tasks. We
are ranked first in detecting factual claims in Arabic and harmful tweets in Dutch. In addition, we are
ranked second in detecting check-worthy claims in Arabic and Bulgarian.
Keywords
Fact-Checking, Check-worthiness, Attention-worthy tweets, Harmful tweets, Factual Claims
1. Introduction
Social media platforms became one of the main information resource for people by enabling
their users to easily share messages and follow others. While these platforms are extremely
important to help people share their thoughts and make their voice heard, they can be also
used in a very negative way by spreading misinformation and/or hateful messages which will
negatively impact individuals and societies. We have especially observed this dark side of social
media platforms during COVID-19 pandemic. For instance, misinformation and conspiracy
theories about vaccines increased hesitation towards being vaccinated [1]. Furthermore, the
messages spread on social media platforms might impact public opinion on a particular issue
and mobilize people, forcing government entities to take action. For instance, government
entities of several countries had to regularly share information about vaccines to reduce the
vaccine hesitation (e.g., [2]).
In this paper, we explain our participation in Task 1 [3] of the CLEF Check That! 2022
Lab [4, 5]. Task 1 covers four subtasks including 1) check-worthy claim detection (Subtask
1A), verifiable factual claim detection (Subtask 1B), harmful tweet detection (Subtask 1C), and
attention-worthy tweet detection (Subtask 1D). Subtask 1A covers six languages including
Arabic, Bulgarian, Dutch, English, Spanish, and Turkish while the other subtasks cover all the
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ ahmetbahadireyuboglu@gmail.com (A. B. Eyuboglu); mustafaboraarslan@outlook.com (M. B. Arslan);
sonmezerekrem@outlook.com (E. Sonmezer); m.kutlu@etu.edu.tr (M. Kutlu)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
mentioned languages except Spanish. We participated in all subtasks for Arabic, Bulgarian,
Dutch, English, and Turkish languages1 , yielding 20 submissions in total.
In the development phase of the shared task, we explored three different research directions
including i) fine-tuning various pre-trained transformer models, ii) increasing the training
data for fine-tuning transformer models, and iii) applying the Manifold Mixup regularization
technique [6] for the subtasks we participated. In particular, we investigated 9, 3, 5, 13, and 3
different pre-trained transformer models for subtask 1A in Arabic, Bulgarian, Dutch, English, and
Turkish, respectively. In addition, we explored increasing training data by back-translation and
machine-translating datasets in other languages for subtask 1C. Next, we compared the Manifold
Mixup approach, fine-tuning transformer models, and data augmentation by back-translation
in all four subtasks to select models for our official submissions.
In our experiments with the development dataset, we find that the type of the transformer
model causes dramatic changes in the performance, suggesting that researchers should select
the models carefully. In addition, our findings about the impact of artificially increasing the
data are mixed. In particular, we observe that increasing training data usually has a negative
impact in Bulgarian and Turkish datasets in subtask 1C while using additional data for English
and Dutch datasets improves the performance.
In the official ranking, we achieved mixed results. Considering tasks with at least three
participants, we are ranked first in 1B-Arabic and second in 1A-Arabic and 1A-Bulgarian. We
share our implementation for the Manifold Mixup method2 for reproducibility of our results.
2. Approaches
We explore three different approaches for all subtasks including fine-tuning various transformer
models, increasing dataset size via machine translation, and the Manifold Mixup regularization.
In this section we explain each of them in detail.
2.1. Fine Tuning Various Transformer Models
Prior works show remarkable success of transformer models in various text classification tasks
[7]. Furthermore, the best-performing systems in previous check-worthy claim detection tasks of
Check That! Lab [8] usually exploited various transformer models [9, 10]. However, Kartal and
Kutlu [11] show that the performance of models varies dramatically across different transformer
models. Therefore, in this approach, we explore several language-specific transformer models
pre-trained with different datasets.
2.2. Increasing Training Data via Machine Translation
Training data has enormous impact on the performance of resultant models. Prior work on
detecting check-worthy claim detection investigated several ways to increase the training data
size such as back-translation [9], weak supervision [12], and utilizing datasets in other languages
with multi-lingual models [11]. In this approach, we explore increasing training data size by
1
We could not participate for Spanish due to a technical problem we encountered during development.
2
https://github.com/Carnagie/manifold-mixup-text-classification
two different methods including 1) utilizing datasets in other languages by machine-translating
them into the respective language, and 2) paraphrasing the training data via back-translation
and using them as additional labeled data.
In the first method, we exploit datasets in several languages provided by the Check That! Lab
organizers this year. In particular, in order to develop a model for a specific language, 𝐿𝑂 , we
first select a training dataset provided for another language and machine-translate its tweets
to the language 𝐿𝑂 using Google Translate. Subsequently, we fine-tune a language-specific
transformer model using the original data and machine-translated data together. In subtask 1C,
we machine translate only tweets labeled as harmful to reduce the imbalance in label distribution
while increasing the training data size.
In our back-translation method, we first translate the original text to another language using
Google Translate. Subsequently, we translate the resultant text back to the original language.
This method is likely to create slightly different texts than the original ones with a same or
similar meaning. Assuming that the change in the texts will not affect their label, we combine
the original data with the back-translated data and fine-tune a language specific transformer
model.
2.3. Language Specific BERT with Manifold Mixup
Many of the annotations in the shared task are subjective. For instance, whether a tweet requires
attention of government entities might depend on how much the annotators want governments
to intervene their life. Similarly, prior work on check-worthiness points out the subjective
nature of the task (e.g., [11, 13]) In order to focus on this problem, we apply the Manifold
Mixup regularization proposed by Verma et al. [6]. In particular, the Manifold Mixup trains
neural networks on linear combinations of hidden representations of training examples, yielding
flattened class-representations and smoother decision boundaries. Verma et al. [6] demonstrate
that their approach yields more robust solutions in image classification. In our work, we use
BERT embeddings to represent tweets and then train a four-layer feed-forward network with
the Manifold Mixup method.
In subtask 1-D, we apply a different approach than the other tasks due to its severely imbal-
anced label distribution. In particular, there are nine labels in subtask 1-D, but eight of them
are about why a particular tweet is attention-worthy. In addition, the majority of the tweets
have “not attention-worthy” label. Therefore, we first binarize labels by merging variants of
attention-worthy labels into a single one, yielding only two labels: 1) attention-worthy and
2) not-attention-worthy. Subsequently, we under-sample negative class with the 1/5 ratio and
train our Manifold Mixup model. Next, we build another model using eight labels for attention-
worthy tweets. If a tweet is classified as attention-worthy, we use the second model to predict
why it is attention-worthy. Otherwise, we do not use the second model and label it as “not
attention-worthy”. Note that we do not apply this two-step approach for other subtasks because
they are already binary classification tasks.
3. Experiments
We first present statistics about the datasets and explain implementation details and our experi-
mental setup in Section 3.1. Next, we explain how we selected our submissions in Section 3.2.
Finally, we present the results of our submissions in Section 3.3.
3.1. Experimental Setup
3.1.1. Implementation
In order to fine-tune and configure transformer models, we use PyTorch v.1.9.03 and Tensorflow4
libraries. We import transformer models used in our experiments from Huggingface5 . In addition,
we use Google’s SentencePiece library for machine translation6 . We set the batch size to 32 in
all our experiments with fine-tuned transformer models. In experiments on increasing dataset
size using machine translation, we train the models for 5 epochs.
We implemented the Manifold Mixup [6] method from scratch using PyTorch v.1.9.0, and set
epoch and the batch size to 5 and 2, respectively. We use the following transformer models for
each language: AraBERT.v02 [14] for Arabic, RoBERTa-base-bulgarian7 for Bulgarian, RobBERT
[15] for Dutch, the uncased version of BERT-base8 for English, and DistilBERTurk9 for Turkish.
3.1.2. Evaluation Metrics
We use the official metric for each subtask to evaluate and compare our methods. In particular,
we use 𝐹1 score of positive class in subtasks 1A and 1C, accuracy in subtask 1B, and weighted
𝐹1 in subtask 1D.
3.1.3. Datasets
The shared task organizers provide train, development, test development, and test datasets for
each language and subtask. The number of tweets for each label in train, development, test
development, and test datasets in subtasks 1A, 1B, 1C, and 1D are presented in Table 1, 2, 3,
and 4, respectively.
In our experiments during the development phase, we use the train and development datasets
for training and validation of the Manifold Mixup model, respectively. In our experiments for
fine-tuning various transformer models and increasing dataset size via machine translation,
we combine train and development sets for each case and fine-tune models accordingly. In all
experiments during the development phase, we use the development test dataset for testing.
3
https://pytorch.org/
4
https://www.tensorflow.org/
5
https://huggingface.co/docs/transformers/index
6
https://github.com/google/sentencepiece
7
https://huggingface.co/iarfmoose/roberta-base-bulgarian
8
https://huggingface.co/bert-base-uncased
9
https://huggingface.co/dbmdz/distilbert-base-turkish-cased
Table 1
Data & Label Distribution for Each Language in Subtask 1A.
Language Label Train Dev. Dev. Test Test
not check-worthy 1675 151 445 110
English
check-worthy 447 44 129 39
not check-worthy 1493 141 413 73
Bulgarian
check-worthy 378 36 106 57
not check-worthy 546 44 150 350
Dutch
check-worthy 377 28 102 316
not check-worthy 1995 177 427 289
Turkish
check-worthy 422 45 84 14
not check-worthy 1551 135 425 435
Arabic
check-worthy 962 100 266 247
Table 2
Data & Label Distribution for Each Language in Task 1B.
Language Label Train Dev. Dev. Test Test
not claim 3031 276 828 102
English
claim 292 31 82 149
not claim 839 74 217 130
Bulgarian
claim 1871 177 519 199
not claim 1021 109 282 750
Dutch
claim 929 72 252 608
not claim 828 72 222 303
Turkish
claim 1589 150 438 209
not claim 1118 104 305 682
Arabic
claim 2513 235 691 566
Table 3
Data & Label Distribution for Each Language in Task 1C.
Language Label Train Dev. Dev. Test Test
not harmful 3031 276 828 211
English
harmful 292 31 82 40
not harmful 2341 209 636 314
Bulgarian
harmful 248 18 67 11
not harmful 1775 165 476 1145
Dutch
harmful 171 14 55 215
not harmful 1790 157 476 466
Turkish
harmful 627 65 174 46
not harmful 2946 276 805 1011
Arabic
harmful 678 60 189 190
Table 4
Data & Label Distribution in Training (Tr), Development (D), Test Development (TD), and Test (T) Sets
for Each Language in Subtask 1D.
English Bulgarian Dutch Turkish Arabic
Label Tr D TD T Tr D TD T Tr D TD T Tr D TD T Tr D TD T
not interesting 2851 267 774 202 2341209636308 15451424051078 1698151466429 1185115298354
harmful 173 21 55 26 248 18 67 3 94 11 31 86 24 8 10 2 511 50 164 98
blame authorities 138 7 36 7 35 7 9 3 128 10 39 54 82 8 21 5 71 5 17 61
calls for action 48 3 12 4 4 1 3 1 27 5 11 22 15 1 5 4 36 6 19 53
discusses cure 42 3 15 5 56 12 11 8 5 1 2 13 38 5 14 6 1132101303248
discusses action 27 1 7 4 17 2 6 3 23 1 8 42 21 1 6 11 501 42 152250
contains advice 12 2 4 1 6 1 3 1 38 2 10 12 4 1 5 0 79 3 20 48
asks question 5 1 1 1 1 0 0 1 84 6 26 29 16 2 5 7 98 14 17 47
other 25 1 5 1 2 1 1 1 5 1 1 20 6 1 1 1 8 2 5 27
3.2. Experimental Results in the Development Phase
We participate in all subtasks of Task 1 for five languages, yielding 20 different submissions. In
addition, we explore three different approaches to determine our final submissions. Therefore, in
order to reduce the complexity of experiments and meet the deadlines of the shared task, we first
evaluate using various transformer models and increasing training data size in subtask 1A and
1C, respectively, on the respective test development datasets. Next, based on our experiments
in subtask 1A and 1C, we compare three different approaches in all subtasks to determine our
submissions for the official evaluation on the test data. We note that this is not an ideal way to
select systems for submission, but we take this step to meet the deadlines.
3.2.1. Impact of Transformer Model on Detecting Check-Worthy Claims
In order to observe the impact of transformer models, we identify several transformer models
available on the Huggingface platform based on their monthly download scores and evaluate
their performance in subtask 1A. The number of transformer models we compare is 9, 3, 5, 13,
and 3 for Arabic, Bulgarian, Dutch, English, and Turkish, respectively.
We present the results in Table 5. Our observations based on our extensive experiments are
as follows. Firstly, the results for English show the importance of evaluation metric to report
the performance of systems. For instance, distilroberta-base-climate-f has the worst recall and
𝐹1 scores, but achieves the best accuracy. Secondly, our results suggest that the text used in
pre-training has a major impact on the models’ performance. For instance, COVID-Twitter-BERT
v1 achieves the best 𝐹1 score among all English models. This should be because it is pretrained
with tweets about COVID-19 while the tweets used in the shared task are also about COVID-19.
Similarly, PubMedBERT, which is pretrained with research articles on PubMed, yields the second
best results for English. However, we also observe some unexpected results in our experiments.
For instance, AraBERT.v1, which is pre-trained on a smaller dataset compared to other variants
of AraBERT (i.e., AraBERTv0.2-Twitter, AraBERTv0.2, and AraBERTv2), outperforms all Arabic
specific models. In addition, while DarijaBERT is pre-trained with only texts in Moroccan
Arabic, it outperforms all other Arabic specific models except AraBERT.v1. Furthermore, the best
performing model in the Turkish dataset is the one with the smallest vocabulary size. Therefore,
our results show that it is not easy to determine a pre-trained model by just comparing models’
configurations and texts used in pre-training. We think that one of the reasons for having these
unexpected results is the subjective nature of the task [11].
Table 5
Results of Various Transformer Models in Detecting Check-Worthy Claims. For each language the
best-performing case is shown in bold.
Model Accuracy Precision Recall 𝐹1
AraBERT.v1 [14] 0.413 0.390 0.932 0.550
DarijaBERT10 0.499 0.420 0.789 0.548
Ara_DialectBERT11 0.431 0.393 0.887 0.545
arabert_c19 [16] 0.548 0.439 0.627 0.517
Arabic
AraBERTv0.2-Twitter [14] 0.600 0.482 0.526 0.503
bert-base-arabic [17] 0.481 0.397 0.672 0.5
CAMeLBERT [18] 0.451 0.372 0.620 0.465
bert-base-arabertv212 0.534 0.399 0.417 0.408
bert-base-arabertv0213 0.599 0.454 0.206 0.284
RoBERTa-base-bulgarian7
Bulg.
0.776 0.451 0.443 0.447
RoBERTa-small-bulgarian-POS14 0.485 0.259 0.820 0.394
bert-base-bg-cased [19] 0.784 0.448 0.245 0.317
BERTje [20] 0.619 0.516 0.941 0.666
RobBERT [15] 0.650 0.549 0.764 0.639
Dutch
bert-base-nl-cased15 0.559 0.469 0.676 0.554
bert-base-dutch-cased-finetuned-gem16 0.638 0.582 0.382 0.461
COVID-Twitter-BERT v1 [21] 0.721 0.434 0.798 0.562
PubMedBERT [22] 0.745 0.447 0.558 0.496
BERT base model (uncased) [7] 0.634 0.343 0.689 0.458
LEGAL-BERT [23] 0.630 0.326 0.604 0.423
ALBERT Base v2 [24] 0.689 0.353 0.457 0.398
English
Bio_ClinicalBERT [25] 0.682 0.337 0.426 0.376
BERT base model (cased) [7] 0.224 0.224 1.0 0.366
bert-base-uncased-contracts17 0.740 0.405 0.333 0.365
ALBERT Base v118 0.707 0.338 0.317 0.328
hateBERT [26] 0.770 0.476 0.232 0.312
COVID-Twitter-BERT v2 MNLI19 0.667 0.265 0.271 0.268
RoBERTa base [27] 0.731 0.295 0.139 0.189
DistilRoBERTa-base-climate-f [28] 0.783 0.631 0.093 0.162
BERTurk uncased 32K Vocabulary20
Turkish
0.760 0.333 0.385 0.357
BERTurk uncased 128K Vocabulary 21 0.337 0.188 0.859 0.309
BERTurk cased 128K Vocabulary22 0.562 0.203 0.526 0.293
3.2.2. Impact of Training Data in Detecting Harmful Tweets
We use roberta-small-bulgarian23 for Bulgarian, BERTje [20] for Dutch, BERT-base-cased for
English, and bert-base-turkish-sentiment-cased 24 for Turkish as language-specific transformer
models. Table 6 shows the performance of each model when a different dataset is machine-
translated to the corresponding language and respective language-specific model is fine-tuned
with the original data and the machine-translated data. In this experiment, we are not able to
report results for Arabic because we run into technical challenges (e.g., insufficient memory)
preventing us to obtain results. We observe that increasing training data does not always
improve the performance. In particular, using the original dataset for Turkish and Bulgarian
yields the highest results while the performance of models usually increase in English and
Dutch datasets by utilizing more labeled samples.
Table 6
Impact of increasing training data by machine-translating another dataset in a different language in
detecting harmful tweets. We report 𝐹1 score for each case. The best result for each language is written
in bold.
Machine-Translated Data Bulgarian Dutch English Turkish
None 0.26 0.26 0.11 0.55
Bulgarian - 0.39 0.23 0.13
Dutch 0.23 - 0.23 0.53
English 0.21 0.39 - 0.48
Turkish 0.19 0.25 0.25 -
Arabic 0.16 0.27 0.21 0.47
The subjective nature of this task might be one of the reasons for having lower performance
by using additional data from other languages. In particular, as each country is dealing with
different social issues, it is likely that people living in different countries might disagree on
what makes a message harmful for a society. For instance, Turkish annotators might be more
sensitive to tweets about refugees compared to annotators for other languages because Turkey
hosts nearly 3.8 million refugees, i.e., the largest refugee population worldwide25 , and thereby,
misinformation about refugees might have unpleasant consequences.
10
https://huggingface.co/Kamel/DarijaBERT
11
https://huggingface.co/MutazYoune/Ara_DialectBERT
12
https://huggingface.co/aubmindlab/bert-base-arabertv2
13
https://huggingface.co/aubmindlab/bert-base-arabertv02
14
https://huggingface.co/iarfmoose/roberta-small-bulgarian-pos
15
https://huggingface.co/Geotrend/distilbert-base-nl-cased
16
https://huggingface.co/GeniusVoice/bert-base-dutch-cased-finetuned-gem
17
https://huggingface.co/nlpaueb/bert-base-uncased-contracts
18
https://huggingface.co/albert-base-v1
19
https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2-mnli
20
https://huggingface.co/dbmdz/bert-base-turkish-uncased
21
https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased
22
https://huggingface.co/dbmdz/bert-base-turkish-128k-cased
23
https://huggingface.co/iarfmoose/roberta-small-bulgarian
24
https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
25
https://www.unhcr.org/figures-at-a-glance.html
Another method to increase the traing data size is back-translation which does not deal with
social differences across countries. Therefore, in our next experiment, we increase training
data using various languages for back-translation. Again, we are not able to report results for
Arabic due to technical challenges we encountered. In this experiment, we also use Spanish for
back-translation of the Bulgarian dataset, but not the others to meet the deadlines of the lab.
The results are shown in Table 7
Table 7
The impact of increasing train data using various languages for back-translation (BT). The best result
for each language is written in bold.
Lang. used for BT Bulgarian Dutch English Turkish
None 0.26 0.26 0.11 0.55
Bulgarian - 0.39 0.30 0.51
Dutch 0.26 - 0.23 0.51
English 0.26 0.35 - 0.54
Turkish 0.25 0.41 0.25 -
Arabic - 0.36 0.27 0.49
Spanish 0.27 - - -
We again observe that we achieve the best result for Turkish when we use only the original
dataset for training. However, back-translation improves the performance in the Dutch and
English datasets. For Bulgarian, back-translation has a minimal impact. We do not observe
a particular language which yields consistently higher results than others when used as the
language for back-translation.
3.2.3. Selecting Models for Submission
In order to select the models to submit for official ranking, we compare three different approaches
for each subtask and language:
• Fine-tuning the best-performing pre-trained transformer model with the original
dataset (FT-BP-TM). We use the best-performing pre-trained transformer model in
our experiments in Section 3.2.1 for all subtasks except 1D. In particular, we fine-tune
AraBERT.v1, RoBERTa-base-bulgarian, BERTje, COVID-Twitter-BERT v1, and BERTurk, for
Arabic, Bulgarian, Dutch, English, and Turkish, respectively, using the corresponding
datasets.
• Fine-tuning a transformer model with back translation (FT-TM-BT). We use the
best-performing model in our experiments in Section 3.2.2. In particular, we use Spanish,
Turkish, Bulgarian, and English for back-translation to increase the size of Bulgarian,
Dutch, English, and Turkish datasets, respectively. Note that the back-translation does
not improve the performance in the Turkish dataset. However, the FT-BP-TM approach
also uses the original dataset for fine-tuning. Therefore, in this approach, we increase
the size of Turkish dataset using back-translation. In particular, we use English as the
back-translation language because it yields the best results among others (See Table 7).
• Manifold Mixup. We use the Manifold Mixup model explained in Section 2.3.
Table 8, 9, 10, and 11, present results comparing three approaches for subtasks 1A, 1B,
1C, and 1D, respectively. Results for some cases are missing due to technical challenges we
encountered and the limited time frame for submissions. In our submissions, we chose the
best-performing method for each case and submitted our results accordingly.
Table 8
Development Test Results in Subtask 1A for 𝐹1 Score for the Positive Class
Model Arabic Bulgarian Dutch English Turkish
Manifold Mixup 0.14 0 0.58 0.48 0.22
FT-TM-BT - 0.42 0.64 0.48 0.40
FT-BP-TM 0.47 0.47 0.57 0.55 0.40
Table 9
Development Test Results in Subtask 1B for 𝐹1 Score for the Positive Class
Model Arabic Bulgarian Dutch English Turkish
Manifold Mixup 0.76 0.75 0.49 0.67 0.63
FT-TM-BT - 0.86 0.73 - 0.78
FT-BP-TM - 0.87 0.72 0.76 0.78
Table 10
Development Test Results in Subtask 1C for 𝐹1 Score for the Positive Class
Model Arabic Bulgarian Dutch English Turkish
Manifold Mixup 0.64 0 0.12 0.18 0.30
FT-TM-BT 0.12 0.27 0.41 0.30 0.54
FT-BP-TM - 0.24 0.33 0.35 0.52
Table 11
Development Test Results in Subtask 1D for Average Weighted 𝐹1 . We do not have results for FT-BP-TM
case in this experiment.
Model Arabic Bulgarian Dutch English Turkish
Manifold Mixup 0.65 0.80 0.65 0.78 0.79
FT-TM-BT - 0.33 0.31 - 0.28
3.3. Results of Our Submissions
Table 12 shows our results and ranking for each case we participated. We are ranked first in
1B Arabic and 1C Dutch. Focusing on subtasks with at least four participants, we are ranked
second in Arabic 1A and Bulgarian 1A. We also observe that our rankings are generally higher
in 1A than other subtasks.
Table 12
Results for our official submissions. Results show 𝐹1 , accuracy, 𝐹1 , and weighted 𝐹1 scores for tasks 1A,
1B, 1C, and 1D, respectively (i.e., the official evaluation metrics).
Task Language Submitted Model Rank Score
Arabic FT-BP-TM 2 (out of 5) 0.495
Bulgarian FT-BP-TM 2 (out of 6) 0.542
1A Dutch FT-TM-BT 3 (out of 6) 0.534
English FT-BP-TM 4 (out of 14) 0.561
Turkish FT-TM-BT 3 (out of 5) 0.118
Arabic Manifold Mixup 1 (out of 4) 0.570
Bulgarian FT-BP-TM 2 (out of 3) 0.742
1B Dutch FT-TM-BT 2 (out of 3) 0.658
English FT-BP-TM 9 (out of 10) 0.641
Turkish FT-TM-BT 4 (out of 4) 0.729
Arabic Manifold Mixup 2 (out of 3) 0.268
Bulgarian FT-TM-BT 2 (out of 3) 0.054
1C Dutch FT-TM-BT 1 (out of 3) 0.147
English FT-BP-TM 5 (out of 12) 0.329
Turkish FT-TM-BT 3 (out of 5) 0.262
Arabic Manifold Mixup 2 (out of 2) 0.184
Bulgarian Manifold Mixup 2 (out of 3) 0.887
1D Dutch Manifold Mixup 2 (out of 3) 0.694
English Manifold Mixup 4 (out of 7) 0.670
Turkish Manifold Mixup 3 (out of 3) 0.806
4. Conclusion
In this paper, we present our participation in CLEF 2022 CheckThat! Lab’s Task 1. We partici-
pated in all four subtasks of Task1 for Arabic, Bulgarian, Dutch, English, and Turkish, yielding
20 submissions in total. We explore which transformer model yields the highest performance,
the impact of increasing training data size by machine translating datasets in other languages
and back-translation, and the Manifold Mixup method proposed by Verma et al. [6]. We are
ranked first in subtask 1B for Arabic and in subtask 1C for Dutch. In addition, we are ranked
second in subtask 1A for Arabic and Bulgarian.
Our observations based on our comprehensive experiments are as follows. Firstly, the
performance of transformer models varies dramatically based on the text used for pre-training.
Secondly, increasing training data does not always improve the performance. Therefore, it is
important to consider biases existing in each dataset. Thirdly, we do not observe that a particular
language used for back-translation yields consistently higher performance than others.
In the future, we plan to focus on the subjective nature of the tasks in this lab. In particular,
we will first qualitatively analyze the datasets to better understand annotations. Subsequently,
we plan to develop a model focusing on dealing with subjective annotations.
References
[1] J. Roozenbeek, C. R. Schneider, S. Dryhurst, J. Kerr, A. L. Freeman, G. Recchia, A. M. Van
Der Bles, S. Van Der Linden, Susceptibility to misinformation about covid-19 around the
world, Royal Society open science 7 (2020) 201199.
[2] Republic of turkey ministry of health covid-19 vaccination information platform, https:
//covid19asi.saglik.gov.tr/?_Dil=2, 2022. Accessed: 2022-06-22.
[3] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, R. Míguez, T. Caselli, M. Kutlu,
W. Zaghouani, C. Li, S. Shaar, H. Mubarak, A. Nikolov, Y. S. Kartal, J. Beltrán, Overview
of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets, in:
N. Faggioli, Guglielmo andd Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF
2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[4] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
N. Babulkov, Y. S. Kartal, J. Beltrán, The CLEF-2022 CheckThat! Lab on fighting the
covid-19 infodemic and fake news detection, in: M. Hagen, S. Verberne, C. Macdonald,
C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Springer
International Publishing, Cham, 2022, pp. 416–428.
[5] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the
CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection,
in: A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald,
G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, F. Nicola (Eds.), Proceedings of the 13th
International Conference of the CLEF Association: Information Access Evaluation meets
Multilinguality, Multimodality, and Visualization, CLEF ’2022, Bologna, Italy, 2022.
[6] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, Y. Bengio, Manifold
mixup: Better representations by interpolating hidden states, in: International Conference
on Machine Learning, PMLR, 2019, pp. 6438–6447.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
[8] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, M. Kutlu, Y. S. Kartal,
F. Alam, G. Da San Martino, et al., Overview of the clef-2021 checkthat! lab task 1 on
check-worthiness estimation in tweets and political debates, in: CLEF (Working Notes),
2021.
[9] E. Williams, P. Rodrigues, S. Tran, Accenture at checkthat! 2021: Interesting claim
identification and ranking with contextually sensitive lexical training data augmentation,
arXiv preprint arXiv:2107.05684 (2021).
[10] M. Zengin, Y. Kartal, M. Kutlu, Tobb etu at checkthat! 2021: Data engineering for detecting
check-worthy claims, in: CEUR Workshop Proceedings, CEUR-WS, 2021.
[11] Y. S. Kartal, M. Kutlu, Re-think before you share: A comprehensive study on prioritizing
check-worthy claims, IEEE Transactions on Computational Social Systems (2022).
[12] C. Hansen, C. Hansen, J. G. Simonsen, C. Lioma, Neural weakly supervised fact check-
worthiness detection with contrastive sampling-based ranking loss., in: CLEF (Working
Notes), 2019.
[13] Y. S. Kartal, M. Kutlu, Trclaim-19: The first collection for turkish check-worthy claim detec-
tion with annotator rationales, in: Proceedings of the 24th Conference on Computational
Natural Language Learning, 2020, pp. 386–395.
[14] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language
understanding, in: LREC 2020 Workshop Language Resources and Evaluation Conference
11–16 May 2020, ????, p. 9.
[15] P. Delobelle, T. Winters, B. Berendt, RobBERT: a Dutch RoBERTa-based Language Model,
in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association
for Computational Linguistics, Online, 2020, pp. 3255–3265. URL: https://www.aclweb.org/
anthology/2020.findings-emnlp.292. doi:10.18653/v1/2020.findings-emnlp.292.
[16] M. S. H. Ameur, H. Aliane, Aracovid19-mfh: Arabic covid-19 multi-label fake news and
hate speech detection dataset, 2021. arXiv:2105.03143.
[17] A. Safaya, M. Abdullatif, D. Yuret, KUISAIL at SemEval-2020 task 12: BERT-CNN for
offensive speech identification in social media, in: Proceedings of the Fourteenth Workshop
on Semantic Evaluation, International Committee for Computational Linguistics, Barcelona
(online), 2020, pp. 2054–2059. URL: https://www.aclweb.org/anthology/2020.semeval-1.271.
[18] G. Inoue, B. Alhafni, N. Baimukan, H. Bouamor, N. Habash, The interplay of variant, size,
and task type in Arabic pre-trained language models, in: Proceedings of the Sixth Arabic
Natural Language Processing Workshop, Association for Computational Linguistics, Kyiv,
Ukraine (Online), 2021.
[19] A. Abdaoui, C. Pradel, G. Sigel, Load what you need: Smaller versions of mutlilingual bert,
in: SustaiNLP / EMNLP, 2020.
[20] W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. v. Noord, M. Nissim, BERTje: A
Dutch BERT Model, arXiv:1912.09582, 2019. URL: http://arxiv.org/abs/1912.09582.
[21] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing
model to analyse covid-19 content on twitter, arXiv preprint arXiv:2005.07503 (2020).
[22] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
Domain-specific language model pretraining for biomedical natural language processing,
2020. arXiv:arXiv:2007.15779.
[23] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, LEGAL-BERT:
The muppets straight out of law school, in: Findings of the Association for Computational
Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp.
2898–2904. doi:10.18653/v1/2020.findings-emnlp.261.
[24] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for
self-supervised learning of language representations, CoRR abs/1909.11942 (2019). URL:
http://arxiv.org/abs/1909.11942. arXiv:1909.11942.
[25] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott,
Publicly available clinical bert embeddings, arXiv preprint arXiv:1904.03323 (2019).
[26] T. Caselli, V. Basile, J. Mitrović, M. Granitzer, HateBERT: Retraining BERT for abusive
language detection in English, in: Proceedings of the 5th Workshop on Online Abuse and
Harms (WOAH 2021), Association for Computational Linguistics, Online, 2021, pp. 17–25.
URL: https://aclanthology.org/2021.woah-1.3. doi:10.18653/v1/2021.woah-1.3.
[27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[28] N. Webersinke, M. Kraus, J. Bingler, M. Leippold, Climatebert: A pretrained language
model for climate-related text, arXiv preprint arXiv:2110.12010 (2021).