TOBB ETU at CheckThat! 2022: Detecting Attention-Worthy and Harmful Tweets and Check-Worthy Claims

TOBB ETU at CheckThat! 2022: Detecting Attention-Worthy and Harmful Tweets and Check-Worthy Claims AhmetBahadirEyuboglu ahmetbahadireyuboglu@gmail.com MustafaBoraArslan mustafaboraarslan@outlook.com EkremSonmezer sonmezerekrem@outlook.com MucahidKutlu m.kutlu@etu.edu.tr TOBB University of Economics and Technology

Ankara Turkey

Evaluation Forum

September 5-8 2022 Bologna Italy

TOBB ETU at CheckThat! 2022: Detecting Attention-Worthy and Harmful Tweets and Check-Worthy Claims 1613-0073 4548A5C0EFF1E971CBE26B70C423DCF4 GROBID - A machine learning software for extracting information from scholarly documents Fact-Checking Check-worthiness Attention-worthy tweets Harmful tweets Factual Claims

In this paper, we present our participation in CLEF 2022 CheckThat! Lab's Task 1 on detecting checkworthy and verifiable claims and attention-worthy and harmful tweets. We participated in all subtasks of Task1 for Arabic, Bulgarian, Dutch, English, and Turkish datasets. We investigate the impact of fine-tuning various transformer models and how to increase training data size using machine translation. We also use feed-forward networks with the Manifold Mixup regularization for the respective tasks. We are ranked first in detecting factual claims in Arabic and harmful tweets in Dutch. In addition, we are ranked second in detecting check-worthy claims in Arabic and Bulgarian.

Introduction

Social media platforms became one of the main information resource for people by enabling their users to easily share messages and follow others. While these platforms are extremely important to help people share their thoughts and make their voice heard, they can be also used in a very negative way by spreading misinformation and/or hateful messages which will negatively impact individuals and societies. We have especially observed this dark side of social media platforms during COVID-19 pandemic. For instance, misinformation and conspiracy theories about vaccines increased hesitation towards being vaccinated [1]. Furthermore, the messages spread on social media platforms might impact public opinion on a particular issue and mobilize people, forcing government entities to take action. For instance, government entities of several countries had to regularly share information about vaccines to reduce the vaccine hesitation (e.g., [2]).

In this paper, we explain our participation in Task 1 [3] of the CLEF Check That! 2022 Lab [4,5]. Task 1 covers four subtasks including 1) check-worthy claim detection (Subtask 1A), verifiable factual claim detection (Subtask 1B), harmful tweet detection (Subtask 1C), and attention-worthy tweet detection (Subtask 1D). Subtask 1A covers six languages including Arabic, Bulgarian, Dutch, English, Spanish, and Turkish while the other subtasks cover all the mentioned languages except Spanish. We participated in all subtasks for Arabic, Bulgarian, Dutch, English, and Turkish languages 1 , yielding 20 submissions in total.

In the development phase of the shared task, we explored three different research directions including i) fine-tuning various pre-trained transformer models, ii) increasing the training data for fine-tuning transformer models, and iii) applying the Manifold Mixup regularization technique [6] for the subtasks we participated. In particular, we investigated 9, 3, 5, 13, and 3 different pre-trained transformer models for subtask 1A in Arabic, Bulgarian, Dutch, English, and Turkish, respectively. In addition, we explored increasing training data by back-translation and machine-translating datasets in other languages for subtask 1C. Next, we compared the Manifold Mixup approach, fine-tuning transformer models, and data augmentation by back-translation in all four subtasks to select models for our official submissions.

In our experiments with the development dataset, we find that the type of the transformer model causes dramatic changes in the performance, suggesting that researchers should select the models carefully. In addition, our findings about the impact of artificially increasing the data are mixed. In particular, we observe that increasing training data usually has a negative impact in Bulgarian and Turkish datasets in subtask 1C while using additional data for English and Dutch datasets improves the performance.

In the official ranking, we achieved mixed results. Considering tasks with at least three participants, we are ranked first in 1B-Arabic and second in 1A-Arabic and 1A-Bulgarian. We share our implementation for the Manifold Mixup method2 for reproducibility of our results.

Approaches

We explore three different approaches for all subtasks including fine-tuning various transformer models, increasing dataset size via machine translation, and the Manifold Mixup regularization. In this section we explain each of them in detail.

Fine Tuning Various Transformer Models

Prior works show remarkable success of transformer models in various text classification tasks [7]. Furthermore, the best-performing systems in previous check-worthy claim detection tasks of Check That! Lab [8] usually exploited various transformer models [9,10]. However, Kartal and Kutlu [11] show that the performance of models varies dramatically across different transformer models. Therefore, in this approach, we explore several language-specific transformer models pre-trained with different datasets.

Increasing Training Data via Machine Translation

Training data has enormous impact on the performance of resultant models. Prior work on detecting check-worthy claim detection investigated several ways to increase the training data size such as back-translation [9], weak supervision [12], and utilizing datasets in other languages with multi-lingual models [11]. In this approach, we explore increasing training data size by two different methods including 1) utilizing datasets in other languages by machine-translating them into the respective language, and 2) paraphrasing the training data via back-translation and using them as additional labeled data.

In the first method, we exploit datasets in several languages provided by the Check That! Lab organizers this year. In particular, in order to develop a model for a specific language, 𝐿 𝑂 , we first select a training dataset provided for another language and machine-translate its tweets to the language 𝐿 𝑂 using Google Translate. Subsequently, we fine-tune a language-specific transformer model using the original data and machine-translated data together. In subtask 1C, we machine translate only tweets labeled as harmful to reduce the imbalance in label distribution while increasing the training data size.

In our back-translation method, we first translate the original text to another language using Google Translate. Subsequently, we translate the resultant text back to the original language. This method is likely to create slightly different texts than the original ones with a same or similar meaning. Assuming that the change in the texts will not affect their label, we combine the original data with the back-translated data and fine-tune a language specific transformer model.

Language Specific BERT with Manifold Mixup

Many of the annotations in the shared task are subjective. For instance, whether a tweet requires attention of government entities might depend on how much the annotators want governments to intervene their life. Similarly, prior work on check-worthiness points out the subjective nature of the task (e.g., [11,13]) In order to focus on this problem, we apply the Manifold Mixup regularization proposed by Verma et al. [6]. In particular, the Manifold Mixup trains neural networks on linear combinations of hidden representations of training examples, yielding flattened class-representations and smoother decision boundaries. Verma et al. [6] demonstrate that their approach yields more robust solutions in image classification. In our work, we use BERT embeddings to represent tweets and then train a four-layer feed-forward network with the Manifold Mixup method.

In subtask 1-D, we apply a different approach than the other tasks due to its severely imbalanced label distribution. In particular, there are nine labels in subtask 1-D, but eight of them are about why a particular tweet is attention-worthy. In addition, the majority of the tweets have "not attention-worthy" label. Therefore, we first binarize labels by merging variants of attention-worthy labels into a single one, yielding only two labels: 1) attention-worthy and 2) not-attention-worthy. Subsequently, we under-sample negative class with the 1/5 ratio and train our Manifold Mixup model. Next, we build another model using eight labels for attentionworthy tweets. If a tweet is classified as attention-worthy, we use the second model to predict why it is attention-worthy. Otherwise, we do not use the second model and label it as "not attention-worthy". Note that we do not apply this two-step approach for other subtasks because they are already binary classification tasks.

Experiments

We first present statistics about the datasets and explain implementation details and our experimental setup in Section 3.1. Next, we explain how we selected our submissions in Section 3.2. Finally, we present the results of our submissions in Section 3.3.

Experimental Setup

Implementation

In order to fine-tune and configure transformer models, we use PyTorch v.1.9.03 and Tensorflow4 libraries. We import transformer models used in our experiments from Huggingface 5 . In addition, we use Google's SentencePiece library for machine translation 6 . We set the batch size to 32 in all our experiments with fine-tuned transformer models. In experiments on increasing dataset size using machine translation, we train the models for 5 epochs.

We implemented the Manifold Mixup [6] method from scratch using PyTorch v.1.9.0, and set epoch and the batch size to 5 and 2, respectively. We use the following transformer models for each language: AraBERT.v02 [14] for Arabic, RoBERTa-base-bulgarian 7 for Bulgarian, RobBERT [15] for Dutch, the uncased version of BERT-base 8 for English, and DistilBERTurk9 for Turkish.

Evaluation Metrics

We use the official metric for each subtask to evaluate and compare our methods. In particular, we use 𝐹 1 score of positive class in subtasks 1A and 1C, accuracy in subtask 1B, and weighted 𝐹 1 in subtask 1D.

Datasets

The shared task organizers provide train, development, test development, and test datasets for each language and subtask. The number of tweets for each label in train, development, test development, and test datasets in subtasks 1A, 1B, 1C, and 1D are presented in Table 1, 2, 3, and 4, respectively.

In our experiments during the development phase, we use the train and development datasets for training and validation of the Manifold Mixup model, respectively. In our experiments for fine-tuning various transformer models and increasing dataset size via machine translation, we combine train and development sets for each case and fine-tune models accordingly. In all experiments during the development phase, we use the development test dataset for testing.

Experimental Results in the Development Phase

We participate in all subtasks of Task 1 for five languages, yielding 20 different submissions. In addition, we explore three different approaches to determine our final submissions. Therefore, in order to reduce the complexity of experiments and meet the deadlines of the shared task, we first evaluate using various transformer models and increasing training data size in subtask 1A and 1C, respectively, on the respective test development datasets. Next, based on our experiments in subtask 1A and 1C, we compare three different approaches in all subtasks to determine our submissions for the official evaluation on the test data. We note that this is not an ideal way to select systems for submission, but we take this step to meet the deadlines.

Impact of Transformer Model on Detecting Check-Worthy Claims

In order to observe the impact of transformer models, we identify several transformer models available on the Huggingface platform based on their monthly download scores and evaluate their performance in subtask 1A. The number of transformer models we compare is 9, 3, 5, 13, and 3 for Arabic, Bulgarian, Dutch, English, and Turkish, respectively. We present the results in Table 5. Our observations based on our extensive experiments are as follows. Firstly, the results for English show the importance of evaluation metric to report the performance of systems. For instance, distilroberta-base-climate-f has the worst recall and 𝐹 1 scores, but achieves the best accuracy. Secondly, our results suggest that the text used in pre-training has a major impact on the models' performance. For instance, COVID-Twitter-BERT v1 achieves the best 𝐹 1 score among all English models. This should be because it is pretrained with tweets about COVID-19 while the tweets used in the shared task are also about COVID-19. Similarly, PubMedBERT, which is pretrained with research articles on PubMed, yields the second best results for English. However, we also observe some unexpected results in our experiments. For instance, AraBERT.v1, which is pre-trained on a smaller dataset compared to other variants of AraBERT (i.e., AraBERTv0.2-Twitter, AraBERTv0.2, and AraBERTv2), outperforms all Arabic specific models. In addition, while DarijaBERT is pre-trained with only texts in Moroccan Arabic, it outperforms all other Arabic specific models except AraBERT.v1. Furthermore, the best performing model in the Turkish dataset is the one with the smallest vocabulary size. Therefore, our results show that it is not easy to determine a pre-trained model by just comparing models' configurations and texts used in pre-training. We think that one of the reasons for having these unexpected results is the subjective nature of the task [11].

Impact of Training Data in Detecting Harmful Tweets

We use roberta-small-bulgarian 23 for Bulgarian, BERTje [20] for Dutch, BERT-base-cased for English, and bert-base-turkish-sentiment-cased 24 for Turkish as language-specific transformer models. Table 6 shows the performance of each model when a different dataset is machinetranslated to the corresponding language and respective language-specific model is fine-tuned with the original data and the machine-translated data. In this experiment, we are not able to report results for Arabic because we run into technical challenges (e.g., insufficient memory) preventing us to obtain results. We observe that increasing training data does not always improve the performance. In particular, using the original dataset for Turkish and Bulgarian yields the highest results while the performance of models usually increase in English and Dutch datasets by utilizing more labeled samples. The subjective nature of this task might be one of the reasons for having lower performance by using additional data from other languages. In particular, as each country is dealing with different social issues, it is likely that people living in different countries might disagree on what makes a message harmful for a society. For instance, Turkish annotators might be more sensitive to tweets about refugees compared to annotators for other languages because Turkey hosts nearly 3.8 million refugees, i.e., the largest refugee population worldwide 25 , and thereby, misinformation about refugees might have unpleasant consequences.

Another method to increase the traing data size is back-translation which does not deal with social differences across countries. Therefore, in our next experiment, we increase training data using various languages for back-translation. Again, we are not able to report results for Arabic due to technical challenges we encountered. In this experiment, we also use Spanish for back-translation of the Bulgarian dataset, but not the others to meet the deadlines of the lab. The results are shown in Table 7 Table 7 The impact of increasing train data using various languages for back-translation (BT). The best result for each language is written in bold. We again observe that we achieve the best result for Turkish when we use only the original dataset for training. However, back-translation improves the performance in the Dutch and English datasets. For Bulgarian, back-translation has a minimal impact. We do not observe a particular language which yields consistently higher results than others when used as the language for back-translation.

Lang. used for BT Bulgarian

Selecting Models for Submission

In order to select the models to submit for official ranking, we compare three different approaches for each subtask and language:

• Fine-tuning the best-performing pre-trained transformer model with the original dataset (FT-BP-TM). We use the best-performing pre-trained transformer model in our experiments in Section 3.2.1 for all subtasks except 1D. In particular, we fine-tune AraBERT.v1, RoBERTa-base-bulgarian, BERTje, COVID-Twitter-BERT v1, and BERTurk, for Arabic, Bulgarian, Dutch, English, and Turkish, respectively, using the corresponding datasets. • Fine-tuning a transformer model with back translation (FT-TM-BT). We use the best-performing model in our experiments in Section 3.2.2. In particular, we use Spanish, Turkish, Bulgarian, and English for back-translation to increase the size of Bulgarian, Dutch, English, and Turkish datasets, respectively. Note that the back-translation does not improve the performance in the Turkish dataset. However, the FT-BP-TM approach also uses the original dataset for fine-tuning. Therefore, in this approach, we increase the size of Turkish dataset using back-translation. In particular, we use English as the back-translation language because it yields the best results among others (See Table 7). • Manifold Mixup. We use the Manifold Mixup model explained in Section 2.3. Table 8, 9, 10, and 11, present results comparing three approaches for subtasks 1A, 1B, 1C, and 1D, respectively. Results for some cases are missing due to technical challenges we encountered and the limited time frame for submissions. In our submissions, we chose the best-performing method for each case and submitted our results accordingly.

Results of Our Submissions

Table 12 shows our results and ranking for each case we participated. We are ranked first in 1B Arabic and 1C Dutch. Focusing on subtasks with at least four participants, we are ranked second in Arabic 1A and Bulgarian 1A. We also observe that our rankings are generally higher in 1A than other subtasks.

Conclusion

In this paper, we present our participation in CLEF 2022 CheckThat! Lab's Task 1. We participated in all four subtasks of Task1 for Arabic, Bulgarian, Dutch, English, and Turkish, yielding 20 submissions in total. We explore which transformer model yields the highest performance, the impact of increasing training data size by machine translating datasets in other languages and back-translation, and the Manifold Mixup method proposed by Verma et al. [6]. We are ranked first in subtask 1B for Arabic and in subtask 1C for Dutch. In addition, we are ranked second in subtask 1A for Arabic and Bulgarian.

Our observations based on our comprehensive experiments are as follows. Firstly, the performance of transformer models varies dramatically based on the text used for pre-training. Secondly, increasing training data does not always improve the performance. Therefore, it is important to consider biases existing in each dataset. Thirdly, we do not observe that a particular language used for back-translation yields consistently higher performance than others.

In the future, we plan to focus on the subjective nature of the tasks in this lab. In particular, we will first qualitatively analyze the datasets to better understand annotations. Subsequently, we plan to develop a model focusing on dealing with subjective annotations.

Table 11Data & Label Distribution for Each Language in Subtask 1A.Language LabelTrain Dev. Dev. Test TestEnglishnot check-worthy 1675 check-worthy 447151 44445 129110 39Bulgariannot check-worthy 1493 check-worthy 378141 36413 10673 57Dutchnot check-worthy check-worthy546 37744 28150 102350 316Turkishnot check-worthy 1995 check-worthy 422177 45427 84289 14Arabicnot check-worthy 1551 check-worthy 962135 100425 266435 247

Table 22Data & Label Distribution for Each Language in Task 1B.Language LabelTrain Dev. Dev. Test TestEnglishnot claim 3031 claim 292276 31828 82Bulgariannot claim claim839 187174 177217 519Dutchnot claim 1021 claim 929109 72282 252Turkishnot claim claim828 158972 150222 438Arabicnot claim 1118 claim 2513104 235305 691

Table 33Data & Label Distribution for Each Language in Task 1C.Language LabelTrain Dev. Dev. Test TestEnglishnot harmful 3031 harmful 292276 31828 82211 40Bulgariannot harmful 2341 harmful 248209 18636 67314 11Dutchnot harmful 1775 harmful 171165 14476 551145 215Turkishnot harmful 1790 harmful 627157 65476 174466 46Arabicnot harmful 2946 harmful 678276 60805 1891011

Table 44Data & Label Distribution in Training (Tr), Development (D), Test Development (TD), and Test (T) Sets for Each Language in Subtask 1D.EnglishBulgarianDutchTurkishArabicLabelTr D TD TTr D TD T Tr D TD T Tr D TD T Tr D TD Tnot interesting 2851 267 774 202 2341209636308 15451424051078 1698151466429harmful173 21 55 26 248 18 67 3 94 11 31 86 24 8 10 2 511 50 164 98blame authorities 138 7 36 735 7 9 3 128 10 39 54 82 8 21 5 71 5 17 61calls for action483 12 44 1 3 1 27 5 11 22 15 1 5 4 36 6 19 53discusses cure423 15 556 12 11 8 5 1 2 13 38 5 14 6discusses action 2717417 2 6 3 23 1 8 42 21 1 6 11 501 42contains advice 122416 1 3 1 38 2 10 12 4 1 5 0 79 3 20 48asks question51111 0 0 1 84 6 26 29 16 2 5 7 98 14 17 47other251512 1 1 1 5 1 1 20 6 1 1 1 8 2 5 27

Table 55Results of Various Transformer Models in Detecting Check-Worthy Claims. For each language the best-performing case is shown in bold.ModelAccuracy Precision Recall𝐹 1AraBERT.v1 [14]0.4130.3900.932 0.550DarijaBERT 100.4990.4200.7890.548Ara_DialectBERT 110.4310.3930.8870.545Arabicarabert_c19 [16] AraBERTv0.2-Twitter [14] bert-base-arabic [17]0.548 0.600 0.4810.439 0.482 0.3970.627 0.526 0.6720.517 0.503 0.5CAMeLBERT [18]0.4510.3720.6200.465bert-base-arabertv2 120.5340.3990.4170.408bert-base-arabertv02 130.5990.4540.2060.284Bulg.RoBERTa-base-bulgarian 7 RoBERTa-small-bulgarian-POS 140.776 0.4850.451 0.2590.443 0.447 0.820 0.394bert-base-bg-cased [19]0.7840.4480.2450.317BERTje [20]0.6190.5160.941 0.666DutchRobBERT [15] bert-base-nl-cased 15 bert-base-dutch-cased-finetuned-gem 160.650 0.559 0.6380.549 0.469 0.5820.764 0.676 0.3820.639 0.554 0.461COVID-Twitter-BERT v1 [21]0.7210.4340.798 0.562PubMedBERT [22]0.7450.4470.5580.496BERT base model (uncased) [7]0.6340.3430.6890.458LEGAL-BERT [23]0.6300.3260.6040.423ALBERT Base v2 [24]0.6890.3530.4570.398EnglishBio_ClinicalBERT [25] BERT base model (cased) [7] bert-base-uncased-contracts 170.682 0.224 0.7400.337 0.224 0.4050.426 1.0 0.3330.376 0.366 0.365ALBERT Base v1 180.7070.3380.3170.328hateBERT [26]0.7700.4760.2320.312COVID-Twitter-BERT v2 MNLI 190.6670.2650.2710.268RoBERTa base [27]0.7310.2950.1390.189DistilRoBERTa-base-climate-f [28]0.7830.6310.0930.162TurkishBERTurk uncased 32K Vocabulary 20 BERTurk uncased 128K Vocabulary 21 BERTurk cased 128K Vocabulary 220.760 0.337 0.5620.333 0.188 0.2030.385 0.357 0.859 0.309 0.526 0.293

Table 66Impact of increasing training data by machine-translating another dataset in a different language in detecting harmful tweets. We report 𝐹 1 score for each case. The best result for each language is written in bold.Machine-Translated Data Bulgarian Dutch English TurkishNone0.260.260.110.55Bulgarian-0.390.230.13Dutch0.23-0.230.53English0.210.39-0.48Turkish0.190.250.25-Arabic0.160.270.210.47

Table 88Development Test Results in Subtask 1A for 𝐹 1 Score for the Positive ClassModelArabic Bulgarian Dutch English TurkishManifold Mixup0.1400.580.480.22FT-TM-BT-0.420.640.480.40FT-BP-TM0.470.470.570.550.40

Table 99Development Test Results in Subtask 1B for 𝐹 1 Score for the Positive ClassModelArabic Bulgarian Dutch English TurkishManifold Mixup0.760.750.490.670.63FT-TM-BT-0.860.73-0.78FT-BP-TM-0.870.720.760.78

Table 1010Development Test Results in Subtask 1C for 𝐹 1 Score for the Positive ClassModelArabic Bulgarian Dutch English TurkishManifold Mixup0.6400.120.180.30FT-TM-BT0.120.270.410.300.54FT-BP-TM-0.240.330.350.52

Table 1111Development Test Results in Subtask 1D for Average Weighted 𝐹 1 . We do not have results for FT-BP-TM case in this experiment.ModelArabic Bulgarian Dutch English TurkishManifold Mixup0.650.800.650.780.79FT-TM-BT-0.330.31-0.28

Table 1212Results for our official submissions. Results show 𝐹 1 , accuracy, 𝐹 1 , and weighted 𝐹 1 scores for tasks 1A, 1B, 1C, and 1D, respectively (i.e., the official evaluation metrics).Task Language Submitted ModelRankScoreArabicFT-BP-TM2 (out of 5)0.495BulgarianFT-BP-TM2 (out of 6)0.5421ADutchFT-TM-BT3 (out of 6)0.534EnglishFT-BP-TM4 (out of 14) 0.561TurkishFT-TM-BT3 (out of 5)0.118ArabicManifold Mixup1 (out of 4)0.570BulgarianFT-BP-TM2 (out of 3)0.7421BDutchFT-TM-BT2 (out of 3)0.658EnglishFT-BP-TM9 (out of 10) 0.641TurkishFT-TM-BT4 (out of 4)0.729ArabicManifold Mixup2 (out of 3)0.268BulgarianFT-TM-BT2 (out of 3)0.0541CDutchFT-TM-BT1 (out of 3)0.147EnglishFT-BP-TM5 (out of 12) 0.329TurkishFT-TM-BT3 (out of 5)0.262ArabicManifold Mixup2 (out of 2)0.184BulgarianManifold Mixup2 (out of 3)0.8871DDutchManifold Mixup2 (out of 3)0.694EnglishManifold Mixup4 (out of 7)0.670TurkishManifold Mixup3 (out of 3)0.806

We could not participate for Spanish due to a technical problem we encountered during development. https://github.com/Carnagie/manifold-mixup-text-classification https://pytorch.org/ https://www.tensorflow.org/ https://huggingface.co/docs/transformers/index https://github.com/google/sentencepiece https://huggingface.co/iarfmoose/roberta-base-bulgarian https://huggingface.co/bert-base-uncased https://huggingface.co/dbmdz/distilbert-base-turkish-cased https://huggingface.co/Kamel/DarijaBERT https://huggingface.co/MutazYoune/Ara_DialectBERT https://huggingface.co/aubmindlab/bert-base-arabertv2 https://huggingface.co/aubmindlab/bert-base-arabertv02

Susceptibility to misinformation about covid-19 around the world JRoozenbeek CRSchneider SDryhurst JKerr ALFreeman GRecchia AMVan Der Bles SVan Der Linden Royal Society open science 7 201199 2020 Republic of turkey ministry of health covid-19 vaccination information platform 2022. 2022-06-22 Overview of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets PNakov ABarrón-Cedeño GDa San Martino FAlam RMíguez TCaselli MKutlu WZaghouani CLi SShaar HMubarak ANikolov YSKartal JBeltrán Working Notes of CLEF 2022-Conference and Labs of the Evaluation Forum, CLEF '2022 NFaggioli AGuglielmo Andd Ferro MHanbury Potthast

Bologna, Italy

2022 The CLEF-2022 CheckThat! Lab on fighting the covid-19 infodemic and fake news detection PNakov ABarrón-Cedeño GDa San Martino FAlam JMStruß TMandl RMíguez TCaselli MKutlu WZaghouani CLi SShaar GKShahi HMubarak ANikolov NBabulkov YSKartal JBeltrán Advances in Information Retrieval MHagen SVerberne CMacdonald CSeifert KBalog KNørvåg VSetty

Cham

Springer International Publishing 2022 Overview of the CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection PNakov ABarrón-Cedeño GDa San Martino FAlam JMStruß TMandl RMíguez TCaselli MKutlu WZaghouani CLi SShaar GKShahi HMubarak ANikolov NBabulkov YSKartal JBeltrán MWiegand MSiegel JKöhler Proceedings of the 13th International Conference of the CLEF Association: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization, CLEF '2022 ABarrón-Cedeño GDa San Martino MDegli FEsposti CSebastiani GMacdonald APasi MHanbury GPotthast FFaggioli Nicola the 13th International Conference of the CLEF Association: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization, CLEF '2022

Bologna, Italy

2022 Manifold mixup: Better representations by interpolating hidden states VVerma ALamb CBeckham ANajafi IMitliagkas DLopez-Paz YBengio International Conference on Machine Learning

PMLR

2019 Bert: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019 1 SShaar MHasanain BHamdan ZSAli FHaouari ANikolov MKutlu YSKartal FAlam GDa San Martino Overview of the clef-2021 checkthat! lab task 1 on check-worthiness estimation in tweets and political debates 2021 CLEF (Working Notes) EWilliams PRodrigues STran arXiv:2107.05684 Accenture at checkthat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation 2021 arXiv preprint Tobb etu at checkthat! 2021: Data engineering for detecting check-worthy claims MZengin YKartal MKutlu CEUR Workshop Proceedings CEUR-WS 2021 Re-think before you share: A comprehensive study on prioritizing check-worthy claims YSKartal MKutlu IEEE Transactions on Computational Social Systems 2022 Neural weakly supervised fact checkworthiness detection with contrastive sampling-based ranking loss CHansen CHansen JGSimonsen CLioma CLEF (Working Notes) 2019 Trclaim-19: The first collection for turkish check-worthy claim detection with annotator rationales YSKartal MKutlu Proceedings of the 24th Conference on Computational Natural Language Learning the 24th Conference on Computational Natural Language Learning 2020 Arabert: Transformer-based model for arabic language understanding WAntoun FBaly HHajj LREC 2020 Workshop Language Resources and Evaluation Conference 11-16 May 2020 9 RobBERT: a Dutch RoBERTa-based Language Model PDelobelle TWinters BBerendt 10.18653/v1/2020.findings-emnlp.292 Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics 2020 MS HAmeur HAliane arXiv:2105.03143 Aracovid19-mfh: Arabic covid-19 multi-label fake news and hate speech detection dataset 2021 KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media ASafaya MAbdullatif DYuret Proceedings of the Fourteenth Workshop on Semantic Evaluation, International Committee for Computational Linguistics the Fourteenth Workshop on Semantic Evaluation, International Committee for Computational Linguistics

Barcelona (online

2020 The interplay of variant, size, and task type in Arabic pre-trained language models GInoue BAlhafni NBaimukan HBouamor NHabash Proceedings of the Sixth Arabic Natural Language Processing Workshop, Association for Computational Linguistics the Sixth Arabic Natural Language Processing Workshop, Association for Computational Linguistics

Kyiv, Ukraine (Online

2021 Load what you need: Smaller versions of mutlilingual bert AAbdaoui CPradel GSigel SustaiNLP / EMNLP 2020 WVries AVan Cranenburgh ABisazza TCaselli GVNoord MNissim arXiv:1912.09582 BERTje: A Dutch BERT Model 2019 MMüller MSalathé PEKummervold arXiv:2005.07503 Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter 2020 arXiv preprint Domain-specific language model pretraining for biomedical natural language processing YGu RTinn HCheng MLucas NUsuyama XLiu TNaumann JGao HPoon arXiv:2007.15779 2020 LEGAL-BERT: The muppets straight out of law school IChalkidis MFergadiotis PMalakasiotis NAletras IAndroutsopoulos 10.18653/v1/2020.findings-emnlp.261 Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics 2020 ALBERT: A lite BERT for self-supervised learning of language representations ZLan MChen SGoodman KGimpel PSharma RSoricut CoRR abs/1909.11942 2019 EAlsentzer JRMurphy WBoag W.-HWeng DJin TNaumann MMcdermott arXiv:1904.03323 Publicly available clinical bert embeddings 2019 arXiv preprint HateBERT: Retraining BERT for abusive language detection in English TCaselli VBasile JMitrović MGranitzer 10.18653/v1/2021.woah-1.3 Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), Association for Computational Linguistics the 5th Workshop on Online Abuse and Harms (WOAH 2021), Association for Computational Linguistics 2021 Roberta: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov CoRR abs/1907.11692 2019 NWebersinke MKraus JBingler MLeippold arXiv:2110.12010 Climatebert: A pretrained language model for climate-related text 2021 arXiv preprint