UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims Ipek Baris Schlicht, Angel Felipe Magnossão de Paula and Paolo Rosso Universitat Politècnica de València, Spain Abstract Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text represen- tations could be one approach to solve the multilingual check-worthiness detection. However, this ap- proach could suffer if cultural bias exists within the communities on determining what is check-worthy. In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias. With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages. Keywords Check-worthy Claim Detection, Language Identification, Sentence Transformers, Multilingual, Joint Training, Bias 1. Introduction The number of fact-checking initiatives worldwide has increased to fight misinformation. Manual fact-checking is a labor-intensive and time-consuming task that cannot cope up with the dissemination of misinformation [1]. Therefore, the automation of fact-checking steps is required to speed up the process. Check-worthy claim detection is a crucial step of an automated fact-checking pipeline [2, 1, 3] to prioritize what is needed to be fact-checked by fact-checkers or journalists. There has been an ongoing effort to address the claim-detection task by different research communities. Prior studies rely on machine learning methods that use statistical features with bag of words [4, 5, 6]. Additionally, CLEF CheckThat! Lab (CTL) has organized shared tasks to tackle this problem in political debates [7, 8] and social media [9]. This year, CTL 2021 [10] organized the shared task in English, Turkish, Bulgarian, Spanish and Arabic where the task datasets are collected from social media [11]. The task’s input is a tweet and the output is a score indicating the check-worthiness of the tweet. Multilingual language models have been widely used in natural language understanding tasks with low-resourced languages (e.g. comment moderation [12], fake news detection [13]). CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " ibarsch@doctor.upv.es (I. B. Schlicht); adepau@doctor.upv.es (A. F. M. d. Paula); prosso@dsic.upv.es (P. Rosso) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) However, the exhibition of cultural differences is inevitable in tasks in which cultural context is required [14]. This issue could harm the transfer of knowledge across languages. Fact-checking is one of such tasks where disagreements could exist on credibility assessments even among the domain experts [15, 16]. Furthermore, exposure of global claims and their credibility (e.g. Covid-19) could vary by country [17]. With this motivation, in this paper, we present a unified framework that processes the input in different languages and uses a multilingual sentence transformer trained on the mixed language training set to learn representations for the low-resourced languages. To mitigate the bias in the sentence representations, we introduce a language identification task and train the model jointly for check-worthiness detection (CWD) and language identification (LI) tasks. Our contributions can be summarized as follows: 1. We introduce a framework whose aim is to be aware of cultural bias. We conduct an extensive analysis on its performance. 2. We employ joint learning to reduce unintended bias. To the best of our knowledge, a similar method has not been applied to reduce bias in multilingual fact-checking tasks. 3. Our framework could be extended with various multilingual transformer models in Huggingface [18]. The source code and the trained models are publicly available 1 . 2. Related Work ClaimBuster is the first study to address the check-worthy claim detection task. The component of ClaimBuster [4, 5] that detects check-worthy claims is trained with a Support Vector Machine (SVM) classifier using tf-idf bag of words, named entity types, POS tags, sentiment, and sentence length as a feature set. [6] proposed a fully connected neural network model trained on claims and their related political debate content. Last year, CTL 2020 [9] organized a shared CWD task in English and Arabic for the claims in social media. In this shared task, multilingual transformer models performed well on the Arabic dataset [19]. However, for the English datasets, the participants did not utilize the multilingual transformer model. In our approach, we fine-tune the multilingual sentence transformers [20], which is computationally less expensive than the BERT models, on the mixed language of the training dataset. We trained one model and employed this for all languages. The multi-task learning approach has been a proven method to mitigate unintended bias. Das et al. [21] applied multi-task learning on a face recognition task using Convolution Neural Network. As a related example in the Natural Language Processing (NLP) domain, Vaidya et al. [22] mitigate the identity bias in toxic comment detection. Their model encodes the inputs with a Bidirectional Long Short-Term Memory Network (BiLSTM). However, our approach and the tasks we deal with are different from those studies. 1 https://github.com/isspek/Cross_Lingual_Checkworthy_Detection 3. Methodology In this section, we introduce our framework, which is depicted in Figure 1. The input of the framework is a Twitter post. The input is tokenized with a sentence transformer encoder in order to be fed into the transformer layer. After obtaining the shared text representation from a sentence transformer, the framework fine-tunes the shared representation and the classification layers for CWD and LI tasks by minimizing a joint loss. In the following subsections, we give more details about the sentence transformer and the joint training. Figure 1: Our proposed framework (QDMSBERT𝑗𝑜𝑖𝑛𝑡 ) for mitigating unintended bias. 3.1. Sentence Transformer The framework uses a Sentence-BERT (SBERT) transformer [23], which is a modified BERT that uses a siamese and a triplet network. The SBERT can provide semantically more meaningful sentence embeddings than the BERT models. To support multilingualism in our framework and to enable fine-tuning with a small GPU, we use a pre-trained SBERT that was obtained by applying knowledge distillation [20] and that was trained on a multilingual corpus from a community-driven Q&A website2 . We refer to it as QDMSBERT. We apply mean pooling on the output of QDMSBERT to obtain sentence embeddings. We set the maximum length of the tokens as 128 by padding shorter texts and truncating longer texts. 3.2. Joint Learning The framework contains two task layers: one is for the CWD task and the other is for the LI task. The input of the task layers are shared QDMSBERT embeddings. Both task layers use the same neural network structure, consisting of two fully-connected layers followed by a softmax layer that outputs the probabilities of task classes. During the training, the weighted loss of the CWD and the LI task are summed up to compute the joint loss as seen in Equation 1 where 𝛼 is a probability indicating the importance of the tasks. Lastly, the joined loss is minimized by optimizing the weights of the transformer network and the tasks’ classification layers. 𝐽𝑗𝑜𝑖𝑛𝑡 = 𝛼𝐽𝐶𝑊 𝐷 + (1 − 𝛼)𝐽𝐿𝐼 (1) 2 https://www.quora.com/ Table 1 Topic, class distribution and average tokens in the CheckThat! dataset. Pos-Class means check-worthy, Neg-Class means not check-worthy. Properties English Turkish Bulgarian Arabic Spanish Topic Covid-19 Miscellaneous Covid-19 Miscellaneous Politics Pos-Class (Train) 290 729 392 921 200 Neg-Class (Train) 532 1170 2608 2798 2295 Pos-Class (Dev) 60 146 62 107 109 Neg-Class (Dev) 80 242 288 279 1138 Pos-Class (Test) 19 183 76 242 120 Neg-Class (Test) 331 830 281 358 1128 Avg. Tokens (Train) 31.69 19.11 20.27 27.85 36.73 Avg. Tokens (Dev) 34.71 18.22 16.66 36.68 36.19 Avg. Tokens (Test) 35.33 23.72 17.02 23.47 36.21 4. Experiments In this section, we give the details of the CLEF 2021 CheckThat! dataset, explain the baselines and the systems that we compared, and present the experimental settings. 4.1. Dataset The CLEF 2021 CheckThat! offers datasets in English, Spanish, Arabic, Turkish, and Bulgarian for the CWD task. The statistics of the datasets are given in Table 1. The class distribution of the datasets for each language is highly imbalanced, which reflects the real-world cases. Check-worthy samples (Pos-Class) are the minority. English and Bulgarian datasets contain only COVID topic. The Turkish dataset covers miscellaneous topics, and the Spanish dataset has only samples about politics. The topics of the Arabic dataset are mainly COVID related. 4.2. Baselines We compare the proposed model (QDMSBERT𝑗𝑜𝑖𝑛𝑡 ) against the following models and systems: • SVM: It encodes the texts with unigrams. • Monolingual Models and Mk-Bg-BERT: We use a distilled SBERT [23] model3 for the English samples. We couldn’t find any monolingual SBERTs for Arabic, Turkish and Spanish; therefore, we use popular BERT [24] variants that are trained on monolingual corpora. TrBERT 4 is the model for Turkish samples, BETO [25] for Spanish, and lastly AraBERT [26] for the tweets in Arabic. For Bulgarian tweets, we leverage a BERT model (Mk-Bg-BERT) trained on Macedonian and Bulgarian corpora 5 . 3 https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens 4 https://huggingface.co/dbmdz/bert-base-turkish-cased 5 https://huggingface.co/anon-submission-mk/bert-base-macedonian-bulgarian-cased • CLEF-2021: Submissions for the CLEF-2021 CWD task [11] that support all languages, namely Accenture, BigIR and TOBB ETU6 . • QDMSBERT: QDMSBERT𝑗𝑜𝑖𝑛𝑡 where the weights are only optimized for the CWD task. 4.3. Experimental Settings and Environment We split the training dataset randomly into five chunks and thus train five different QDMSBERT models with the epochs of 3, weighted decay Adam optimizer [27], and in batches of 16. The mean of each model’s predictions represents the final score. We use the GPU of Google Colab7 for training the models. 5. Results Table 2 presents the results of each model. We report the test results in official metrics of the shared task: Mean Average Precision (MAP), precision scores at 1-50 (P@1-P@50), R-Precision (R-Prec), and R-Rank. We first compare QDMSBERT𝑗𝑜𝑖𝑛𝑡 with the SVM and QDMSBERT. QDMSBERT𝑗𝑜𝑖𝑛𝑡 outperforms QDMSBERT in many metrics across the languages except for Arabic, also QDMSBERT𝑗𝑜𝑖𝑛𝑡 underperforms the SVM in Spanish. We see performance gains on the English, Bulgarian and Turkish samples. The results indicate that QDMSBERT𝑗𝑜𝑖𝑛𝑡 presents better results on the examples in COVID-19, but is generalized less to other topics in Spanish and in Arabic. Among the results by the teams who submitted runs in all languages (group CLEF-2021), the performance of QDMSBERT𝑗𝑜𝑖𝑛𝑡 is the best in English and the second in Bulgarian which is promising for a low-resource language. Monolingual BERT models outperformed our model and the other teams’ submissions in English and Spanish. TrBERT and AraBERT also show better results than our approach. Al- though we improve our outcome compared to QDMSBERT by mitigating differences across the languages, the performance of the monolingual embeddings is still unsurpassed in this task. The presented results of QDMSBERT𝑗𝑜𝑖𝑛𝑡 were accomplished by using a contribution of task loss (𝛼) of 0.6. This initial value was choose heuristically. As an ablation study, we change the 𝛼 values of the tasks’ loss and train QDMSBERT𝑗𝑜𝑖𝑛𝑡 for each 𝛼 value to understand its influence on the CWD learning. The optimal alpha value is 0.8. In Bulgarian samples, the lower alpha values could also yield good performance for the CWD task. Lastly, we analyze the feature representations of QDMSBERT and QDMSBERT𝑗𝑜𝑖𝑛𝑡 . We visualize the feature representations by applying t-distributed stochastic neighbor embedding nonlinear dimensionality reduction (T-SNE) [28]. As depicted in Figure 2, the features that QDMSBERT𝑗𝑜𝑖𝑛𝑡 produces are more clearly separated. For instance, the cluster with English sam- ples (lower right region of Figure 2a) in the T-SNE for QDMSBERT overlaps with both the cluster of Turkish and the cluster of Bulgarian samples. In contrast, the T-SNE for QDMSBERT𝑗𝑜𝑖𝑛𝑡 6 At the time of the writing the paper, we didn’t know the system descriptions of their models 7 https://colab.research.google.com/ Table 2 The results of the models on the test set. Our submission is QDMSBERT𝑗𝑜𝑖𝑛𝑡 . Language Models MAP R-Rank R-Pr P@1 P@3 P@5 P@10 P@20 P@50 SVM 0.052 0.020 0.000 0.000 0.000 0.000 0.000 0.000 0.020 SBERT 0.198 1.000 0.211 1.000 0.333 0.200 0.300 0.200 0.160 Accenture 0.101 0.143 0.158 0.000 0.000 0.000 0.200 0.200 0.100 English BigIR 0.136 0.500 0.105 0.000 0.333 0.200 0.100 0.100 0.120 TOBB ETU 0.081 0.077 0.053 0.000 0.000 0.000 0.000 0.050 0.080 QDMSBERT 0.114 0.500 0.105 0.00 0.333 0.200 0.100 0.100 0.100 QDMSBERT𝑗𝑜𝑖𝑛𝑡 0.149 1.000 0.105 1.000 0.333 0.200 0.200 0.100 0.120 SVM 0.354 1.000 0.311 1.000 0.667 0.600 0.700 0.600 0.460 TrBERT 0.563 1.000 0.530 1.000 1.000 1.000 0.800 0.850 0.780 Accenture 0.402 0.250 0.415 0.000 0.000 0.400 0.400 0.650 0.660 Turkish BigIR 0.525 1.000 0.503 1.000 1.000 1.000 0.800 0.700 0.720 TOBB ETU 0.581 1.000 0.585 1.000 1.000 0.800 0.700 0.750 0.660 QDMSBERT 0.549 1.000 0.579 1.000 0.333 0.600 0.700 0.650 0.680 QDMSBERT𝑗𝑜𝑖𝑛𝑡 0.517 1.000 0.508 1.000 1.000 1.000 1.000 0.850 0.700 Bulgarian SVM 0.588 1.000 0.474 1.000 1.000 1.000 0.900 0.750 0.640 Mk-Bg-BERT 0.661 1.000 0.645 1.000 1.000 1.000 0.900 0.700 0.700 Accenture 0.497 1.000 0.474 1.000 1.000 0.800 0.700 0.600 0.440 BigIR 0.737 1.000 0.632 1.000 1.000 1.000 1.000 1.000 0.800 TOBB ETU 0.149 0.143 0.039 0.000 0.000 0.000 0.200 0.100 0.060 QDMSBERT 0.667 1.000 0.566 1.000 1.000 1.000 1.000 0.900 0.720 QDMSBERT𝑗𝑜𝑖𝑛𝑡 0.673 1.000 0.605 1.000 1.000 1.000 1.000 0.800 0.700 Arabic SVM 0.428 0.500 0.409 0.000 0.667 0.600 0.500 0.450 0.440 AraBERT 0.640 1.000 0.591 1.000 1.000 0.600 0.800 0.750 0.760 Accenture 0.658 1.000 0.599 1.000 1.000 1.000 1.000 0.950 0.840 BigIR 0.615 0.500 0.579 0.000 0.667 0.600 0.600 0.800 0.740 TOBB ETU 0.575 0.333 0.574 0.000 0.333 0.400 0.400 0.500 0.680 QDMSBERT 0.571 1.000 0.579 0.000 0.667 0.600 0.600 0.550 0.580 QDMSBERT𝑗𝑜𝑖𝑛𝑡 0.548 1.000 0.550 1.000 0.667 0.600 0.500 0.400 0.580 Spanish SVM 0.450 1.000 0.450 1.000 0.667 0.800 0.700 0.700 0.660 BETO 0.569 1.000 0.533 1.000 0.667 0.800 0.800 0.750 0.720 Accenture 0.491 1.000 0.508 1.000 0.667 0.800 0.900 0.700 0.620 BigIR 0.496 1.000 0.483 1.000 1.000 0.800 0.800 0.600 0.620 TOBB ETU 0.537 1.000 0.525 1.000 1.000 0.800 0.900 0.700 0.680 QDMSBERT 0.398 0.500 0.425 0.000 0.333 0.600 0.600 0.500 0.580 QDMSBERT𝑗𝑜𝑖𝑛𝑡 0.446 0.333 0.475 0.000 0.333 0.600 0.800 0.650 0.580 shows that only very few non-English samples fall close to the English cluster (upper region of Figure 2b). Table 3 The performance of QDMSBERT𝑗𝑜𝑖𝑛𝑡 under the different 𝛼 values Language 𝛼 MAP R-Rank R-Pr P@1 P@3 P@5 P@10 P@20 P@50 0.3 0.143 1.000 0.105 1.000 0.333 0.200 0.200 0.100 0.080 0.4 0.145 1.000 0.105 1.000 0.333 0.200 0.200 0.100 0.080 0.5 0.151 1.000 0.105 1.000 0.333 0.400 0.200 0.100 0.080 English 0.6 0.149 1.000 0.105 1.000 0.333 0.200 0.200 0.100 0.120 0.7 0.123 0.500 0.105 0.00 0.333 0.200 0.200 0.100 0.120 0.8 0.155 1.000 0.158 1.000 0.333 0.200 0.200 0.150 0.120 0.9 0.144 1.000 0.105 1.000 0.333 0.200 0.100 0.150 0.120 0.3 0.520 1.000 0.492 1.000 1.000 1.000 1.000 0.900 0.660 0.4 0.531 1.000 0.481 1.000 1.000 1.000 1.000 0.950 0.740 0.5 0.534 1.000 0.492 1.000 1.000 1.000 1.000 0.950 0.740 Turkish 0.6 0.517 1.000 0.508 1.000 1.000 1.000 1.000 0.850 0.700 0.7 0.528 1.000 0.497 1.000 1.000 0.200 0.200 0.850 0.680 0.8 0.588 1.000 0.563 1.000 1.000 1.000 1.000 0.950 0.780 0.9 0.582 1.000 0.568 1.000 1.000 1.000 1.000 0.850 0.740 0.3 0.657 1.000 0.618 1.000 1.000 1.000 0.900 0.800 0.720 0.4 0.666 1.000 0.618 1.000 1.000 1.000 0.900 0.800 0.700 0.5 0.670 1.000 0.618 1.000 1.000 1.000 0.900 0.850 0.720 Bulgarian 0.6 0.673 1.000 0.605 1.000 1.000 1.000 1.000 0.800 0.700 0.7 0.677 1.000 0.618 1.000 1.000 1.000 1.000 0.850 0.700 0.8 0.670 1.000 0.592 1.000 1.000 1.000 1.000 0.800 0.720 0.9 0.677 1.000 0.579 1.000 1.000 1.000 0.900 0.850 0.700 0.3 0.562 1.000 0.558 1.000 0.333 0.400 0.500 0.400 0.680 0.4 0.561 1.000 0.562 1.000 0.667 0.400 0.400 0.350 0.640 0.5 0.567 1.000 0.562 1.000 0.667 0.400 0.400 0.400 0.660 Arabic 0.6 0.548 1.000 0.550 1.000 0.667 0.600 0.500 0.400 0.580 0.7 0.561 1.000 0.566 1.000 0.667 0.400 0.500 0.400 0.620 0.8 0.566 1.000 0.566 1.000 0.667 0.400 0.400 0.450 0.580 0.9 0.573 1.000 0.574 1.000 0.667 0.400 0.500 0.500 0.580 0.3 0.450 0.333 0.458 0.00 0.333 0.600 0.700 0.750 0.580 0.4 0.453 0.333 0.475 0.00 0.333 0.600 0.700 0.750 0.580 0.5 0.456 0.333 0.472 0.00 0.333 0.600 0.700 0.750 0.640 Spanish 0.6 0.446 0.333 0.475 0.00 0.333 0.600 0.800 0.650 0.580 0.7 0.443 0.333 0.483 0.00 0.333 0.400 0.600 0.600 0.580 0.8 0.443 0.333 0.475 0.00 0.333 0.400 0.500 0.650 0.580 0.9 0.431 0.250 0.467 0.00 0.00 0.400 0.500 0.700 0.580 6. Conclusion In this paper, we proposed a method to tackle the multilingual check-worthiness of claims. To mitigate bias due to cultural differences, we leveraged multilingual sentence BERTs as feature representations and trained them jointly with the language identification task. Our approach outperformed the SVM and QDMSBERT for almost all of the languages on the CLEF2021 dataset. Figure 2: T-SNE visualization of QDMSBERT (a) and QDMSBERT𝑗𝑜𝑖𝑛𝑡 (b). 0: not check-worthy, 1: check-worthy Also, it became one of the top-performing approaches in Bulgarian and English datasets among the submissions that have been done for all these languages. In the future, we will investigate how the consideration of images [29] that are embedded in the tweets influence the results. Acknowledgement The work of P. Rosso was partially funded by the Spanish Ministry of Science and Innovation under the research project MISMIS-FAKEnHATE on MISinformation and MIScommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31). References [1] L. Graves, Understanding the promise and limits of automated fact-checking, Factsheet 2 (2018) 2018–02. [2] S. Cazalens, P. Lamarre, J. Leblay, I. Manolescu, X. Tannier, A content management perspective on fact-checking, in: WWW (Companion Volume), ACM, 2018, pp. 565–574. [3] J. Thorne, A. Vlachos, Automated fact checking: Task formulations, methods and future directions, in: COLING, Association for Computational Linguistics, 2018, pp. 3346–3359. [4] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates, in: Proceedings of the 24th acm international on conference on information and knowledge management, 2015, pp. 1835–1838. [5] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A. Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, Claimbuster: The first-ever end-to-end fact-checking system, Proc. VLDB Endow. 10 (2017) 1945–1948. [6] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, I. Koychev, A context-aware approach for detecting worth-checking claims in political debates, in: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 2017, pp. 267–276. [7] P. Atanasova, L. Màrquez, A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov, G. D. S. Martino, P. Nakov, Overview of the CLEF-2018 checkthat! lab on automatic identification and verification of political claims. task 1: Check-worthiness, in: CLEF (Working Notes), volume 2125 of CEUR Workshop Proceedings, CEUR-WS.org, 2018. [8] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, G. D. S. Martino, Overview of the CLEF-2019 checkthat! lab: Automatic identification and verification of claims. task 1: Check-worthiness, in: CLEF (Working Notes), volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. [9] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. D. S. Martino, M. Hasanain, R. Suwaileh, F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. S. Ali, Overview of checkthat! 2020: Automatic identification and verification of claims in social media, in: CLEF, volume 12260 of Lecture Notes in Computer Science, Springer, 2020, pp. 215–236. [10] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news, in: Proceedings of the 12th International Conference of the CLEF Association: Information Access Evaluation Meets Multiliguality, Multimodality, and Visualization, CLEF ’2021, Bucharest, Romania (online), 2021. [11] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, M. K. Alex Nikolov, F. A. Yavuz Selim Kartal, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates, in: Working Notes of CLEF 2021—Conference and Labs of the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021. [12] D. Korenčić, I. Baris, E. Fernandez, K. Leuschel, E. Salido, To block or not to block: Experiments with machine learning for news comment moderation, in: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, 2021, pp. 127–133. [13] M. Z. Hossain, M. A. Rahman, M. S. Islam, S. Kar, BanFakeNews: A Dataset for Detecting Fake News in Bangla, in: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association (ELRA), 2020. [14] B. Y. Lin, F. F. Xu, K. Q. Zhu, S. Hwang, Mining cross-cultural differences and similarities in social media, in: ACL (1), Association for Computational Linguistics, 2018, pp. 709–719. [15] M. Mensio, H. Alani, News source credibility in the eyes of different assessors, in: TTO, 2019. [16] D. Bountouridis, M. Makhortykh, E. Sullivan, J. Harambam, N. Tintarev, C. Hauff, Anno- tating credibility: Identifying and mitigating bias in credibility datasets (2019). [17] K. Singh, G. Lima, M. Cha, C. Cha, J. Kulshrestha, Y.-Y. Ahn, O. Varol, Misinformation, believability, and vaccine acceptance over 40 countries: Takeaways from the initial phase of the covid-19 infodemic, arXiv preprint arXiv:2104.10864 (2021). [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [19] M. Hasanain, T. Elsayed, bigir at checkthat! 2020: Multilingual bert for ranking arabic tweets by check-worthiness, Cappellato et al.[10] (2020). [20] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge distillation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2020. [21] A. Das, A. Dantcheva, F. Bremond, Mitigating bias in gender, age and ethnicity classification: a multi-task convolution neural network approach, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0. [22] A. Vaidya, F. Mai, Y. Ning, Empirical analysis of multi-task learning for reducing identity bias in toxic comment detection, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 14, 2020, pp. 683–693. [23] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing, Association for Computational Linguistics, 2019. [24] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional trans- formers for language understanding, in: NAACL-HLT (1), Association for Computational Linguistics, 2019, pp. 4171–4186. [25] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [26] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language understanding, in: LREC 2020 Workshop Language Resources and Evaluation Conference 11–16 May 2020, ????, p. 9. [27] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: ICLR (Poster), Open- Review.net, 2019. [28] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research 9 (2008). [29] G. S. Cheema, S. Hakimov, E. Müller-Budack, R. Ewerth, On the role of images for ana- lyzing claims in social media, in: CLEOPATRA@WWW, volume 2829 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 32–46.