1. Introduction

Conspiracy Theory Detection using Transformers with Multi-task and Multilingual Approaches

Leon Zrnić

0 0 University of Zagreb, Faculty of Electrical Engineering and Computing

2024

The COVID-19 pandemic sparked a new age of conspiracy theories in society. This has become an issue, especially since these theories are mixed in with reasonable arguments that criticize the measures taken by governments and their efects. A way to help diferentiate these two narratives is using natural language processing (NLP) models such as Transformers. These working notes detail a few approaches to classifying conspiracy and critical narratives and the identification of the key narrative elements present in these texts. We employ these models on two datasets which encompass English and Spanish texts on Telegram talking about the COVID-19 pandemic. Our approaches include using pre-trained BERT and RoBERTa models on monolingual datasets, a multilingual approach in which we translate the Spanish texts into English and the use of a multilingual model on nontranslated texts, and using a multi-task model architecture in the identification of narrative elements. Our results show that BERT pre-trained on COVID-19 tweets had similar results to RoBERTa in the binary classification task, while in the token classification task RoBERTa worked better. The monolingual English approach yielded better results than the multilingual one which was, however, better than the Spanish models. We conclude that transformer models can have good results in these classification tasks, making them an easy-to-deploy way to diferentiate critical narratives from conspiracy theories.

eol>Machine learning NLP conspiracy theories transformers multilingual multi-task model

1. Introduction

The COVID-19 pandemic has flooded digital platforms with both essential updates and conspiracy theories. This surge of information creates the challenge of distinguishing between legitimate critical narratives and harmful conspiracy theories. Critical narratives question established systems using evidence and reason, while conspiracy theories claim secret plots without substantial proof. Diferentiating these is vital for efective public health communication and social stability. It ensures informed decision-making, as critical narratives drive constructive scrutiny based on evidence, while conspiracy theories spread misinformation and cause societal divisions.

One way to diferentiate these narratives is through the use of natural language processing (NLP) models. These automatic classifiers can hasten the process of identifying conspiracy narratives, removing the need for human annotation in the process.

In these work notes, we describe our approach to creating and training these automatic classifiers on two separate classification tasks. The first task is a binary classification task in which an AI model diferentiates between conspiracy and critical narratives. The second task sets the goal of identifying the key narrative elements present in conspiracy and critical narrative texts regarding the COVID-19 pandemic.

2. Task descriptions

As part of the PAN at CLEF 2024 Oppositional thinking analysis: Conspiracy theories vs critical thinking narratives shared task [ 1, 2 ], we partook in two diferent tasks. The first task was a binary classification task in which participants needed to make AI models that diferentiated between conspiracy and critical narratives in Telegram messages about the COVID-19 pandemic. The second task was a token classification task in which models needed to find six diferent key narrative elements present in the Telegram messages.

3. Dataset

As mentioned, there are two train datasets: one with English Telegram messages and another with Spanish Telegram messages, each containing 4000 annotated messages about the COVID-19 pandemic.

Messages are labeled as either Critical or Conspiracy. Critical messages discuss the pandemic with reasoned arguments, questioning government measures. Conspiracy messages claim hidden plots aim to undermine freedom and establish a new world order.

Each message also has token-level annotations for six narrative elements, as defined by the dataset authors: • Agents, • Facilitators, • Victims, • Campaigners, • Objectives, and • Negative efects .

To further analyze the dataset, we examined the diferences between texts labeled as conspiracy and critical in two ways. First, we looked at the lengths of the messages, as shown in Table 1. On average, conspiratorial messages are almost twice as long as critical messages and Spanish texts are longer than English texts.

4. Methodology

This section describes the methodology used in our work. Subsection 4.1 briefly explains the transformer architecture and the models we used. Subsection 4.2 details the two multilingual approaches we used, namely text translation and multilingual models. In Subsection 4.3 we explain the multi-task model architecture that was used for the token classification task, and in Subsection 4.4 we detail Stratified K-fold cross-validation with which we evaluated the performance of our models during training.

4.1. Transformer models

Transformers [ 3 ] are deep learning neural network architectures primarily used in NLP. They leverage the attention mechanism proposed by [4], allowing the model to focus on crucial parts of a text for a given task. Transformers can be applied to various tasks, including text summarization, question answering, and binary and token classification, as explored in our work.

Since their introduction, many transformer architectures have emerged, with BERT (Bidirectional Encoder Representations from Transformers) [5] being one of the most popular. BERT improves on the original transformer by using a bidirectional approach, analyzing the entire sentence to determine the importance of each word, unlike the original architecture, which only considered preceding words.

The availability of pre-trained models has significantly contributed to the widespread use of transformers in the NLP community. All pre-trained models were sourced from the HuggingFace transformers library [6]. The following models were used for the binary classification task: • English • Spanish – bert-base-cased [5], – bert-large-cased [5], – roberta-base [7], – roberta-large [7], and – digitalepidemiologylab/covid-twitter-bert-v2 [8] (referred to as ct-bert) – dccuchile/bert-base-spanish-wwm-cased [9] (referred to as bert-spanish) and – PlanTL-GOB-ES/roberta-large-bne [10] (referred to as roberta-spanish)

4.2. Multilingual approach 4.2.1. Text translation

One way to utilize all the available data is by translating one language data into another. This way we will have available twice the amount of data instead of the monolingual approach in which we use half of all the data. To this end, we used the translate Python package1. From this Python package, we implemented the MyMemory [11] translation provider that has several diferent machine translation models with a linguistic database.

Since English is a high-resource language, we decided that for this approach we would use monolingual English models that worked on a dataset that combined the original English texts and the Spanish texts translated into English.

4.2.2. Multilingual models

Using multilingual models can address the issue of utilizing only half the available data. An example is the xlm-roberta-base [12] model, which was “pre-trained on 2.5TB of CommonCrawl data in 100 languages”.

Multilingual models leverage shared learning across languages, allowing for the use of double the data compared to monolingual models. English, a high-resource language, can enhance the classification of Spanish texts, improving performance through shared representations and transfer learning, which also aids generalization.

4.3. Multi-task model

Multi-task Learning (MTL) models [13] are trained on related tasks to create representations that can handle multiple objectives. They use two main architectures: hard parameter sharing and soft parameter sharing.

In hard parameter sharing, a single pipeline of shared layers is used, with separate task-specific layers. Figure 1 showcases this MTL architecture. In hard parameter sharing, the model has one main pipeline of shared layers while keeping task-specific layers separate for each task. This approach reduces overfitting and enables knowledge transfer between tasks. For example, representations learned from a binary classification task can aid in token classification. 1https://pypi.org/project/translate/#description

The second way of making an MTL model is by soft parameter sharing. Figure 2 shows the structure of one such model. Soft parameter sharing involves separate models for each task, with regularized layers to keep parameters similar. [14] state that there are diferent ways of regularizing these models such as L2 distance [15] or the trace norm [16].

We employ an MTL model for token classification using a hard parameter-sharing transformer model. It shares a common hidden layer backbone with six separate classification heads for diferent narrative elements. Diferent pre-trained transformers serve as backbones for the two datasets. Figure 3 visualizes the model architecture used.

4.4. Stratified K-fold cross-validation

Since only the train dataset was available for most of our work, we used an artificial test dataset for evaluation during training. We created this dataset using Stratified -fold cross-validation, which splits the training set into equal-sized subsets while preserving the class label ratio in each fold. In each epoch, the model is trained times, using − 1 folds for training and the remaining fold for validation. The model’s performance is then averaged across all folds and epochs. This method, implemented with Scikit-learn [17], allowed us to obtain performance scores without the oficial test dataset. The best models were ultimately evaluated on the oficial test dataset at the competition’s end.

5. Experimental setup

In this section, we present the technical details regarding our setup. For both tasks, we used Stratified 5-fold cross-validation. The hyperparameters for the transformers were 10 epochs, a learning rate of 2− 5, a batch size of 32, weight decay was set to 0.01, and we set the value of warmup steps to 0.1. We also increased the maximum sequence length from the base length of 256 to 512. The models were trained on an Nvidia A100 graphics card with 40GB of RAM. All of the models and their tokenizers were from the HuggingFace [6] library.

6. Results

Here we present the results from our diferent approaches to the two classification tasks. In Subsection 6.1 we detail the results we got while self-validating on the train dataset. Subsection 6.2 shows the results of the models we submitted to evaluation on the oficial test dataset.

6.1. Experimental results 6.1.1. Task 1: Binary classification

2https://towardsdatascience.com/how-to-create-and-train-a-multi-task-transformer-model-18c54a146240

6.1.2. Task 2: Token classification

Here we detail the results the token classification models had during our own evaluation. Table 5 and Table 6 have the results for the English and Spanish models respectively. The roberta-large achieved the best results on the English dataset while roberta-spanish was the best model on the Spanish dataset.

6.2. Oficial results

This subsection contains the tables with the results we achieved on the oficial test dataset. Table 7 contains the binary classification results and Table 8 the results for the token classification task.

When comparing these results to the other teams competing at PAN, we placed sixth in the English variant of the first task and ninth in the Spanish variant. In the second task, we placed second in both the English and Spanish variants.

7. Conclusion

In our work, we explored the application of transformer models on the tasks of binary classification of conspiracy and critical narratives and the token classification of key narrative elements in the dataset. Our results show that transformer models such as BERT and RoBERTa are highly efective in both binary and token classification tasks in the domain of COVID-19 messages. In the binary classification task, the English transformers performed better than the Spanish ones. There are many possible reasons for this, such as the data quality and size of pre-training data, the pre-training approaches, and the diferences between English and Spanish. The translation approach did not succeed in achieving good results. We attribute this to the poor translation capabilities of MyMemory. Further work could use diferent translation methods, such as transformer-based machine translation [18, 19]. On the other hand, the multilingual transformer model had good results when compared to the monolingual approaches. In the token classification task, the best-performing English and Spanish models had the same F1 score. However, there were diferences in the performance of the models when looking at the F1 scores for each annotation. For example, the Spanish models could detect better the Negative efect and Victim annotations. Further work should explore the diferences between English and Spanish conspiracy theories on a semantical level. Multilingual models could perhaps leverage these diferences to achieve better results than the monolingual models. [4] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, 2016. arXiv:1409.0473. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: https://arxiv.org/abs/1810.04805. doi:10.48550/ARXIV. 1810.04805. [6] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,

Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [8] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter, Frontiers in Artificial Intelligence 6 (2023). URL: https: //www.frontiersin.org/articles/10.3389/frai.2023.1023281. doi:10.3389/frai.2023.1023281. [9] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [10] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0. mendeley. doi:10.26342/2022-68-3. [11] M. Trombetti, MyMemory: creating the world’s largest translation memory, in: Proceedings of Translating and the Computer 31, Aslib, London, UK, 2009. URL: https://aclanthology.org/2009. tc-1.12. [12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747. doi:10.18653/v1/2020. acl-main.747. [13] R. Caruana, Multitask learning: A knowledge-based source of inductive bias, in: International

Conference on Machine Learning, 1993. URL: https://api.semanticscholar.org/CorpusID:18522085. [14] S. Ruder, An overview of multi-task learning in deep neural networks, 2017. arXiv:1706.05098. [15] L. Duong, T. Cohn, S. Bird, P. Cook, Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, in: C. Zong, M. Strube (Eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Beijing, China, 2015, pp. 845–850. URL: https://aclanthology.org/ P15-2139. doi:10.3115/v1/P15-2139. [16] Y. Yang, T. M. Hospedales, Trace norm regularised deep multi-task learning, 2017.

arXiv:1606.04038. [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825– 2830. [18] T. Tian, C. Song, J. Ting, H. Huang, A french-to-english machine translation model using transformer network, Procedia Computer Science 199 (2022) 1438–1443. URL: https://www.sciencedirect. com/science/article/pii/S1877050922001831. doi:https://doi.org/10.1016/j.procs.2022. 01.182, the 8th International Conference on Information Technology and Quantitative Management (ITQM 2020 2021): Developing Global Digital Economy after COVID-19. [19] T. J. Sefara, S. G. Zwane, N. Gama, H. Sibisi, P. N. Senoamadi, V. Marivate, Transformer-based machine translation for low-resourced languages embedded with language identification, in: 2021 Conference on Information Communications Technology and Society (ICTAS), 2021, pp. 127–132. doi:10.1109/ICTAS50802.2021.9394996.

[1]

A. A.

Ayele ,

Babakov ,

Bevendorf ,

X. B.

Casals ,

Chulvi ,

Dementieva ,

Elnagar ,

Freitag ,

Fröbe ,

Korenčić ,

Mayerl ,

Moskovskiy ,

Mukherjee ,

Panchenko ,

Potthast ,

Rangel ,

Rizwan ,

Rosso ,

Schneider ,

Smirnova ,

Stamatatos ,

Stein ,

Taulé ,

Ustalov ,

Wang ,

Wiegmann ,

S. M.

Yimam , E. Zangerle, Overview of PAN 2024: Multiauthor writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative AI authorship verification - condensed lab overview, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Fifteenth International Conference of the CLEF Association CLEF-2024 , 2024 .

[2]

Korenčić ,

Chulvi ,

Bonet-Casals ,

Taulé ,

Rosso ,

Rangel , Overview of the oppositional thinking analysis PAN task at CLEF 2024 , in: G. Faggioli,

Ferro ,

Galuvakova , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , 2024 .

[3]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).