1. Introduction

Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, September

Vitali at ACTI - Transformer-based Conspiracy Theory Identification

Michael Vitali

Vincenzo Scotti

Mark James Carman

0 0 DEIB, Politecnico di Milano , Via Ponzio 34/5, 20133, Milano (MI) , Italy

2023

0 7 08

English. In this work, we participated in the Automatic Conspiracy Theory Identification (ACTI) competition, which involved two sub-tasks: identifying whether an input text is a conspiracy theory and recognising the specific conspiracy theory it discusses. Our approach involved fine-tuning two BERT models, one in Italian and one multilingual, and combining them in an ensemble. The results were promising, and we achieved a position among the top participants in the challenge. This work contributes to the advancement of automatic conspiracy theory identification and highlights the efectiveness of fine-tuned BERT models in this domain. Italiano. In questo lavoro, abbiamo partecipato alla competizione di Identificazione Automatica delle Teorie Cospiratorie (Automatic Conspiracy Theory Identification, ACTI), che si compone due sotto-problemi: identificare se un dato testo riguarda una teoria del complotto e riconoscere a quale teoria del complotto in particolare si fa riferimento nel testo. Il nostro approccio prevedeva l'adattamento di due modelli BERT, uno in italiano e uno multilingue, e la loro combinazione in un ensemble. I risultati sono stati promettenti e abbiamo raggiunto una posizione tra i primi partecipanti nella sfida. Questo lavoro contribuisce allo sviluppo dell'identificazione automatica delle teorie del complotto e mette in evidenza l'eficacia dei modelli BERT adattati in questo ambito.

eol>Natural Language Processing Transformer Network BERT Ensemble Conspiracy Theory Identification

1. Introduction

In this section, we present the data sets that constitute the two diferent sub-tasks of the ACTI task (hereafter theories. Both sub-tasks use separate data sets consisting of Italian text samples. Table 1 provides an overview of the main statistics for the text samples in each corpus, we used NLTK [ 9 ] tokeniser to compute the number of tokens. Figure 1 illustrates the label distributions.

Sub-task A involves binary classification to identify whether a text sample relates to a conspiracy theory or not. Samples contain noise like emoticons or spelling errors, hence we assumed they had not been pre-processed. Label distribution is well balanced between the two classes, as can be seen in the top part of Figure 1.

Sub-task B extends sub-task A by introducing a multi-class classification aspect. The goal is to identify if a text pertains to one of the following conspiracy theories: COVID, QAnon, Flat-Earth, and Russia. As for sub-task A, samples have not been pre-processed. Diferently from sub-task A, the label distribution is unbalanced, COVID and QAnon are more frequent than Flat-Earth and Russia.

2.2. Pre-processing

Text samples in the data sets contain a lot of noise, like emoticons, slang, or spelling errors; thus, we applied some cleaning and pre-processing steps. Moreover, the two data sets do not contain a large number of samples, and sub-task B presents an unbalanced label distribution, introducing the risks of overfitting and learning biased models. To cope with this issue we considered applying data augmentation to the data sets. Initial results on the training set showed that augmentation was relevant to obtain good results on sub-task A, while we did not need to apply it to sub-task B, despite the class unbalance.

To clean the data sets, we employed basic text transformation and regular expressions to: • Convert all text to lowercase to ensure consistency and reduce the vocabulary size. • Clean the data by removing noise such as emoticons, slang, and special characters with regular expressions. This step helps to improve the quality of the text samples. • Clean data by removing specific patterns from the text, including dates, texts between brackets, links, emails, and multiple spaces, using regular expressions.

Additionally, we applied data augmentation to sub-task A, to increase the number of samples. The method of choice was back-translation, which involves translating a text sample from the source language to another language and then translating it back. This process preserves the original text’s semantics while potentially altering the syntax, generating synthetic samples.

Original

Augmented 100 101 102 103 Russia

Original 100 101 102

103 Count The ACTI task comprises two sub-tasks: sub-task A, a binary classification task to determine if a given text piece is about a conspiracy theory or not, and sub-task B, a multiclass classification task to recognise specific conspiracy

Model

In this section, we describe the architecture of the classifiers we built using

Transformer neural networks [ 10 ] and the training process we followed to prepare our models for evaluation.

3.1. Architecture

To solve both ACTI sub-tasks, we used the same Transformer-based classification architecture, changing only the target classes from one task to the other. We explored diferent Transformer Encoder neural networks, namely BERT [ 11 ], pre-trained on diferent data sets. We also explored individual and combined applications of the classifiers.

We visualise the single-model and ensemble-model pipelines in Figure 2.

Each BERT-based classifier takes as input a sequence of tokens, extracted from the pre-processed text. The sequence starts with a classification token [CLS] and is concluded by an end-of-sequence token [SEP], introduced during the tokenisation process. To classify the input piece of text, we retrieve the contextual embedding computed by the transformer hidden layers in correspondence of the [CLS] input token and use it to feed a linear classifier. The ifnal classification layer outputs the probability distribution of the input piece of text to belong to one of the possible classes. We reported the entire process in Figure 2a.

To improve the classification results and take the best from the trained models, we considered also creating an ensemble [12, Chapter 16]. For each task, we aggregated the predictions of the individual models. To aggregate the predictions, we froze the fine-tuned classifiers and learned a separate Logistic Regression classifier on top of the Transformer models. The Logistic Regression classifier takes as input the probability distributions predicted by individual models and compute a new output probability combining the previous. The entire ensemble pipeline is represented in Figure 2b.

3.2. Training

To efectively train our models, we adopted 5-fold cross-validation to find the best hyperparameters for each of the considered models and each task. We preferred this approach to the usual train-validation split to make the best out of the available data. Given the best hyperparameters combination, we retrained the model on the entire training data set.

We fine-tuned two variants of BERT base ( 110M parameters):

• Italian BERT (uncased)2, pre-trained on Italian.

2Model card:

La Terra è piatta!

Input text (b) Ensemble classifier.

We used the implementations available in the Transformers library from Hugging Face [ 13 ]. Each configuration was trained using the Adam optimiser [ 14 ], a linear learning rate schedule and a batch size of 8.

For each of these models, during cross-validation, we varied the learning rate and the number of epochs. Additionally, we explored regularisation: we evaluated models with and without dynamic masking. We varied the learning rate in {1× 10− 5,2× 10− 5,3× 10− 5,5× 10− 5}, and the number of epochs in {2,3,4}. Dynamic masking applies the same kind of masking BERT uses during pre-training, randomly corrupting the input sequence by masking the tokens.

We adopted 5-fold cross-validation with the ensemble as well. Referring to the implementation of Logistic Regression available in the Scikit-Learn library [ 15 ], we explored values for the following hyperparameters: regularisation strength (2 regularisation), number of iterations, and solver. We varied the inverse of the regularisation strength in {10− 3,10− 2,10− 1,1,10,102,103}, the maximum number of iterations in {20,50,100,200,500,1000}, and we tried all solvers apart from the Newton-Cholesky one. Additionally, we weighed the classes with a weight inversely proportional to class frequencies, to obtain a more balanced classifier.

4. Results

In this section, we explain how we evaluated the models proposed for each sub-task, present the results obtained on each task, and provide comments on these results. In both cases, we focus in the results of the ensembles, since they peeform better than individual models in both cases.

4.1. Evaluation

We evaluated the classification models using the 1 score. For the multitask settings, we computed the macro average of the scores to account for potential class distribution imbalances.

We reported the 1 scores on the test set in table Table 2. In addition to the results of the submitted models, we included some additional scores for comparison and to provide further insight into the results. The 1 scores are computed on 70% of the test data for sub-task A 50% of the test data for sub-task B via the Kaggle platform4 (which hosted the competition), as determined by the authors of the ACTI task for the private leaderboard.

3Model card:

https://huggingface.co/bert-base-multilingual-uncased 4Website: https://www.kaggle.com ure Italian BERT - P(Conspiracy) atMultilingual BERT - P(Conspiracy) e F-l Multilingual BERT N-onPe(N-orBmiaals) eod Italian BERT - P(Normal) M

(a) Sub-task A.

Multilingual BERT - P(COVID) erutea IIttaalliiaann BBEERRTTN-o-nPeP((C-QOAVBnIioDan)s) F Italian BERT - P(Russia) -lMultilingual BERT - P(Flat-Earth) ed Italian BERT - P(Flat-Earth) oM Multilingual BERT - P(Russia)

Multilingual BERT - P(QAnon)

Italian BERT - P(Flat-Earth) eMultilingual BERT - P(Flat-Earth) teuar Multilingual BERTN-onPe(Q-AnBoina)s F-edl IIItttaaallliiiaaannnBBBEEERRRTTT---PPP(((RCQuOAsVnsIoiDna))) Mo Multilingual BERT - P(COVID)

Multilingual BERT - P(Russia)

Multilingual BERT - P(QAnon) ruteae IIttaalliiaann BBEERRTTN-o-nPeP((Q-CAOnBVoiInaD)s) FMultilingual BERT - P(Flat-Earth) le-d MultiIltianlgiuaanlBBEERRTT--PP((RCuOsVsIiDa)) oM Italian BERT - P(Flat-Earth)

Multilingual BERT - P(Russia)

Multilingual BERT - P(Russia) urteae ItIatlailainanBEBRETRTN-o-nPe(PR(-uCsOBsViiIaaDs)) FMultilingual BERT - P(Flat-Earth) edl- ItaliaIntaBlEiRaTn -BEPR(TFl-atP-(EQaArntohn)) oM Multilingual BERT - P(COVID)

Multilingual BERT - P(QAnon) Label: Conspiracy 1

Concerning feature relevance analysis in the ensemble, from Figure 3a, we can see that the ensemble gives a higher weight to the prediction of the Italian BERT model, rather than the multilingual one. This hints that for this specific task of detecting whether the text concerns a conspiracy theory or not, having a language-specific model may be the better solution. However, the ensemble improves over both single models, thus the multilingual model is contributing to the correct classification as well.

4.3. Sub-task B

For Sub-task B, the ensemble model achieved a test accuracy score of 89.83% (see Table 2). This result highlights again the efectiveness of the ensemble approach in capturing task-specific patterns and making accurate predictions.

The best model configuration for sub-task B, involved ifne-tuning with the following hyperparameters. Italian BERT: learning rate of 3 × 10− 5 , number of epochs of 2, and dynamic masking enabled. Multilingual BERT: learning rate of 3 × 10− 5 , number of epochs of 2, and dynamic masking enabled. Logistic Regression (ensemble): inverse of the regularisation strength of 10− 3, maximum number of iterations of 20, Newton-CG solver.

Comparing it with other configurations, we observe that the ensemble model outperformed the configuration without dynamic masking, which achieved an accuracy score of 87.67%. This indicates that dynamic masking played a crucial role in improving the model’s performance. When comparing the ensemble model’s accuracy score with the provided baseline accuracy score of 68.37%, we observe a significant performance boost, underscoring the efectiveness of our approach.

Concerning feature relevance analysis in the ensemble, from Figure 3b), we can see that, diferently from sub-task A, here both models contribute equally to the prediction. In fact, the values of the weights associated with the same input probability and the same output class models are close for the diferent models.

Additionally, to get better insights on the behaviour of the two Transformer-based classifiers on the sub-tasks, we analysed the weights learned by the Logistic Regression during the training of the ensemble. The higher the weight, the stronger the contribution of the probability predicted by a classifier to the prediction and, thus, the stronger the relevance of that classifier in the ensemble. To this end, we visualised all the weights of the two Logistic Regression models in Figure 3.

4.2. Sub-task A

For Sub-task A, the ensemble model achieved a test 1 score of 82.30% (see Table 2). This result highlights the efectiveness of combining the predictions from individual models to improve overall performance.

The best model configuration for sub-task A, involved ifne-tuning with the following hyperparameters. Italian BERT: learning rate of 2 × 10− 5 , number of epochs of 4, and dynamic masking enabled. Multilingual BERT: 5. Discussion learning rate of 2 × 10− 5 , number of epochs of 4, and dynamic masking enabled. Logistic Regression In this report, we described our approach to training (ensemble): inverse of the regularisation strength of 10− 2, Transformer-based classification models for conspiracy maximum number of iterations of 20, Newton-CG solver. theory identification. We trained and evaluated our mod

Comparing it with other configurations, we observe els on the two sub-tasks of the ACTI data set, an Italian that the ensemble model outperformed other approaches, benchmark for conspiracy theory identification. The such as using augmentation alone (77.04%) or not ifrst task involved binary text classification to determine applying any augmentation nor masking (81.67%). whether a piece of text is about a conspiracy theory or not, Furthermore, compared to the provided baseline 1 while the second task focused on multi-class classification score of 51.07%, the proposed ensemble model shows to identify the specific conspiracy theory referenced in a substantial improvement, highlighting the efectiveness a piece of text. of our approach.

Given the limited resources available, including the data set size and computational power, we were unable to explore all possible avenues. Additionally, the availability of pre-trained Italian Language Models is also limited.

Most Italian models are part of multilingual models rather than dedicated Italian models, and the available Italian-only models are smaller if compared to English ones (for example). However, this provided us with the opportunity to train potentially multilingual models for conspiracy theory identification, although we did not test this approach on other languages in our current study.

The results obtained from our models are promising.

For Sub-task A, our ensemble model achieved a test 1 score of 82.30%, outperforming both the other configurations we explored and the provided baseline 1 score of 51.07%. Regarding Sub-task B, our ensemble model achieved a test 1 score of 89.83%, surpassing the other configurations we tested and the provided baseline 1 score of 68.37%. This highlights the efectiveness of combining the predictions from individual models to improve overall performance on this task.

Moving forward, we aim to explore the application of end-to-end text generation models for conspiracy theory identification. Current research suggests that LLMs can be efectively employed for text classification tasks by concatenating the text to classify with a question asking for the class and triggering text generation. We plan to leverage one of these multilingual LLMs with a combination of prompting and in-context learning, enabling zero-shot to few-shots learning.

Overall, our study contributes to the understanding of conspiracy theory identification using Transformerbased models. The achieved results show the potential of these models in accurately classifying conspiracy-related texts, and future investigations can explore additional approaches to further enhance performance. The source code developed for the challenge is available via GitHub at the following link: https://github.com/MichaelVitali/Evalita2023.

[1]

Russo ,

Stoehr ,

M. H.

Ribeiro , Acti at evalita 2023: Overview of the conspiracy theory identification task , arXiv preprint arXiv:2307.06954 ( 2023 ).

[2]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[3]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong ,

Du ,

Yang ,

Chen ,

Jiang ,

Ren ,

Li ,

Tang ,

Liu , P. Liu,

Nie ,

Wen , A survey of large language models , CoRR abs/2303 .18223 ( 2023 ). URL: https:// doi.org/10.48550/arXiv.2303.18223. doi: 10 .48550/ arXiv.2303.18223. arXiv: 2303 . 18223 .

[4]

Weidinger ,

Mellor ,

Rauh ,

Grifin ,

Uesato ,

Huang , M. Cheng, M. Glaese,

Balle ,

Kasirzadeh ,

Kenton ,

Brown , W. Hawkins,

Stepleton ,

Biles ,

Birhane ,

Haas ,

Rimell ,

L. A.

Hendricks ,

Isaac ,

Legassick , G. Irving, I. Gabriel, Ethical and social risks of harm from language models , CoRR abs/2112 .04359 ( 2021 ). URL: https://arxiv.org/abs/2112.04359. arXiv: 2112 . 04359 .

[5]

Russo ,

M. Horta

Ribeiro ,

Casiraghi ,

Verginer , Understanding online migration decisions following the banning of radical communities , in: Proceedings of the 15th ACM Web Science Conference 2023 , WebSci '23, Association for Computing Machinery, New York, NY, USA, 2023 , p. 251 - 259 . URL: https://doi.org/10.1145/3578503.3583608. doi: 10 .1145/3578503.3583608.

[6]

Russo ,

Verginer ,

M. H.

Ribeiro , G. Casiraghi, Spillover of antisocial behavior from fringe platforms: The unintended consequences of community banning , in: Proceedings of the International AAAI Conference on Web and Social Media , volume 17 , 2023 , pp. 742 - 753 .

[7]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423. doi: 10 .18653/v1/ N19 -1423.

[8]

Russo ,

Gote ,

Brandenberger ,

Schlosser ,

Schweitzer , Helping a friend or supporting a cause? disentangling active and passive cosponsorship in the U.S. congress, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 2952 - 2969 . URL: https://aclanthology.org/ 2023 . acl-long . 166 .

[9]

Bird ,

Klein , E. Loper, Natural Language Processing with Python , O'Reilly , 2009 . URL: http://www. oreilly.de/catalog/9780596516499/index.html.

[10]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: I. Guyon, U. von Luxburg, S. Bengio,

H. M.

Wallach ,

Fergus ,

S. V. N.

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9 , 2017 , Long Beach, CA, USA, 2017 , pp. 5998 - 6008 . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract. html.

[11]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis , MN, USA, June 2-7, 2019 , Volume 1 (Long and Short Papers), Association for Computational Linguistics , 2019 , pp. 4171 - 4186 . URL: https://doi.org/10.18653/ v1/n19- 1423 . doi: 10 .18653/v1/n19- 1423 .

[12]

Hastie ,

Tibshirani ,

J. H.

Friedman , The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , Springer Series in Statistics, Springer, 2009 . URL: https://doi.org/10.1007/978-0- 387 -84858-7. doi: 10 .1007/978-0- 387 -84858-7.

[13]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest ,

A. M.

Rush , Transformers: State-ofthe-art natural language processing , in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20 , 2020 , Association for Computational Linguistics, 2020 , pp. 38 - 45 . URL: https://doi.org/10.18653/v1/ 2020 .emnlp-demos.6. doi: 10 .18653/v1/ 2020 .emnlp-demos. 6 .

[14]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA, USA, May 7- 9 , 2015 , Conference Track Proceedings, 2015 . URL: http://arxiv.org/abs/1412.6980.

[15]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg , J. VanderPlas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine learning in python , J. Mach. Learn. Res . 12 ( 2011 ) 2825 - 2830 . URL: https://dl.acm.org/doi/10.5555/1953048.2078195. doi: 10 .5555/1953048.2078195.