1. Introduction

Cicognini at ACTI: Analysis of techniques for conspiracies individuation in Italian

Giacomo Cignoni

Alessandro Bucci

0 0 University of Pisa

This report illustrates methods and results for solving SubtaskA (conspiracy detection) and SubtaskB (conspiracy topic classification) of EVALITA 2023 ACTI challenge. We employed diferent transformer-based models and an original method based on tf-idf. Results shows top performance scores over 80% for both subtasks. a radicalization process after the application of content moderation [8]. Therefore, the need for automatic modWe decided to cover the EVALITA 2023 challenge "Au- els that can detect the difusion of troublesome (or more tomatic Conspiracy Theory Identification" or ACTI for specifically) conspiratorial content has become crucial. short [2]. This challenge is about classifying whenever Transformer based models have revolutionized modern an Italian message is conspiratorial or not and, if positive, natural language processing [9, 10, 11, 12]. Indeed, they what type of conspiracy is about. Therefore the challenge are the current state of the art models in most NLP tasks is subdivided is 2 subtasks: spanning diferent fields from politics [ 13, 14], conflict prediction [15], and, of course, hate speech detection • CmoondeslpmiruasttorreiaclogCnoizneteifnat tCelleagsrsaimficaptioosnt is: ctohne- [16, 17, 18].[19]. In particular finetuning of BERT[ 20] spiratorial or not. based models for classification tasks such as sentiment analysis or topic detection has been widely studied and • Conspiracy Category Classification : the its efectiveness proved with multiple benchmarks [ 21]. model must discriminate to which conspiracy The usage of machine learning techniques for detecting theory a post belongs from a list of 4 possible conspiracy theories has been studied mainly in regard conspiracy topics: to social media texts extracted in the English language, although also classification on diferent topic of the conspiracies has been considered [22, 23].

eol>Conspiracy Theory Content Moderation Large Language Models Computational Social Science

1. Introduction

1. Covid-Conspiracy 2. Qanon-Conspiracy 3. Flat Earth-Conspiracy 4. Pro-Russia Conspiracy

2. Related works

Conspiratorial content has been raising on the internet over the past years such that some has define it as a "Golden Age of Conspiracy" [ 3 ]. Indeed mainstream platforms have tried to moderate the difusion of online communities with the implementation of content moderation known as deplatforming. However, there have been a lot of discussion regarding the eficacy of such interventions [ 4, 5, 6 ].. Indeed, some identified the presence of spillover of toxic behaviour [7] and the the presence of

3. Datasets

The 2 provided datasets are a collection of labeled Italian Telegram’s messages. Both datasets were relatively clean in regard to the text, so heavy preprocessing was not needed.

3.1. Subtask A dataset

More specifically for Subtask A, the training dataset is a .csv file containing: • id: unique post identifier. • comment_text: the text of the telegram’s message. • conspiratorial: a binary label that indicates if the message is conspiratorial or not.

The training dataset is composed by 1842 samples, of which 925 with a positive conspiratorial label and 917 with a negative conspiratorial label. The hidden test set is composed by 460 samples instead.

3.2. Subtask B dataset

And for Subtask B, the training dataset is a .csv file containing: • id: unique post identifier. • comment_text: the text of the telegram’s message. • conspiracy: a label going from 0 to 3 indicating which conspiracy topic the message is about. The training dataset is composed by 810 samples, with the following conspiracy label distribution: 435 Covid-Conspiracy, 242 Qanon-Conspiracy, 76 Flat EarthConspiracy, 57 Pro-Russia Conspiracy. The hidden test set is composed by 300 samples instead.

4. Models

Due to the nature of the tasks, we mainly decided to try diferent types of transformers based models for both subtasks, in order to capture the semantics and the general matter of the message itself. This is concatenated with a densely connected neural network in order to classify what the specific task is asking. Figure 1: The transformer architecture.

More specifically a Transformer as described in "Attention is all you need" [ 10 ], is composed of encoderdecoder structure composed by multiple modules stacked 768. We executed fine tuning on the transformer. ClasNx times on top of each other like in Figure 1 where each sification is executed on the first special output token module is mainly consisted of Multi Head Attentions and [CLS] of the transformer Feed Forward layers. In this architecture, the inputs and the outputs (target sentences) are embedded (the outputs need a right shift before usage) into an n-dimensional 4.2. XLM-RoBERTa space because we cannot use the strings directly. XLM-RoBERTa [ 25 ] is a multilingual version of RoBERTa,

Here we present the selected transformer-based mod- a transformers model pre-trained in a self-supervised els for the tasks. Those were selected after a preliminary fashion, similarly to BERT, but with a larger corpus and exploratory phase based on their performance on the no next sentecnce prediction. XLM-RoBERTa was prevalidation set. trained on 2.5TB of filtered CommonCrawl data containing 100 languages. Specifically, we used the xlm-roberta4.1. BERT-xxl large variant, which has 24 hidden layers, 16 attention heads and a hidden size of 1024. We executed fine tuning We used the bert-base-italian-xxl-cased model[ 24 ], which on the transformer. Classification is executed on the first is an Italian pretrained BERT, an encoder-only trans- special output token [CLS] of the transformer. former, variant developed by MDZ Digital Library team.

It was pretrained using as source data a Wikipedia dump of various texts from the OPUS corpora collection with a 4.3. Llama size of 13 GB and more than 2 billion tokens. With the LLaMA is an autoregressive language model developed XXL variant, the corpus was extended with the Italian by Meta AI [ 26 ], based on a decoder only transformer part of the OSCAR corpus, reaching a size of 81 GB and architecture. We used the 7B variant, the smallest one, more than 13 billion tokens. This BERT-xxl model has which has 7 billions parameters. It was pretrained on 12 hidden layers, 12 attention heads and a hidden size of 1 trillion tokens from CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%] sources. The Wikipedia and Books sources are multilingual. Classification is executed on the last output token of the transformer.

We don’t use fine tuning on this model due to its size, but only use it to generate sentence embeddings; trainining was only executed on the classification head.

4.4. Topic-specific tf-idf baseline

For Subtask B, considered its nature of topic classification and observing the presence of specific and unique words in each topic, we also developed an original heuristic baseline based on this assumptions. In short, it tries to retrieve the most specific keywords to each topic and extract their distribution in input texts. We recall that the definition of tf-idf for each word in a set of documents ∈ (in our case each document corresponds to each Telegram message in the dataset) is:

− , = , × , with , = , (, being the number of occurren| |

|| cies of word in document ) and = 10 :∈

This method makes use of topic-specific tf-idf , which is basically the normalized average tf-idf for each word in respect to the documents of each topic, then divided by the average tf-idf of the same word in the other topics.

In mathematical terms, defining as the set of topics, _ , as the average tf-idf for word and topic , and __ , as the normalized _ , in [0, 100] range, we have: _ _ , =

__ , ∑︀′∈ ∖ __ ,′

Instead, for the Topic-specific tf-idf model, as the focus are topic specific relevant words, we apply stop word and short words (less than 3 characters) removal, number and punctuation elimination and stemming.

This sccore is calculated only for the training set; for each topic t then we extract the top K 5. Implementation _ _ , words and store them (K is an hyperparameter). Figure 2 shows the top 10 keywords We used the Python environment for developing the modfor each topic with their respective score. els, using mainly PyTorch, Scikit-Learn and Transformers

Finally, for each input text, we extract the distribu- libraries. tion of the previously stored words, thus we obtain a _ × distribution vector. This vector is then fed into a Random Forest (RF) model for the final classifi- 6. Experiments and results cation. This model is trained with 6-fold Cross-Validation (CV) on the training set.

We used an hold-out approach for both subtasks, reserving 20% of the training set for validation for hyperparameter tuning (split with labels ratio preservation). We 4.5. Preprocessing experimented with retrain on validation found hyperpaFor the transformer-based models, only light preprocess- rameters, but with worse results, so we decided to keep ing was applied, only substituting break line characters the model tested on the validation set as the final model with spaces and using each transformer tokenizer. for each configuration.

Model BERT-xxl RoBERTa-XLM Llama 7B Model BERT-xxl RoBERTa-XLM Llama 7B [1e-6, 2e-6, 3e-6] [6e-6, 8e-6] [1e-5, 5e-5, 1e-4, 5e-4]

warmup

For the Topic-specific tf-idf baseline, the validation cause of the benefits of finetuning or of the encoder-only set was used for finding the best K. After this we used transformer architecture, versus the decoder and not finea retrain strategy, in order to obtain a more general tuned Llama. topic_specific_tfidf for words in each topic (RF classifier Among the relevant findings we include also that the was also retrained with same CV hyperparameters) transformer dimension does not influence the perfor

The performance score of choice is macro-averaged F1 mance score; for example, although XLM-RoBERTa emscore, as it is the one also used to evaluate the challenge. ploys a larger architecture than BERT, they are comparable. The same reasoning applies when confronting with 6.1. Hyperparameters grid search Llama 7B, which has at least an order of magnitude more parameters than the other transformers.

Tables 1, 2 and 3 display the explored hyperparameters This indicates that the pre-training dataset (we recall respectively for transformer-based models in SubtaskA, that BERT-xxl is not multilingual and trained only in transformer-based models in SubtaskB and Topic-specific Italian) and the choice of finetuning have the greatest tf-idf baseline model. The final chosen hyperparameters impact on performances. are those which yield the best score on the validation set In regard to the Topic-specific tf-idf model, it provides and are highlighted in bold. solid results in exchange for a lower computational cost, thanks to its strong assumptions of the importance of 6.2. Results topic specific keywords in Subtask B. It is also important to note that the samples correctly Tables 4 and 5 display the scores on both the internal identified by Topic-specific tf-idf are not a strict subset validation set (the score used to choose the model with of correctly identified samples by the BERT model, as the best hyperparameters) and the hidden test set, respec- the predictions on the test set have a divergence ratio of tively for SubtaskA and SubtaskB. Only macro-averaged almost 25%, while there is a performance diference of less F1 score is reported in the tables. than 7%, meaning that a substantial set of "hard" (wrongly

The whole hidden test set is split in public and private classified) samples for the transformer model are instead test sets by the competition rules; the final test score is "easy" (correctly classified) for the Topic-specific tf-idf obtained by weighted average (proportional each of the and vice versa. This implies that combining the 2 models 2 test set sizes) of the public and private sets. in a meaningful way could result in a more robust model.

7. Discussions

For both tasks, the best performing models are the BERTbased ones, both the Italian BERT-xxl and XLM-RoBERTa, as their performance is close in F1 terms and are the top-2 performers in both subtasks. These results are a probable K Random Forest max_depth Random Forest max_features Random Forest min_samples_leaf Random Forest n_estimators [ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ] [5, 15, None] [log2, None] [ [ 1, 2, 4 ] [ [64, 128, 256]

[1]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language process [7]

Russo ,

Verginer ,

M. H.

Ribeiro , G. Casiraghi, BERT-xxl Spillover of antisocial behavior from fringe platXLMLla-RmoaB7EBRTa forms: The unintended consequences of community banning , in: Proceedings of the International AAAI Conference on Web and Social Media , volume 17 , 2023 , pp. 742 - 753 .

[8]

Russo ,

M. H.

Ribeiro ,

Casiraghi ,

Verginer , Understanding online migration decisions followValidation score Test score ing the banning of radical communities , arXiv BERT-xxl 0.8651 0 .8265 preprint arXiv: 2212 .04765 ( 2022 ). XLMLla-RmoaB7EBRTa 00..88717263 00 .. 87538392 [9]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Topic-specific tf-idf 0.7400 0 .7520 Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[10]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need, 2017 . arXiv: 1706 . 03762 .

[11]

Sun ,

Gaut ,

Tang ,

Huang , M. ElSherief, ing and speech tools for italian , in: Proceedings J. Zhao , D.

Mirza , E.

Belding , K.-W.

Chang , W. Y. of the Eighth Evaluation Campaign of Natural Lan- Wang, Mitigating gender bias in natural language guage Processing and Speech Tools for Italian. Final processing: Literature review , arXiv preprint Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, arXiv: 1906 . 08976 ( 2019 ). 2023 . [12]

Russo ,

Hollenstein ,

C. C.

Musat , C. Zhang,

[2]

Russo ,

Stoehr ,

M. H.

Ribeiro , Acti at evalita Control, generate, augment: A scalable framework 2023: Overview of the conspiracy theory identifica- for multi-attribute text generation, in: Findings tion task , arXiv preprint arXiv:2307.06954 ( 2023 ). of the Association for Computational Linguistics:

[3]

H. W.

Hanley ,

Kumar ,

Durumeric , A golden EMNLP 2020 , Association for Computational Linage: Conspiracy theories' relationship with mis- guistics, Online , 2020 , pp. 351 - 366 . URL: https: information outlets, news media, and the wider //aclanthology .org/ 2020 .findings- emnlp.33. internet (preprint) ( 2023 ). doi: 10 .18653/v1/ 2020 .findings-emnlp. 33 .

[4]

Chandrasekharan ,

Pavalanathan , A . Srini- [13]

Russo ,

Gote ,

Brandenberger , S. Schlosser, vasan, A. Glynn,

Eisenstein , E. Gilbert, You F. Schweitzer, Disentangling active and pascan't stay here: The eficacy of reddit's 2015 ban sive cosponsorship in the u.s. congress, ArXiv examined through hate speech , Proc. ACM Hum.- abs/2205 .09674 ( 2022 ). Comput. Interact . 1 ( 2017 ). URL: https://doi.org/10 [14]

Valvoda ,

Pimentel ,

Stoehr , R. Cotterell, . 1145 /3134666. doi: 10 .1145/3134666. S. Teufel, What about the precedent: An

[5]

Chandrasekharan ,

Jhaver , A . Bruckman, information -theoretic analysis of common law , in: E. Gilbert, Quarantined! examining the efects Proceedings of the 2021 Conference of the North of a community-wide moderation intervention on American Chapter of the Association for Computareddit, ACM Transactions on Computer-Human tional Linguistics: Human Language Technologies, Interaction (TOCHI) 29 ( 2022 ) 1 - 26 . Association for Computational Linguistics, Online,

[6]

Trujillo ,

Cresci , Make reddit great again: as- 2021, pp. 2275 - 2288 . URL: https://aclanthology.org sessing community efects of moderation interven- / 2021 .naacl-main. 181 . doi: 10 .18653/v1/ 2021 .n tions on r/the_donald, Proceedings of the ACM on aacl-main.181. Human-Computer Interaction 6 ( 2022 ) 1 - 28 . [15]

Zhong ,

Dhuliawala ,

Stoehr , Extracting victim counts from text , arXiv preprint arXiv:2302.12367 ( 2023 ).

[16]

Alonso ,

Saini , G. Kovács, Hate speech detection using transformer ensembles on the hasoc dataset , in: Speech and Computer: 22nd International Conference, SPECOM 2020, St . Petersburg, Russia, October 7- 9 , 2020 , Proceedings, Springer, 2020 , pp. 13 - 21 .

[17]

R. T.

Mutanga ,

Naicker ,

O. O.

Olugbara , Hate speech detection in twitter using transformer methods , International Journal of Advanced Computer Science and Applications 11 ( 2020 ).

[18]

Stappen ,

Brunn ,

Schuller , Cross-lingual zero-and few-shot hate speech detection utilising frozen transformer language models and axel , arXiv preprint arXiv: 2004 . 13850 ( 2020 ).

[19]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest ,

A. M.

Rush , Huggingface's transformers: State-of-the-art natural language processing , 2020 . arXiv: 1910 .03771.

[20]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . arXiv: 1810 .04805.

[21]

Sun ,

Qiu ,

Xu ,

Huang , How to ifne-tune bert for text classification ?, 2020 . arXiv: 1905 .05583.

[22]

Guo ,

Ash ,

Chung , G. Friedland, Detecting conspiracy theories from tweets: Textual and structural approaches ., in: MediaEval, 2020 .

[23]

Marcellino ,

T. C.

Helmus ,

Kerrigan ,

Reininger , R. I. Karimov ,

R. A.

Lawrence , Detecting Conspiracy Theories on Social Media: Improving Machine Learning to Detect and Understand Online Conspiracy Theories , RAND Corporation, Santa Monica, CA, 2021 . doi: 10 .7249/RR-A676 -1.

[24] DBMDZ, bert-base-italian-xxl- cased , 2020 . URL: ht tps://huggingface.co/dbmdz/bert-base -italian-xxl -cased.

[25]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised crosslingual representation learning at scale , 2020 . arXiv: 1911 .02116.

[26]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , 2023 . arXiv: 2302 . 13971 .