INTRODUCTION

On the Performance of Diferent Text Classification Strategies on Conspiracy Classification in Social Media

Manfred Moosleitner

manfred.moosleitner@uibk.ac.at 0

Benjamin Muraurer

b.murauer@posteo.de 1 0 Universität Innsbruck , Austria 1 Universität Innsbruck , Austria

2021

6 8

This paper summarizes the contribution of our team UIBK-DBISFAKENEWS to the shared task “FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task” as part of MediaEval 2021, the goal of which is to classify tweets based on their textual content. The task features the three sub-tasks (i) Text-Based Misinformation Detection, (ii) Text-Based Conspiracy Theories Recognition, and (iii) Text-Based Combined Misinformation and Conspiracies Detection. We achieved our best results for all three sub-tasks using the pre-trained language model BERT Base[1], with extremely randomized trees and support vector machines as runner ups. We further show that syntactic features using dependency grammar are inefective, resulting in prediction scores close to a random baseline.

INTRODUCTION

This task consists of the three sub-tasks, in which properties of short social media posts must be predicted. The sub-tasks’ goals are similar but distinctively diferent: In sub-task 1 (Text-Based Misinformation Detection), each document belongs to one of three classes ("Promotes/Supports Conspiracy", "Discusses Conspiracy", "Non-Conspiracy"), making it a multi-class classification problem. In sub-task 2 (Text-Based Conspiracy Theories Recognition), a list of conspiracies is provided. For each conspiracy, the goal is to predict whether that conspiracy is mentioned in a document, whereas more than one conspiracy can be mentioned in one document. This makes it a multi-label classification problem. Finally, in sub-task 3 (TextBased Combined Misinformation and Conspiracies Detection), the above sub-tasks are combined and for each evaluation document, the model must predict for each of the provided conspiracies the way that conspiracy is mentioned according to sub-task 1. The development data provided for this task consists of about 1,500 tweets collected by Schroeder et al. [ 5 ], and a detailed description of the individual sub-tasks can be found in the task overview paper [ 4 ]. The code of our solution is available online1.

For each task, each team is allowed to submit five runs with the following restrictions: For run 1, only features extracted from the provided texts without additional external data or pre-trained models were allowed (concretely, this also disallows using any BERT-related model). For run 2, the usage of pre-trained models was additionally allowed, and for runs 3 and 4, also the use of external data for training any model was permitted.

1https://git.uibk.ac.at/c7031305/mediaeval2021-fakenews Parameter

-gram size -grams max. features lowercase text DT-gram word repr.

BERT model num. trees Tested Values

1, 2, ..., 10 unlimited, 1000

true, false universal POS tag, English POS tag

RoBERTa, DistilBERT, BERT base

100, 250, 500, 750, 1000, 2000, ..., 5000

In our approach to this task, we perform a large-scale grid search experiment testing a variety of diferent feature extraction methods and classification models and hyper-parameter configurations thereof. We show that the pre-trained language model BERT outperforms the other presented approaches for all the sub-tasks and that the syntax-based features are not able to detect the conspiracyrelated classes in the documents. 2

TEXT FEATURES

We include multiple types of features in our grid experiments. As a widely used and simple to calculate baseline, we include word and character -grams. We thereby test diferent configurations of the extraction, including diferent sizes of and pre-processing the texts to lowercase. The full list of parameters is shown in Table 1. We normalize the frequency of the resulting -grams by using tf-idf.

As the second type of text features, we calculate DependencyTree-grams (DT-grams) [ 3 ] to determine if texts within one class or label have similar grammatical structures. DT-grams are substructures of the dependency graph of sentences, and can be interpreted as a way to enhance -grams of part-of-speech tags by redefining which tokens are “close” to one another and therefore form an -gram. Thereby, each word is represented either by its English-specific, part-of-speech tag, or a combination thereof. This choice is a hyper-parameter and is tuned by the grid-search (cf. Table 1). Like their character- and word-based counterparts, we calculate the frequency of the resulting -grams as feature and use tf-idf normalization. We include this feature to check whether some of the conspiracy classes exhibit stylistic markers that are typical for that category.

Lastly, we use the sequence of tokens in the unmodified text as input for fine-tuning diferent pre-trained language models to directly perform classification on the evaluation sets. Thereby, no splitting of the training documents was required as the documents are short enough to be included in any of the language models. M. Moosleitner, B. Murauer spiritual there all in julia antichrist satanic ark m 3

CLASSIFICATION MODELS

We employ three diferent machine learning models with the tf-idf normalized frequencies of character-, word- and DT-gram-based features: support vector machines (SVM), multinomial naive bayes (MNB), and extremely randomized trees (ET). The hyper-parameters for these models are also listed in Table 1.

We use three BERT-like models: BERT-base [ 1 ], RoBERTa [ 2 ], and DistilBERT [ 6 ]. Thereby, we use a maximum sequence length of 256, three epochs of fine-tuning, and a batch size of 8. All other parameters were left untouched from their default implementation provided by the huggingface2 library. 4

RESULTS AND DISCUSSION

Generally, our results show a strong connection between certain keywords and their corresponding conspiracy theories. Figure 1 shows the top four positive and negative coeficients for two of the classes ("Satanism" and "Mind Control") from sub-task 2, which are an intuitive way to show the relationship of the words with the corresponding classes.

2https://huggingface.co/

From the grid search experiments, we select the best-performing configurations for each of the allowed runs, which are shown in Table 2. Since for the first run we were not allowed to submit the BERT-based method, we submitted the ET-based solution, which was second in line.

The task organizers released the development dataset in two stages: the first part consisted of 500 documents, and the second part included an additional 1,000 documents. With only the first part available, the ET model outperformed all others and was only outperformed by BERT when the full dataset was available. This indicates that BERT-like models require a minimum amount of text data for pre-training to perform eficiently. Concretely, the addition of the second part of the development dataset increased the number of documents for the smallest class "Discusses Conspiracy" from 76 to 262, giving a rough impression of how many samples are required for BERT to perform better than the ET model.

When comparing the Results of extremely randomized Trees and SVMs, we can see that the performance of ET was better than the performance of SVM for sub-task 1, and vice versa for subtask 2, where SMV performed better than ET. We think one of the reasons for this is because the use of word uni-grams with SVM, compared to character 6-grams with ET, indicating that whole words better reflect the connection between keywords and labels than only partial words. Also indicating a connection between certain keywords and their labels is that the performance of BERT in sub-task 2 (multi-label) and sub-task 3 (multi-class, multi-label) is a little better than for sub-task 1 (multi-class).

On the other hand, the results produced by our grammar-based approach show a poor performance in all sub-tasks, indicating that the grammatical structure of the texts as feature is not suited to diferentiate between the given classes and labels. We attribute this behavior to the short texts, which are not likely to incorporate complex grammatical structures, as well as dificulties in parsing due to the unstructured nature of the text.

ACKNOWLEDGMENTS

We would like to thank our research group for the feedback and Prof. Günther Specht, head of our research group, for providing the necessary infrastructure to perform the research reported in this paper.

[1]

Jacob

Devlin , Ming-Wei

Chang

Kenton

Lee ,

and Kristina

Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[2]

Yinhan

Liu , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy ,

Mike

Lewis ,

Luke

Zettlemoyer , and

Veselin

Stoyanov . 2019 . Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 . 11692 ( 2019 ).

[3]

Benjamin

Murauer and

Günther

Specht . 2021 . DT-grams: Structured Dependency Grammar Stylometry for Cross-Language Authorship Attribution . arXiv preprint arXiv:2106.05677 ( 2021 ).

[4]

Konstantin

Pogorelov , Daniel Thilo Schroeder, Stefan Brenner, and

Johannes

Langguth . 2021 . FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task at MediaEval 2021 . In Proc. of the MediaEval 2021 Workshop , Online, 13 - 15 December 2021 .

[5]

Konstantin

Pogorelov , Daniel Thilo Schroeder, Petra Filkuková, Stefan Brenner, and

Johannes

Langguth . 2021 . WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets . In Proceedings of the 2021 Workshop on Open Challenges in Online Social Networks . 21 - 25 .

[6]

Victor

Sanh , Lysandre Debut, Julien Chaumond, and

Thomas

Wolf . 2019 . DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter . arXiv preprint arXiv: 1910 . 01108 ( 2019 ).