On the Performance of Different Text Classification Strategies
            on Conspiracy Classification in Social Media
                           Manfred Moosleitner                                                         Benjamin Muraurer
                      Universität Innsbruck, Austria                                               Universität Innsbruck, Austria
                     manfred.moosleitner@uibk.ac.at                                                   b.murauer@posteo.de

ABSTRACT                                                                                    Table 1: Hyper-Parameters Used in Grid Search
This paper summarizes the contribution of our team UIBK-DBIS-
FAKENEWS to the shared task “FakeNews: Corona Virus and Con-                          Parameter                             Tested Values
spiracies Multimedia Analysis Task” as part of MediaEval 2021, the                    𝑛-gram size                            1, 2, ..., 10
goal of which is to classify tweets based on their textual content.                   𝑛-grams max. features                unlimited, 1000
The task features the three sub-tasks (i) Text-Based Misinforma-                      lowercase text                         true, false
tion Detection, (ii) Text-Based Conspiracy Theories Recognition,                      DT-gram word repr.         universal POS tag, English POS tag
and (iii) Text-Based Combined Misinformation and Conspiracies
Detection. We achieved our best results for all three sub-tasks us-                   BERT model                  RoBERTa, DistilBERT, BERT base
ing the pre-trained language model BERT Base[1], with extremely                       num. trees                100, 250, 500, 750, 1000, 2000, ..., 5000
randomized trees and support vector machines as runner ups. We
further show that syntactic features using dependency grammar
are ineffective, resulting in prediction scores close to a random
baseline.
                                                                                        In our approach to this task, we perform a large-scale grid search
                                                                                     experiment testing a variety of different feature extraction meth-
1    INTRODUCTION                                                                    ods and classification models and hyper-parameter configurations
This task consists of the three sub-tasks, in which properties of                    thereof. We show that the pre-trained language model BERT out-
short social media posts must be predicted. The sub-tasks’ goals                     performs the other presented approaches for all the sub-tasks and
are similar but distinctively different: In sub-task 1 (Text-Based                   that the syntax-based features are not able to detect the conspiracy-
Misinformation Detection), each document belongs to one of three                     related classes in the documents.
classes ("Promotes/Supports Conspiracy", "Discusses Conspiracy",
"Non-Conspiracy"), making it a multi-class classification problem.                   2   TEXT FEATURES
In sub-task 2 (Text-Based Conspiracy Theories Recognition), a list of                We include multiple types of features in our grid experiments. As
conspiracies is provided. For each conspiracy, the goal is to predict                a widely used and simple to calculate baseline, we include word
whether that conspiracy is mentioned in a document, whereas more                     and character 𝑛-grams. We thereby test different configurations of
than one conspiracy can be mentioned in one document. This makes                     the extraction, including different sizes of 𝑛 and pre-processing the
it a multi-label classification problem. Finally, in sub-task 3 (Text-               texts to lowercase. The full list of parameters is shown in Table 1.
Based Combined Misinformation and Conspiracies Detection), the                       We normalize the frequency of the resulting 𝑛-grams by using tf-idf.
above sub-tasks are combined and for each evaluation document,                           As the second type of text features, we calculate Dependency-
the model must predict for each of the provided conspiracies the                     Tree-grams (DT-grams) [3] to determine if texts within one class
way that conspiracy is mentioned according to sub-task 1. The                        or label have similar grammatical structures. DT-grams are sub-
development data provided for this task consists of about 1,500                      structures of the dependency graph of sentences, and can be in-
tweets collected by Schroeder et al. [5], and a detailed description of              terpreted as a way to enhance 𝑛-grams of part-of-speech tags by
the individual sub-tasks can be found in the task overview paper [4].                redefining which tokens are “close” to one another and therefore
The code of our solution is available online1 .                                      form an 𝑛-gram. Thereby, each word is represented either by its
    For each task, each team is allowed to submit five runs with                     English-specific, part-of-speech tag, or a combination thereof. This
the following restrictions: For run 1, only features extracted from                  choice is a hyper-parameter and is tuned by the grid-search (cf.
the provided texts without additional external data or pre-trained                   Table 1). Like their character- and word-based counterparts, we
models were allowed (concretely, this also disallows using any                       calculate the frequency of the resulting 𝑛-grams as feature and use
BERT-related model). For run 2, the usage of pre-trained models                      tf-idf normalization. We include this feature to check whether some
was additionally allowed, and for runs 3 and 4, also the use of                      of the conspiracy classes exhibit stylistic markers that are typical
external data for training any model was permitted.                                  for that category.
1 https://git.uibk.ac.at/c7031305/mediaeval2021-fakenews                                 Lastly, we use the sequence of tokens in the unmodified text
                                                                                     as input for fine-tuning different pre-trained language models to
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   directly perform classification on the evaluation sets. Thereby, no
License Attribution 4.0 International (CC BY 4.0).
                                                                                     splitting of the training documents was required as the documents
MediaEval’21, December 6-8 2021, Online
                                                                                     are short enough to be included in any of the language models.
MediaEval’21, December 6-8 2021, Online                                                                                           M. Moosleitner, B. Murauer

                                                                                          Table 2: Official evaluation results measured with Matthew’s
Figure 1: Top 4 Positive and Negative Coefficients of the                                 correlation coefficient.
Classes "Suppressed Cures" and "Satanism"
                                                                                                          Run 1         Run 2          Run 3         Run 4
                     (a) Coefficients for Class "Suppressed Cures"
                                                                                                    char 6-grams plain text word 1-grams dt-gram
                                                                                           Features
                                                                                                        tf-idf   sequence      tf-idf     tf-idf
       2
                                                                                           Model           ET           BERT           SVM          MNNB
       1                                                                                   Task 1         0.2852        0.3184         0.2228        0.1201
                                                                                           Task 2         0.2086        0.3624         0.2879        0.0000
                                                                                           Task 3         0.1993        0.3347         0.2316       -0.0028
       0

     −1
                                 er


                                                    e
                of

                        he


                                         h


                                                              d


                                                                         p

                                                                                     ps
                                               cin

                                                         in


                                                                         hi
                                         a
                               ov


                                                                                  hi
                                      ye


                                                                                             From the grid search experiments, we select the best-performing
                                                        m


                                                                       oc
                                                c


                                                                                 oc
                                             va


                                                                    icr

                                                                              icr

                                                                                          configurations for each of the allowed runs, which are shown in
                                                                  m

                                                                            m


                                                                                          Table 2. Since for the first run we were not allowed to submit the
                         (b) Coefficients of Class "Satanism"                             BERT-based method, we submitted the ET-based solution, which
                                                                                          was second in line.
                                                                                             The task organizers released the development dataset in two
      2
                                                                                          stages: the first part consisted of 500 documents, and the second
                                                                                          part included an additional 1,000 documents. With only the first
      1                                                                                   part available, the ET model outperformed all others and was only
                                                                                          outperformed by BERT when the full dataset was available. This
                                                                                          indicates that BERT-like models require a minimum amount of text
      0                                                                                   data for pre-training to perform efficiently. Concretely, the addition
                                                                                          of the second part of the development dataset increased the number
                                                                                          of documents for the smallest class "Discusses Conspiracy" from
                                                                                          76 to 262, giving a rough impression of how many samples are
                                       in


                                                                       c
                  al

                          e

                                l


                                               ia


                                                              ist


                                                                                  k
                               al


                                                                      ni
                        er


                                                                                ar


                                                                                          required for BERT to perform better than the ET model.
                tu


                                                l


                                                         hr
                                             ju


                                                                    ta
                       th


                                                                              m
            iri


                                                        tic

                                                                  sa
           sp


                                                                                             When comparing the Results of extremely randomized Trees
                                                    an


                                                                                          and SVMs, we can see that the performance of ET was better than
                                                                                          the performance of SVM for sub-task 1, and vice versa for sub-
3    CLASSIFICATION MODELS                                                                task 2, where SMV performed better than ET. We think one of the
                                                                                          reasons for this is because the use of word uni-grams with SVM,
We employ three different machine learning models with the tf-idf                         compared to character 6-grams with ET, indicating that whole
normalized frequencies of character-, word- and DT-gram-based                             words better reflect the connection between keywords and labels
features: support vector machines (SVM), multinomial naive bayes                          than only partial words. Also indicating a connection between
(MNB), and extremely randomized trees (ET). The hyper-parameters                          certain keywords and their labels is that the performance of BERT
for these models are also listed in Table 1.                                              in sub-task 2 (multi-label) and sub-task 3 (multi-class, multi-label)
   We use three BERT-like models: BERT-base [1], RoBERTa [2],                             is a little better than for sub-task 1 (multi-class).
and DistilBERT [6]. Thereby, we use a maximum sequence length                                On the other hand, the results produced by our grammar-based
of 256, three epochs of fine-tuning, and a batch size of 8. All other                     approach show a poor performance in all sub-tasks, indicating that
parameters were left untouched from their default implementation                          the grammatical structure of the texts as feature is not suited to
provided by the huggingface2 library.                                                     differentiate between the given classes and labels. We attribute this
                                                                                          behavior to the short texts, which are not likely to incorporate
4    RESULTS AND DISCUSSION                                                               complex grammatical structures, as well as difficulties in parsing
Generally, our results show a strong connection between certain                           due to the unstructured nature of the text.
keywords and their corresponding conspiracy theories. Figure 1
shows the top four positive and negative coefficients for two of the
classes ("Satanism" and "Mind Control") from sub-task 2, which are                        ACKNOWLEDGMENTS
an intuitive way to show the relationship of the words with the                           We would like to thank our research group for the feedback and
corresponding classes.                                                                    Prof. Günther Specht, head of our research group, for providing the
                                                                                          necessary infrastructure to perform the research reported in this
2 https://huggingface.co/                                                                 paper.
FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task               MediaEval’21, December 6-8 2021, Online


REFERENCES
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
    2018. Bert: Pre-training of deep bidirectional transformers for language
    understanding. arXiv preprint arXiv:1810.04805 (2018).
[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
    Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy-
    anov. 2019. Roberta: A robustly optimized bert pretraining approach.
    arXiv preprint arXiv:1907.11692 (2019).
[3] Benjamin Murauer and Günther Specht. 2021. DT-grams: Structured
    Dependency Grammar Stylometry for Cross-Language Authorship
    Attribution. arXiv preprint arXiv:2106.05677 (2021).
[4] Konstantin Pogorelov, Daniel Thilo Schroeder, Stefan Brenner, and
    Johannes Langguth. 2021. FakeNews: Corona Virus and Conspiracies
    Multimedia Analysis Task at MediaEval 2021. In Proc. of the MediaEval
    2021 Workshop, Online, 13-15 December 2021.
[5] Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, Stefan
    Brenner, and Johannes Langguth. 2021. WICO Text: A Labeled Dataset
    of Conspiracy Theory and 5G-Corona Misinformation Tweets. In
    Proceedings of the 2021 Workshop on Open Challenges in Online Social
    Networks. 21–25.
[6] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf.
    2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper
    and lighter. arXiv preprint arXiv:1910.01108 (2019).