NLytics at CheckThat! 2022: Hierarchical multi-class
fake news detection of news articles exploiting the
topic structure
Albert Pritzkau1 , Olivier Blanc2 , Michaela Geierhos2 and Ulrich Schade1
1
  Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Fraunhoferstraße 20, 53343
Wachtberg, Germany
2
  Research Institute Cyber Defence (CODE), University of the Bundeswehr Munich, Carl-Wery-Straße 18, 81739 München,
Germany


                                         Abstract
                                         The following system description presents our approach to the detection of fake news in texts. The given
                                         task has been framed as a multi-class classification problem. In a multi-class classification problem, each
                                         input chunk is assigned one of several class labels.
                                             To dissect content patterns in the training data, we made use of topic modeling. Topic modeling
                                         techniques such as Latent Dirichlet Allocation (LDA) are unsupervised algorithms that pick up on
                                         patterns and provide an estimate of what the messages convey.
                                             In order to assign class labels to the given documents, we opted for RoBERTa (A Robustly Optimized
                                         BERT Pretraining Approach) and Longformer as neural network architectures for sequence classification.
                                         Starting off with a pre-trained model for language representation, we fine-tuned this model on the
                                         given classification task with the provided annotated data in supervised training steps. In a hierarchical
                                         approach, the training of a classifier took place at topic level.

                                         Keywords
                                         Sequence Classification, Deep Learning, Transformers, RoBERTa, Longformer, Topic modeling


1. Introduction
The proliferation of disinformation online has given rise to a lot of research on automatic fake
news detection. CLEF - CheckThat! Lab [1, 2] considers disinformation as a communication
phenomenon. By detecting the use of various linguistic features in communication, the given
task takes into account not only the content but also how a subject matter is communicated.

The Shared Task 3 of the CLEF 2022 - CheckThat! Lab[3] defines the following subtasks:


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ albert.pritzkau@fkie.fraunhofer.de (A. Pritzkau); olivier.blanc@unibw.de (O. Blanc);
michaela.geierhos@unibw.de (M. Geierhos); ulrich.schade@fkie.fraunhofer.de (U. Schade)
 https://www.fkie.fraunhofer.de (A. Pritzkau); https://www.unibw.de/code (O. Blanc); https://www.unibw.de/code
(M. Geierhos); https://www.fkie.fraunhofer.de (U. Schade)
 0000-0001-7985-0822 (A. Pritzkau); 0000-0002-2546-857X (U. Schade)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Subtask 3A Given the textual content of an article, specify a credibility level for the content
ranging between “true”, “false”, “partially false” including “other”.

Subtask 3B Transfer learning task to build a classification model for the German language
along with the previous multi-class task.

This paper covers our approach on the multi-class classification task to detecting fake news.
To build our models, only textual content is given as input. Below, we describe the system
built for subtask 3A. At the core of our systems pre-trained models based on the Transformer
architecture [4] such as RoBERTa [5] or Longformer [6] were used.


2. Related Work
The goal of the shared task is to investigate automatic techniques for identifying various
rhetorical and psychological features of disinformation campaigns. A comprehensive survey on
fake news and on automatic fake news detection has been presented by Zhou and Zafarani [7].
Based on the structure of data reflecting different aspects of communication, they identified
four different perspectives on fake news: (1) the false knowledge it carries, (2) its writing style,
(3) its propagation patterns, and (4) the credibility of its creators and spreaders.
   CLEF 2022 CheckThat! Lab Task 3 emphasizes communicative styles that systematically
co-occur with persuasive intentions of (political) media actors. Similar to de Vreese et al. [8],
propaganda and persuasion is considered as an expression of political communication content
and style. Hence, beyond the actual subject of communication, the way it is communicated is
gaining importance[9].
   We build our work on top of this foundation by first investigating content-based approaches
for information discovery. Traditional information discovery methods are based on content:
documents, terms, and the relationships between them [10]. The methods can be considered as
general Information Extraction (IE) methods, automatically deriving structured information
from unstructured and/or semi-structured machine-readable documents. Communities of
researchers have contributed various techniques from machine learning, information retrieval,
and computational linguistics to the different aspects of the information extraction problem.
From a computer science perspective, existing approaches can be roughly divided into the
following categories: rule-based, supervised, and semi-supervised. In our case, we followed
the supervised approach by reframing the complex language understanding task as a simple
classification problem. Text classification, also known as text tagging or text categorization, is
the process of categorizing text into organized groups. By using Natural Language Processing
(NLP), text classifiers can automatically analyze human language texts and then assign a set of
predefined tags or categories. Historically, the evolution of text classifiers can be divided into
three stages: (1) simple lexicon- or keyword-based classifiers, (2) classifiers using distributed
semantics, and (3) deep learning classifiers with advanced linguistic features.
2.1. Deep Learning and Pre-trained Deep Language Representation
Recent work on text classification uses neural networks, particularly Deep Learning (DL). Bad-
jatiya et al. [11] demonstrated that these architectures, including variants of Recurrent Neural
Networks (RNN) [12, 13, 14], Convolution Neural Networks (CNN) [15], or their combina-
tion (CharCNN, WordCNN, and HybridCNN), produce state-of-the-art results and outperform
baseline methods (character n-grams, TF-IDF or bag-of-words representations).
   Until recently, the dominant paradigm in approaching NLP tasks has been focused on the
design of neural architectures, using only task-specific data and word embeddings such as those
mentioned above. This led to the development of models such as Long Short Term Memory
(LSTM) networks or Convolution Neural Networks, which achieve significantly better results in
a range of NLP tasks as compared to less complex classifiers such as Support Vector Machines,
Logistic Regression or Decision Tree Models. Badjatiya et al. [11] demonstrated that these
approaches outperform models based on character and word n-gram representations. In the
same paradigm of pre-trained models, methods like BERT [16] and XLNet [17] have been shown
to achieve state-of-the-art performance in a variety of tasks.
   Indeed, the usage of a pre-trained word embedding layer to convert the text into vectorized
input for a neural network marked a significant step forward in text classification. The potential
of pre-trained language models, e.g. Word2Vec [18], GloVe [19], fastText [20], or ELMo [21],
to capture the local patterns of features to benefit text classification, has been described by
Castelle [22]. Modern pre-trained language models use unsupervised learning techniques such
as creating RNN embeddings on large text corpora to gain some primal “knowledge” of the
language structures before a more specific supervised training steps in. Transformer-based
models are unable to process long sequences due to their self-attention mechanism, which scales
quadratically with the sequence length. BERT-based models enforce a hard limit of 512 tokens,
which is usually enough to process the majority of sequences in most benchmark datasets.

2.2. BERT, RoBERTa and Longformer
BERT stands for Bidirectional Encoder Representations from Transformers. It is based on the
Transformer model architectures introduced by Vaswani et al. [4]. The general approach consists
of two stages: first, BERT is pre-trained on vast amounts of text, with an unsupervised objective
of masked language modeling and next-sentence prediction. Second, this pre-trained network is
then fine-tuned on task specific, labeled data. The Transformer architecture is composed of two
parts, an Encoder and a Decoder, for each of the two stages. The Encoder used in BERT is an
attention-based architecture for NLP. It works by performing a small, constant number of steps.
In each step, it applies an attention mechanism to understand relationships between all words
in a sentence, regardless of their respective position. By pre-training language representations,
the Encoder yields models that can either be used to extract high quality language features from
text data, or fine-tune these models on specific NLP tasks (classification, entity recognition,
question answering, etc.). We rely on RoBERTa [5], a pre-trained Encoder model which builds
on BERT’s language masking strategy. However, it modifies key hyperparameters in BERT
such as removing BERT’s next-sentence pre-training objective, and training with much larger
mini-batches and learning rates. Furthermore, in comparison to BERT, the training data set for
Roberta was an order of magnitude larger (160 GB of text) with the maximum sequence length
of 512 used for all interations. This allows RoBERTa representations to generalize even better
to downstream tasks.
   To address the limitation of traditional Transformer-based models to 512 tokens, Longformer[6]
uses an attention pattern that scales linearly with sequence length, making it easy to process
documents of thousands of tokens or longer. To this end, the standard self-attention is replaced
by an attention mechanism, which combines a local windowed attention with a task motivated
global attention, thus allowing up to 4096 position embeddings. Longformer is pre-trained from
RoBERTa[5].


3. Dataset
The training data for this task was developed during the CLEF-2021 CheckThat! campaign
[23, 24, 25] and provided by Shahi et al. [26]. The AMUSED framework presented by Shahi
[27] was used for data collection. The test data was gathered during CLEF 2022 CheckThat!
Lab [2]. The adopted task was framed as multi-class classification problem. Class labels were
provided as credibility levels {false, partially false, true, other} as proposed by Shahi et al. [28].
The provided training set consists of 1,264 documents. As suggested by the organizers, a much
larger training set was collected, combining data sets from comparable tasks such as the Fake
News Detection Challenge KDD 2020 [29], as well as the Fake News Classification Datasets
[30]. The resulting large training corpus also mentioned in [31] consists of 51148 documents.

Table 1
Composition of corpora used for training

                                     original training data                        1264
                                     Fake News Detection Challenge KDD 2020        4986
             large training corpus
                                     Fake News Classification Datasets            44898
                                                                                  51148


  The content parts are distributed between title and body of messages. Both fields were
concatenated to serve as the input for training.


4. Exploratory data analysis
Our approach is based on a comprehensive exploratory analysis of the training data.

Cleaning The initial training dataset consisted of 1264 documents. The explorative analysis
started with the investigation of inconsistencies in the dataset. Unexpectedly, ambiguities in
the annotation of the documents could be detected. For example, identical documents were
found with contradictory annotations "true" vs. "false". In this case, we decided to remove all
affected documents from the training data, regardless of the provided annotation. Removing
         120
                                                                              2000
         100                                                                  1750
                                                                              1500
         80
                                                                              1250
 Count


                                                                      Count
         60                                                                   1000

         40                                                                   750
                                                                              500
         20
                                                                              250
          0                                                                     0
               101        102                103          104                        101      102        103      104
                                num_tokens                                                    num_tokens

                     (a) original training data                                       (b) large training corpus
Figure 1: Document length (token-based) distribution in the training sets.


Table 2
Statistical summaries of token (word) counts on all utilized datasets.
                                             original training data   large training corpus      test data
                                             (cleansed)               (cleansed)
           doc count                         1096                     44910                      612
           mean                              887.82                   521.97                     1184.60
           std                               926.16                   638.90                     2005.33
           min                               10                       2                          60
           25%                               360.75                   261.00                     432.50
           50%                               639.00                   442.00                     723.00
           75%                               1065.25                  623.00                     1179.25
           max                               8751                     20304                      22168


just one of the duplicates would have led to an inadvertent weighting of the remaining class.
After the elimination of the ambiguities, remaining unique duplicates could be easily removed.
The final cleansed dataset contained 1096 documents. We applied the same procedure to the
44910 documents for the large training corpus. The remainder of this study focuses on this
adjusted version of the originbal dataset.
  Generally, duplicate data does not add any value since looking at the same data multiple times
does not make the algorithm any better. However, if the distribution of duplicates is skewed
towards one class only, a bias is to be expected in the resulting classification, throwing off the
generalization performance, as the model is given information that overrepresents that class.

Token count The statistical summary of token counts in Table 2 as well as Figure 1 suggests
that most of the sequences of the training set exceed the limitation of trditional Transformer-
based models to 512 tokens as described previously. Thus, anything beyond this limitation will
be truncated. For this reason, after an initial training with RoBERTa at its core, we switched to
Longformer[6] as the basic architecture, to gradually improve the overall score.
             500                                                                      25k


             400                                                                      20k


             300                                                                      15k
 doc count


                                                                          doc count
             200                                                                      10k


             100                                                                       5k


               0                                                                       0
                   false         other           partially false   true                     false        other           partially false   true

                                         label                                                                   label


                           (a) original training data                                               (b) large training corpus
Figure 2: Label distribution - training sets


             300


             250


             200
 doc count


             150


             100


             50


              0
                   false         other           partially false   true

                                         label


Figure 3: Label distribution in the gold standard.


Unbalanced class distribution Imbalance in data can exert a major impact on the value and
meaning of accuracy and on certain other well-known performance metrics of an analytical
model. Figure 2 depicts a clear skew towards false information. Furthermore, the “true” class is
significantly underrepresented as compared to “partially false” class.

Topic structure To dissect content patterns in the training data, we made use of topic
modeling. As unsupervised algorithms, topic modeling techniques such as Latent Dirichlet
Allocation[32] (LDA) pick up on patterns and provide an overview of the information that the
data contains. To help distinguish between topics that are semantically interpretable topics
and topics that are artifacts of statistical inference, topic coherence measures [33] are utilzed
measuring the degree of semantic similarity between high scoring words in a given topic. In
particular, a series of sensitivity tests were performed (see Figure 4) to help determine the
optimal number of topics as an essential model hyperparameter. Throughout the sensitivity
tests CV was applied as coherence measure. CV creates content vectors of words using their
co-occurrences. It is based on a sliding window, a one-set segmentation of the top words and
an indirect confirmation measure that uses normalized pointwise mutual information (NPMI)
and the cosinus similarity. To further improve the interpretability of the esulting topics, other
coherence measures such as the UMass Coherence Score may also be explored. Based on these
tests 15 was chosen as the optimal number of topics, since the coherence score does not change
significantly even for a higher number of topics.


                    0.5
 coherence score


                   0.45


                    0.4


                   0.35


                          10   20        30            40   50   60

                                    number of topics


Figure 4: Coherence scores to choose the optimal number of topics.


  The resulting topic distributions as well as their high scoring words are depicted in Figure 5.
  In fact, the distribution of labels differs significantly depending on the topic as shown in
Figure 6.


5. Our approach
Our approach is based on the assumption that differentiation of various viewpoints usually
takes place in a topic-related manner. A topic results from a specific distribution over the words
used. Via this distribution, different topics can be distinguished from each other. With our
approach we propose a hierarchical method, where automatic text classification takes place on
topic level.

5.1. Experimental setup
Model Architecture Subtask 3A is given as a multi-class classification problem. The models
for the experimental setup were based on RoBERTa and Longformer. For the classification task,
fine-tuning is initially performed using RobertaForSequenceClassification[34] – roberta-base – as
the pre-trained model. RobertaForSequenceClassification optimizes for a regression loss (Mean-
Square Loss) using an AdamW optimizer with an initial learning rate set to 2e-5. Fine-tuning is
                                     Topic 1                                                Topic 2                                                Topic 3
            climate                                                    virus                                                    year
             change                                                   s udy                                                      pay
              global                                                 cancer                                                        job
        temperature                                                    mask                                                         ax
                 year                                                    find                                            business
               warm                                                     case                                                    work
                level                                                     cell                                                   cos
                   ice                                              disease                                               increase
           scientist                                               infec ion                                                 money
                  rise                                               spread                                                        cu
           warming                                                   people                                                  benefi
                  sea                                               medical                                                     high
              record                                                    drug                                                      low
               study                                                      es                                                budge
             carbon                                                    wear                                                      ra e
             degree                                                      risk                                                   fund
           increase                                                     lung                                                 worker
               water                                                   covid                                                income
            century                                             researcher                                                     wage
              ocean                                                        kill                                                raise
               world                                                   body                                                   family
              planet                                               research                                                       rise
           emission                                                   effec                                                     plan
            weather                                               rea men                                                  na ional
                 high                                                      flu                                           company


                                     Topic 4                                                Topic 5                                                Topic 6
                   fire                                               police                                              elec ion
              a ack                                                     work                                                   vo e
             mili ary                                                officer                                                 rump
                force                                                  crime                                            campaign
                 year                                                 prison                                                 ballo
               order                                                 border                                                   vo er
                cour                                                include                                                   sae
                 case                                              criminal                                                  resul
              accord                                                     food                                           candida e
                    kill                                             charge                                           presiden ial
               arres                                                release                                             presiden
            medium                                                 migran                                                    biden
           american                                                   repor                                                     poll
                oday                                                  vic im                                                 fraud
               rump                                                  official                                                coun
            securi y                                                   shoo                                           democra ic
           presiden                                                member                                                   official
                   law                                             number                                              republican
               video                                                      ime                                               repor
             murder                                                    make                                                   show
               group                                                    drug                                               russian
              na ion                                             employee                                                       run
                  rule                                             commi                                                        day
                    ac                                                illegal                                                  lead
                 hear                                            individual                                             suppor er


                                     Topic 7                                                Topic 8                                                Topic 9
                  sign                                               vaccine                                                    food
                email                                                 repor                                                human
              people                                                   claim                                                 wa er
             pic ure                                                  ar icle                                                make
                  cen                                                    fac                                              produce
                 show                                                receive                                              problem
                   day                                              medium                                                produc
               image                                                    base                                                     add
                  bear                                          informa ion                                                      ea
                  year                                                   give                                                affec
                 news                                                   s ory                                              reduce
              record                                                   make                                                     year
           message                                                     public                                              change
                 en er                                                  wri e                                               was e
          verifyerror                                                include                                               sys em
                 week                                              company                                                    large
            address                                                 ques ion                                                   level
                figure                                                  news                                               impac
                 s ock                                               confirm                                                      life
                   ime                                         governmen                                                  damage
              mon h                                                   da um                                                 plas ic
                  hold                                                 group                                                 fores
                even                                                      si e                                                  lead
        independen                                                     wi er                                                    land
      adver isemen                                                       ru h                                                  form


                                    Topic 10                                               Topic 11                                               Topic 12
                sae                                                    child                                               people
                   law                                              school                                                coun ry
               public                                                  year                                                  world
                  rule                                             s uden                                                    make
             federal                                             educa ion                                                   good
             require                                                eacher                                                    hing
               policy                                              suppor                                                      ime
                issue                                                  local                                                  s ar
                    bill                                            paren                                                  energy
               make                                                   sae                                                 happen
              ci izen                                                family                                                 power
                 legal                                             provide                                                     live
                  gun                                                young                                                   oday
                 pass                                                   high                                                   ci y
                open                                              program                                                ques ion
            member                                              communi y                                                 percen
                carry                                               quali y                                                  grea
            decision                                               funding                                                     end
               argue                                                     pre                                                    alk
          legisla ion                                             resource                                                change
        governmen                                                     place                                              con inue
              person                                                  make                                                       lo
             pro ec                                                     free                                            impor an
          regula ion                                                ensure                                                   build
             provide                                                  open                                                      pu


                                    Topic 13                                               Topic 14                                               Topic 15
        governmen                                                      man                                                 people
             people                                                  make                                                   dea h
               leave                                                woman                                                  heal h
               home                                                     ime                                                   care
            coun ry                                                   black                                              number
                  call                                              people                                                    case
                work                                                    feel                                              pa ien
                 give                                              poli ical                                                    die
                 face                                                speak                                                      es
              bri ish                                                   day                                                   ra e
             remain                                                      life                                                covid
              leader                                                   long                                                    ime
                 year                                                  hing                                                   year
                 plan                                                 hand                                               hospi al
                 deal                                                 wri e                                               include
               make                                                  whi e                                                     day
                force                                               his ory                                               service
                back                                                  back                                                  resul
                  live                                                 eam                                              condi ion
               par y                                                   visi                                                    age
                 free                                                    call                                               figure
               week                                               commen                                                  men al
               s rike                                                  love                                                doc or
                  pu                                                    win                                              problem
              ac ion                                                 s and                                               posi ive
                     0.00   0.01   0.02   0.03   0.04   0.05                0.00   0.01   0.02   0.03   0.04   0.05                0.00   0.01   0.02   0.03   0.04   0.05


Figure 5: Topics in the training data.
                                                                                           label
                                                                                                   false
              120
                                                                                                   other
                                                                                                   partially false
              100                                                                                  true


              80
  doc count


              60


              40


              20


               0
                    1   2   3   4   5   6     7   8    9     10   11   12   13   14   15

                                            dominant topic


Figure 6: Topic label distribution.


done on NVIDIA TESLA V100 GPU using the Pytorch [35] framework with a vocabulary size of
50265 and an input size of 512. The model is trained to optimize the objective for 10 epochs. To
estimate the performance of the resulting models we have chosen a ratio of 82/18 to split the
data into training and validation set. We utilized both accuracy and the macro-averaged F1 score
to assess the quality of the resulting models. As expected, the RoBERTa model architecture
reaches its limitation due to the token counts as shown in Table 2. Therefore, the overall score
was significantly improved by replacing the basic architecture with a Longformer configuration
which, eventually, was also the architecture utilized for the official submission.
   The hierarchical arrangement of text classification is the essential part of our contribution.
In this configuration, training and prediction are preceded by topic modeling to first dissect
content patterns in the data being mediated. Topics are modelled as distributions over content
words derived from documents. To this end, LDA is applied: based on the vocabulary of a
document, topics can be assigned to it with a certain probability. The assignment of a particular
document to a topic is determined by the highest association probability. The set of documents
assigned to a particular topic form the training set for a topic-specific text classifier. Using the
model architecture described above, a specific classifier was trained for each derived topic.
   Of course, the described hierarchy must also be followed for the model prediction. Based on
the previously trained topic model, documents from the test data are first assigned to a topic.
The prediction is then conducted by the dedicated classification model.
  Both topic modeling and text classification are implemented in the form of a comprehensive
pipeline.

Input Embeddings The input embedding layer converts the inputs into sequences of features:
word-level sentence embeddings. These embedding features will be further processed by the
subsequent encoding layers.

Word-Level Sentence Embeddings A sentence is split into words 𝑤1 , ..., 𝑤𝑛 with length
of n by the WordPiece tokenizer [36]. The word 𝑤𝑖 and its index 𝑖 (𝑤𝑖 ’s absolute position in the
sentence) are projected to vectors by embedding sub-layers, and then added to the index-aware
word embeddings:
                                    𝑤
                                    ˆ 𝑖 = 𝑊 𝑜𝑟𝑑𝐸𝑚𝑏𝑒𝑑(𝑤𝑖 )
                                      𝑢
                                      ˆ 𝑖 = 𝐼𝑑𝑥𝐸𝑚𝑏𝑒𝑑(𝑖)
                                  ℎ𝑖 = 𝐿𝑎𝑦𝑒𝑟𝑁 𝑜𝑟𝑚(𝑤
                                                  ˆ𝑖 + 𝑢
                                                       ˆ𝑖)

Target Encoding We encode the target labels using label encoding, although we assume
the target variable to be categorical and non-ordinal. Since we do not assume a natural order,
the substitution of the respective category by a natural number is done arbitrarily (cf. Table 3).
This might pose a challenge and might be replaced by a multi-label binarizer as an analog of
the one-hot (or one-of-K) scheme to multiple labels. It might also be useful to investigate the
impact of an alternative order of the target encodings on the result.

Table 3
Label encoding map

                                        label         encoding
                                         true            0
                                         false           1
                                    partially false      2
                                        other            3


5.2. Results and Discussion
We participated in subtask 3A. Official evaluation results of the final submission on the test set
are presented in Table 7. The entire classification report on this submission in shown in Table 4.
Furthermore, the gold standard also allows the derivation of a corresponding confusion matrix
(see Figure 7).
   We focused on suitable combinations of deep learning methods as well as their hyperparameter
settings. Fine-tuning pre-trained language models like RoBERTa or Longformer on downstream
tasks has become ubiquitous in NLP research and applied NLP. Even without extensive pre-
processing of the training data, we already achieve competitive results and our models can serve
Table 4
Classification report for the final submission against the gold standard.
                                                   precision            recall                           f1-score                support
        false                                      0.6432               0.7841                           0.7067                  315
        other                                      0.0                  0.0                              0.0                     31
        partially false                            0.1267               0.33934                          0.1845                  56
        true                                       0.6575               0.2286                           0.3392                  210
        accuracy                                   0.5131               0.5131                           0.5131
        macro avg                                  0.3569               0.3380                           0.3076                  612
        weighted avg                               0.5683               0.5131                           0.4970                  612

                                                                            Confusion matri
                                                                                                                                 384
                                             e       247           20                  24                     93                64.32%
                                        fals        40.36%       3.27%               3.92%                  15.20%
                                                                                                                                35.68%

                                                                                                                                  5
                                          her
                                                      1                                3                       1                0.00%
                                        ot          0.16%                            0.49%                   0.16%
                                                                                                                            100.00%

                                                                                                                                 150
                                             lse
             Predicted


                                         fa           53           10                  19                     68                12.67%
                                 ti ally            8.66%        1.63%               3.10%                  11.11%
                              par
                                                                                                                                87.33%

                                                                                                                                 73
                                              e       14           1                   10                      48               65.75%
                                         tru        2.29%        0.16%               1.63%                   7.84%
                                                                                                                                34.25%

                                                     315          31                    56                    210                612
                                      a       ry
                                   mm               78.41%       0.00%              33.93%                  22.86%              51.31%
                              l su
                         actua
                                                    21.59%      100.00%             66.07%                  77.14%              48.69%

                                                     fals
                                                            e        er                            lse        tru
                                                                                                                  e                         ry
                                                                  oth                      l   y fa                                      ma
                                                                                      tial                                         um
                                                                                  par                                            ds
                                                                                                                            dicte
                                                                                                                      pre
                                                                                      Actual

Figure 7: Confusion matrix for Task 3A with the large training corpus on the gold standard.


as strong baseline models which, when fine-tuned, significantly outperform training models
trained from scratch. The submission is based on the best performing model checkpoint on the
validation set. In our case, of course, this evaluation had to take place at the topic level.
   To identify potential improvements, our approach was applied to both the original training
dataset and the large training corpus.
   When improving on the pretrained baseline models, class imbalance appears to be a primary
Table 5
Classification report for the predictions on the original training data against the gold standard.
                                                   precision            recall                          f1-score                support
        false                                      0.5970               0.8889                          0.7143                  315
        other                                      0.0                  0.0                             0.0                     31
        partially false                            0.1654               0.3929                          0.2328                  56
        true                                       0.8889               0.0381                          0.0731                  210
        accuracy                                   0.5065               0.5065                          0.5065
        macro avg                                  0.4128               0.3300                          0.2550                  612
        weighted avg                               0.6274               0.5065                          0.4140                  612

                                                                            Confusion matri
                                                                                                                                469
                                             e       280           21                 34                    134                59.70%
                                        fals        45.75%       3.43%              5.56%                  21.90%
                                                                                                                               40.30%

                                                                                                                                 1
                                          her
                                                      1                                                                        0.00%
                                        ot          0.16%
                                                                                                                           100.00%

                                                                                                                                133
                                             lse
             Predicted


                                         fa           34           9                  22                     68                16.54%
                                 ti ally            5.56%        1.47%              3.59%                  11.11%
                              par
                                                                                                                               83.46%

                                                                                                                                 9
                                              e                    1                                          8                88.89%
                                         tru                     0.16%                                      1.31%
                                                                                                                               11.11%

                                                     315          31                   56                    210                612
                                      a       ry
                                   mm               88.89%       0.00%             39.29%                   3.81%              50.65%
                              l su
                         actua
                                                    11.11%      100.00%            60.71%                  96.19%              49.35%

                                                     fals
                                                            e        er                           lse        tru
                                                                                                                 e                         ry
                                                                  oth                     l   y fa                                      ma
                                                                                     tial                                         um
                                                                                 par                                            ds
                                                                                                                           dicte
                                                                                                                     pre
                                                                                     Actual

Figure 8: Confusion matrix for Task 3A with the original training dataset on the gold standard.


challenge. This is clearly reflected in Figure 7. The poor performance, especially for the
categories partially false and other, correlates with the distribution of training data across these
categories (see Figure 2b).

  A commonly used tactic to deal with imbalanced datasets is to assign weights to each label.
Alternative solutions for coping with unbalanced datasets for supervised machine learning are
Table 6
Classification report for the predictions on the original training data with oversampling against the
gold standard.
                                                   precision           recall                          f1-score                support
        false                                      0.5933              0.7873                          0.6767                  315
        other                                      0.1667              0.0323                          0.0541                  31
        partially false                            0.1566              0.4643                          0.2342                  56
        true                                       0.6818              0.0714                          0.1293                  210
        accuracy                                   0.4739              0.4739                          0.4739
        macro avg                                  0.3996              0.3388                          0.2736                  612
        weighted avg                               0.5621              0.4739                          0.4168                  612

                                                                           Confusion matri
                                                                                                                               418
                                             e       248          18                 29                    123                59.33%
                                        fals        40.52%      2.94%              4.74%                  20.10%
                                                                                                                              40.67%

                                                                                                                                6
                                          her
                                                      3           1                  1                       1                16.67%
                                        ot          0.49%       0.16%              0.16%                   0.16%
                                                                                                                              83.33%

                                                                                                                               166
                                             lse
             Predicted


                                         fa           59          10                 26                     71                15.66%
                                 ti ally            9.64%       1.63%              4.25%                  11.60%
                              par
                                                                                                                              84.34%

                                                                                                                               22
                                              e       5           2                                          15               68.18%
                                         tru        0.82%       0.33%                                      2.45%
                                                                                                                              31.82%

                                                     315         31                   56                    210                612
                                      a       ry
                                   mm               78.73%      3.23%             46.43%                   7.14%              47.39%
                              l su
                         actua
                                                    21.27%      96.77%            53.57%                  92.86%              52.61%

                                                     fals
                                                            e       er                           lse        tru
                                                                                                                e                         ry
                                                                 oth                     l   y fa                                      ma
                                                                                    tial                                         um
                                                                                par                                            ds
                                                                                                                          dicte
                                                                                                                    pre
                                                                                    Actual

Figure 9: Confusion matrix for Task 3A with the original training dataset with oversampling on the
gold standard.


undersampling or oversampling. Undersampling only considers a subset of an overpopulated
class to end up with a balanced dataset. With the same goal, oversampling creates copies of
the unbalanced classes. The influence of oversampling is evident from a comparison of both
experiments on the original training data set (cf. Table 5 and 6). Thus, the macro-averaged F1
score was improved from 0.2550 to 0.2736.
            Rank     Team                  Accuracy               F1-macro
            1        iCompass              0.5474                 0.3391
            2        nlpiruned             0.5408                 0.3325
            3        awakened              0.5310                 0.3231
            4        UNED                  0.5441                 0.3154
            5        NLytics               0.5131                 0.3076
            6        SCUoL                 0.5261                 0.3047
            7        hariharanrl           0.5359                 0.2980
            8        CIC                   0.4755                 0.2859
            9        ur-iw-hnt             0.5327                 0.2833
            10       BUM                   0.4722                 0.2760
            11       boby232               0.4755                 0.2754
            12       HBDCI                 0.5082                 0.2734
            13       DIU_SpeedOut          0.5212                 0.2707
            14       DIU_Carbine           0.4722                 0.2579
            15       CODE                  0.4444                 0.2550
            16       MNB                   0.5065                 0.2507
            17       subMNB                0.5065                 0.2507
            18       fosil                 0.4624                 0.2505
            19       Text_Minor            0.3775                 0.2347
            20       DLRG                  0.5131                 0.1987
            21       DIU_Phoenix           0.2778                 0.1593
            22       AIT_FHSTP             0.1993                 0.1549
            23       DIU_SilentKillers     0.2598                 0.1529
            24       DIU_Fire71            0.2745                 0.1328
            25       AI Rational           0.0980                 0.1165

Table 7
Results on Task 3A


   Overfitting poses the most difficult challenge in these experiments and reduces generalizability.
In all three experiments, we observe the same pattern of misclassification, which is due to
difficulties of the system to find discriminative features (cf. Figure 7, 8, 9). The problem is most
evident in the the poor performance of assigning the class label “true” on the test set. Most
assigments were lost either to “false” and “partially false”. This issue is potentially caused by
flaws in the selection of the training data. Indeed, we can attribute part of this problem to
content features. At its most basic level, there is a significant difference in the average document
length of the documents used for training and prediction, respectively. Following Table 2,
significantly shorter documents were used for the training. The phenomenon is particularly
evident for the category “true” (cf. Table 8). To support this hypothesis, however, the high
standard deviation in both statistics suggests further investigation into outliers, as median and
quantiles suggest a smaller deviation between test and training data.
   Further investigation examining lexical properties at the class level do not reflect significant
differences in the training and testing data (cf. Figure 10). Even the use of the much larger data
set does not effect the overall pattern (see Figure 7).
   In fact, the problem may be due to a questionable choice of categories reflected in the class
Table 8
Statistical summary of token (word) counts on the training set.
    class                                           original training data   test data
                            doc count               493                      315
                            mean                    760.81                   1063.53
                            std                     739.57                   1898.29
                            min                     17                       68
    false
                            25%                     341.00                   400.00
                            50%                     512.00                   671.00
                            75%                     969.00                   1001.00
                            max                     6367                     22168
                            doc count               313                      56
                            mean                    984.60                   1037.88
                            std                     1159.67                  1542.03
                            min                     10                       120
    partially false
                            25%                     382.00                   361.75
                            50%                     701.00                   556.50
                            75%                     1094.00                  954.00
                            max                     8751                     10108
                            doc count               196                      210
                            mean                    1084.78                  1481.96
                            std                     894.23                   2349.92
                            min                     123                      60
    true
                            25%                     493.25                   533.00
                            50%                     890.00                   968.00
                            75%                     1269.75                  1552.00
                            max                     6064                     19575
                            doc count               94                       31
                            mean                    821.00                   665.55
                            std                     902.36                   512.82
                            min                     15                       114
    other
                            25%                     389.75                   443.50
                            50%                     608.50                   554.00
                            75%                     933.00                   747.00
                            max                     6341                     3005


labels. In the case of the given task, the classification results suggest some kind of fact check. The
system, however, is supposed to determine a truth value for an unseen document based solely
on the available training data. We assume that in most cases external features contribute to the
determination of the truth value of a certain statement. In particular, an individual’s – this holds
true for the sender as well as the receiver – worldview, contextual knowledge, and thematic
context are crucial to their own decision. For this reason, linguistic means alone do not have
enough discriminative power to robustly determine the truth value. Our approach is an attempt
to narrow down the problem of distinguishing different views on a specific topic. Depending
on the topic under investigation, we noticed significant differences in the performance of the
trained systems with f1-scores ranging from 0.07 to 0.72.
                         3.75
                                                                                                                              4.0

                         3.50
     log10 text length


                                                                                                          log10 text length
                         3.25                                                                                                 3.5


                         3.00
                                                                                                                              3.0
                         2.75

                         2.50
                                                                                                                              2.5

                         2.25

                                                   true       false      partially false       other                                                  partially false       other      false      true


                                                  (a) log10 text length - training set                                                                     (b) log10 text length - test set


                                           0.65                                                                                                 0.6

                                           0.60
                                                                                                                                                0.5

                                           0.55
                                                                                                                                                0.4
                         lexical density


                                                                                                                              lexical density
                                           0.50
                                                                                                                                                0.3
                                           0.45

                                                                                                                                                0.2
                                           0.40

                                           0.35                                                                                                 0.1

                                           0.30                                                                                                 0.0
                                                      true       false       partially false      other                                                   partially false      other      false      true


                                                     (c) lexical density - training set                                                                           (d) lexical density - test set
Figure 10: Class-based lexical feature comparison.


  With the above findings, we achieve state of the art performance on the text classification
datasets. Transformer-based models such as RoBERTa or Longformer have proven to be powerful
language representation model for various natural language processing tasks. As this study
shows, they are also an effective tool for multi-class text classification. In the future, we will
further investigate the inner workings of Transformer-based models and how to counteract
their tendency to overfitting.


6. Conclusion and Future work
In future work, we plan to investigate more recent neural architectures for language representa-
tion such as T5 [37], GPT-3 [38], or its open competitor OPT-175B [39].
   Furthermore, we expect great opportunities for transfer learning from the areas such as
argumentation mining [40] and offensive language detection [41]. In order to deal with data
scarcity as a general challenge in natural language processing, we examine the application of
concepts such as active learning, semi-supervised learning [42] as well as weak supervision [43].
With the evaluation of feature importance [44] we will further address the issue of robustness of
our system, by explaining the individual features of the training data as well as their relevance
to the models prediction.
References
 [1] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, The clef-2022 checkthat! lab on fighting the covid-19
     infodemic and fake news detection, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert,
     K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Springer Interna-
     tional Publishing, Cham, 2022, pp. 416–428.
 [2] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the
     CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection,
     in: Proceedings of the 13th International Conference of the CLEF Association: Information
     Access Evaluation meets Multilinguality, Multimodality, and Visualization, CLEF ’2022,
     Bologna, Italy, 2022.
 [3] J. Köhler, G. K. Shahi, J. M. Struß, M. Wiegand, M. Siegel, T. Mandl, Overview of the
     CLEF-2022 CheckThat! lab task 3 on fake news detection, in: Working Notes of CLEF
     2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
 [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in Neural Information Processing Systems,
     volume 2017-Decem, 2017, pp. 5999–6009. arXiv:1706.03762.
 [5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
     moyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, 2019.
     arXiv:1907.11692.
 [6] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Long-Document Transformer (2020).
     arXiv:2004.05150.
 [7] X. Zhou, R. Zafarani, Fake News: A Survey of Research, Detection Methods, and Opportu-
     nities, ACM Comput. Surv 1 (2018). arXiv:1812.00315.
 [8] C. H. de Vreese, F. Esser, T. Aalberg, C. Reinemann, J. Stanyer, Populism as an Expres-
     sion of Political Communication Content and Style: A New Perspective, International
     Journal of Press/Politics 23 (2018) 423–438. URL: http://journals.sagepub.com/doi/10.1177/
     1940161218790035. doi:10.1177/1940161218790035.
 [9] U. Schade, F. Meißner, A. Pritzkau, S. Verschitz, Prebunking als Möglichkeit zur Resilien-
     zsteigerung gegenüber Falschinformationen in Online-Medien, in: N. Zowislo-Grünewald,
     N. Wörmer (Eds.), Kommunikation, Resilienz und Sicherheit, Konrad-Adenauer-Stiftung,
     Berlin, 2021, pp. 134–155.
[10] J. Leskovec, K. Lang, Statistical properties of community structure in large social and
     information networks, Proceedings of the 17th international conference on World Wide
     Web. ACM (2008) 695–704. URL: http://dl.acm.org/citation.cfm?id=1367591.
[11] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep learning for hate speech detection in
     tweets, in: 26th International World Wide Web Conference 2017, WWW 2017 Companion,
     International World Wide Web Conferences Steering Committee, 2017, pp. 759–760. doi:10.
     1145/3041021.3054223. arXiv:1706.00188.
[12] L. Gao, R. Huang, Detecting online hate speech using context aware models, in: In-
     ternational Conference Recent Advances in Natural Language Processing, RANLP, vol-
     ume 2017-Septe, Association for Computational Linguistics (ACL), 2017, pp. 260–266.
     doi:10.26615/978-954-452-049-6-036. arXiv:1710.07395.
[13] J. Pavlopoulos, P. Malakasiotis, I. Androutsopoulos, Deeper attention to abusive user
     content moderation, in: EMNLP 2017 - Conference on Empirical Methods in Nat-
     ural Language Processing, Proceedings, Association for Computational Linguistics,
     Stroudsburg, PA, USA, 2017, pp. 1125–1135. URL: http://aclweb.org/anthology/D17-1117.
     doi:10.18653/v1/d17-1117.
[14] G. K. Pitsilis, H. Ramampiaro, H. Langseth, Effective hate-speech detection in Twitter data
     using recurrent neural networks, Applied Intelligence 48 (2018) 4730–4742. doi:10.1007/
     s10489-018-1242-y. arXiv:1801.04433.
[15] Z. Zhang, D. Robinson, J. Tepper, Detecting Hate Speech on Twitter Using a Convolution-
     GRU Based Deep Neural Network, in: Lecture Notes in Computer Science (including sub-
     series Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume
     10843 LNCS, Springer Verlag, 2018, pp. 745–760. doi:10.1007/978-3-319-93417-4_
     48.
[16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding (2018). arXiv:1810.04805.
[17] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, XLNet: General-
     ized Autoregressive Pretraining for Language Understanding, Technical Report, 2019.
     arXiv:1906.08237.
[18] T. Mikolov, Q. V. Le, I. Sutskever, Exploiting Similarities among Languages for Machine
     Translation (2013). arXiv:1309.4168.
[19] J. Pennington, R. Socher, C. D. Manning, GloVe: Global vectors for word representation,
     in: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing,
     Proceedings of the Conference, 2014, pp. 1532–1543. doi:10.3115/v1/d14-1162.
[20] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classifi-
     cation, in: 15th Conference of the European Chapter of the Association for Compu-
     tational Linguistics, EACL 2017 - Proceedings of Conference, volume 2, 2017, pp. 427–
     431. URL: https://github.com/facebookresearch/fastText. doi:10.18653/v1/e17-2068.
     arXiv:1607.01759.
[21] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     Contextualized Word Representations, Association for Computational Linguistics (ACL),
     2018, pp. 2227–2237. doi:10.18653/v1/n18-1202. arXiv:1802.05365.
[22] M. Castelle, The Linguistic Ideologies of Deep Abusive Language Classification, 2019, pp.
     160–170. doi:10.18653/v1/w18-5120.
[23] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, The
     CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked
     Claims, and Fake News, in: Proceedings of the 43rd European Conference on Information
     Retrieval, ECIR˜’21, Lucca, Italy, 2021, pp. 639–649. URL: https://link.springer.com/chapter/
     10.1007/978-3-030-72240-1_75.
[24] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl,
     S. Modha, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! Lab on Detecting
     Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News, in: Proceedings of
     the 12th International Conference of the CLEF Association: Information Access Evaluation
     Meets Multiliguality, Multimodality, and Visualization, CLEF˜’2021, Bucharest, Romania
     (online), 2021.
[25] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! Lab Task 3
     on Fake News Detection, in: Working Notes of CLEF 2021—Conference and Labs of the
     Evaluation Forum, CLEF˜’2021, Bucharest, Romania (online), 2021.
[26] G. K. Shahi, J. M. Struß, T. Mandl, Task 3: Fake News Detection at CLEF-2021 CheckThat!,
     CLEF˜’2021, Zenodo, Bucharest, Romania (online), 2021. doi:10.5281/zenodo.4714517.
[27] G. K. Shahi, AMUSED: An Annotation Framework of Multi-modal Social Media Data
     (2020). arXiv:2010.00502.
[28] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation
     on twitter, Online Social Networks and Media 22 (2021) 100104.
[29] K. Shu, Fake News Detection Challenge KDD 2020, 2020. URL: https://www.kaggle.com/
     competitions/fakenewskdd2020/data.
[30] J. Ribeiro, Fakenews Classification Datasets, 2020. URL: https://www.kaggle.com/datasets/
     liberoliber/onion-notonion-datasets.
[31] O. Blanc, A. Pritzkau, U. Schade, M. Geierhos, CODE at CheckThat! 2022: Multi-class fake
     news detection of news articles with BERT, in: CEUR Workshop Proceedings, 2022. URL:
     http://ceur-ws.org.
[32] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning
     Research 3 (2003) 993–1022. URL: http://www.crossref.org/jmlr{_}DOI.html. doi:10.1162/
     jmlr.2003.3.4-5.993.
[33] M. Röder, A. Both, A. Hinneburg, Exploring the space of topic coherence measures, in:
     Proceedings of the Eighth ACM International Conference on Web Search and Data Mining,
     WSDM ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 399–408.
     URL: https://doi.org/10.1145/2684822.2685324. doi:10.1145/2684822.2685324.
[34] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le
     Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-Art Natural
     Language Processing, in: arxiv.org, 2020, pp. 38–45. URL: https://github.com/huggingface/.
     doi:10.18653/v1/2020.emnlp-demos.6. arXiv:1910.03771v5.
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
     S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style,
     high-performance deep learning library, in: Advances in Neural Information Process-
     ing Systems, volume 32, Neural information processing systems foundation, 2019. URL:
     http://arxiv.org/abs/1912.01703. arXiv:1912.01703.
[36] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo,
     H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick,
     O. Vinyals, G. Corrado, M. Hughes, J. Dean, Google’s Neural Machine Translation System:
     Bridging the Gap between Human and Machine Translation (2016). arXiv:1609.08144.
[37] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv
     21 (2019) 1–67. arXiv:1910.10683.
[38] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language models are few-shot learners, 2020. arXiv:2005.14165.
[39] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li,
     X. V. Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint
     arXiv:2205.01068 (2022).
[40] M. Stede, Automatic argumentation mining and the role of stance and sentiment, Journal
     of Argumentation in Context 9 (2020) 19–41. URL: https://www.jbe-platform.com/content/
     journals/10.1075/jaic.00006.ste. doi:10.1075/jaic.00006.ste.
[41] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the
     type and target of offensive posts in social media, in: NAACL HLT 2019 - 2019 Confer-
     ence of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies - Proceedings of the Conference, volume 1, Association
     for Computational Linguistics, Stroudsburg, PA, USA, 2019, pp. 1415–1420. URL: http:
     //aclweb.org/anthology/N19-1144. doi:10.18653/v1/n19-1144. arXiv:1902.09666.
[42] S. Ruder, B. Plank, Strong Baselines for Neural Semi-supervised Learning under Domain
     Shift, ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics,
     Proceedings of the Conference (Long Papers) 1 (2018) 1044–1054. arXiv:1804.09530.
[43] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Ré, Snorkel: rapid training data
     creation with weak supervision, in: VLDB Journal, volume 29, Springer, 2020, pp. 709–730.
     doi:10.1007/s00778-019-00552-1.
[44] S. M. Lundberg, S.-I. Lee,         A Unified Approach to Interpreting Model Predic-
     tions, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
     wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, vol-
     ume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/file/
     8a20a8621978632d76c43dfd28b67767-Paper.pdf.