=Paper= {{Paper |id=Vol-3318/short22 |storemode=property |title=Offensive text detection across languages and datasets using rule-based and hybrid methods |pdfUrl=https://ceur-ws.org/Vol-3318/short22.pdf |volume=Vol-3318 |authors=Kinga Gémes,Ádám Kovács,Gábor Recski |dblpUrl=https://dblp.org/rec/conf/cikm/GemesKR22 }} ==Offensive text detection across languages and datasets using rule-based and hybrid methods== https://ceur-ws.org/Vol-3318/short22.pdf
Offensive text detection across languages and datasets using
rule-based and hybrid methods
Kinga Gémes1,2 , Ádám Kovács1,2 and Gábor Recski1
1
    TU Wien, Favoritenstraße 9-11., Vienna, 1040, Austria
2
    Budapest University of Technology and Economics, Műegyetem rkp. 3., Budapest, H-1111, Hungary


                                             Abstract
                                             We investigate the potential of rule-based systems for the task of offensive text detection in English and German, and
                                             demonstrate their effectiveness in low-resource settings, as an alternative or addition to transfer learning across tasks and
                                             languages. Task definitions and annotation guidelines used by existing datasets show great variety, hence state-of-the-art
                                             machine learning models do not transfer well across datasets or languages. Furthermore, such systems lack explainability
                                             and pose a critical risk of unintended bias. We present simple rule systems based on semantic graphs for classifying offensive
                                             text in two languages and provide both quantitative and qualitative comparison of their performance with deep learning
                                             models on 5 datasets across multiple languages and shared tasks.

                                             Keywords
                                             offensive text, rule-based methods, human in the loop learning



1. Introduction                                                                                                                       categorization of their errors to demonstrate the sensi-
                                                                                                                                      tivity of quantitative evaluation to the characteristics of
The task of offensive text detection, especially as applied                                                                           individual datasets and their potentially controversial
to social media has seen a rise of interest in recent years,                                                                          annotations. The main contributions of the paper are the
with many overlapping definitions of categories such as                                                                               following:
toxicity, hate speech, profanity etc. Datasets are con-
structed using different sets of class definitions corre-                                                                                  • A rule-based method for offensive text detection
sponding to different annotation instructions, and ma-                                                                                       using semantic parsing and graph patterns
chine learning models that learn patterns of one dataset                                                                                   • 5 high-precision rule systems for English and Ger-
may perform poorly on another. Modern deep learning                                                                                          man offensive text detection based on datasets
models also offer little or no explainability of their de-                                                                                   from two shared tasks
cisions, and their potential for unintended bias reduces                                                                                   • Quantitative evaluation of our rule systems, deep
their applicability in real-world scenarios such as auto-                                                                                    learning baselines, and their ensembles across 5
matic content moderation. In this paper we present a                                                                                         datasets, demonstrating that rule based and hy-
rule-based approach, a semi-automatic method for con-                                                                                        brid systems can outperform deep learning mod-
structing patterns over Abstract Meaning Representa-                                                                                         els in cross-dataset and cross-language settings.
tions (AMR graphs) built from input text, and evaluate                                                                                     • Detailed error analysis of each system on sam-
its potential as an alternative to machine learning for                                                                                      ples of 100 posts each from one English and one
offensive text detection using five datasets of English and                                                                                  German dataset.
German social media text. Our quantitative analysis com-
pares the rule-based method to both monolingual and                                                                                   The rest of this paper is organized as follows. An overview
multilingual deep learning models trained on data from                                                                                of related work and the most important shared tasks and
each language and shared task, demonstrating its poten-                                                                               datasets is given in Section 2. The datasets used in our
tial in low-resource settings as an alternative or addition                                                                           experiments are described in Section 3. Our method for
to transfer learning. Our qualitative analysis examines                                                                               constructing AMR-based rule systems is presented in
the decisions made by each system on samples of 100-                                                                                  Section 4 and our experiments are described in Section 5.
100 texts from both languages and provides a subjective                                                                               Quantitative evaluation is presented and discussed in
                                                                                                                                      Section 6, the qualitative analysis on samples from two
CIKM’22: Advances in Interpretable Machine Learning and Artificial
                                                                                                                                      datasets is provided in Section 7. All software for ex-
Intelligence Workshop, Oct 17–21, 2022, Atlanta, GA
Envelope-Open kinga.gemes@tuwien.ac.at (K. Gémes);                                                                                    periments as well as the rule-based systems presented is
adam.kovacs@tuwien.ac.at (.́ Kovács); gabor.recski@tuwien.ac.at                                                                       available as open-source software under an MIT license
(G. Recski)                                                                                                                           from https://github.com/GKingA/offensive_text.
Orcid 0000-0003-0626-9644 (K. Gémes); 0000-0001-6132-7144
(.́ Kovács); 0000-0001-5551-3100 (G. Recski)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work                                                  is subject of growing interest, also as part of the broader
                                                                 research area of explainable artificial intelligence (xAI).
Datasets As pointed out already in a 2017 survey [1], Deep learning models are considered black boxes in most
the definition of offensive text varies greatly across datasets, applications and efforts to interpret them are generally
which makes the portability of deep learning models for limited to feature weight visualizations with limited valid-
offensive text detection a hard problem. Annual shared ity (see e.g. [30], [31], and [32] for the controversy about
tasks on hate speech detection and related tasks may use using attention weights as explanation). Yet even the
similar definitions year after year, but there is great vari- more mature methods for interpreting neural networks
ation when moving from one shared task to another and (e.g. LIME [33]) do not offer the kind of transparency
models that achieve high quantitative results on their of ML models that would allow developers to customize
targeted test set don’t generalize well (see [2] for a re- their functionality the way a domain expert can update a
cent survey). In this paper we shall experiment on yearly traditional rule system. In this work we experiment with
datasets from two tasks that both use the same labeling a rudimentary method for semi-automatic, human-in-the-
scheme for offensive text, HASOC [3] and GermEval [4]. loop (HITL) learning of simple rule systems over semantic
Both challenges define a binary classification of social graphs. Recent approaches to automatic learning of rule
media texts (Tweets or Facebook comments) into the of- systems for NLP tasks range from the learning of first
fensive and non-offensive classes, and a fine-grained clas- order logic formulae over semantic representations us-
sification of the offensive category into the subclasses ing neural networks [34] and integer programming [35]
abusive, insulting, and profane. A detailed description to the training of probabilistic grammars over seman-
of these tasks and datasets will be given in Section 3. tic graphs [36]. Human-in-the-loop (HITL) approaches
The OLID and SOLID datasets of SemEval 2019 [5] and involve generating rule candidates to be reviewed by
2020 [6] use task definitions similar to GermEval. Other experts, e.g. by extracting textual patterns [37] or seman-
widely used datasets with a narrower scope include the tic structures [38]. Rule-based approaches are also often
data provided by the TRAC [7, 8] and HatEval [9] shared combined with ML methods, e.g. by incorporating lexical
tasks. TRAC contains English, Hindi and Bangla data features into DL architectures [39, 40] or voting between
from Twitter and Facebook and annotation focuses on rule-based and ML systems [41, 42, 43].
the categories aggression and misogyny, the HatEval task
is concerned with hate speech directed at immigrants or
women in English and Spanish Twitter data.                       3. Data
                                                               In this section we introduce datasets from the GermEval
Approaches Most systems for offensive text detection
                                                               and HASOC shared tasks, which are the basis of all our
rely on distributional text representations, including both
                                                               quantitative experiments in Section 5 and our qualitative
static [10] and contextual embeddings [8, 11]. As in many
                                                               analysis in Section 7. We choose two recent tasks that use
popular text classification tasks, the most widely used
                                                               identical labeling schemes and also have one language in
neural language models are based on the Transformer ar-
                                                               common (German) allowing us to perform various cross-
chitecture [12], and in particular BERT-based models [13]
                                                               dataset experiments. Our experiments involve datasets
are the basis of the state of the art machine learning sys-
                                                               in German and English only, these are the two languages
tems for most datasets, including the best-performing sys-
                                                               for which we are able to build rule systems and also
tems on GermEval2021 [14], GermEval2019 [15], HASOC
                                                               perform qualitative analysis (see Section 7) in addition to
2020 English [16] and HASOC 2020 German [8, 17]. Top
                                                               quantitative results, allowing us to investigate the ability
systems enhance quantitative performance by optimiz-
                                                               of both ML and rule-based models to transfer between
ing metaparameters such as maximum sentence length
                                                               tasks as well as languages.
or number of training epochs [18, 19], by training on
                                                                  The GermEval shared task was organized in 2018 [44],
joint subtask labels [20] or utilizing multiple Transformer
                                                               2019 [45], and 2021 [4]. German Twitter posts were anno-
based models to counteract the small dataset sizes [14],
                                                               tated for the 2018 and 2019 challenges, the 2021 task used
by pretraining on additional hate speech corpora [21],
                                                               comments from a news-related Facebook group. The
training jointly on different corpora [8], or by using ad-
                                                               2018 and 2019 Twitter datasets consist of posts from 100
versarial learning [22]. Further deep learning methods
                                                               user timelines and is limited to tweets in German that
used in offensive text detection include LSTMs [23, 24],
                                                               are not retweets, do not contain URLs, and contain at
CNNs [25, 26], or both [27], sentence embeddings [28],
                                                               least 5 alphabetic tokens. The dataset is not a random
and ensembles of multiple machine learning models [27,
                                                               sample of posts meeting these criteria, users were heuris-
29].
                                                               tically selected to ensure a high ratio of offensive tweets
                                                               (further details on this selection were not given), then
Explainability and rule learning The interpretability          the dataset was debiased using additional tweets with
of NLP models and the explainability of their decisions
non-offensive words that were observed to be overrepre-         the semantics of each sentence. For English texts we use
sented in offensive posts, such as Merkel or Flüchtlinge        a pretrained Transformer-based AMR parser [50] and the
‘refugees’. The 2021 edition of Germeval featured a collec-     amrlib 2 library, for German we construct AMRs from
tion of comments from the Facebook page of a German             text using a multilingual, transition-based system [51] via
political talk show. The 2021 training data was collected       the amr-eager-multilingual 3 library. A rule system
between January and June of 2019, while the test set is         for a task consists of lists of patterns over graph repre-
from between September and December of 2020. The                sentations of text for each possible class, and a text is
dataset has been anonymized to comply with Facebook’s           predicted to belong to a given class iff at least one pattern
guidelines for publishing data. The datasets from 2018          in the class’s list of patterns matches the corresponding
and 2019 categorize the offensive texts further into three      graph. Graphs must be directed and can be edge- and/or
categories, profanity, insult, and abuse and defines offen-     node-labeled. Individual patterns are directed graphs
sive text as the union of these categories, this is identical   whose edge and node labels may be strings or regular
to the definition used at HASOC. The 2021 dataset does          expressions (regexes) defining sets of possible labels, and
not contain such fine-grained labels and defines offensive      a graph pattern with regexes for labels defines the set of
texts as the union of screaming, vulgar language, insults,      all graphs whose corresponding node and edge labels are
sarcasm, discrimination, discrediting, and accusation of        matched by those regexes. Patterns can also be negated
lying.                                                          and a conjunction of patterns used as a single rule, a com-
   The Hate Speech and Offensive Content Identification         plete rule system can therefore be considered as a single
in English and Indo-Aryan Languages (HASOC) shared              boolean statement in disjunctive normal form (DNF) of
task was inspired by GermEval and OffensEval and was            boolean predicates corresponding to graph patterns, in
organized in 2019 [46], 2020 [47], and 2021 [3]. The            this regard the method is similar to the approach of [35]
dataset from 2019 contained tweets and Facebook com-            and [34] (see Section 2).
ments in English, Hindi and German. Offensive posts                To construct rule systems efficiently, POTATO imple-
were selected based on keywords and hashtags, and de-           ments a form of human-in-the-loop (HITL) learning. For
biased similarly to the process described by GermEval           each training dataset we consider all AMR graphs and
organizers. From 2020 datasets were selected by training        generate a list of frequently occurring subgraphs with at
a Support Vector Machine classifier (SVM) on a collection       most 2 edges, then rank them based on their importance
of hate speech datasets and using this classifier to select     for the classification task. For this we use subgraphs as
the tweets to be annotated for the dataset. Following           features to train a decision tree on the dataset using the
the definition of the 2019 and 2020 GermEval challenges,        sklearn library and then rank these features based on
each HASOC task distinguishes between three types of            their Gini coefficient. The maximum size of subgraphs
offensive text, those displaying profanity (PRFN ), offense     is a free parameter of the system but must be kept low
(OFFN ), or hate (HATE ). The binary classification of offen-   to limit the search space. We thus obtain a ranked list of
sive texts considers the union of these three categories,       relevant graph patterns that we can use to construct our
and both our quantitative experiments in Section 5 and          rule systems manually. We shall describe the individual
our qualitative analysis in Section 7 are concerned with        rule systems built for our experiments in Section 5.
this task only.

                                                                5. Experiments
4. Method
                                                                Quantitative evaluation is performed using 5 datasets.
In our quantitative experiments as well as in our error         For English we train models using the three datasets
analysis we compare the performance of standard deep            from the 2019-2021 editions of the HASOC shared task,
learning models with rule-based systems that define sets        for German we use the 2021 GermEval dataset (the train-
of patterns over AMR graphs built from the texts of posts       ing portion of which is from earlier editions of GermEval)
to be classified. For the DL models we use standard archi-      and the 2020 HASOC corpus (see Section 3 for details on
tectures without modification, technical details will be        each dataset). We train standard BERT-based classifiers
described along with the experimental setup in Section 5.       on each dataset and compare them with rule systems we
   Our rule-based solutions are built using POTATO 1 [48],      built manually. We investigate the ability of models to
a framework that enables the rapid construction of graph-       transfer between tasks by evaluating each of them on the
based rule systems and has recently been used for text          test sets of all other datasets as well. We also attempt
classification in multiple domains and languages. Input         transfer learning between English and German data, by
text is parsed into Abstract Meaning Representations            training models using multilingual BERT on datasets
(AMR, [49]), directed graphs of concepts representing
                                                                2
                                                                    https://amrlib.readthedocs.io/en/latest/
1                                                               3
    https://github.com/adaamko/POTATO                               https://github.com/mdtux89/amr-eager-multilingual
from one language and evaluating them on the other lan-        only used for quantitative evaluation, but not for HITL
guage. Finally, we also measure the contribution of our        learning or manual analysis.
rule-based system to DL models by evaluating the union            In each of the 5 rule systems the rules with the highest
of their predicted positive labels, i.e. by considering the    yield are those that consist of a single node, i.e. that
strategy of classifying a text as offensive iff at least one   refer to the presence of a single word in the text. The
of multiple models would classify it as such. In this sec-     majority of these words are in themselves profane and/or
tion we provide details of our deep learning experiments,      insulting. In English rule systems top keywords include
followed by an overview of our rule systems built from         asshole, stupid, bitch, shit, fuck as well as useless and dis-
each dataset using the method in Section 4. Results and        grace. In German rule sets the top words that trigger the
discussion follow in Section 6.                                offensive label in themselves also include ficken ‘fuck’,
                                                               porno, hurensohn ‘son of a bitch’, arsch ‘ass’ and scheiße
Deep learning models For training BERT-based mod-              ‘shit’. Rules with multiple nodes typically serve to sepa-
els we preprocess text data by replacing emoticons with        rate offensive and non-offensive occurrences of a word.
their textual representation using the emoji Python li-        For example, the word shame is present in over 200 of-
brary, then removing hashtag symbols and substituting          fensive posts of the English HASOC 2021 dataset, but
currencies and urls with special tags using the regex-         as a keyword rule it would also yield 43 false positives.
based library clean-text 4 . Finally, we use our own regu-     Using a pattern over AMR graphs we can filter occur-
lar expressions for masking usernames, media tags, and         rences of the word by the object (ARG1) of shame and
                                                                                            𝐴𝑅𝐺1
moderators, by replacing each with the [USER] tag. For construct the rule shame −−−−−→ (media|person|publica-
both languages we fine-tune a language specific pre- tion|they|you|party|have|government), which yields only 8
trained BERT model (bert-base-german-cased for Ger- false positives for 103 true positives. Another example of
man and bert-base-uncased for English) as well as the patterns over multiple nodes are rules covering negation.
multilingual model (bert-base-multilingual-cased ). For example, in the rule system based on the GermEval
On each dataset we then train one model with the language-                                         𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦
                                                               2021 training set, the rule normal −−−−−−→ − matches all
specific BERT and one with multilingual BERT. Each of
                                                               posts where the word normal is negated, such as in the
the 6 datasets consists of a train and test portion. For
                                                               sentence Das ist doch nicht mehr normal! ‘That’s just not
selecting training metaparameters we further divide the
                                                               normal anymore!’. The complete rule lists built from each
train portions of each dataset into into train and valida-
                                                               of the 5 datasets is available from our repository.
tion sets, using a 3:1 ratio, then for the final experiments
we train our models using the full training datasets and
evaluate them on the test sets. For each dataset we train 6. Results
a neural network with a single linear classification head
on top of BERT. Hyper-parameters are set based on per- The shared tasks we focus on each evaluate classifiers
formance on the validation set. We use Adam optimizer by measuring precision, recall, and F1-score on both the
with a weight decay value of 10−5 and initial learning offensive and non-offensive class, and systems are ranked
rate of 10−5 . We use the balanced weighted loss func- based on the macro-average F-score.
tion of sklearn,5 to compensate for unbalanced labels, as         HASOC organizers argue that using macro-average
suggested by [52]. We set batch size to 8 and train each F1-score counteracts class imbalance [46]. We follow
model for 10 epochs, then determine the optimal number this practice in our evaluation, especially since many of
of iterations based on their F-score on the validation set. the top participating systems do not publish scores for
                                                               individual classes. Our main results on the test portions
Rule based system For building and applying our of each of the 5 corpora is presented in Table 1. On each
AMR-based rule systems we parse all text with language- dataset we evaluate DL models trained on data from the
specific text-to-AMR parsers (see Section 4 for details). same task, on data from the other task of the same lan-
The only preprocessing step we apply is the replacement guage, on all data in the language, or on all data from the
of emoticons, as described in the previous paragraph. other language (using multilingual BERT). Additionally
We build rule systems based on each of the 5 training we evaluate our dataset-specific rule systems and the
datasets (HASOC 2019-2021 for English, GermEval 2021 pairwise unions of various systems. We also present the
and HASOC 2020 for German). Rule systems were built scores of the top-performing system for each dataset.
semi-automatically by the authors, based only on the              The DL models trained on data from the same task
training portions of each dataset, test sets were excluded achieve the best results. These models are typically within
from the process entirely and even validation sets were a few percentage points of the best models, and are not
                                                               improved significantly with the addition of the rule sys-
4
  The dependencies and BERT models are noted in our repository tem. Rule systems achieve the highest precision values
5
    https://scikit-learn.org/
                                                                          Offensive                 Other                 Macro avg
                   Test            System
                                                                     P         R        F      P       R      F      P        R         F
                                   Rules                           65.4      9.7      16.9   64.6   97.0    77.5   65.0     53.3      58.6
                                   DE-all                          72.9     35.4      47.7   70.8   92.3    80.1   71.9     63.8      67.6
                                   DE-GermEval                     56.7     48.6      52.3   72.0   78.1    75.0   64.4     63.3      63.8



                 DE GermEval2021
                                   DE-HASOC                        69.6     11.1      19.2   65.0   97.1    77.9   67.3     54.1      60.0
                                   DE-GermEval2021                 67.3     19.4      30.2   66.5   94.4    78.1   66.9     56.9      61.5
                                   EN-all-multi                    53.4     20.0      29.1   65.6   89.7    75.8   59.5     54.9      57.1
                                   DE-all ∪ Rules                  69.8     40.3      51.1   71.8   89.7    79.8   70.8     65.0      67.8
                                   EN-all-multi ∪ Rules            54.9     27.4      36.6   67.0   86.7    75.6   60.9     57.1      58.9
                                   DE-all ∪ EN-all-multi           62.3     44.9      52.2   72.1   84.0    77.6   67.2     64.4      65.8
                                   DE-all ∪ EN-all-multi ∪ Rules   60.9     48.9      54.2   73.0   81.5    77.0   66.9     65.2      66.0
                                   FHAC                               -        -         -      -      -       -   73.1     70.4      71.8
                                   Rules                           92.4     28.3      43.4   77.0   99.0    86.6   84.7     63.7      72.7
                                   DE-all                          55.4     93.0      69.4   96.0   69.1    80.3   75.7     81.0      78.3
                                   DE-GermEval                     47.7     90.7      62.5   93.9   59.0    72.5   70.8     74.8      72.8
                 DE HASOC2020




                                   DE-HASOC                        66.6     81.7      73.4   91.7   83.1    87.2   79.1     82.4      80.7
                                   DE-HASOC2020                    69.6     74.7      72.0   89.2   86.5    87.8   79.4     80.6      80.0
                                   EN-all-multi                    57.4     49.0      52.9   80.2   85.0    82.5   68.8     67.0      67.9
                                   DE-all ∪ Rules                  55.4     93.3      69.6   96.2   69.1    80.4   75.8     81.2      78.4
                                   EN-all-multi ∪ Rules            62.1     61.7      61.9   84.2   84.5    84.3   73.2     73.1      73.1
                                   DE-all ∪ EN-all-multi           51.1     94.7      66.4   96.6   62.6    76.0   73.8     78.6      76.2
                                   DE-all ∪ EN-all-multi ∪ Rules   51.2     95.0      66.5   96.8   62.6    76.0   74.0     78.8      76.3
                                   HASOCOne                           -        -         -      -      -       -      -        -      77.9
                                   Rules                           87.2     45.1      59.5   49.5   89.0    63.7   68.4     67.1      67.7
                                   EN-all                          80.3     95.2      87.2   88.7   61.5    72.6   84.5     78.4      81.3
                                   EN-HASOC2021                    84.8     83.3      84.1   73.2   75.4    74.3   79.0     79.3      79.2
                 EN HASOC2021




                                   DE-all-multi                    82.7     23.9      37.1   42.2   91.7    57.8   62.4     57.8      60.0
                                   DE-GermEval-multi               77.8     18.9      30.4   40.5   91.1    56.1   59.2     55.0      57.0
                                   DE-HASOC-multi                  70.6     22.6      34.2   39.8   84.5    54.1   55.2     53.5      54.3
                                   EN-all ∪ Rules                  79.8     95.6      87.0   89.2   60.0    71.8   84.5     77.8      81.0
                                   DE-all-multi ∪ Rules            84.1     53.9      65.7   52.2   83.2    64.2   68.2     68.6      68.4
                                   EN-all ∪ DE-all-multi           79.3     95.5      86.6   88.8   58.8    70.7   84.0     77.1      80.4
                                   EN-all ∪ DE-all-multi ∪ Rules   78.8     95.7      86.4   89.1   57.3    69.8   83.9     76.5      80.1
                                   NLP-CIC                            -        -         -      -      -       -      -        -      83.1
                                   Rules                           95.3     74.6      83.7   78.6   96.2    86.5   86.9     85.4      86.2
                                   EN-all                          90.2     90.5      90.3   90.2   89.9    90.1   90.2     90.2      90.2
                                   EN-HASOC2020                    91.5     91.6      91.5   91.3   91.2    91.3   91.4     91.4      91.4
                 EN HASOC2020




                                   DE-all-multi                    79.3     20.9      33.1   53.7   94.4    68.5   66.5     57.7      61.8
                                   DE-GermEval-multi               66.9     12.3      20.7   51.0   93.8    66.0   58.9     53.0      55.8
                                   DE-HASOC-multi                  75.5     19.5      30.9   53.0   93.5    67.7   64.3     56.5      60.1
                                   EN-all ∪ Rules                  89.6     91.0      90.3   90.6   89.2    89.9   90.1     90.1      90.1
                                   DE-all-multi ∪ Rules            89.8     78.7      83.9   80.6   90.8    85.4   85.2     84.8      85.0
                                   EN-all ∪ DE-all-multi           86.6     91.9      89.2   91.2   85.4    88.2   88.9     88.6      88.8
                                   EN-all ∪ DE-all-multi ∪ Rules   86.0     92.3      89.1   91.5   84.6    87.9   88.7     88.5      88.6
                                   IIITK                              -        -         -      -      -       -      -        -       93
                                   Rules                           73.2     35.1      47.4   81.6   95.7    88.1   77.4     65.4      70.9
                                   EN-all                          59.6     76.7      67.1   91.4   82.7    86.8   75.5     79.7      77.5
                                   EN-HASOC2019                    59.0     75.3      66.2   91.0   82.5    86.5   75.0     78.9      76.9
                 EN HASOC2019




                                   DE-all-multi                    53.1     47.9      50.4   83.2   85.9    84.5   68.1     66.9      67.5
                                   DE-GermEval-multi               51.0     34.4      41.1   80.3   89.0    84.4   65.7     61.7      63.6
                                   DE-HASOC-multi                  43.0     33.3      37.6   79.4   85.3    82.2   61.2     59.3      60.2
                                   EN-all ∪ Rules                  57.5     77.4      66.0   91.5   80.9    85.9   74.5     79.2      76.8
                                   DE-all-multi ∪ Rules            55.0     63.5      58.9   87.2   82.7    84.9   71.1     73.1      72.1
                                   EN-all ∪ DE-all-multi           51.5     82.6      63.5   92.8   74.1    82.4   72.1     78.4      75.1
                                   EN-all ∪ DE-all-multi ∪ Rules   50.2     83.0      62.6   92.8   72.6    81.5   71.5     77.8      74.5
                                   YNU_wb                             -        -         -      -      -       -      -        -      78.8

Table 1
Quantitative performance of models on 5 datasets. The language codes are DE for German and EN for English. The ‘all’
denotes the language specific BERT model trained on all datasets of that language, ‘all-multi’ is multilingual BERT trained on
all language specific data, and ‘Rules’ is the rule-based system trained on the train set corresponding to the test. The union of
two or more models means classifying a text as offensive iff at least one of the models classifies it as offensive. Previously
published top systems included for comparison are FHAC [14], ComMA [8], HASOCOne [17], IIIT_DWD [24], IIITK [16], and
YNU_wb [23]. The NLP-CIC team, whose system was reported by shared task organizers to have achieved the highest F1
score on the shared task [3], did not publish a description of their methods, and is only included for the sake of completeness.



on each dataset, which is by design and at the expense                                 high-precision, most models’ performance is improved
of recall. The effect of rules as an enhancement is consid-                            by considering their union with the task-specific rule sys-
erable in the case of the transfer learning scenarios, both                            tem. (taking the union of two or more binary classifiers
between tasks and languages. Since rules are generally                                 means classifying a text as offensive iff at least one of the
models classifies it as such). This effect can be observed              the presence of profanity alone warrants the offensive
on both German and English datasets. On the German                      label. The posts FPen1*‡ and FPen2*†, which have been
HASOC dataset, where the EN-multi model is in itself                    predicted as offensive by several of our models and con-
more than 20 points below the F-score on the offensive                  tain words such as fuck and bitch, are annotated as non-
class achieved by the model trained on the training data                offensive. One might attribute these annotations to the
corresponding to the test set (DE-HASOC), but adding la-                lack of hostile intent in these posts, but this would be
bels predicted by the rule-based system closes almost half              in sharp contrast with FNen22† and FNen23†, which
of this gap, raising F-score from 52.9 to 61.9. On the 2019             contain the same words, also lack any offensive content,
English HASOC dataset the effect is similar, rules close                but are nevertheless annotated as offensive (and profane
about half of the performance gap between German and                    in particular).
English models. This effect shows the potential of simple                  The German sample, taken from the GermEval dataset
rule systems in low-resource scenarios where training                   containing longer Facebook comments, also contained
data is only available for other languages and/or for other             several instances of sarcasm, which typically resulted in
tasks/genres. On some datasets, our rule systems work                   false negative predictions such as FNde4*†‡ and FNde5*†‡.
well as standalone solutions as well. In case of the 2020               Finally, the English sample contained several examples
English dataset our rules achieve 83.7 F-score on the of-               of data error, such as the inclusion of non-English text
fensive class, compared to 90.3 of the best DL system.                  (FNen3†‡) or encoding issues (FNen13†‡).
We believe that in real-world applications, e.g. automatic
content moderation, such a system may be preferred de-                        ID     Text

spite its lower performance, due to its transparency and                  FNen14†‡ How many people you planning to shag in September? — one person.
                                                                                   the rest are a bonus https://t.co/FcS1FpxSvE
the fact that its precision is above 95%.                                 FNde1*†‡ @USER solch sinnfreie Beiträge…
                                                                          FPde2*   Schauspielen kann er nicht. Und inzwischen meint er, Ahnung von
                                                                                   Allem zu haben. Schlimm dieser Typ
7. Error Analysis                                                         FPde4*   @USER…äh, Verzeihung! Fangen Sie doch einfach mal bei sich
                                                                                   selbst, mit Ihren unnützen Motorrädern, an!
                                                                          FNen1*†‡ @timesofindia How dare they call it Indian variant when they dint
In this section we perform manual error analysis on sam-                           call it a #wuhanvirus or #chinesevirus?? India should file a legal case
ples of 100 posts each of the 2021 datasets for each lan-                          against WHO and China in international court.
                                                                          FNen2*†‡ Sad reality of Indian news channels. A minute by minute coverage
guage (GermEval for German and HASOC for English).                                 of elections while a common man struggles to find #covid treatment
Samples were selected randomly and classified by each                              essentials. Useless News channels. #COVIDSecondWaveInIndia
                                                                                   #CoronaPandemic #IndiaCovidCrisis #COVID19India #IndiaChoked
of the models described and evaluated in previous sec-                             #aajtak #zeenews #ABPnews
tions. Here we provide an overview of errors made by            FPen1*‡ miya four creeps into every thought i have what the fuck
each model and cite selected examples. The quantitative         FPen2*† @imtillyherron Happy MF birthday to my fave bitch out there!!
                                                                         thank you for always being YOU and for showing me that I shouldn’t
results on this sample are noted in the README of our                    have to worry about what others might say thank you for being my
repository. Errors made by our models are grouped into                   motivation, my idol who radiates nothing but positive energ
                                                                FNen22† Bitch I done did so much today I’m tired
what we consider to be typical error classes, but we note       FNen23† would you fuck me? - ash — Idk who ash is? So you gotta tell me
that such a categorization is subjective and is made solely              lol https://t.co/I0Jj7LNEho
for the purpose of discussion and presentation of the re-       FNde4*†‡ @USER Sie sind Hellseher?
                                                                FNde5*†‡ Oh…die Frau hat eine Glaskugel ? Ist ja interessant.
sults of our manual analysis. The examples we refer to in
our discussion below are presented in Table 2, a full list      FNen3†‡ @ANI Naa desh ko corona se bachaya Naa WB elections jeeta itna
                                                                         campaigning ke baad Seriously Modi is big failure for India than
of errors made by each of the systems as well as quanti-                 what I thought. #ResignPMmodi
tative evaluation of each classifier on the two samples is      FNen13†‡ Windy says oh ya hoor sir… No long in. Shattered. Got myself a
                                                                         wee part time job. 3 days a month. First day. 12 hour shift. Bollocks
available in our repository.                                             😮🤦��♂�🤣 Think I’ll give ma sel a 9/10 the
   The largest error class consists of false negative predic-            day though. What an absolute fuking stonker eh 😎🔥🙌

tions that are clearly offensive and some models failed to
                                                              Table 2
detect them as such. These include e.g. the profanity in Sample texts misclassified by any of our systems, grouped
FNen14†‡ or the insult in FNde1*†‡.                           by error type. Text IDs indicate false positive (FP) or false
   Another major group consists of posts on controver- negative (FN) and the models that made the false prediction.
sial/sensitive topics whose status as offensive/non-offensive * denotes the language specific BERT model, † refers to the
is influenced by both form and content and is also proba- multilingual BERT model, ‡ marks the rule-based system.
bly controversial. False positive predictions in this group
include texts that express strong negative opinions in a
relatively civil way (FPde2*, FPde4*), while false nega-
tives are those that may have been annotated as offensive References
because of their tone (FNen1*†‡, FNen2*†‡).
   Ground truth annotations are inconsistent about whether [1] A. Schmidt, M. Wiegand, A survey on hate speech
                                                                     detection using natural language processing, in:
    Proceedings of the Fifth International Workshop on         training across different classification tasks, in:
    Natural Language Processing for Social Media, As-          FIRE, CEUR, Hyderabad, India, 2020, pp. 823–828.
    sociation for Computational Linguistics, Valencia,         URL: http://ceur-ws.org/Vol-2826/T10-3.pdf.
    Spain, 2017, pp. 1–10. URL: https://aclanthology.      [9] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M.
    org/W17-1101. doi:10.18653/v1/W17- 1101 .                  Rangel Pardo, P. Rosso, M. Sanguinetti, SemEval-
[2] W. Yin, A. Zubiaga,          Towards generalisable         2019 task 5: Multilingual detection of hate speech
    hate speech detection: a review on obstacles               against immigrants and women in Twitter, in: Pro-
    and solutions, PeerJ Computer Science 7 (2021).            ceedings of the 13th International Workshop on Se-
    URL: https://peerj.com/articles/cs-598/. doi:https:        mantic Evaluation, Association for Computational
    //doi.org/10.7717/peerj- cs.598 .                          Linguistics, Minneapolis, Minnesota, USA, 2019,
[3] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Sa-          pp. 54–63. URL: https://aclanthology.org/S19-2007.
    tapara, P. Majumder, J. Schäfer, T. Ranasinghe,            doi:10.18653/v1/S19- 2007 .
    M. Zampieri, D. Nandini, A. K. Jaiswal, Overview      [10] P. Chiril, F. Benamara Zitoune, V. Moriceau,
    of the HASOC subtrack at FIRE 2021: Hate speech            M. Coulomb-Gully, A. Kumar, Multilingual and
    and offensive content identification in English and        multitarget hate speech detection in tweets, in:
    Indo-Aryan languages, in: Working Notes of FIRE            Actes de la Conférence sur le Traitement Automa-
    2021 - Forum for Information Retrieval Evalua-             tique des Langues Naturelles (TALN) PFIA 2019. Vol-
    tion, FIRE 2021, Association for Computing Ma-             ume II : Articles courts, ATALA, Toulouse, France,
    chinery, New York, NY, USA, 2021, pp. 1–3. URL:            2019, pp. 351–360. URL: https://aclanthology.org/
    https://doi.org/10.1145/3503162.3503176.                   2019.jeptalnrecital-court.21.
[4] J. Risch, A. Stoll, L. Wilms, M. Wiegand, Overview    [11] T. Ranasinghe, M. Zampieri, Multilingual offensive
    of the GermEval 2021 shared task on the identifi-          language identification with cross-lingual embed-
    cation of toxic, engaging, and fact-claiming com-          dings, in: Proceedings of the 2020 Conference on
    ments, in: Proceedings of the GermEval 2021                Empirical Methods in Natural Language Process-
    Shared Task on the Identification of Toxic, En-            ing (EMNLP), Association for Computational Lin-
    gaging, and Fact-Claiming Comments co-located              guistics, Online, 2020, pp. 5838–5844. URL: https:
    with KONVENS, Association for Computational                //aclanthology.org/2020.emnlp-main.470. doi:10.
    Linguistics, Düsseldorf, Germany, 2021, pp. 1–12.          18653/v1/2020.emnlp- main.470 .
    URL: https://aclanthology.org/2021.germeval-1.1.      [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
    doi:10.48415/2021/fhw5- x128 .                             L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,
[5] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal,           Attention is all you need, in: I. Guyon, U. V.
    N. Farra, R. Kumar, SemEval-2019 task 6: Identify-         Luxburg, S. Bengio, H. Wallach, R. Fergus,
    ing and categorizing offensive language in social          S. Vishwanathan, R. Garnett (Eds.), Advances
    media (OffensEval), in: Proceedings of the 13th            in Neural Information Processing Systems 30,
    International Workshop on Semantic Evaluation,             Curran Associates, Inc., Long Beach, CA, USA,
    Association for Computational Linguistics, Min-            2017, pp. 5998–6008. URL: http://papers.nips.
    neapolis, Minnesota, USA, 2019, pp. 75–86. URL:            cc/paper/7181-attention-is-all-you-need.pdf.
    https://aclanthology.org/S19-2010. doi:10.18653/           arXiv:1706.03762 .
    v1/S19- 2010 .                                        [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
[6] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova,         Pre-training of deep bidirectional transformers for
    G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pite-          language understanding, in: Proc. of NAACL,
    nis, Çağrı Çöltekin, Semeval-2020 task 12: Multi-          Association for Computational Linguistics, Min-
    lingual offensive language identification in social        neapolis, Minnesota, 2019, pp. 4171–4186. URL:
    media (OffensEval 2020), in: Proceedings of the            https://aclanthology.org/N19-1423. doi:10.18653/
    Fourteenth Workshop on Semantic Evaluation, In-            v1/N19- 1423 .
    ternational Committee for Computational Linguis-      [14] T. Bornheim, N. Grieger, S. Bialonski, FHAC at
    tics, Barcelona (online), 2020, pp. 1425–1447. URL:        GermEval 2021: Identifying German toxic, engag-
    https://aclanthology.org/2020.semeval-1.188.               ing, and fact-claiming comments with ensemble
[7] ws-2018-trolling, Proceedings of the First Work-           learning, in: Proceedings of the GermEval 2021
    shop on Trolling, Aggression and Cyberbullying             Workshop on the Identification of Toxic, Engag-
    (TRAC-2018), Association for Computational Lin-            ing, and Fact-Claiming Comments, Association for
    guistics, Santa Fe, New Mexico, USA, 2018. URL:            Computational Linguistics, Heinrich Heine Univer-
    https://aclanthology.org/W18-4400.                         sity Düsseldorf, Germany, 2021, pp. 105–111. URL:
[8] R. Kumar, B. Lahiri, A. K. Ojha, A. Bansal,                https://aclanthology.org/2021.germeval-1.16.
    ComMA@FIRE 2020: Exploring multilingual joint         [15] A. Paraschiv, D.-C. Cercel, UPB at GermEval-2019
     task 2: BERT-based offensive language classifica-              org/2020.emnlp-main.606. doi:10.18653/v1/2020.
     tion of german Tweets, in: Proceedings of the 15th             emnlp- main.606 .
     Conference on Natural Language Processing (KON-           [23] B. Wang, Y. Ding, S. Liu, , X. Zhou, YNU_wb at
     VENS 2019), German Society for Computational                   HASOC 2019: Ordered neurons LSTM with atten-
     Linguistics & Language Technology, Erlangen, Ger-              tion for identifying hate speech and offensive lan-
     many, 2019, pp. 398–404.                                       guage, in: FIRE (Working Notes), CEUR, Kolkata,
[16] N. Ghanghor, R. Ponnusamy, P. K. Kumare-                       India, 2019, pp. 191–198. URL: http://ceur-ws.org/
     san, R. Priyadharshini, S. Thavareesan, B. R.                  Vol-2517/T3-2.pdf.
     Chakravarthi, IIITK@LT-EDI-EACL2021: Hope                 [24] A.     Mishra,     S.    Saumya,       A.    Kumar,
     speech detection for equality, diversity, and inclu-           IIIT_DWD@HASOC 2020: Identifying offen-
     sion in Tamil , Malayalam and English, in: Pro-                sive content in Indo-European languages, in: FIRE,
     ceedings of the First Workshop on Language Tech-               CEUR, Hyderabad, India, 2020, pp. 139–144. URL:
     nology for Equality, Diversity and Inclusion, Asso-            http://ceur-ws.org/Vol-2826/T2-5.pdf.
     ciation for Computational Linguistics, Kyiv, 2021,        [25] B. Gambäck, U. K. Sikdar, Using convolutional neu-
     pp. 197–203. URL: https://aclanthology.org/2021.               ral networks to classify hate-speech, in: Proceed-
     ltedi-1.30.                                                    ings of the First Workshop on Abusive Language
[17] S. Dowlagar, R. Mamidi, Hasocone@fire-hasoc2020:               Online, Association for Computational Linguis-
     Using bert and multilingual bert models for hate               tics, Vancouver, BC, Canada, 2017, pp. 85–90. URL:
     speech detection, 2021. URL: https://arxiv.org/pdf/            https://aclanthology.org/W17-3013. doi:10.18653/
     2101.09007.pdf. arXiv:2101.09007 .                             v1/W17- 3013 .
[18] K. Kumari, J. Singh, AI_ML_NIT_Patna @HASOC               [26] J. H. Park, P. Fung, One-step and two-step clas-
     2020: BERT models for hate speech identification               sification for abusive language detection on Twit-
     in Indo-European languages, in: FIRE, CEUR,                    ter, in: Proceedings of the First Workshop on Abu-
     Hyderabad, India, 2020, pp. 319–324. URL: http:                sive Language Online, Association for Computa-
     //ceur-ws.org/Vol-2826/T2-29.pdf.                              tional Linguistics, Vancouver, BC, Canada, 2017,
[19] P. Liu, W. Li, L. Zou, NULI at SemEval-2019 task               pp. 41–45. URL: https://aclanthology.org/W17-3006.
     6: Transfer learning for offensive language detec-             doi:10.18653/v1/W17- 3006 .
     tion using bidirectional transformers, in: Proceed-       [27] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep
     ings of the 13th International Workshop on Se-                 learning for hate speech detection in tweets, in:
     mantic Evaluation, Association for Computational               Proceedings of the 26th International Conference
     Linguistics, Minneapolis, Minnesota, USA, 2019,                on World Wide Web Companion, WWW ’17 Com-
     pp. 87–91. URL: https://aclanthology.org/S19-2011.             panion, International World Wide Web Confer-
     doi:10.18653/v1/S19- 2011 .                                    ences Steering Committee, Republic and Canton
[20] S. Mishra, S. Mishra, 3Idiots at HASOC 2019: Fine-             of Geneva, CHE, 2017, p. 759–760. URL: https:
     tuning transformer neural networks for hate speech             //doi.org/10.1145/3041021.3054223. doi:10.1145/
     identification in Indo-European languages, in: FIRE            3041021.3054223 .
     (Working Notes), CEUR, Kolkata, India, 2019, pp.          [28] V. Indurthi, B. Syed, M. Shrivastava, N. Chakravar-
     208–213. URL: http://ceur-ws.org/Vol-2517/T3-4.                tula, M. Gupta, V. Varma, FERMI at SemEval-2019
     pdf.                                                           task 5: Using sentence embeddings to identify hate
[21] T. Caselli, V. Basile, J. Mitrović, M. Granitzer, Hate-        speech against immigrants and women in Twitter,
     BERT: Retraining BERT for abusive language detec-              in: Proceedings of the 13th International Work-
     tion in english, in: Proceedings of the 5th Work-              shop on Semantic Evaluation, Association for Com-
     shop on Online Abuse and Harms, volume Pro-                    putational Linguistics, Minneapolis, Minnesota,
     ceedings of the 5th Workshop on Online Abuse                   USA, 2019, pp. 70–74. URL: https://aclanthology.
     and Harms (WOAH 2021), Association for Compu-                  org/S19-2009. doi:10.18653/v1/S19- 2009 .
     tational Linguistics, Bangkok, Thailand, 2021, pp.        [29] A. Nikolov, V. Radivchev, Nikolov-radivchev at
     17–25. URL: https://aclanthology.org/2021.woah-1.              SemEval-2019 task 6: Offensive Tweet classifica-
     3. doi:10.18653/v1/2021.woah- 1.3 .                            tion with BERT and ensembles, in: Proceedings
[22] T. Tran, Y. Hu, C. Hu, K. Yen, F. Tan, K. Lee,                 of the 13th International Workshop on Semantic
     S. Park, HABERTOR: An efficient and effec-                     Evaluation, Association for Computational Lin-
     tive deep hatespeech detector, in: Proceedings                 guistics, Minneapolis, Minnesota, USA, 2019, pp.
     of the 2020 Conference on Empirical Methods                    691–695. URL: https://aclanthology.org/S19-2123.
     in Natural Language Processing (EMNLP), As-                    doi:10.18653/v1/S19- 2123 .
     sociation for Computational Linguistics, Online,          [30] S. Serrano, N. A. Smith, Is Attention Interpretable?,
     2020, pp. 7486–7502. URL: https://aclanthology.                in: Proceedings of the 57th Annual Meeting of the
     Association for Computational Linguistics, Associ-           https://aclanthology.org/K19-2006. doi:10.18653/
     ation for Computational Linguistics, Florence, Italy,        v1/K19- 2006 .
     2019, pp. 2931–2951. URL: https://aclanthology.org/     [37] P. Lertvittayakumjorn, L. Choshen, E. Shnarch,
     P19-1282. doi:10.18653/v1/P19- 1282 .                        F. Toni, GrASP: A library for extracting and
[31] S. Wiegreffe, Y. Pinter, Attention is not not Expla-         exploring human-interpretable textual pat-
     nation, in: Proceedings of the 2019 Conference on            terns, https://arxiv.org/abs/2104.03958, 2021.
     Empirical Methods in Natural Language Processing             arXiv:2104.03958 .
     and the 9th International Joint Conference on Nat-      [38] P. Sen, Y. Li, E. Kandogan, Y. Yang, W. Lasecki,
     ural Language Processing (EMNLP-IJCNLP), Asso-               HEIDL: Learning linguistic expressions with deep
     ciation for Computational Linguistics, Hong Kong,            learning and human-in-the-loop, in: Proceed-
     China, 2019, pp. 11–20. URL: https://aclanthology.           ings of the 57th Annual Meeting of the Asso-
     org/D19-1002. doi:10.18653/v1/D19- 1002 .                    ciation for Computational Linguistics: System
[32] S. Jain, B. C. Wallace, Attention is not Explana-            Demonstrations, Association for Computational
     tion, in: Proceedings of the 2019 Conference of              Linguistics, Florence, Italy, 2019, pp. 135–140.
     the North American Chapter of the Association                URL: https://www.aclweb.org/anthology/P19-3023.
     for Computational Linguistics: Human Language                doi:10.18653/v1/P19- 3023 .
     Technologies, Volume 1 (Long and Short Papers),         [39] A. Koufakou, E. W. Pamungkas, V. Basile, V. Patti,
     Association for Computational Linguistics, Min-              HurtBERT: Incorporating lexical features with
     neapolis, Minnesota, 2019, pp. 3543–3556. URL:               BERT for the detection of abusive language, in:
     https://aclanthology.org/N19-1357. doi:10.18653/             Proceedings of the Fourth Workshop on Online
     v1/N19- 1357 .                                               Abuse and Harms, Association for Computational
[33] M. T. Ribeiro, S. Singh, C. Guestrin, ”Why Should            Linguistics, Online, 2020, pp. 34–43. URL: https:
     I Trust You?”: Explaining the Predictions of Any             //aclanthology.org/2020.alw-1.5. doi:10.18653/v1/
     Classifier, in: Proceedings of the 22nd ACM                  2020.alw- 1.5 .
     SIGKDD International Conference on Knowledge            [40] E. W. Pamungkas, V. Patti, Cross-domain and cross-
     Discovery and Data Mining, KDD ’16, Association              lingual abusive language detection: A hybrid ap-
     for Computing Machinery, New York, NY, USA,                  proach with deep learning and a multilingual lex-
     2016, p. 1135–1144. URL: https://doi.org/10.1145/            icon, in: Proceedings of the 57th Annual Meet-
     2939672.2939778. doi:10.1145/2939672.2939778 .               ing of the Association for Computational Linguis-
[34] P. Sen, M. Danilevsky, Y. Li, S. Brahma, M. Boehm,           tics: Student Research Workshop, Association for
     L. Chiticariu, R. Krishnamurthy, Learning explain-           Computational Linguistics, Florence, Italy, 2019, pp.
     able linguistic expressions with neural inductive            363–370. URL: https://aclanthology.org/P19-2051.
     logic programming for sentence classification, in:           doi:10.18653/v1/P19- 2051 .
     Proceedings of the 2020 Conference on Empirical         [41] A. Razavi, D. Inkpen, S. Uritsky, S. Matwin, Of-
     Methods in Natural Language Processing (EMNLP),              fensive language detection using multi-level classi-
     Association for Computational Linguistics, On-               fication, in: Advances in Artificial Intelligence,
     line, 2020, pp. 4211–4221. URL: https://www.                 Springer Berlin Heidelberg, Berlin, Heidelberg,
     aclweb.org/anthology/2020.emnlp-main.345.                    2010, pp. 16–27. URL: http://www.eiti.uottawa.ca/
     doi:10.18653/v1/2020.emnlp- main.345 .                       ~diana/publications/Flame_Final.pdf. doi:10.1007/
[35] S. Dash, O. Gunluk, D. Wei, Boolean decision                 978- 3- 642- 13059- 5_5 .
     rules via column generation, in: S. Bengio,             [42] K. Gémes, G. Recski, TUW-Inf at GermEval2021:
     H. Wallach, H. Larochelle, K. Grauman, N. Cesa-              Rule-based and hybrid methods for detecting toxic,
     Bianchi, R. Garnett (Eds.), Advances in Neu-                 engaging, and fact-claiming comments, in: Proceed-
     ral Information Processing Systems, volume 31,               ings of the GermEval 2021 Workshop on the Identifi-
     Curran Associates, Inc., Montréal, Canada, 2018.             cation of Toxic, Engaging, and Fact-Claiming Com-
     URL: https://proceedings.neurips.cc/paper/2018/              ments, Association for Computational Linguistics,
     file/743394beff4b1282ba735e5e3723ed74-Paper.pdf.             Heinrich Heine University Düsseldorf, Germany,
[36] L. Donatelli, M. Fowlie, J. Groschwitz, A. Koller,           2021, pp. 69–75. URL: https://aclanthology.org/2021.
     M. Lindemann, M. Mina, P. Weißenhorn, Saar-                  germeval-1.10.
     land at MRP 2019: Compositional parsing across          [43] K. Gémes, A. Kovács, M. Reichel, G. Recski, Of-
     all graphbanks, in: Proceedings of the Shared                fensive text detection on English Twitter with
     Task on Cross-Framework Meaning Representa-                  deep learning models and rule-based systems, in:
     tion Parsing at the 2019 Conference on Natural               FIRE 2021 Working Notes, CEUR, Gandhinagar, In-
     Language Learning, Association for Computational             dia, 2021, pp. 283–296. URL: http://ceur-ws.org/
     Linguistics, Hong Kong, 2019, pp. 66–75. URL:                Vol-3159/T1-29.pdf.
[44] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview          W13-2322.
     of the GermEval 2018 shared task on the identifi- [50] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     cation of offensive language, in: Proceedings of         M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     GermEval 2018, 14th Conference on Natural Lan-           limits of transfer learning with a unified text-to-text
     guage Processing (KONVENS 2018), Vienna, Aus-            transformer, 2020. arXiv:1910.10683 .
     tria – September 21, 2018, Austrian Academy of Sci- [51] M. Damonte, S. B. Cohen, Cross-lingual Ab-
     ences, Vienna, Austria, 2018, pp. 1–10. URL: https:      stract Meaning Representation parsing, in: Pro-
     //nbn-resolving.org/urn:nbn:de:bsz:mh39-84935.           ceedings of the 2018 Conference of the North
[45] J. Struß, M. Siegel, J. Ruppenhofer, M. Wie-             American Chapter of the Association for Compu-
     gand, M. Klenner,         Overview of GermEval           tational Linguistics: Human Language Technolo-
     task 2, 2019 shared task on the identification           gies, Volume 1 (Long Papers), Association for Com-
     of offensive language,        in: Preliminary pro-       putational Linguistics, New Orleans, Louisiana,
     ceedings of the 15th Conference on Natural               2018, pp. 1146–1155. URL: https://aclanthology.org/
     Language Processing (KONVENS 2019), October              N18-1104. doi:10.18653/v1/N18- 1104 .
     9 – 11, 2019 at Friedrich-Alexander-Universität [52] G. King, L. Zeng,               Logistic regression in
     Erlangen-Nürnberg, German Society for Com-               rare events data, Political Analysis 9 (2001)
     putational Linguistics & Language Technology             137–163.         URL:       https://www.cambridge.
     und Friedrich-Alexander-Universität Erlangen-            org/core/journals/political-analysis/article/
     Nürnberg, München, Germany, 2019, pp. 352–363.           logistic-regression-in-rare-events-data/
     URL:        https://corpora.linguistik.uni-erlangen.     1E09F0F36F89DF12A823130FDF0DA462.
     de/data/konvens/proceedings/papers/germeval/             doi:10.1093/oxfordjournals.pan.a004868 .
     GermEvalSharedTask2019Iggsa.pdf.
[46] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave,
     M. Chintak, A. Patel, Overview of the HASOC
     Track at FIRE 2019: Hate speech and offensive
     content identification in Indo-European languages,
     in: Proceedings of the 11th Forum for Infor-
     mation Retrieval Evaluation, FIRE ’19, Associa-
     tion for Computing Machinery, New York, NY,
     USA, 2019, p. 14–17. URL: https://doi.org/10.1145/
     3368567.3368584. doi:10.1145/3368567.3368584 .
[47] T. Mandl, S. Modha, A. K. M, B. R. Chakravarthi,
     Overview of the HASOC Track at FIRE 2020: Hate
     speech and offensive language identification in
     Tamil, Malayalam, Hindi, English and German,
     in: Forum for Information Retrieval Evaluation,
     FIRE 2020, Association for Computing Machinery,
     New York, NY, USA, 2020, p. 29–32. URL: https:
     //doi.org/10.1145/3441501.3441517. doi:10.1145/
     3441501.3441517 .
[48] Á. Kovács, K. Gémes, E. Iklódi, G. Recski, Potato:
     explainable information extraction framework, in:
     Proceedings of the 31st ACM International Confer-
     ence on Information and Knowledge Management,
     CIKM ’22, Association for Computing Machinery,
     2022, p. 4897–4901. URL: https://doi.org/10.1145/
     3511808.3557196. doi:10.1145/3511808.3557196 .
[49] L. Banarescu, C. Bonial, S. Cai, M. Georgescu,
     K. Griffitt, U. Hermjakob, K. Knight, P. Koehn,
     M. Palmer, N. Schneider, Abstract Meaning Rep-
     resentation for sembanking, in: Proceedings of
     the 7th Linguistic Annotation Workshop and Inter-
     operability with Discourse, Association for Com-
     putational Linguistics, Sofia, Bulgaria, 2013, pp.
     178–186. URL: https://www.aclweb.org/anthology/