=Paper=
{{Paper
|id=Vol-3318/short22
|storemode=property
|title=Offensive text detection across languages and datasets using rule-based and hybrid methods
|pdfUrl=https://ceur-ws.org/Vol-3318/short22.pdf
|volume=Vol-3318
|authors=Kinga Gémes,Ádám Kovács,Gábor Recski
|dblpUrl=https://dblp.org/rec/conf/cikm/GemesKR22
}}
==Offensive text detection across languages and datasets using rule-based and hybrid methods==
Offensive text detection across languages and datasets using rule-based and hybrid methods Kinga Gémes1,2 , Ádám Kovács1,2 and Gábor Recski1 1 TU Wien, Favoritenstraße 9-11., Vienna, 1040, Austria 2 Budapest University of Technology and Economics, Műegyetem rkp. 3., Budapest, H-1111, Hungary Abstract We investigate the potential of rule-based systems for the task of offensive text detection in English and German, and demonstrate their effectiveness in low-resource settings, as an alternative or addition to transfer learning across tasks and languages. Task definitions and annotation guidelines used by existing datasets show great variety, hence state-of-the-art machine learning models do not transfer well across datasets or languages. Furthermore, such systems lack explainability and pose a critical risk of unintended bias. We present simple rule systems based on semantic graphs for classifying offensive text in two languages and provide both quantitative and qualitative comparison of their performance with deep learning models on 5 datasets across multiple languages and shared tasks. Keywords offensive text, rule-based methods, human in the loop learning 1. Introduction categorization of their errors to demonstrate the sensi- tivity of quantitative evaluation to the characteristics of The task of offensive text detection, especially as applied individual datasets and their potentially controversial to social media has seen a rise of interest in recent years, annotations. The main contributions of the paper are the with many overlapping definitions of categories such as following: toxicity, hate speech, profanity etc. Datasets are con- structed using different sets of class definitions corre- • A rule-based method for offensive text detection sponding to different annotation instructions, and ma- using semantic parsing and graph patterns chine learning models that learn patterns of one dataset • 5 high-precision rule systems for English and Ger- may perform poorly on another. Modern deep learning man offensive text detection based on datasets models also offer little or no explainability of their de- from two shared tasks cisions, and their potential for unintended bias reduces • Quantitative evaluation of our rule systems, deep their applicability in real-world scenarios such as auto- learning baselines, and their ensembles across 5 matic content moderation. In this paper we present a datasets, demonstrating that rule based and hy- rule-based approach, a semi-automatic method for con- brid systems can outperform deep learning mod- structing patterns over Abstract Meaning Representa- els in cross-dataset and cross-language settings. tions (AMR graphs) built from input text, and evaluate • Detailed error analysis of each system on sam- its potential as an alternative to machine learning for ples of 100 posts each from one English and one offensive text detection using five datasets of English and German dataset. German social media text. Our quantitative analysis com- pares the rule-based method to both monolingual and The rest of this paper is organized as follows. An overview multilingual deep learning models trained on data from of related work and the most important shared tasks and each language and shared task, demonstrating its poten- datasets is given in Section 2. The datasets used in our tial in low-resource settings as an alternative or addition experiments are described in Section 3. Our method for to transfer learning. Our qualitative analysis examines constructing AMR-based rule systems is presented in the decisions made by each system on samples of 100- Section 4 and our experiments are described in Section 5. 100 texts from both languages and provides a subjective Quantitative evaluation is presented and discussed in Section 6, the qualitative analysis on samples from two CIKM’22: Advances in Interpretable Machine Learning and Artificial datasets is provided in Section 7. All software for ex- Intelligence Workshop, Oct 17–21, 2022, Atlanta, GA Envelope-Open kinga.gemes@tuwien.ac.at (K. Gémes); periments as well as the rule-based systems presented is adam.kovacs@tuwien.ac.at (.́ Kovács); gabor.recski@tuwien.ac.at available as open-source software under an MIT license (G. Recski) from https://github.com/GKingA/offensive_text. Orcid 0000-0003-0626-9644 (K. Gémes); 0000-0001-6132-7144 (.́ Kovács); 0000-0001-5551-3100 (G. Recski) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work is subject of growing interest, also as part of the broader research area of explainable artificial intelligence (xAI). Datasets As pointed out already in a 2017 survey [1], Deep learning models are considered black boxes in most the definition of offensive text varies greatly across datasets, applications and efforts to interpret them are generally which makes the portability of deep learning models for limited to feature weight visualizations with limited valid- offensive text detection a hard problem. Annual shared ity (see e.g. [30], [31], and [32] for the controversy about tasks on hate speech detection and related tasks may use using attention weights as explanation). Yet even the similar definitions year after year, but there is great vari- more mature methods for interpreting neural networks ation when moving from one shared task to another and (e.g. LIME [33]) do not offer the kind of transparency models that achieve high quantitative results on their of ML models that would allow developers to customize targeted test set don’t generalize well (see [2] for a re- their functionality the way a domain expert can update a cent survey). In this paper we shall experiment on yearly traditional rule system. In this work we experiment with datasets from two tasks that both use the same labeling a rudimentary method for semi-automatic, human-in-the- scheme for offensive text, HASOC [3] and GermEval [4]. loop (HITL) learning of simple rule systems over semantic Both challenges define a binary classification of social graphs. Recent approaches to automatic learning of rule media texts (Tweets or Facebook comments) into the of- systems for NLP tasks range from the learning of first fensive and non-offensive classes, and a fine-grained clas- order logic formulae over semantic representations us- sification of the offensive category into the subclasses ing neural networks [34] and integer programming [35] abusive, insulting, and profane. A detailed description to the training of probabilistic grammars over seman- of these tasks and datasets will be given in Section 3. tic graphs [36]. Human-in-the-loop (HITL) approaches The OLID and SOLID datasets of SemEval 2019 [5] and involve generating rule candidates to be reviewed by 2020 [6] use task definitions similar to GermEval. Other experts, e.g. by extracting textual patterns [37] or seman- widely used datasets with a narrower scope include the tic structures [38]. Rule-based approaches are also often data provided by the TRAC [7, 8] and HatEval [9] shared combined with ML methods, e.g. by incorporating lexical tasks. TRAC contains English, Hindi and Bangla data features into DL architectures [39, 40] or voting between from Twitter and Facebook and annotation focuses on rule-based and ML systems [41, 42, 43]. the categories aggression and misogyny, the HatEval task is concerned with hate speech directed at immigrants or women in English and Spanish Twitter data. 3. Data In this section we introduce datasets from the GermEval Approaches Most systems for offensive text detection and HASOC shared tasks, which are the basis of all our rely on distributional text representations, including both quantitative experiments in Section 5 and our qualitative static [10] and contextual embeddings [8, 11]. As in many analysis in Section 7. We choose two recent tasks that use popular text classification tasks, the most widely used identical labeling schemes and also have one language in neural language models are based on the Transformer ar- common (German) allowing us to perform various cross- chitecture [12], and in particular BERT-based models [13] dataset experiments. Our experiments involve datasets are the basis of the state of the art machine learning sys- in German and English only, these are the two languages tems for most datasets, including the best-performing sys- for which we are able to build rule systems and also tems on GermEval2021 [14], GermEval2019 [15], HASOC perform qualitative analysis (see Section 7) in addition to 2020 English [16] and HASOC 2020 German [8, 17]. Top quantitative results, allowing us to investigate the ability systems enhance quantitative performance by optimiz- of both ML and rule-based models to transfer between ing metaparameters such as maximum sentence length tasks as well as languages. or number of training epochs [18, 19], by training on The GermEval shared task was organized in 2018 [44], joint subtask labels [20] or utilizing multiple Transformer 2019 [45], and 2021 [4]. German Twitter posts were anno- based models to counteract the small dataset sizes [14], tated for the 2018 and 2019 challenges, the 2021 task used by pretraining on additional hate speech corpora [21], comments from a news-related Facebook group. The training jointly on different corpora [8], or by using ad- 2018 and 2019 Twitter datasets consist of posts from 100 versarial learning [22]. Further deep learning methods user timelines and is limited to tweets in German that used in offensive text detection include LSTMs [23, 24], are not retweets, do not contain URLs, and contain at CNNs [25, 26], or both [27], sentence embeddings [28], least 5 alphabetic tokens. The dataset is not a random and ensembles of multiple machine learning models [27, sample of posts meeting these criteria, users were heuris- 29]. tically selected to ensure a high ratio of offensive tweets (further details on this selection were not given), then Explainability and rule learning The interpretability the dataset was debiased using additional tweets with of NLP models and the explainability of their decisions non-offensive words that were observed to be overrepre- the semantics of each sentence. For English texts we use sented in offensive posts, such as Merkel or Flüchtlinge a pretrained Transformer-based AMR parser [50] and the ‘refugees’. The 2021 edition of Germeval featured a collec- amrlib 2 library, for German we construct AMRs from tion of comments from the Facebook page of a German text using a multilingual, transition-based system [51] via political talk show. The 2021 training data was collected the amr-eager-multilingual 3 library. A rule system between January and June of 2019, while the test set is for a task consists of lists of patterns over graph repre- from between September and December of 2020. The sentations of text for each possible class, and a text is dataset has been anonymized to comply with Facebook’s predicted to belong to a given class iff at least one pattern guidelines for publishing data. The datasets from 2018 in the class’s list of patterns matches the corresponding and 2019 categorize the offensive texts further into three graph. Graphs must be directed and can be edge- and/or categories, profanity, insult, and abuse and defines offen- node-labeled. Individual patterns are directed graphs sive text as the union of these categories, this is identical whose edge and node labels may be strings or regular to the definition used at HASOC. The 2021 dataset does expressions (regexes) defining sets of possible labels, and not contain such fine-grained labels and defines offensive a graph pattern with regexes for labels defines the set of texts as the union of screaming, vulgar language, insults, all graphs whose corresponding node and edge labels are sarcasm, discrimination, discrediting, and accusation of matched by those regexes. Patterns can also be negated lying. and a conjunction of patterns used as a single rule, a com- The Hate Speech and Offensive Content Identification plete rule system can therefore be considered as a single in English and Indo-Aryan Languages (HASOC) shared boolean statement in disjunctive normal form (DNF) of task was inspired by GermEval and OffensEval and was boolean predicates corresponding to graph patterns, in organized in 2019 [46], 2020 [47], and 2021 [3]. The this regard the method is similar to the approach of [35] dataset from 2019 contained tweets and Facebook com- and [34] (see Section 2). ments in English, Hindi and German. Offensive posts To construct rule systems efficiently, POTATO imple- were selected based on keywords and hashtags, and de- ments a form of human-in-the-loop (HITL) learning. For biased similarly to the process described by GermEval each training dataset we consider all AMR graphs and organizers. From 2020 datasets were selected by training generate a list of frequently occurring subgraphs with at a Support Vector Machine classifier (SVM) on a collection most 2 edges, then rank them based on their importance of hate speech datasets and using this classifier to select for the classification task. For this we use subgraphs as the tweets to be annotated for the dataset. Following features to train a decision tree on the dataset using the the definition of the 2019 and 2020 GermEval challenges, sklearn library and then rank these features based on each HASOC task distinguishes between three types of their Gini coefficient. The maximum size of subgraphs offensive text, those displaying profanity (PRFN ), offense is a free parameter of the system but must be kept low (OFFN ), or hate (HATE ). The binary classification of offen- to limit the search space. We thus obtain a ranked list of sive texts considers the union of these three categories, relevant graph patterns that we can use to construct our and both our quantitative experiments in Section 5 and rule systems manually. We shall describe the individual our qualitative analysis in Section 7 are concerned with rule systems built for our experiments in Section 5. this task only. 5. Experiments 4. Method Quantitative evaluation is performed using 5 datasets. In our quantitative experiments as well as in our error For English we train models using the three datasets analysis we compare the performance of standard deep from the 2019-2021 editions of the HASOC shared task, learning models with rule-based systems that define sets for German we use the 2021 GermEval dataset (the train- of patterns over AMR graphs built from the texts of posts ing portion of which is from earlier editions of GermEval) to be classified. For the DL models we use standard archi- and the 2020 HASOC corpus (see Section 3 for details on tectures without modification, technical details will be each dataset). We train standard BERT-based classifiers described along with the experimental setup in Section 5. on each dataset and compare them with rule systems we Our rule-based solutions are built using POTATO 1 [48], built manually. We investigate the ability of models to a framework that enables the rapid construction of graph- transfer between tasks by evaluating each of them on the based rule systems and has recently been used for text test sets of all other datasets as well. We also attempt classification in multiple domains and languages. Input transfer learning between English and German data, by text is parsed into Abstract Meaning Representations training models using multilingual BERT on datasets (AMR, [49]), directed graphs of concepts representing 2 https://amrlib.readthedocs.io/en/latest/ 1 3 https://github.com/adaamko/POTATO https://github.com/mdtux89/amr-eager-multilingual from one language and evaluating them on the other lan- only used for quantitative evaluation, but not for HITL guage. Finally, we also measure the contribution of our learning or manual analysis. rule-based system to DL models by evaluating the union In each of the 5 rule systems the rules with the highest of their predicted positive labels, i.e. by considering the yield are those that consist of a single node, i.e. that strategy of classifying a text as offensive iff at least one refer to the presence of a single word in the text. The of multiple models would classify it as such. In this sec- majority of these words are in themselves profane and/or tion we provide details of our deep learning experiments, insulting. In English rule systems top keywords include followed by an overview of our rule systems built from asshole, stupid, bitch, shit, fuck as well as useless and dis- each dataset using the method in Section 4. Results and grace. In German rule sets the top words that trigger the discussion follow in Section 6. offensive label in themselves also include ficken ‘fuck’, porno, hurensohn ‘son of a bitch’, arsch ‘ass’ and scheiße Deep learning models For training BERT-based mod- ‘shit’. Rules with multiple nodes typically serve to sepa- els we preprocess text data by replacing emoticons with rate offensive and non-offensive occurrences of a word. their textual representation using the emoji Python li- For example, the word shame is present in over 200 of- brary, then removing hashtag symbols and substituting fensive posts of the English HASOC 2021 dataset, but currencies and urls with special tags using the regex- as a keyword rule it would also yield 43 false positives. based library clean-text 4 . Finally, we use our own regu- Using a pattern over AMR graphs we can filter occur- lar expressions for masking usernames, media tags, and rences of the word by the object (ARG1) of shame and 𝐴𝑅𝐺1 moderators, by replacing each with the [USER] tag. For construct the rule shame −−−−−→ (media|person|publica- both languages we fine-tune a language specific pre- tion|they|you|party|have|government), which yields only 8 trained BERT model (bert-base-german-cased for Ger- false positives for 103 true positives. Another example of man and bert-base-uncased for English) as well as the patterns over multiple nodes are rules covering negation. multilingual model (bert-base-multilingual-cased ). For example, in the rule system based on the GermEval On each dataset we then train one model with the language- 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 2021 training set, the rule normal −−−−−−→ − matches all specific BERT and one with multilingual BERT. Each of posts where the word normal is negated, such as in the the 6 datasets consists of a train and test portion. For sentence Das ist doch nicht mehr normal! ‘That’s just not selecting training metaparameters we further divide the normal anymore!’. The complete rule lists built from each train portions of each dataset into into train and valida- of the 5 datasets is available from our repository. tion sets, using a 3:1 ratio, then for the final experiments we train our models using the full training datasets and evaluate them on the test sets. For each dataset we train 6. Results a neural network with a single linear classification head on top of BERT. Hyper-parameters are set based on per- The shared tasks we focus on each evaluate classifiers formance on the validation set. We use Adam optimizer by measuring precision, recall, and F1-score on both the with a weight decay value of 10−5 and initial learning offensive and non-offensive class, and systems are ranked rate of 10−5 . We use the balanced weighted loss func- based on the macro-average F-score. tion of sklearn,5 to compensate for unbalanced labels, as HASOC organizers argue that using macro-average suggested by [52]. We set batch size to 8 and train each F1-score counteracts class imbalance [46]. We follow model for 10 epochs, then determine the optimal number this practice in our evaluation, especially since many of of iterations based on their F-score on the validation set. the top participating systems do not publish scores for individual classes. Our main results on the test portions Rule based system For building and applying our of each of the 5 corpora is presented in Table 1. On each AMR-based rule systems we parse all text with language- dataset we evaluate DL models trained on data from the specific text-to-AMR parsers (see Section 4 for details). same task, on data from the other task of the same lan- The only preprocessing step we apply is the replacement guage, on all data in the language, or on all data from the of emoticons, as described in the previous paragraph. other language (using multilingual BERT). Additionally We build rule systems based on each of the 5 training we evaluate our dataset-specific rule systems and the datasets (HASOC 2019-2021 for English, GermEval 2021 pairwise unions of various systems. We also present the and HASOC 2020 for German). Rule systems were built scores of the top-performing system for each dataset. semi-automatically by the authors, based only on the The DL models trained on data from the same task training portions of each dataset, test sets were excluded achieve the best results. These models are typically within from the process entirely and even validation sets were a few percentage points of the best models, and are not improved significantly with the addition of the rule sys- 4 The dependencies and BERT models are noted in our repository tem. Rule systems achieve the highest precision values 5 https://scikit-learn.org/ Offensive Other Macro avg Test System P R F P R F P R F Rules 65.4 9.7 16.9 64.6 97.0 77.5 65.0 53.3 58.6 DE-all 72.9 35.4 47.7 70.8 92.3 80.1 71.9 63.8 67.6 DE-GermEval 56.7 48.6 52.3 72.0 78.1 75.0 64.4 63.3 63.8 DE GermEval2021 DE-HASOC 69.6 11.1 19.2 65.0 97.1 77.9 67.3 54.1 60.0 DE-GermEval2021 67.3 19.4 30.2 66.5 94.4 78.1 66.9 56.9 61.5 EN-all-multi 53.4 20.0 29.1 65.6 89.7 75.8 59.5 54.9 57.1 DE-all ∪ Rules 69.8 40.3 51.1 71.8 89.7 79.8 70.8 65.0 67.8 EN-all-multi ∪ Rules 54.9 27.4 36.6 67.0 86.7 75.6 60.9 57.1 58.9 DE-all ∪ EN-all-multi 62.3 44.9 52.2 72.1 84.0 77.6 67.2 64.4 65.8 DE-all ∪ EN-all-multi ∪ Rules 60.9 48.9 54.2 73.0 81.5 77.0 66.9 65.2 66.0 FHAC - - - - - - 73.1 70.4 71.8 Rules 92.4 28.3 43.4 77.0 99.0 86.6 84.7 63.7 72.7 DE-all 55.4 93.0 69.4 96.0 69.1 80.3 75.7 81.0 78.3 DE-GermEval 47.7 90.7 62.5 93.9 59.0 72.5 70.8 74.8 72.8 DE HASOC2020 DE-HASOC 66.6 81.7 73.4 91.7 83.1 87.2 79.1 82.4 80.7 DE-HASOC2020 69.6 74.7 72.0 89.2 86.5 87.8 79.4 80.6 80.0 EN-all-multi 57.4 49.0 52.9 80.2 85.0 82.5 68.8 67.0 67.9 DE-all ∪ Rules 55.4 93.3 69.6 96.2 69.1 80.4 75.8 81.2 78.4 EN-all-multi ∪ Rules 62.1 61.7 61.9 84.2 84.5 84.3 73.2 73.1 73.1 DE-all ∪ EN-all-multi 51.1 94.7 66.4 96.6 62.6 76.0 73.8 78.6 76.2 DE-all ∪ EN-all-multi ∪ Rules 51.2 95.0 66.5 96.8 62.6 76.0 74.0 78.8 76.3 HASOCOne - - - - - - - - 77.9 Rules 87.2 45.1 59.5 49.5 89.0 63.7 68.4 67.1 67.7 EN-all 80.3 95.2 87.2 88.7 61.5 72.6 84.5 78.4 81.3 EN-HASOC2021 84.8 83.3 84.1 73.2 75.4 74.3 79.0 79.3 79.2 EN HASOC2021 DE-all-multi 82.7 23.9 37.1 42.2 91.7 57.8 62.4 57.8 60.0 DE-GermEval-multi 77.8 18.9 30.4 40.5 91.1 56.1 59.2 55.0 57.0 DE-HASOC-multi 70.6 22.6 34.2 39.8 84.5 54.1 55.2 53.5 54.3 EN-all ∪ Rules 79.8 95.6 87.0 89.2 60.0 71.8 84.5 77.8 81.0 DE-all-multi ∪ Rules 84.1 53.9 65.7 52.2 83.2 64.2 68.2 68.6 68.4 EN-all ∪ DE-all-multi 79.3 95.5 86.6 88.8 58.8 70.7 84.0 77.1 80.4 EN-all ∪ DE-all-multi ∪ Rules 78.8 95.7 86.4 89.1 57.3 69.8 83.9 76.5 80.1 NLP-CIC - - - - - - - - 83.1 Rules 95.3 74.6 83.7 78.6 96.2 86.5 86.9 85.4 86.2 EN-all 90.2 90.5 90.3 90.2 89.9 90.1 90.2 90.2 90.2 EN-HASOC2020 91.5 91.6 91.5 91.3 91.2 91.3 91.4 91.4 91.4 EN HASOC2020 DE-all-multi 79.3 20.9 33.1 53.7 94.4 68.5 66.5 57.7 61.8 DE-GermEval-multi 66.9 12.3 20.7 51.0 93.8 66.0 58.9 53.0 55.8 DE-HASOC-multi 75.5 19.5 30.9 53.0 93.5 67.7 64.3 56.5 60.1 EN-all ∪ Rules 89.6 91.0 90.3 90.6 89.2 89.9 90.1 90.1 90.1 DE-all-multi ∪ Rules 89.8 78.7 83.9 80.6 90.8 85.4 85.2 84.8 85.0 EN-all ∪ DE-all-multi 86.6 91.9 89.2 91.2 85.4 88.2 88.9 88.6 88.8 EN-all ∪ DE-all-multi ∪ Rules 86.0 92.3 89.1 91.5 84.6 87.9 88.7 88.5 88.6 IIITK - - - - - - - - 93 Rules 73.2 35.1 47.4 81.6 95.7 88.1 77.4 65.4 70.9 EN-all 59.6 76.7 67.1 91.4 82.7 86.8 75.5 79.7 77.5 EN-HASOC2019 59.0 75.3 66.2 91.0 82.5 86.5 75.0 78.9 76.9 EN HASOC2019 DE-all-multi 53.1 47.9 50.4 83.2 85.9 84.5 68.1 66.9 67.5 DE-GermEval-multi 51.0 34.4 41.1 80.3 89.0 84.4 65.7 61.7 63.6 DE-HASOC-multi 43.0 33.3 37.6 79.4 85.3 82.2 61.2 59.3 60.2 EN-all ∪ Rules 57.5 77.4 66.0 91.5 80.9 85.9 74.5 79.2 76.8 DE-all-multi ∪ Rules 55.0 63.5 58.9 87.2 82.7 84.9 71.1 73.1 72.1 EN-all ∪ DE-all-multi 51.5 82.6 63.5 92.8 74.1 82.4 72.1 78.4 75.1 EN-all ∪ DE-all-multi ∪ Rules 50.2 83.0 62.6 92.8 72.6 81.5 71.5 77.8 74.5 YNU_wb - - - - - - - - 78.8 Table 1 Quantitative performance of models on 5 datasets. The language codes are DE for German and EN for English. The ‘all’ denotes the language specific BERT model trained on all datasets of that language, ‘all-multi’ is multilingual BERT trained on all language specific data, and ‘Rules’ is the rule-based system trained on the train set corresponding to the test. The union of two or more models means classifying a text as offensive iff at least one of the models classifies it as offensive. Previously published top systems included for comparison are FHAC [14], ComMA [8], HASOCOne [17], IIIT_DWD [24], IIITK [16], and YNU_wb [23]. The NLP-CIC team, whose system was reported by shared task organizers to have achieved the highest F1 score on the shared task [3], did not publish a description of their methods, and is only included for the sake of completeness. on each dataset, which is by design and at the expense high-precision, most models’ performance is improved of recall. The effect of rules as an enhancement is consid- by considering their union with the task-specific rule sys- erable in the case of the transfer learning scenarios, both tem. (taking the union of two or more binary classifiers between tasks and languages. Since rules are generally means classifying a text as offensive iff at least one of the models classifies it as such). This effect can be observed the presence of profanity alone warrants the offensive on both German and English datasets. On the German label. The posts FPen1*‡ and FPen2*†, which have been HASOC dataset, where the EN-multi model is in itself predicted as offensive by several of our models and con- more than 20 points below the F-score on the offensive tain words such as fuck and bitch, are annotated as non- class achieved by the model trained on the training data offensive. One might attribute these annotations to the corresponding to the test set (DE-HASOC), but adding la- lack of hostile intent in these posts, but this would be bels predicted by the rule-based system closes almost half in sharp contrast with FNen22† and FNen23†, which of this gap, raising F-score from 52.9 to 61.9. On the 2019 contain the same words, also lack any offensive content, English HASOC dataset the effect is similar, rules close but are nevertheless annotated as offensive (and profane about half of the performance gap between German and in particular). English models. This effect shows the potential of simple The German sample, taken from the GermEval dataset rule systems in low-resource scenarios where training containing longer Facebook comments, also contained data is only available for other languages and/or for other several instances of sarcasm, which typically resulted in tasks/genres. On some datasets, our rule systems work false negative predictions such as FNde4*†‡ and FNde5*†‡. well as standalone solutions as well. In case of the 2020 Finally, the English sample contained several examples English dataset our rules achieve 83.7 F-score on the of- of data error, such as the inclusion of non-English text fensive class, compared to 90.3 of the best DL system. (FNen3†‡) or encoding issues (FNen13†‡). We believe that in real-world applications, e.g. automatic content moderation, such a system may be preferred de- ID Text spite its lower performance, due to its transparency and FNen14†‡ How many people you planning to shag in September? — one person. the rest are a bonus https://t.co/FcS1FpxSvE the fact that its precision is above 95%. FNde1*†‡ @USER solch sinnfreie Beiträge… FPde2* Schauspielen kann er nicht. Und inzwischen meint er, Ahnung von Allem zu haben. Schlimm dieser Typ 7. Error Analysis FPde4* @USER…äh, Verzeihung! Fangen Sie doch einfach mal bei sich selbst, mit Ihren unnützen Motorrädern, an! FNen1*†‡ @timesofindia How dare they call it Indian variant when they dint In this section we perform manual error analysis on sam- call it a #wuhanvirus or #chinesevirus?? India should file a legal case ples of 100 posts each of the 2021 datasets for each lan- against WHO and China in international court. FNen2*†‡ Sad reality of Indian news channels. A minute by minute coverage guage (GermEval for German and HASOC for English). of elections while a common man struggles to find #covid treatment Samples were selected randomly and classified by each essentials. Useless News channels. #COVIDSecondWaveInIndia #CoronaPandemic #IndiaCovidCrisis #COVID19India #IndiaChoked of the models described and evaluated in previous sec- #aajtak #zeenews #ABPnews tions. Here we provide an overview of errors made by FPen1*‡ miya four creeps into every thought i have what the fuck each model and cite selected examples. The quantitative FPen2*† @imtillyherron Happy MF birthday to my fave bitch out there!! thank you for always being YOU and for showing me that I shouldn’t results on this sample are noted in the README of our have to worry about what others might say thank you for being my repository. Errors made by our models are grouped into motivation, my idol who radiates nothing but positive energ FNen22† Bitch I done did so much today I’m tired what we consider to be typical error classes, but we note FNen23† would you fuck me? - ash — Idk who ash is? So you gotta tell me that such a categorization is subjective and is made solely lol https://t.co/I0Jj7LNEho for the purpose of discussion and presentation of the re- FNde4*†‡ @USER Sie sind Hellseher? FNde5*†‡ Oh…die Frau hat eine Glaskugel ? Ist ja interessant. sults of our manual analysis. The examples we refer to in our discussion below are presented in Table 2, a full list FNen3†‡ @ANI Naa desh ko corona se bachaya Naa WB elections jeeta itna campaigning ke baad Seriously Modi is big failure for India than of errors made by each of the systems as well as quanti- what I thought. #ResignPMmodi tative evaluation of each classifier on the two samples is FNen13†‡ Windy says oh ya hoor sir… No long in. Shattered. Got myself a wee part time job. 3 days a month. First day. 12 hour shift. Bollocks available in our repository. 😮🤦ðŸ�»â€�♂ï¸�🤣 Think I’ll give ma sel a 9/10 the The largest error class consists of false negative predic- day though. What an absolute fuking stonker eh 😎🔥🙌 tions that are clearly offensive and some models failed to Table 2 detect them as such. These include e.g. the profanity in Sample texts misclassified by any of our systems, grouped FNen14†‡ or the insult in FNde1*†‡. by error type. Text IDs indicate false positive (FP) or false Another major group consists of posts on controver- negative (FN) and the models that made the false prediction. sial/sensitive topics whose status as offensive/non-offensive * denotes the language specific BERT model, † refers to the is influenced by both form and content and is also proba- multilingual BERT model, ‡ marks the rule-based system. bly controversial. False positive predictions in this group include texts that express strong negative opinions in a relatively civil way (FPde2*, FPde4*), while false nega- tives are those that may have been annotated as offensive References because of their tone (FNen1*†‡, FNen2*†‡). Ground truth annotations are inconsistent about whether [1] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language processing, in: Proceedings of the Fifth International Workshop on training across different classification tasks, in: Natural Language Processing for Social Media, As- FIRE, CEUR, Hyderabad, India, 2020, pp. 823–828. sociation for Computational Linguistics, Valencia, URL: http://ceur-ws.org/Vol-2826/T10-3.pdf. Spain, 2017, pp. 1–10. URL: https://aclanthology. [9] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. org/W17-1101. doi:10.18653/v1/W17- 1101 . Rangel Pardo, P. Rosso, M. Sanguinetti, SemEval- [2] W. Yin, A. Zubiaga, Towards generalisable 2019 task 5: Multilingual detection of hate speech hate speech detection: a review on obstacles against immigrants and women in Twitter, in: Pro- and solutions, PeerJ Computer Science 7 (2021). ceedings of the 13th International Workshop on Se- URL: https://peerj.com/articles/cs-598/. doi:https: mantic Evaluation, Association for Computational //doi.org/10.7717/peerj- cs.598 . Linguistics, Minneapolis, Minnesota, USA, 2019, [3] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Sa- pp. 54–63. URL: https://aclanthology.org/S19-2007. tapara, P. Majumder, J. Schäfer, T. Ranasinghe, doi:10.18653/v1/S19- 2007 . M. Zampieri, D. Nandini, A. K. Jaiswal, Overview [10] P. Chiril, F. Benamara Zitoune, V. Moriceau, of the HASOC subtrack at FIRE 2021: Hate speech M. Coulomb-Gully, A. Kumar, Multilingual and and offensive content identification in English and multitarget hate speech detection in tweets, in: Indo-Aryan languages, in: Working Notes of FIRE Actes de la Conférence sur le Traitement Automa- 2021 - Forum for Information Retrieval Evalua- tique des Langues Naturelles (TALN) PFIA 2019. Vol- tion, FIRE 2021, Association for Computing Ma- ume II : Articles courts, ATALA, Toulouse, France, chinery, New York, NY, USA, 2021, pp. 1–3. URL: 2019, pp. 351–360. URL: https://aclanthology.org/ https://doi.org/10.1145/3503162.3503176. 2019.jeptalnrecital-court.21. [4] J. Risch, A. Stoll, L. Wilms, M. Wiegand, Overview [11] T. Ranasinghe, M. Zampieri, Multilingual offensive of the GermEval 2021 shared task on the identifi- language identification with cross-lingual embed- cation of toxic, engaging, and fact-claiming com- dings, in: Proceedings of the 2020 Conference on ments, in: Proceedings of the GermEval 2021 Empirical Methods in Natural Language Process- Shared Task on the Identification of Toxic, En- ing (EMNLP), Association for Computational Lin- gaging, and Fact-Claiming Comments co-located guistics, Online, 2020, pp. 5838–5844. URL: https: with KONVENS, Association for Computational //aclanthology.org/2020.emnlp-main.470. doi:10. Linguistics, Düsseldorf, Germany, 2021, pp. 1–12. 18653/v1/2020.emnlp- main.470 . URL: https://aclanthology.org/2021.germeval-1.1. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, doi:10.48415/2021/fhw5- x128 . L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, [5] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, Attention is all you need, in: I. Guyon, U. V. N. Farra, R. Kumar, SemEval-2019 task 6: Identify- Luxburg, S. Bengio, H. Wallach, R. Fergus, ing and categorizing offensive language in social S. Vishwanathan, R. Garnett (Eds.), Advances media (OffensEval), in: Proceedings of the 13th in Neural Information Processing Systems 30, International Workshop on Semantic Evaluation, Curran Associates, Inc., Long Beach, CA, USA, Association for Computational Linguistics, Min- 2017, pp. 5998–6008. URL: http://papers.nips. neapolis, Minnesota, USA, 2019, pp. 75–86. URL: cc/paper/7181-attention-is-all-you-need.pdf. https://aclanthology.org/S19-2010. doi:10.18653/ arXiv:1706.03762 . v1/S19- 2010 . [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: [6] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, Pre-training of deep bidirectional transformers for G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pite- language understanding, in: Proc. of NAACL, nis, Çağrı Çöltekin, Semeval-2020 task 12: Multi- Association for Computational Linguistics, Min- lingual offensive language identification in social neapolis, Minnesota, 2019, pp. 4171–4186. URL: media (OffensEval 2020), in: Proceedings of the https://aclanthology.org/N19-1423. doi:10.18653/ Fourteenth Workshop on Semantic Evaluation, In- v1/N19- 1423 . ternational Committee for Computational Linguis- [14] T. Bornheim, N. Grieger, S. Bialonski, FHAC at tics, Barcelona (online), 2020, pp. 1425–1447. URL: GermEval 2021: Identifying German toxic, engag- https://aclanthology.org/2020.semeval-1.188. ing, and fact-claiming comments with ensemble [7] ws-2018-trolling, Proceedings of the First Work- learning, in: Proceedings of the GermEval 2021 shop on Trolling, Aggression and Cyberbullying Workshop on the Identification of Toxic, Engag- (TRAC-2018), Association for Computational Lin- ing, and Fact-Claiming Comments, Association for guistics, Santa Fe, New Mexico, USA, 2018. URL: Computational Linguistics, Heinrich Heine Univer- https://aclanthology.org/W18-4400. sity Düsseldorf, Germany, 2021, pp. 105–111. URL: [8] R. Kumar, B. Lahiri, A. K. Ojha, A. Bansal, https://aclanthology.org/2021.germeval-1.16. ComMA@FIRE 2020: Exploring multilingual joint [15] A. Paraschiv, D.-C. Cercel, UPB at GermEval-2019 task 2: BERT-based offensive language classifica- org/2020.emnlp-main.606. doi:10.18653/v1/2020. tion of german Tweets, in: Proceedings of the 15th emnlp- main.606 . Conference on Natural Language Processing (KON- [23] B. Wang, Y. Ding, S. Liu, , X. Zhou, YNU_wb at VENS 2019), German Society for Computational HASOC 2019: Ordered neurons LSTM with atten- Linguistics & Language Technology, Erlangen, Ger- tion for identifying hate speech and offensive lan- many, 2019, pp. 398–404. guage, in: FIRE (Working Notes), CEUR, Kolkata, [16] N. Ghanghor, R. Ponnusamy, P. K. Kumare- India, 2019, pp. 191–198. URL: http://ceur-ws.org/ san, R. Priyadharshini, S. Thavareesan, B. R. Vol-2517/T3-2.pdf. Chakravarthi, IIITK@LT-EDI-EACL2021: Hope [24] A. Mishra, S. Saumya, A. Kumar, speech detection for equality, diversity, and inclu- IIIT_DWD@HASOC 2020: Identifying offen- sion in Tamil , Malayalam and English, in: Pro- sive content in Indo-European languages, in: FIRE, ceedings of the First Workshop on Language Tech- CEUR, Hyderabad, India, 2020, pp. 139–144. URL: nology for Equality, Diversity and Inclusion, Asso- http://ceur-ws.org/Vol-2826/T2-5.pdf. ciation for Computational Linguistics, Kyiv, 2021, [25] B. Gambäck, U. K. Sikdar, Using convolutional neu- pp. 197–203. URL: https://aclanthology.org/2021. ral networks to classify hate-speech, in: Proceed- ltedi-1.30. ings of the First Workshop on Abusive Language [17] S. Dowlagar, R. Mamidi, Hasocone@fire-hasoc2020: Online, Association for Computational Linguis- Using bert and multilingual bert models for hate tics, Vancouver, BC, Canada, 2017, pp. 85–90. URL: speech detection, 2021. URL: https://arxiv.org/pdf/ https://aclanthology.org/W17-3013. doi:10.18653/ 2101.09007.pdf. arXiv:2101.09007 . v1/W17- 3013 . [18] K. Kumari, J. Singh, AI_ML_NIT_Patna @HASOC [26] J. H. Park, P. Fung, One-step and two-step clas- 2020: BERT models for hate speech identification sification for abusive language detection on Twit- in Indo-European languages, in: FIRE, CEUR, ter, in: Proceedings of the First Workshop on Abu- Hyderabad, India, 2020, pp. 319–324. URL: http: sive Language Online, Association for Computa- //ceur-ws.org/Vol-2826/T2-29.pdf. tional Linguistics, Vancouver, BC, Canada, 2017, [19] P. Liu, W. Li, L. Zou, NULI at SemEval-2019 task pp. 41–45. URL: https://aclanthology.org/W17-3006. 6: Transfer learning for offensive language detec- doi:10.18653/v1/W17- 3006 . tion using bidirectional transformers, in: Proceed- [27] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep ings of the 13th International Workshop on Se- learning for hate speech detection in tweets, in: mantic Evaluation, Association for Computational Proceedings of the 26th International Conference Linguistics, Minneapolis, Minnesota, USA, 2019, on World Wide Web Companion, WWW ’17 Com- pp. 87–91. URL: https://aclanthology.org/S19-2011. panion, International World Wide Web Confer- doi:10.18653/v1/S19- 2011 . ences Steering Committee, Republic and Canton [20] S. Mishra, S. Mishra, 3Idiots at HASOC 2019: Fine- of Geneva, CHE, 2017, p. 759–760. URL: https: tuning transformer neural networks for hate speech //doi.org/10.1145/3041021.3054223. doi:10.1145/ identification in Indo-European languages, in: FIRE 3041021.3054223 . (Working Notes), CEUR, Kolkata, India, 2019, pp. [28] V. Indurthi, B. Syed, M. Shrivastava, N. Chakravar- 208–213. URL: http://ceur-ws.org/Vol-2517/T3-4. tula, M. Gupta, V. Varma, FERMI at SemEval-2019 pdf. task 5: Using sentence embeddings to identify hate [21] T. Caselli, V. Basile, J. Mitrović, M. Granitzer, Hate- speech against immigrants and women in Twitter, BERT: Retraining BERT for abusive language detec- in: Proceedings of the 13th International Work- tion in english, in: Proceedings of the 5th Work- shop on Semantic Evaluation, Association for Com- shop on Online Abuse and Harms, volume Pro- putational Linguistics, Minneapolis, Minnesota, ceedings of the 5th Workshop on Online Abuse USA, 2019, pp. 70–74. URL: https://aclanthology. and Harms (WOAH 2021), Association for Compu- org/S19-2009. doi:10.18653/v1/S19- 2009 . tational Linguistics, Bangkok, Thailand, 2021, pp. [29] A. Nikolov, V. Radivchev, Nikolov-radivchev at 17–25. URL: https://aclanthology.org/2021.woah-1. SemEval-2019 task 6: Offensive Tweet classifica- 3. doi:10.18653/v1/2021.woah- 1.3 . tion with BERT and ensembles, in: Proceedings [22] T. Tran, Y. Hu, C. Hu, K. Yen, F. Tan, K. Lee, of the 13th International Workshop on Semantic S. Park, HABERTOR: An efficient and effec- Evaluation, Association for Computational Lin- tive deep hatespeech detector, in: Proceedings guistics, Minneapolis, Minnesota, USA, 2019, pp. of the 2020 Conference on Empirical Methods 691–695. URL: https://aclanthology.org/S19-2123. in Natural Language Processing (EMNLP), As- doi:10.18653/v1/S19- 2123 . sociation for Computational Linguistics, Online, [30] S. Serrano, N. A. Smith, Is Attention Interpretable?, 2020, pp. 7486–7502. URL: https://aclanthology. in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Associ- https://aclanthology.org/K19-2006. doi:10.18653/ ation for Computational Linguistics, Florence, Italy, v1/K19- 2006 . 2019, pp. 2931–2951. URL: https://aclanthology.org/ [37] P. Lertvittayakumjorn, L. Choshen, E. Shnarch, P19-1282. doi:10.18653/v1/P19- 1282 . F. Toni, GrASP: A library for extracting and [31] S. Wiegreffe, Y. Pinter, Attention is not not Expla- exploring human-interpretable textual pat- nation, in: Proceedings of the 2019 Conference on terns, https://arxiv.org/abs/2104.03958, 2021. Empirical Methods in Natural Language Processing arXiv:2104.03958 . and the 9th International Joint Conference on Nat- [38] P. Sen, Y. Li, E. Kandogan, Y. Yang, W. Lasecki, ural Language Processing (EMNLP-IJCNLP), Asso- HEIDL: Learning linguistic expressions with deep ciation for Computational Linguistics, Hong Kong, learning and human-in-the-loop, in: Proceed- China, 2019, pp. 11–20. URL: https://aclanthology. ings of the 57th Annual Meeting of the Asso- org/D19-1002. doi:10.18653/v1/D19- 1002 . ciation for Computational Linguistics: System [32] S. Jain, B. C. Wallace, Attention is not Explana- Demonstrations, Association for Computational tion, in: Proceedings of the 2019 Conference of Linguistics, Florence, Italy, 2019, pp. 135–140. the North American Chapter of the Association URL: https://www.aclweb.org/anthology/P19-3023. for Computational Linguistics: Human Language doi:10.18653/v1/P19- 3023 . Technologies, Volume 1 (Long and Short Papers), [39] A. Koufakou, E. W. Pamungkas, V. Basile, V. Patti, Association for Computational Linguistics, Min- HurtBERT: Incorporating lexical features with neapolis, Minnesota, 2019, pp. 3543–3556. URL: BERT for the detection of abusive language, in: https://aclanthology.org/N19-1357. doi:10.18653/ Proceedings of the Fourth Workshop on Online v1/N19- 1357 . Abuse and Harms, Association for Computational [33] M. T. Ribeiro, S. Singh, C. Guestrin, ”Why Should Linguistics, Online, 2020, pp. 34–43. URL: https: I Trust You?”: Explaining the Predictions of Any //aclanthology.org/2020.alw-1.5. doi:10.18653/v1/ Classifier, in: Proceedings of the 22nd ACM 2020.alw- 1.5 . SIGKDD International Conference on Knowledge [40] E. W. Pamungkas, V. Patti, Cross-domain and cross- Discovery and Data Mining, KDD ’16, Association lingual abusive language detection: A hybrid ap- for Computing Machinery, New York, NY, USA, proach with deep learning and a multilingual lex- 2016, p. 1135–1144. URL: https://doi.org/10.1145/ icon, in: Proceedings of the 57th Annual Meet- 2939672.2939778. doi:10.1145/2939672.2939778 . ing of the Association for Computational Linguis- [34] P. Sen, M. Danilevsky, Y. Li, S. Brahma, M. Boehm, tics: Student Research Workshop, Association for L. Chiticariu, R. Krishnamurthy, Learning explain- Computational Linguistics, Florence, Italy, 2019, pp. able linguistic expressions with neural inductive 363–370. URL: https://aclanthology.org/P19-2051. logic programming for sentence classification, in: doi:10.18653/v1/P19- 2051 . Proceedings of the 2020 Conference on Empirical [41] A. Razavi, D. Inkpen, S. Uritsky, S. Matwin, Of- Methods in Natural Language Processing (EMNLP), fensive language detection using multi-level classi- Association for Computational Linguistics, On- fication, in: Advances in Artificial Intelligence, line, 2020, pp. 4211–4221. URL: https://www. Springer Berlin Heidelberg, Berlin, Heidelberg, aclweb.org/anthology/2020.emnlp-main.345. 2010, pp. 16–27. URL: http://www.eiti.uottawa.ca/ doi:10.18653/v1/2020.emnlp- main.345 . ~diana/publications/Flame_Final.pdf. doi:10.1007/ [35] S. Dash, O. Gunluk, D. Wei, Boolean decision 978- 3- 642- 13059- 5_5 . rules via column generation, in: S. Bengio, [42] K. Gémes, G. Recski, TUW-Inf at GermEval2021: H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Rule-based and hybrid methods for detecting toxic, Bianchi, R. Garnett (Eds.), Advances in Neu- engaging, and fact-claiming comments, in: Proceed- ral Information Processing Systems, volume 31, ings of the GermEval 2021 Workshop on the Identifi- Curran Associates, Inc., Montréal, Canada, 2018. cation of Toxic, Engaging, and Fact-Claiming Com- URL: https://proceedings.neurips.cc/paper/2018/ ments, Association for Computational Linguistics, file/743394beff4b1282ba735e5e3723ed74-Paper.pdf. Heinrich Heine University Düsseldorf, Germany, [36] L. Donatelli, M. Fowlie, J. Groschwitz, A. Koller, 2021, pp. 69–75. URL: https://aclanthology.org/2021. M. Lindemann, M. Mina, P. Weißenhorn, Saar- germeval-1.10. land at MRP 2019: Compositional parsing across [43] K. Gémes, A. Kovács, M. Reichel, G. Recski, Of- all graphbanks, in: Proceedings of the Shared fensive text detection on English Twitter with Task on Cross-Framework Meaning Representa- deep learning models and rule-based systems, in: tion Parsing at the 2019 Conference on Natural FIRE 2021 Working Notes, CEUR, Gandhinagar, In- Language Learning, Association for Computational dia, 2021, pp. 283–296. URL: http://ceur-ws.org/ Linguistics, Hong Kong, 2019, pp. 66–75. URL: Vol-3159/T1-29.pdf. [44] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview W13-2322. of the GermEval 2018 shared task on the identifi- [50] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, cation of offensive language, in: Proceedings of M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the GermEval 2018, 14th Conference on Natural Lan- limits of transfer learning with a unified text-to-text guage Processing (KONVENS 2018), Vienna, Aus- transformer, 2020. arXiv:1910.10683 . tria – September 21, 2018, Austrian Academy of Sci- [51] M. Damonte, S. B. Cohen, Cross-lingual Ab- ences, Vienna, Austria, 2018, pp. 1–10. URL: https: stract Meaning Representation parsing, in: Pro- //nbn-resolving.org/urn:nbn:de:bsz:mh39-84935. ceedings of the 2018 Conference of the North [45] J. Struß, M. Siegel, J. Ruppenhofer, M. Wie- American Chapter of the Association for Compu- gand, M. Klenner, Overview of GermEval tational Linguistics: Human Language Technolo- task 2, 2019 shared task on the identification gies, Volume 1 (Long Papers), Association for Com- of offensive language, in: Preliminary pro- putational Linguistics, New Orleans, Louisiana, ceedings of the 15th Conference on Natural 2018, pp. 1146–1155. URL: https://aclanthology.org/ Language Processing (KONVENS 2019), October N18-1104. doi:10.18653/v1/N18- 1104 . 9 – 11, 2019 at Friedrich-Alexander-Universität [52] G. King, L. Zeng, Logistic regression in Erlangen-Nürnberg, German Society for Com- rare events data, Political Analysis 9 (2001) putational Linguistics & Language Technology 137–163. URL: https://www.cambridge. und Friedrich-Alexander-Universität Erlangen- org/core/journals/political-analysis/article/ Nürnberg, München, Germany, 2019, pp. 352–363. logistic-regression-in-rare-events-data/ URL: https://corpora.linguistik.uni-erlangen. 1E09F0F36F89DF12A823130FDF0DA462. de/data/konvens/proceedings/papers/germeval/ doi:10.1093/oxfordjournals.pan.a004868 . GermEvalSharedTask2019Iggsa.pdf. [46] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, M. Chintak, A. Patel, Overview of the HASOC Track at FIRE 2019: Hate speech and offensive content identification in Indo-European languages, in: Proceedings of the 11th Forum for Infor- mation Retrieval Evaluation, FIRE ’19, Associa- tion for Computing Machinery, New York, NY, USA, 2019, p. 14–17. URL: https://doi.org/10.1145/ 3368567.3368584. doi:10.1145/3368567.3368584 . [47] T. Mandl, S. Modha, A. K. M, B. R. Chakravarthi, Overview of the HASOC Track at FIRE 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German, in: Forum for Information Retrieval Evaluation, FIRE 2020, Association for Computing Machinery, New York, NY, USA, 2020, p. 29–32. URL: https: //doi.org/10.1145/3441501.3441517. doi:10.1145/ 3441501.3441517 . [48] Á. Kovács, K. Gémes, E. Iklódi, G. Recski, Potato: explainable information extraction framework, in: Proceedings of the 31st ACM International Confer- ence on Information and Knowledge Management, CIKM ’22, Association for Computing Machinery, 2022, p. 4897–4901. URL: https://doi.org/10.1145/ 3511808.3557196. doi:10.1145/3511808.3557196 . [49] L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, N. Schneider, Abstract Meaning Rep- resentation for sembanking, in: Proceedings of the 7th Linguistic Annotation Workshop and Inter- operability with Discourse, Association for Com- putational Linguistics, Sofia, Bulgaria, 2013, pp. 178–186. URL: https://www.aclweb.org/anthology/