YNU OXZ @ HaSpeeDe 2 and AMI : XLM-RoBERTa with Ordered Neurons LSTM for classification task at EVALITA 2020 Xiaozhi Ou Hongling Li Yunnan University Yunnan University China China xiaozhiou88@gmail.com honglingli66@126.com Abstract substantial amount of work has been done in lan- guages like English. However, hate speech and of- English. This paper describes the sys- fensive language identification in other language tem that team YNU OXZ submitted for scenario is still an area worth exploring. The latest EVALITA 2020. We participate in the edition of EVALITA (Caselli et al., 2018) hosted shared task on Automatic Misogyny Iden- the first Hate Speech (HS) detection in Social Me- tification (AMI) and Hate Speech Detec- dia (i.e. HaSpeeDe (Bosco et al., 2018)) task for tion (HaSpeeDe 2) at the 7th evaluation Italian, the HaSpeeDe 2 (Hate Speech Detection) campaign EVALITA 2020. For HaSpeeDe (Sanguinetti et al., 2020) shared task have been or- 2, we participate in Task A - Hate Speech ganized within Evalita 2020 1 . The ultimate goal Detection and submitted two-run result- of HaSpeeDe 2 is to take a step further in the s- s for the news headline test and tweet- tate of the art of HS detection for Italian while al- s headline test, respectively. Our submit- so exploring other side phenomena, the extent to ted run is based on the pre-trained multi- which they can be distinguished from HS, and fi- language model XLM-RoBERTa, and in- nally whether and how much automatic systems put into Convolution Neural Network and are able to draw such conclusions. For AMI (Elis- K-max Pooling (CNN + K-max Pooling). abetta Fersini, 2020), the second shared task at the Then, an Ordered Neurons LSTM (ON- 7th evaluation campaign EVALITA 2020 (Basile LSTM) is added to the previous represen- et al., 2020). Given the huge amount of user- tation and submitted to a linear decision generated content on the Web, and in particular on function. Regarding the AMI shared task social media, the problem of detecting, in order to for the automatic identification of misogy- possibly limit the diffusion of hate speech against nous content in the Italian language. We women, is rapidly becoming fundamental espe- participate in subtask A about Misogy- cially for the societal impact of the phenomenon, ny & Aggressive Behaviour Identifica- it is very important to identify misogyny in social tion. Our system is similar to the one de- media. fined for HaSpeeDe and is based on the pre-trained multi-language model XLM- 1.1 Hate Speech (HaSpeeDe 2) RoBERTa, an Ordered Neurons LSTM In recent years, with the acceleration of infor- (ON-LSTM), a Capsule Network, and a fi- mation dissemination, the identification of hate nal classifier. speech and offense language has become a crucial mission in multilingual sentiment analysis field- s and has attracted the attention of a large num- 1 Introduction and Background ber of industrial and academic researchers. From an NLP perspective, much attention has been paid People use offensive contents in their social me- to the topic of HS - together with all its possi- dia posts to degrade an individual or religion or ble facets and related phenomena, such as offen- other organizations in many respects, the identifi- sive/abusive language, and its identification. This cation of such social media posts is a necessity, a is shown by the proliferation, especially in the Copyright ⃝c 2020 for this paper by its authors. Use last few years, of contributions on this topic (e.g. permitted under Creative Commons License Attribution 4.0 1 International (CC BY 4.0). http://www.evalita.it/2020/tasks Caselli et al. (2020), Jurgens et al. (2019), Fortuna yny in a multilingual environment. Aiming at et al. (2019)), corpora and lexica (e.g. de Pelle the TRAC-2 shared tasks of Aggression Identifica- and Moreira (2017), (Sanguinetti et al., 2018), tion and Misogynistic Aggression Identification, (Bassignana et al., 2018)), dedicated workshop- Samghabadi et al. (2020) propose an end-to-end s, and shared tasks within national (GermEval neural model using attention on top of BERT that 2 , HASOC 3 , IberLEF 4 ) and international (Se- incorporates a multi-task learning paradigm to ad- mEval 5 ) evaluation campaigns. Among them, dress both the sub-tasks simultaneously. Arango Gemeval2018 is about offensive language recog- et al. (2019) discussed the implications for current nition and aims to promote research on offen- research and re-conduct experiments, a closer look sive contents recognition in German language mi- at model validation to give a more accurate pic- croblogs. The best teams system is to train three ture of the current state-of-the-art methods. Re- basic classifiers (maximum entropy and two ran- cent investigations studied how the misogyny phe- dom forest sets) using five disjoint feature set- nomenon takes place, such as Farrell et al. (2019) s and then used the maximum entropy element- study this phenomenon by investigating the flow level classifier for final classification (Montani and of extreme language across seven online commu- Schüller, 2018). In the SemEval-2019 shared tasks nities on Reddit. Goenaga et al. (2018) automat- HatEval and OffensEval, HatEval is a multilin- ic misogyny identification using neural networks. gual detection of hate speech against immigrants Automatic misogyny identification in Twitter has and women on Twitter. Fermi team is the best been firstly investigated by Anzovino et al. (2018). team of Hateval. It proposes an SVM model with the RBF kernel and uses sentence embedding in 2 Task and Data description Google general sentence encoder as a function (In- 2.1 Task description durthi et al., 2019). OffensEval is about the iden- tification and classification of offensive language In this part, we describe one of the subtasks in social media. The NULI team is the best per- HaSpeeDe 2 participating in EVALITA 2020. This forming team, they use BERT-base without default task introduces its novelty from three main aspect- parameters (Liu et al., 2019). HASOC2019 is pro- s (Language variety and test of time, Stereotyp- posed to identify hate speech and offensive con- ical communication, Syntactic realization of HS). tent in Indo-European languages. Its purpose is We participated in Task A - Hate Speech Detection to develop powerful technologies capable of pro- (Main Task), a binary classification task aimed at cessing multilingual data and to develop a transfer determining the presence or the absence of hateful learning method that can utilize cross-lingual data. content in the text towards a given target (among The optimal system is a system based on ordered Immigrants, Muslims or Roma people). neuron LSTM (ON-LSTM) and attention model The AMI shared task proposes that misogy- and adopts the K-folding approach for ensemble nous content in Italian is automatic identification (Wang et al., 2019). in Twitter. It is organized according to two main subtasks, namely subtask A - Misogyny & Ag- 1.2 Misogyny (AMI) gressive Behaviour Identification and subtask B - Unfortunately, nowadays more and more incidents Unbiased Misogyny Identification. We participate of harassment against women have appeared and in subtask A, the system must recognize whether misogynistic comments have been found in so- the text is misogyny, and if it is misogyny, it must cial media, where misogynists hide behind by also recognize whether it expresses an aggressive anonymity security. Therefore, it is very important attitude. to identify misogyny in social media. Pamungkas 2.2 Data description et al. (2020) conducted extensive and in-depth re- search on online misogyny, developed a state-of- HaSpeeDe 2 task organizer provides a new H- the-art model for detecting misogyny in social me- S training dataset (binary task) based on Twit- dia and explored the feasibility of detecting misog- ter data, accompanied by a test set including both 2 in-domain and out-of-domain data (tweets + news https://projects.fzai.h-da.de/iggsa/germeval/ 3 https://hasocfire.github.io/hasoc/2020 headlines), as well as from different time periods. 4 http://hitz.eus/sepln2019/ The HaSpeeDe 2020 new training set already con- 5 http://alt.qcri.org/semeval2020/ tains the Twitter dataset of HaSpeeDe 2018. The new dataset contains a total of 6,839 tweets (label reason is that StratifiedKFold can utilize stratified 0 means NOT HS, label 1 means HS), of which sampling to divide, which can ensure that the pro- HS contains 2,766, NOT HS contains 4,703, the portion of each category in the generated training tweets headlines test set contains 1,263 tweets, and set and validation set is consistent with the origi- the news headlines test set contains 500 elements. nal training set so that the generated data distribu- In the experimental run, the data we recommend tion disorder will not occur. In the experiment, we for this task is the result of combining the Face- used 5-fold stratified sampling. For the HaSpeeDe book dataset (training set + test set) of HaSpeeDe 2 training set (Merged dataset), each of which in- 2018 with the new training set of HaSpeeDe 2020, cluded the randomly sampled training set (8,671) this is to analyze the influence of out-of-domain and validation set (2,168). For the AMI training texts in the training set. The two contain a total of set (raw dataset), each of which included the ran- 10,839 comments/tweets. domly sampled training set (4,000) and validation The AMI organizer provided a raw dataset set (1,000). (5,000 tweets) as the training set for participants in subtask A, the raw dataset is a balanced dataset of 3 Description of the system tweets manually labeled according to two levels: • Misogynous: defines if a tweet is misogy- nous or not misogynous. Label 0 means Not misogynous tweet, label 1 means Misogy- nous tweet. • Aggressiveness: denotes the subject of the misogynistic tweet (misogynous tweet is la- bel 1). Label 0 means Non-aggressive tweet, label 1 means Aggressive tweet. Not misog- ynous tweet (misogynous tweet is label 0) are labeled as 0 by default. For the test set (1,000 tweets) for subtask A pro- vided by the AMI organizer, only the annotations on the “misogynous” and “aggressiveness” fields in the raw dataset will consider. Figure 2: System architecture diagram for Task A (HaSpeeDe 2) In this part, we introduce our final submission system. Figure 2 shows the overall framework of the system we submitted to HaSpeeDe 2 Task A. We use the pre-trained multi-language model XLM-RoBERTa. We discover the limitations of BERT’s pooler output (P O) and obtained rich se- mantic information by extracting the hidden state Figure 1: 5-fold stratified sampling to the training (The last four hidden layers) of XLM-RoBERTa, set which is used as input for Convolution Neural Net- work and K-max Pooling (CNN + K-max Pool- As shown in Figure 1, we use stratified sam- ing). Then, we input the output of (CNN + K-max pling technology (StratifiedKFold), using Strati- Pooling) into the Ordered Neurons LSTM (ON- fiedKFold cross-validation instead of ordinary k- LSTM). Finally, we concatenate the P O and out- fold cross-validation to evaluate a classifier. The put of ON-LSTM ON-LSTM together and pass it through the Linear layer and Softmax for the final CommonCrawl data in 100 languages. Because classification. the training of the model in this task must make Figure 3 shows the overall framework of the full use of the whole sentence content to extract system we submitted to AMI subtask A. We useful semantic features, which may help to deep- use the pre-trained multi-language model XLM- en the understanding of the sentence and reduce RoBERTa. We first get pooler output (P O) and the impact of noise on speech. Therefore, we use obtained rich semantic information by extracting XLM-RoBERTa in this work. the hidden state (The last four hidden layers) of In the classification task, the original output of XLM-RoBERTa, which is input into Ordered Neu- XLM-RoBERTa is obtained through the last hid- rons LSTM (ON-LSTM). Then, we input the out- den state of the model. However, the output usual- put of ON-LSTM into Capsule Network.Finally, ly does not summarize the semantic content of the we concatenate the P O and output of Capsule to- input. Recent studies have shown that abundan- getherand through the Linear layer and Softmax t semantic information features are learned by the for the final classification. top hidden layer of BERT (Jawahar et al., 2019), which we call the semantic layer. In our opinion, the same is true of XLM-RoBERTa. Therefore, in order to make the model obtain more abundant se- mantic information features, we propose the sys- tem as shown in Figure 2 for HaSpeeDe 2 Task A. Firstly, we get P O. Secondly, we extract the hid- den state of the last four layers of XLM-RoBERTa and input them into CNN and K-max Pooling. Then, input into ON-LSTM. For AMI subtask A, we propose the system as shown in Figure 3. First- ly, we get P O. Secondly, we extract the hidden s- tate of the last four layers of XLM-RoBERTa and input them into ON-LSTM. Then, input into Cap- sule. 3.2 CNN and K-max Pooling As shown in Figure 2, we input the extracted hidden states of the last four layers of XLM- RoBERTa into CNN and K-max Pooling for con- Figure 3: System architecture diagram for subtask volution operations to obtain multiple feature A (AMI) maps. The specific operation: a sentence contains L words, each of which has a dimension of d after the embedding layer, and the representation of the 3.1 XLM-RoBERTa and hidden layer state sentence is formed by splicing the L words to form Early work in the field of cross-language under- a matrix of L ∗ d. There are several convolution k- standing has proved the effectiveness of multi- ernels in the convolutional layer, the size of which lingual masked language model (MLM) in cross- is N ∗ d, and N is the filter window size. The con- language understanding, but models such as volution operation is to apply a convolution kernel XLM (Lample and Conneau, 2019) and Multilin- to create a new feature in a matrix that is spliced gual BERT (Devlin et al., 2018) (pre-trained on by words. Its formula is as follows: Wikipedia) are still limited in learning useful rep- resentations of low resource languages. XLM- Cl = f (w ∗ x(l : L + N − 1) + b) (1) RoBERTa (Conneau et al., 2020) shows that the performance of cross-language transfer tasks can where l represents the lth word, Cl is the feature, be significantly improved by using the large-scale w is the convolution kernel, b is the bias term, and multi-language pre-training model. It can be un- f is a nonlinear function. After the convolution derstood as a combination of XLM and RoBER- operation of the whole sentence, a feature map is Ta. It is trained on 2.5 TB of newly created clean obtained, which is a vector of size L + N - 1. Another important idea of CNN is pooling. The each step of input, thereby embedding the hierar- pooling layer is usually connected behind the con- chical structure through information grading. volution layer. The purpose of introducing it is to simplify the output of the convolutional layer and 3.4 Capsule Network perform dimensionality reduction on the features As shown in Figure 3, we input the output of ON- of the Filter to form the final feature. Here is the LSTM into Capsule. In the deep learning mod- K-max Pooling operation, which takes the value el, spatial patterns are aggregated at a lower lev- of the scores in Top K among all the feature val- el, which helps to represent higher-level concepts. ues, and retains the original order of these feature We use the Capsule Network (Sabour et al., 2017) values, that is, by retaining some feature informa- to enhance the models feature extraction capabil- tion for subsequent use. Obviously, K-max Pool- ities, spatial insensitivity methods are inevitably ing can express the same type of feature multiple limited by the abundant text structure (such as times, that is, can express the intensity of a certain saving the location of words, semantic informa- type of feature; in addition, because the relative tion, grammatical structure, etc.), difficult to ef- order of these Top K eigenvalues is preserved, it fectively encode, and lack of text expression abili- should be said that it retains part of the position ty. The Capsule network effectively improved this information. However, this location information disadvantage by using neuron vectors instead of is only the relative order between features, not ab- individual neuron nodes of traditional neural net- solute location information. works to train this new neural network in the dy- namic routing way. The Capsule’s parameter up- 3.3 Ordered Neurons LSTM date algorithm is routing-by-agreement, a lower- For HaSpeeDe 2, as shown in Figure 2, we input level capsule prefers to send its output to higher- the output of CNN and K-max pooling into ON- level capsule whose activity vectors have a big s- LSTM. For AMI, as shown in Figure 3, We input calar product with the prediction coming from the the extracted hidden states of the last four layers lower-level capsule. The calculation formula of of XLM-RoBERTa into ON-LSTM. ON-LSTM is the Capsule is as follows: a new variant of LSTM, which sorts the neurons in a specific order, allowing the hierarchical struc- ∥ Sj ∥2 Sj Vj = (6) ture (tree structure) to be integrated into the LSTM 1+ ∥ Sj ∥ ∥ Sj ∥ 2 to express richer information. The gate structure ∑ and output structure of ON-LSTM are still similar Sj = Cij ûj|i , ûj|i = Wij ui (7) i to the original LSTM. The difference is that the update mechanism from cbt to ct is different. The where Vj is the vector output of capsule j and formula is as follows (Shen et al., 2018): Sj is its total input, prediction vectors ûj|i is by multiplying the output ui of a capsule in the layer below by a weight matrix Wij , the Cij are cou- fet = → − cs(sof tmax(Wfext + Ufeht−1 + bfe) (2) pling coefficients that are determined by the itera- tive dynamic routing process. iet = ← − cs(sof tmax(Wei xt + Uei ht−1 + bei ) (3) The most fundamental difference between the Capsule network and the traditional artificial neu- wt = fet ◦ iet (4) ral network lies in the unit structure of the net- ct =wt ◦ (ft ◦ ct−1 + it ◦ cbt ) + (fet − wt ) work. For traditional neural networks, the calcula- (5) tion of neurons can be divided into the following ◦ ct−1 + (iet − wt ) ◦ cbt three steps: 1. Perform a scalar weighted calcu- Among them, − → cs and ← − are cumsum() opera- cs lation on the input. 2. Sum the weighted input tions in the right and left directions, respectively. scalars. 3. Nonlinearization from scalar to the s- the newly introduced fet and iet represent the mas- calar. For the Capsule, its calculation is divided ter forget gate and master input gate respectively. into the following four steps: 1. Do matrix multi- wt represents a vector where the intersection part plication on the input vector. 2. Scalar weighting is 1 and the rest is all 0. In this way, the high-level of the input vector. 3. Sum the weighted vector. information remains a considerable long distance, 4. Vector-to-vector nonlinearization. The biggest while the low-level information may be updated at difference between the Capsule network and the traditional neural network is the unit output. The XLM-RoBERTa with only P O in News output of the traditional neural network is a val- The validation set of 1-fold ue, while the output of the Capsule network is a Category P R F1 Instances vector, which can contain abundant features and is Not Hate 0.70 0.981 0.817 1355 more interpretable. Hate 0.886 0.259 0.401 813 Macro F1 0.793 0.62 0.609 2168 3.5 Experiment setting XLM-RoBERTa with only P O in Tweets For the XLM-RoBERTa, we use XLM- The validation set of 1-fold RoBERTa-base6 pre-trained model, which Category P R F1 Instances contains 12 layers. We use Binary cross-entropy, Not Hate 0.805 0.569 0.667 1355 Adam optimizer with a learning rate of 5e-5. Hate 0.659 0.858 0.745 813 The batch size is set to 32 and the max sequence Macro F1 0.723 0.713 0.706 2168 length is set to 80. We extract the hidden layer state of XLM-RoBERTa by setting the out- Table 2: Precision, Recall, F1 score and Instances put hidden States is true. The model is trained in for XLM-RoBERTa with only P O in HaSpeeDe 8 epochs with a dropout rate of 0.1. 2 Task A (The validation set is the first fold in the For the Convolution Neural Network,we use 5-fold stratified cross-validation) 2D convolution (nn.Conv2d7 ). The size of the convolution kernel is set to (3,4,5) and the num- The number of different hidden layers of ber of convolution kernels is set to 256. XLM-RoBERTa (The validation set of 1-fold) For the ON-LSTM, we set the hidden units to Systems HS-News HS-Tweets 128 and num levels to 16. Hidden layers Macro F1 Macro F1 For the Capsule Network, we set num capsule The last layers 0.623 0.725 to 10, dim capsule to 16, routings to 4. The last two layers 0.646 0.734 The last three layers 0.66 0.749 4 Results and Discussion The last four layers 0.703 0.798 Task Our Score Best Score Rank Table 3: The performance of our model at different HaSpeeDe Macro F1 hidden layers (The validation set is the first fold in the 5-fold stratified cross-validation) Tweets 0.7717 0.8088 8 News 0.6922 0.7744 7 put of BERT is P O. In the same way, we just put AMI Average F1 P O as the output of XLM-RoBERTa.The results subtask A 0.7313 0.7406 3 are shown in Table 2. We can see that the results are not good when only P O is used as the output Table 1: Classification results of our best runs on of XLM-RoBERTa. We think that just using P O the HaSpeeDe 2 Task A and AMI subtask A. as the output will lose some effective semantic in- formation. So we think that deep and abundant Table 1 reports the official results of the best semantic features are effective for this work. We runs on the two tasks we participate in. For these extract the hidden state of XLM-RoBERTa and we two tasks, we submitted the results of two runs, also discover that the performance of the model and the results of both runs were ideal and equally improves with the increase of the semantic layer. matched. In the following subsections, the results Table 3 shows the performance of our model at d- obtained in each task will be discussed. ifferent semantic layers. Table 4 shows our results on the test set. 4.1 HaSpeeDe 2 Task A In our experiment, we find the limitations of P O 4.2 AMI subtask A for sentiment analysis of hate text in Italian lan- In this work, we have similar tasks as discussed in guages. In the classification task, the original out- Section 4.1, and we consider the influence of P O 6 https://huggingface.co/xlm-roberta-base for identifying misogyny content. We conduct ex- 7 https://pytorch.org/docs/stable/generated/torch.nn.Conv2d periments on the AMI subtask A base on the mod- System Average F1 The last four hidden states of XLM-RoBERTa Run 1 (without using P O) 0.7014 Run 2 (using P O) 0.7313 News P R F1 Macro F1 Table 6: The results on the test set for AMI subtask Not Hate 0.7486 0.8965 0.8159 0.6922 A Hate 0.7203 0.4696 0.5685 Tweets P R F1 Macro F1 tate of XLM-RoBERTa. The result shows that it Not Hate 0.8037 0.7285 0.7643 0.7717 is helpful to improve the performance of XLM- Hate 0.7448 0.8167 0.7791 RoBERTa to obtain more abundant semantic infor- mation features by extracting the hidden state of Table 4: Results of Macro F1 on Test set XLM-RoBERTa. We test the effects of using the external dataset (Merged dataset) and not using the external dataset (raw dataset). Our conclusion el in HaSpeeDe 2, and in order to improve the per- is that using data from the same social network for formance, we propose a new method base on this training and test is a necessary condition for good model. Table 5 shows the comparative experimen- performance. In addition, adding data from differ- tal data of the CNN + K-max Pooling + ON-LSTM ent social networks can improve results. method and the ON-LSTM + Capsule method. Ta- ble 6 shows the results of our new model for A- MI subtask A on the test set. Run 1 only extracts References the last four hidden layer states of XLM-RoBERTa Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. and inputs them into ON-LSTM, then through the 2018. Automatic identification and classification of Capsule Network, and finally performs classifica- misogynistic language on twitter. In International tion (without using P O). Run 2 is to concatenate Conference on Applications of Natural Language to the output of the Capsule Network with the ob- Information Systems, pages 57–64. Springer. tained P O and input it to the classifier for final Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. classification (using P O). We think that concate- Hate speech detection is not as easy as you may nate the P O and the hidden layer will retain richer think: A closer look at model validation. In Pro- ceedings of the 42nd International ACM SIGIR Con- semantic information and show excellent results. ference on Research and Development in Informa- tion Retrieval, pages 45–54. Base on XLM-RoBERTa model (The validation set of 1-fold) Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020. Evalita 2020: Overview Method Macro F1 of the 7th evaluation campaign of natural language CNN + K-max Pooling + ON-LSTM 0.786 processing and speech tools for italian. In Valeri- (HaSpeeDe 2 Model) o Basile, Danilo Croce, Maria Di Maro, and Luci- ON-LSTM + Capsule 0.857 a C. Passaro, editors, Proceedings of Seventh Eval- uation Campaign of Natural Language Processing (AMI model) and Speech Tools for Italian. Final Workshop (E- VALITA 2020), Online. CEUR.org. Table 5: Comparison of experimental data be- Elisa Bassignana, Valerio Basile, and Viviana Patti. tween CNN + K-max Pooling method and ON- 2018. Hurtlex: A multilingual lexicon of words to LSTM + Capsule method on the validation set. hurt. In 5th Italian Conference on Computational (The validation set is the first fold in the 5-fold Linguistics, CLiC-it 2018, volume 2253, pages 1–6. stratified cross-validation) CEUR-WS. Cristina Bosco, Dell’Orletta Felice, Fabio Poletto, Manuela Sanguinetti, and Tesconi Maurizio. 2018. 5 Conclusion Overview of the evalita 2018 hate speech detection task. In EVALITA 2018-Sixth Evaluation Campaign In the experiment, we find the limitation of on- of Natural Language Processing and Speech Tools ly using pooler output as the XLM-RoBERTa’s for Italian, volume 2263, pages 1–9. CEUR. output. To obtain deeper and more abundant se- Tomasso Caselli, Nicole Novielli, Viviana Patti, and mantic features, we extract the hidden layer s- Paolo Rosso. 2018. Sixth evaluation campaign of natural language processing and speech tools for i- David Jurgens, Eshwar Chandrasekharan, and Libby talian: Final workshop (evalita 2018). In EVALITA Hemphill. 2019. A just and comprehensive strategy 2018. CEUR Workshop Proceedings (CEUR-WS. for using nlp to address online abuse. arXiv preprint org). arXiv:1906.01738. Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga Guillaume Lample and Alexis Conneau. 2019. Cross- Kartoziya, and Michael Granitzer. 2020. I feel of- lingual language model pretraining. fended, dont be abusive! implicit/explicit messages in offensive and abusive language. In Proceedings of Ping Liu, Wen Li, and Liang Zou. 2019. Nuli at The 12th Language Resources and Evaluation Con- semeval-2019 task 6: Transfer learning for offensive ference, pages 6193–6202. language detection using bidirectional transformers. In Proceedings of the 13th International Workshop Alexis Conneau, Kartikay Khandelwal, Naman Goy- on Semantic Evaluation, pages 87–91. al, Vishrav Chaudhary, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning Joaquın Padilla Montani and Peter Schüller. 2018. at scale. In Proceedings of the 58th Annual Meeting Tuwienkbs at germeval 2018: German abusive tweet of the Association for Computational Linguistics. detection. In 14th Conference on Natural Language Processing KONVENS, volume 2018, page 45. Rogers Prates de Pelle and Viviane P Moreira. 2017. Offensive comments in the brazilian web: a dataset Endang Wahyu Pamungkas, Valerio Basile, and Vi- and baseline results. In Anais do VI Brazilian Work- viana Patti. 2020. Misogyny detection in twitter: shop on Social Network Analysis and Mining. SBC. a multilingual and cross-domain study. Information Processing & Management, 57(6):102360. Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. deep bidirectional transformers for language under- 2017. Dynamic routing between capsules. standing. Niloofar Safi Samghabadi, Parth Patwa, PYKL Srini- Paolo Rosso Elisabetta Fersini, Debora Nozza. 2020. vas, Prerana Mukherjee, Amitava Das, and Thamar Overview of the evalita 2020 automatic misogyny i- Solorio. 2020. Aggression and misogyny detection dentification (ami) task. In Valerio Basile, Danilo using bert: A multi-task approach. In Proceedings Croce, Maria Di Maro, and Lucia C. Passaro, edi- of the Second Workshop on Trolling, Aggression and tors, Proceedings of the 7th evaluation campaign of Cyberbullying, pages 126–131. Natural Language Processing and Speech tools for Italian (EVALITA 2020), Online. CEUR.org. Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Tracie Farrell, Miriam Fernandez, Jakub Novotny, and Viviana Patti, and Marco Stranisci. 2018. An ital- Harith Alani. 2019. Exploring misogyny across the ian twitter corpus of hate speech against immigrants. manosphere in reddit. In Proceedings of the 10th In Proceedings of the Eleventh International Confer- ACM Conference on Web Science, pages 87–96. ence on Language Resources and Evaluation (LREC 2018). Paula Fortuna, Joao Rocha da Silva, Leo Wanner, Sérgio Nunes, et al. 2019. A hierarchically-labeled Manuela Sanguinetti, Gloria Comandini, Elisa Di Nuo- portuguese hate speech dataset. In Proceedings of vo, Simona Frenda, Marco Stranisci, Cristina Bosco, the Third Workshop on Abusive Language Online, Tommaso Caselli, Viviana Patti, and Irene Russo. pages 94–104. 2020. HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task. In Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola, Valerio Basile, Danilo Croce, Maria Di Maro, and Arantza Casillas, Arantza Dı́az de Ilarraza, Nerea Lucia C. Passaro, editors, Proceedings of Seventh E- Ezeiza, Maite Oronoz, Alicia Pérez, and Olatz valuation Campaign of Natural Language Process- Perez-de Viñaspre. 2018. Automatic misogyny i- ing and Speech Tools for Italian. Final Workshop (E- dentification using neural networks. In IberEval@ VALITA 2020), Online. CEUR.org. SEPLN, pages 249–254. Yikang Shen, Shawn Tan, Alessandro Sordoni, and Vijayasaradhi Indurthi, Bakhtiyar Syed, Manish Shri- Aaron Courville. 2018. Ordered neurons: Integrat- vastava, Nikhil Chakravartula, Manish Gupta, and ing tree structures into recurrent neural networks. Vasudeva Varma. 2019. Fermi at semeval-2019 task 5: Using sentence embeddings to identify hate Bin Wang, Yunxia Ding, Shengyan Liu, and Xiaobing speech against immigrants and women in twitter. In Zhou. 2019. Ynu wb at hasoc 2019: Ordered neu- Proceedings of the 13th International Workshop on rons lstm with attention for identifying hate speech Semantic Evaluation, pages 70–74. and offensive language. In FIRE (Working Notes), pages 191–198. Ganesh Jawahar, Benot Sagot, and Djam Seddah. 2019. What does bert learn about the structure of language? In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistic- s.