=Paper=
{{Paper
|id=Vol-2826/T2-31
|storemode=property
|title=NLP-CIC at HASOC 2020: Multilingual Offensive Language Detection using All-in-one Model
|pdfUrl=https://ceur-ws.org/Vol-2826/T2-31.pdf
|volume=Vol-2826
|authors=Segun Taofeek Aroyehun,Alexander Gelbukh
|dblpUrl=https://dblp.org/rec/conf/fire/AroyehunG20
}}
==NLP-CIC at HASOC 2020: Multilingual Offensive Language Detection using All-in-one Model==
NLP-CIC at HASOC 2020: Multilingual Offensive Language Detection using All-in-one Model Segun Taofeek Aroyehuna , Alexander Gelbukha a CIC, Instituto Politécnico Nacional Mexico City, Mexico Abstract We describe our deep learning model submitted to the HASOC 2020 shared task on detection of offensive language in social media in three Indo-European languages: English, German, and Hindi. We fine- tune a pre-trained multilingual encoder on the combination of data provided for the competition. Our submission received a competitive macro- average F1 score of 0.4980 on the English Subtask A as well as comparatively strong performance on the German data. Keywords offensive content identification, deep learning, text classification, multilingual 1. Introduction The impact of offensive content on web users range from subtle uneasiness to graver psy- chological and emotional distress which if go unchecked can result in violent actions to/from affected individuals. In order to make the web a safe place for all, platforms such as Twitter and Facebook pay close attention to content moderation. To aid in the arduous task of removal of objectionable content, it becomes necessary to build efficient and effective systems capable of identifying and classifying such content for automatic or human-assisted content moderation. A standard approach is to automatically flag such content for removal or review by human moderators. There are several studies on the English language due to availability of datasets and distributional representation with which models can be developed. While there is sizeable progress in the English language, the same cannot be said of other languages. With shared task series such as HASOC providing data in other languages, this provides avenue for further research in other languages. With the availability of datasets in several languages, it becomes expensive to design a robust system for each language. An alternative strategy will be to train a single model on languages for which annotated data is available. We base our approach on the recent progress in the development of multilingual language models. In particular, the observation by Conneau et al. [1] that a multilingual model can reach the performance of several language-specific models, at least after pre-training. Can we say the same for fine-tuning on a downstream task? We examine whether jointly fine-tuning a multilingual model on a multilingual dataset is feasible for the task of offensive content FIRE ’20, Forum for Information Retrieval Evaluation, December 16–20, 2020, Hyderabad, India. Envelope-Open aroyehun.segun@gmail.com (S. T. Aroyehun) GLOBE www.gelbukh.com (A. Gelbukh) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) identification and classification. We reckon that this approach will be more energy efficient and less computationally expensive. Specifically, we examine the possibility of using a multilingual pre-trained language model (BERT) to train a single model for the three languages with datasets provided for the HASOC 2020 shared task [2] via transfer learning. 2. Related Work The automatic detection of offensive content has been studied with several approaches. Tra- ditionally, feature engineering in conjunction with classical machine learning models such as Support Vector Machines, Logistic Regression, and Naive Bayes have shown competitive performance [3]. In recent times, neural networks have outperformed the traditional approaches using architectures such as GRU, LSTM, and CNN in combination with word embeddings [4]. The introduction of contextual word embeddings based on pre-trained language models [5] and the transformers [6] architecture has led to state-of-the-art results on several NLP tasks including offensive content identification [7]. Typically, existing approaches rely on pre-trained language models which are adapted to the task at hand [8]. There has been significant progress on the detection of offensive content in English language and the same cannot be said of other languages. Recently, shared tasks such as TRAC 2018 [9], HASOC 2019 [10], TRAC 2020 [11], and Offenseval [12] have introduced datasets in languages other than English. However, the evaluation at those venues still proceeds on a monolingual level. It would be interesting to see evaluation settings that assess models on their multilingual and/or cross-lingual capabilities as exemplified in the work of Pamungkas and Patti [13] and Ranasinghe and Zampieri [14]. 3. Methodology Task. Given a text (tweets in this case) predict (1) For subtask A, whether it is offensive or not. and (2) For subtask B, categorize the text into one of the following classes: none, offensive, hate, and profane. Data. The HASOC 2020 dataset includes annotated text data in English, German, and Hindi. The data has hierarchical labels at two levels. Level one has binary labels (Offensive VS. Not offensive) and level two has four mutually exclusive labels. Table 1 shows the details of the training set. Approach. We train a single model using the combination of labeled datasets provided for each language by the organizers. So, we have a single model per task which covers the three languages covered by the competition. We use as validation set the test set for the 2019 edition of HASOC (the gold labels). We observe that the application of language-agnostic pre-processing: URL removal, normalization of repeated characters, emoji to text conversion, and removal of punctuation marks resulted in performance drop on the validation set. Hence, we did not apply pre-processing to our submissions. It appears that a contextual model such as BERT is able to utilize the information Table 1 Details of the dataset for subtasks A and B for each language. Total is the number of labeled examples per language. OFF – Offensive, and PRFN – profane A B Total OFF NOT OFF HATE PRFN NONE EN 1856 1852 321 158 1377 1852 3708 DE 673 1700 140 146 387 1700 2373 HI 847 2116 465 234 148 2116 2963 Table 2 F1 score on the test set. Numbers in parenthesis represent the performance difference between our submission and the best model on the leaderboard. Task EN DE HI A 0.4980 (−0.0172) 0.5177 (−0.0058) 0.5005 (−0.0332) B 0.2537 (−0.0115) 0.2687 (−0.0256) 0.2374 (−0.0971) that would have been removed. We experiment with both multilingual BERT [5] and XLM-R [1]. We find that the performance of the XLM-R model was unstable and inferior across runs. This is likely due to the size of the model and thus requires careful fine-tuning. Based on this observation, we select multilingual BERT for our submissions. We use as representation for the text the embedding of the [CLS] token, which is of dimension 768 and feed this to a single layer perceptron with softmax activation. This gives a probability distribution over the number of classes to be predicted (2 for subtask A and 4 for subtask B). Our training set up use the following hyperparameter settings: learning rate of 3𝑒 − 5, batch size of 128, Adam as optimizer, and a maximum of 5 epochs. We select the model with the best performance on the validation set for prediction on the unseen test set. For subtask B, we continue fine-tuning on the best model from subtask A using the same hyperparameter settings above. Our implementation uses the Flair library [15]. 4. Results Table 2 shows the scores received by our submissions per task on each language on the private test set maintained by the organizers. On the English subtask A, we recorded a macro-average F1 score of 0.4980, and 0.2537 for subtask B. These scores are within 2 F1 points of the highest ranked submission for English. On the German data, the performance gap on subtask A is the lowest, 0.0058, aproximately 1% F1 points. On the Hindi dataset, we observe the largest gap in performance on subtask B, about 10% F1 points. Also, the second largest gap is recoreded on the Hindi subtask A, around 3% F1 points less than the best submission on the leaderboard. We suspect that there is a negative transfer from either English or German to Hindi. This observation deserves a thorough investigation in the future. Overall, the scores on the subtask B are consistently lower than subtask A across languages. This indicates the difficulty of the task. 5. Conclusion We examined the feasibility of using a single multilingual model to detect and classify offensive language in three Indo-European languages. We run fine-tuning experiments using multilingual BERT. We record a competitive macro-average F1 score of 0.4980 on the English subtask A. We observe that the performance gaps between our submission for tasks on Hindi and the best model on the leaderboard is the highest. In the future, we will like to experiment further with a mixture of more language-specific datasets and identify the limits of using a mixed-language dataset for fine-tuning multilingual encoders on the task of offensive content identification. Acknowledgments We thank the competition organizers for their support. The authors thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory’s Deep Learning Platform for Language Technologies. References [1] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learn- ing at scale, in: Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://www.aclweb.org/anthology/2020.acl-main.747. doi:1 0 . 1 8 6 5 3 / v 1 / 2020.acl- main.747. [2] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer, Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Iden- tification in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, CEUR, 2020. [3] S. Malmasi, M. Zampieri, Detecting hate speech in social media, in: Proceedings of Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria, 2017, pp. 467–472. [4] S. T. Aroyehun, A. Gelbukh, Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling, in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), 2018, pp. 90–97. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ anthology/N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural In- formation Processing Systems 30, Curran Associates, Inc., 2017, pp. 5998–6008. URL: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. [7] J. Risch, R. Krestel, Bagging bert models for robust aggression identification, in: Pro- ceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, 2020, pp. 55–61. [8] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune bert for text classification?, in: China National Conference on Chinese Computational Linguistics, Springer, 2019, pp. 194–206. [9] R. Kumar, A. K. Ojha, S. Malmasi, M. Zampieri, Benchmarking aggression identification in social media, in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), 2018, pp. 1–11. [10] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages, in: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019, pp. 14–17. [11] R. Kumar, A. K. Ojha, S. Malmasi, M. Zampieri, Evaluating aggression identification in social media, in: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, European Language Resources Association (ELRA), Marseille, France, 2020, pp. 1–5. URL: https://www.aclweb.org/anthology/2020.trac-1.1. [12] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Der- czynski, Z. Pitenis, Ç. Çöltekin, Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020), arXiv preprint arXiv:2006.07235 (2020). [13] E. W. Pamungkas, V. Patti, Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Association for Computational Linguistics, Florence, Italy, 2019, pp. 363–370. URL: https://www.aclweb.org/anthology/P19-2051. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 2 0 5 1 . [14] T. Ranasinghe, M. Zampieri, Multilingual offensive language identification with cross- lingual embeddings, arXiv preprint arXiv:2010.05324 (2020). [15] A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in: COLING 2018, 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649.