-

3Idiots at HASOC 2019: Fine-tuning Transformer Neural Networks for Hate Speech Identi cation in Indo-European Languages

0 IIT Kanpur UP 208016 , India 1 iSchool, University of Illinois at Urbana-Champaign , Champaign IL 61820 , USA

2019

We describe our team 3Idiots's approach for participating in the 2019 shared task on hate speech and o ensive content (HASOC) identi cation in Indo-European languages. Our approach relies on netuning pre-trained monolingual and multilingual transformer (BERT) based neural network models. Furthermore, we also investigate an approach based on labels joined from all sub-tasks. This resulted in good performance on the test set. Among the eight shared tasks, our solution won the rst place for English sub-tasks A and B, and Hindi sub-task B. Additionally, it was within the top 5 for 7 of the 8 tasks, being within 1% of the best solution for 5 out of the 8 sub-tasks. We open source our approach at https://github.com/socialmediaie/HASOC2019.

Hate Speech Identi cation O ensive Content Identi cation Neural Networks BERT Transformers Deep Learning

Information extraction from social media data is an important topic. In the past we have used it for identifying sentiment in tweets [ 5 ] [ 7 ], enthusiastic and passive tweets and users [ 3 ] [ 6 ], and extracting named entities [ 2 ] [ 4 ]. The hate speech and o ensive content (HASOC) shared task of 2019 focused on Indo-European languages, gave us an opportunity to try out BERT [ 1 ] for this shared task. BERT based pre-trained transformer based neural network models are publicly available in multiple languages and the model supports ne-tuning for speci c tasks. We also tried a joint-label based approach called shared-task D, which alleviates data sparsity issues for some shared tasks, while achieving competitive performance in the nal leader board evaluation.

Label bert tok1 tok2 bert tok2 tok3 bert tok3 tok4 bert tok4 tok5 bert tok5 tok6 bert tok6 The data supplied by the organizing team, consisted of posts taken from Twitter and Facebook respectively. The posts were in the following three languages : English (EN), German (DE) and Hindi (HI). The competition had three sub-tasks for English data, two sub-tasks for German data and three sub-tasks for Hindi data respectively. Sub-Task A consisted of labeling a post with (HOF) Hate and O ensive, if the post contained any hate speech, profane content, or o ensive content, otherwise the label should be (NOT) Non Hate-O ensive. Next, Sub-Task B was more ne grained, and speci ed identi cation of (HATE) Hate Speech, (OFFN) O ensive content, and (PRFN) Profane content. Finally, Sub-Task C focused on identifying via label (TIN) Targeted Insult, if the hate speech, o ensive, or profane content (collectively referred to as insult) was targeted towards an individual, group, or other. If content was nontargetted, the label should be (UNT) Untargeted. For details about the task, we refer the reader to the shared task publication [ 8 ]. The organizers released teaser data, which we utilized as a dev dataset for selecting hyperparameters of our models. The distribution of the number of samples for each sub-task in each language is tabulated in table 1. it can be observed that the data-set size for each task is quite small. Table 1 describes the distribution of data in each language under each sub-task. task lang model split run id macro dev train weighted dev train A B C bert-base-german-cased bert-base-german-cased (D) 3 DE bert-base-multilingual-cased 1 bert-base-multilingual-cased (D) 2 bert-base-cased

1 bert-base-cased (D) EN bert-base-uncased 3 bert-base-uncased (D) 2 bert-base-multilingual-cased 2 HI bert-base-multilingual-uncased 1 bert-base-multilingual-uncased (D) 3 bert-base-german-cased

3 bert-base-german-cased (D) DE bert-base-multilingual-cased 1 bert-base-multilingual-cased (D) 2 bert-base-cased

1 bert-base-cased (D) EN bert-base-uncased 3 bert-base-uncased (D) 2 bert-base-multilingual-cased 2 HI bert-base-multilingual-uncased 1 bert-base-multilingual-uncased (D) 3 bert-base-cased

1 bert-base-cased (D) EN bert-base-uncased 3 bert-base-uncased (D) 2 bert-base-multilingual-cased 2 HI bert-base-multilingual-uncased 1 bert-base-multilingual-uncased (D) 3 Each sub-task can be modelled as a text classi cation problem. Our submission models are derived from ne-tuning the pre-trained language model to the shared task data. We used BERT [ 1 ] as our pre-trained language model because of its recent success as well as public availability in multiple languages. We utilize the BERT implementation present in pytorch-transfomers library3. In order to predict on HI and DE language datasets, we used bert-multilingual as well as bert-german pre-trained models. Our ne-tuned model is illustrated in gure 1. 3 https://github.com/huggingface/pytorch-transformers 1. English Language Task (EN) - For the English language task we experimented with the bert-base-cased and bert-base-uncased models. We experimented on all three sub-tasks using the above models. 2. German Language Task (DE) - For the German language task we experimented with the bert-base-german-cased and bert-base-multilingual-cased models. We experimented on sub-tasks A, and B using the above models. 3. Hindi Language Task (HI) - For the Hindi language task we experimented with the bert-base-multilingual-cased and bert-base-multilingual-uncased models. We experimented on all three sub-tasks using the above models. 4

Training

Our models were trained using the Adam optimizer (with = 1e 8) for ve epochs, with a training/eval batch size of 32. Finally, each sequence is truncated to max allowed sequence length of 28 characters. We use a learning rate of 5e 5, weight decay of 0:0, and we also use a max gradient norm of 1:0. 4.1

Training via joint labels - Sub-Task D In order to alleviate the data sparsity issue we utilize an approach which we call sub-task D. Herein, the labels of each sub-task are combined to form a uni ed multi-label task. All possible class combinations across all sub-tasks are utilized to create new classes. The motivation behind this approach is to share information between tasks via their label combinations, training a single model for this task, followed by post-processing to identify labels for sub-tasks A, B, and C. The nal set of classes are NOT-NONE-NONE, HOF-HATE-TIN, HOFHATE-UNT, HOF-OFFN-TIN, HOF-OFFN-UNT, HOF-PRFN-TIN, HOF-PRFN-UNT. Furthermore, the approach also addresses data sparsity issue as we use the full training data to solve all tasks. 5 5.1

Results

Internal evaluation of model training Since, we did not have test labels, we evaluated our model on both the training as well as dev set (as described above). Similar to the shared task evaluation protocol, our evaluation also utilized macro-f1 and weighted f1 scores. Our evaluation is presented in table 2. We selected the best models from each evaluation as our submission for the respective sub-task. 5.2

Evaluation on test data To identify our model performance on the test data, we utilized the leader board rankings released by the organizers based on on all the shared task submissions (see table 3). Among the eight shared tasks, our solutions won the rst place for task lang run id model macro f1 weighted f1 rank

DE A

EN HI DE HI EN

HI B

EN C

English sub-tasks A and B, and Hindi sub-task B. Furthermore, it was within the top 5 for 7 of the 8 tasks, being within 1% of the best solution for 5 out of the 8 sub-tasks. For the English sub-task B, our submissions took all the top three ranks. Our submissions also came close second for sub-task C for both English and Hindi. 6

Conclusion

We have presented our team 3Idiots's approach based on ne-tuning monolingual and multi-lingual transformer networks to classify social media posts in three di erent languages, for hate-speech, and o ensive content. We open source our approach at: https://github.com/socialmediaie/HASOC2019

1. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 { 4186 . Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019 ). https://doi.org/10.18653/v1/ N19 -1423

2. Mishra , S.: Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets . In: Proceedings of the 30th ACM Conference on Hypertext and Social Media - HT '19 . pp. 283 { 284 . ACM Press, New York, New York, USA ( 2019 ). https://doi.org/10.1145/3342220.3344929, http://dl.acm.org/citation. cfm?doid= 3342220 . 3344929

3. Mishra , S. , Agarwal , S. , Guo , J. , Phelps , K. , Picco , J. , Diesner , J.: Enthusiasm and support: alternative sentiment classi cation for social movements on social media . In: Proceedings of the 2014 ACM conference on Web science - WebSci '14 . pp. 261 { 262 . ACM Press, Bloomington, Indiana, USA (jun 2014 ). https://doi.org/10.1145/2615569.2615667, http://dl.acm.org/citation. cfm?doid= 2615569 . 2615667

4. Mishra , S. , Diesner , J.: Semi-supervised Named Entity Recognition in noisy-text . In: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) . pp. 203 { 212 . The

COLING

2016

Organizing

Committee , Osaka, Japan ( 2016 ), https: //aclweb.org/anthology/papers/W/W16/W16-3927/

5. Mishra , S. , Diesner , J.: Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora . In: Proceedings of the 29th on Hypertext and Social Media - HT '18 . pp. 2 { 10 . ACM Press, New York, New York, USA ( 2018 ). https://doi.org/10.1145/3209542.3209562, http://dl.acm. org/citation.cfm?doid= 3209542 . 3209562

6. Mishra , S. , Diesner , J.: Capturing Signals of Enthusiasm and Support Towards Social Issues from Twitter . In: Proceedings of the 5th International Workshop on Social Media World Sensors - SIdEWayS'19 . pp. 19 { 24 . ACM Press, New York, New York, USA ( 2019 ). https://doi.org/10.1145/3345645.3351104, http://dl.acm.org/ citation.cfm?doid= 3345645 . 3351104

7. Mishra , S. , Diesner , J. , Byrne , J. , Surbeck , E.: Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization . In: Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT '15 . pp. 323 { 325 . ACM Press, New York, New York, USA ( 2015 ). https://doi.org/10.1145/2700171.2791022, http://doi.acm. org/10 .1145/ 2700171.2791022http://dl.acm.org/citation.cfm?doid= 2700171 . 2791022

8. Modha , S. , Mandl , T. , Majumder , P. , Patel , D. : Overview of the HASOC track at FIRE 2019: Hate Speech and O ensive Content Identi cation in Indo-European Languages . In: Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation (December 2019 )