RNN’s VS TRANSFORMERS : Training language models on deficit datasets Abhishek Kumar Gautam1 , B Bharathi2 1 Department of Computer science, Indian Institute of Information Technology Una, Himachal Pradesh, India 2 Department of CSE, Sri Siva Subramaniya Nadar College of Engineering,Tamil Nadu, India Abstract The concept of content moderation is as old as the online social media itself, the goal is to prevent any hate speech, comments etc to happen on the platform so as to keep the online social environment friendly and sane. With an exponentially increasing number of people on social media content moder- ation is a difficult task as such in the modern era we make use of specialised tools such as AI and NLP. In non-native English spoken countries, social media texts are mostly in code mixed form. This paper discusses the work put by SSNCSE_NLP in HASOC offensive language identification on multilingual codemixed text tasks of FIRE 2021. In this paper we have put a detailed comparison on the performance of several RNN’s based models with transformers based BERT architecture by varying the essential hy- perparameters when training on a smaller dataset for tasks like sentiment analysis. We achieved an F1 score of 72.47% in task 1 and 69.2% ,61.5% in task2 Tamil and Malayalam respectively on the test set from our best evaluated model. Keywords Offensive content, Dravidian languages, RNN, LSTM 1. Introduction Every small or big brand wants to put their product into as many hands as possible and easily accessible in their own native languages this, in combination with the reach of internet has resulted into massive expanse of rich diverse user groups, online content moderation in those native languages along with the mixed languages that the user group speaks is hence necessary. As codemixed languages consist of bilingual, trilingual or more languages in combination with symbols and emojis it’s difficult to train efficient models. With recent developments in sequence processing models and Transformer[1] based architectures it is far easier to train models in these mixed languages sets. The review of code mixed research and challenges involved in speech and language processing is discussed in [2]. Ensemble approach for offensive identification were discussed [3]. Multilingual BERT based transformer models are used for offensive language identification task [4]. In this paper we have compared training LSTM[5] based architectures with transformers based BERT[6] when training on smaller datasets on codemixed Dravidian languages Malayalam and Tamil mixed with English. Machine learning based approaches for FIRE 2021: Forum for Information Retrival Evaluation, December 13-17, 2021, India " 19105@iiitu.ac.in (A. K. Gautam); bharathib@ssn.edu.in (B. Bharathi) ~ https://www.ssn.edu.in/staff-members/dr-b-bharathi/ (B. Bharathi)  0000-0002-8916-6351 (A. K. Gautam); 0000-0001-7279-5357 (B. Bharathi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Example sentences of task 1 and task 2. offensive language identification are described in [7]. Tamil and Malayalam belong to the Dravidian language family spoken mainly in south India, Sri Lanka, and Singapore. The paper is organized as follows: The dataset descriptions are given in Section2.1 Section 2.3 details the experimental setup and various features used for this task. Section 3 provides a subjective analysis and comparison of the performance of various models on the development and test data. Finally, Section 4 concludes the paper. 2. Proposed work 2.1. Dataset analysis and task description The primary goal of this shared task is to detect offensive language of the code-mixed dataset of comments/posts in Dravidian Languages (Malayalam-English and Tamil-English) collected from social media [8][9]. The comment/post may contain more than one sentence but the average sentence length of the corpora is 1. Each comment/post is annotated with offensive language label at the comment/post level[10]. This dataset also has class imbalance problems depicting real-world scenarios. The HASOC Dravidan dataset had 2 tasks, for the first task, we were given with a message-level label classification task. Given a YouTube comment in Tamil, the model had to classify it into offensive or not-offensive. For the second task, Given a tweet in codemixed Tamil and Malayalam, systems have to classify it into offensive or not-offensive. Example sentences of task 1 and task 2 is given in Fig.1. 2.2. Preprocessing The datasets consisted of Dravidian languages Tamil and Malayalam codemixed with English words, symbols and emojis. The dataset was parsed to generate word level tokens then gener- ate characters to separate out non-UTF-8 charset also emojis were removed from the dataset obtained from the charset, later the word level tokens were directly parsed into LSTM based net- works while the clean text were parsed separately to generate BERT-tokens for training the trans- Figure 2: use of separate left-to-right (green LSTM blocks) and right-to-left (blue LSTM blocks) models in ELMo [12] Figure 3: ELMo for sequence classification [12] former architecture this was done so as to create proper tokens for Tamil in English(Tanglish) and Malayalam in English(Manglish) datasets provided in task-2 while pre-trained embeddings from Indic BERT[11] was used for task-1. 2.3. Experiments For natural language modelling several LSTM architectures including ELMo[7] were tested along with state-of-art transformer model BERT. Considering lack of vocabulary or unknown tokens in embeddings for Tanglish and Manglish datasets of task-2 the models were trained from scratch for task-2. The LSTM based RNN architectures were created and trained in Tensorflow while the transformer architecture BERT was trained in Pytorch using Huggingface transformers library. The Jupyter notebooks for both training tasks are available here. 2.3.1. ELMo model ELMo is a LSTM based architecture that leverages bi-directionality[11] of natural language models by using two separate LSTM layers running left-to-right and right-to-left in bidirectional wrapper and shallow concatenating the outputs. Since there weren’t any multilingual models on ELMo for Indian languages we ended up training it from scratch and achieved an accuracy of 82.7% on validation set. ELMo utilizes LSTM’s it shares same hyper-parameters as them namely : Figure 4: Fine-tuning BERT encoder transformer [6] 1. Number of units : number of LSTM units or output dimension. 2. Dropout : dropout rate(0-1) for outputs. 3. Recurrent dropout : dropout rate(0-1) for recurrent output . ELMo architecture in contrast to the fairly small dataset could be trained from scratch for better accuracy. 2.3.2. BERT model BERT is a transformer[4] based architecture which has proven to be fit for a wide variety of tasks[5] it uses self attention to generate understanding within the network. Task-1 had plain Tamil text code mixed with English words so fine-tuning multilingual model Indic BERT gave a score of 81% on validation set. For task-2 we trained the models from scratch , since the maximum sentence length was found to be 91, embedding size of 128 was used with mini-BERT configuration to pre-train the model on text. Pre-training was done on the entire set on Masked language modelling then the model was fine-tuned for classification on downstream tasks. BERT models are implemented in pytorch and utilises hugging face transformers API to create and train models, while the ELMo architecture was implemented using tensorflow and was trained from scratch all the notebooks associated with training of models are available in the link 1 3. Performance analysis The performance of the proposed approach using BERT model is given in Table1. From 1, it has been noted that fine-tuning multilingual model Indic BERT gave a score of 81% on validation set. The performance of the proposed system using ELMO model is given in Table 2. In Table 2 BiLSTMu refers to BiLSTM with side-by-side stacked uni-directional(left-to-right) LSTM. BiLSTMd refers to BiLSTM with separate left-to-right and right-to-left LSTM stacked, outputs shallow concatenated and fed to fully connected layers for classification. 1 https://github.com/Abhishek-krg/Multilingual-codemixed-language-classification Table 1 Performance of proposed system using BERT model with validation data Attention heads Hidden layers Hidden size Embeddings Accuracy (in %) 2 12 128 128 80.66 2 12 256 128 81.2 4 12 128 128 80.38 2 12 256 64 70.51 4 12 256 128 81.06 2 12 512 128 79.37 Table 2 Performance of proposed system using LSTM models with validation data Architecture Units Dropout (in %) Recurrent-dropout (in %) Accuracy (in %) Unidirectional-LSTM 64 20 20 80 BiLSTMu 32 20 20 81 BiLSTMu 64 20 20 81.6 BiLSTMd 32 20 20 82.5 BiLSTMd 64 20 20 82.5 BiLSTMd 32 25 10 82.7 Table 3 Performance of proposed system using XLM-RoBERTa models with validation data Model Accuracy in (%) XLM-RoBERTa-base 80.5 XLM-RoBERTa-large 81.2 Table 4 Performance of proposed system using test data Task Precision Recall F1 score Task-1 Tamil 0.747 0.725 0.735 Task-2 Tamil-English 0.615 0.607 0.61 Task-2 Malayalam -English 0.692 0.678 0.683 From Table 2, it has been noted that BiLSTMd model with 32 units achieves highest accuracy of 82. 7%. Considering the performances of multilingual LM, we have experimented XLM-Roberta. The results are tabulated in Table 3. The performance of the proposed system using the test data is given in Table 4. 4. Conclusion In this paper, we proposed offensive language identification using Dravidian code-mixed text using ELMO and BERT models. From the performance metrics above it is clear that BERT despite being a far better architecture couldn’t achieve expected results while the BiLSTMd architecture gave better results on HASOC dataset. This could be due to the reason that BERT is a very dense model and requires huge-datasets to train on while LSTM based RNN architectures on the other hand can achieve better results on simpler classification tasks. References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, CoRR abs/1706.03762 (2017). URL: http: //arxiv.org/abs/1706.03762. arXiv:1706.03762. [2] S. Sitaram, K. R. Chandu, S. K. Rallabandi, A. W. Black, A survey of code-switched speech and language processing, 2020. arXiv:1904.00784. [3] D. Saha, N. Paharia, D. Chakraborty, P. Saha, A. Mukherjee, Hate- alert@DravidianLangTech-EACL2021: Ensembling strategies for transformer-based offen- sive language detection, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 270–276. URL: https://www.aclweb.org/anthology/2021.dravidianlangtech-1.38. [4] S. M. Jayanthi, A. Gupta, Sj_aj@dravidianlangtech-eacl2021: Task-adaptive pre-training of multilingual bert models for offensive language identification, 2021. arXiv:2102.01051. [5] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, J. Schmidhuber, LSTM: A search space odyssey, CoRR abs/1503.04069 (2015). URL: http://arxiv.org/abs/1503.04069. arXiv:1503.04069. [6] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. arXiv:1810.04805. [7] A. B. Nitin Nikamanth, B. Bharathi, Ssncse_nlp@hasoc-dravidian-codemix-fire2020: Of- fensive language identification on multilingual code mixing text, in: Working Notes of FIRE 2020- Forum for Information Retrieval Evaluation, CEUR, 2020, pp. 370–376. [8] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan, R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of the shared task on offensive language identification in Tamil, Malayalam, and Kannada, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 133–145. URL: https://aclanthology.org/2021. dravidianlangtech-1.17. [9] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly, Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021, Association for Computing Machinery, 2021. [10] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan, P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. [11] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar, IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilin- gual Language Models for Indian Languages, in: Findings of EMNLP, 2020. [12] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, CoRR abs/1802.05365 (2018). URL: http://arxiv.org/ abs/1802.05365. arXiv:1802.05365.