Conversational Hate-Offensive detection in Code-Mixed Hindi-English Tweets Ratnavel Rajalakshmi1 , S Srivarshan1 , Faerie Mattins1 , E Kaarthik1 , Prithvi Seshadri1 and Anand Kumar M.2 1 School of Computer Science and Engineering, Vellore Institute of Technology, Chennai 2 Department of Information Technology, National Institute of Technology Karnataka (NITK), Surathkal Abstract Hate speech in social media has increased due to the increased use of online forums for sharing the opinion among the people. Especially, people prefer expressing the views in their native language while posting such objectionable contents in many social media platforms. It is a challenging task to have an automated system to identify such hate and offensive tweets in many regional languages due to the rich linguistics nature. Recently, this problem has become too complicated, due to the use of multi-lingual and code-mixed tweets. The code-mixed data includes the mixing of two languages on the granular level. A word that might not be a part of either language may be found in the data. To address the above challenges in Hindi-English tweets, we propose an efficient method by combining the IndicBERT with an effective ensemble based method. We have applied different methodologies to find a way to accurately classify whether the given tweet is considered to be Hate Speech or Not in code-mixed Hinglish dataset. Three different models namely, IndicBERT, XLM Roberta and Masked LM were used to embed the tweet data. Then various classification methods such as Logistic Regression, Support Vector Machine, Ensembling and Neural Networks based method were applied to perform classification. From extensive experiments on the data set, embedding the code-mixed data with IndicBERT and Ensembling was found to be the best method, which resulted in an macro F1-score of 62.53%. This work was submitted to the shared task of the HASOC 2021 [1] [2] Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages Competition by team TNLP. Keywords Code-mixed tweets, Natural Language Processing, Sentiment Analysis, Machine Learning, 1. Introduction A well-known proverb aptly describes India’s linguistic diversity: ”Chaar kos par baani, kos-kos par badle paani”(The language spoken in India changes every few kilometres, just like the taste of the water). Our country contains 30 languages, each of which is spoken by over a million people. These 30 languages are merely a linguistic window through which we can observe the 122 languages spoken by at least 10,000 people each. Then there are the 1,599 languages, the majority of which are dialects limited to certain geographical areas. Although there is a diverse collection of languages in India, only Hindi has adequate research in terms of NLP. Presently, much effort is placed towards the creation of multi-lingual instead of many models Forum for Information Retrieval Evaluation, December 13-17, 2021, India Envelope-Open rajalakshmi.r@vit.ac.in (R. Rajalakshmi) Orcid 0000-0002-6570-483X (R. Rajalakshmi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) for individual language. However, choosing a language to convert poses a problem as it should be understood and easily interpreted by the end-user. Although English is not the language most people speak globally, it is globally acknowledged and used in essential parts of government and corporations. Given that the majority of NLP research is conducted in English, it has proven to be the ideal language for conversion. When it comes to Indian languages, the languages are morphologically rich, resulting in greater complexity in literature and sentence formation. It results in a deficit of datasets, as the meaning cannot be analyzed until a large amount of data, which are structurally similar, is used. For example, “Mein phal khaatha hun” and “Mein khaatha hun phal”. Unfortunately, the progress of NLP is stunted by the lack of availability of evaluation benchmarks and large-scale datasets specific for Indian languages. Therefore, models try to work with a limited data set and get the maximum accuracy possible. In this paper, the related works have been analysed and discussed in section 2. An archi- tecture diagram consisting of the models acts as an overview of what has been used, illustrated in section 3. Detailed information about the data set and pre-processing is given in subsection 3. Three different embedding models were used for embedding, which is explained in the 3rd subsection, and also the classifier models used are explained in the same section. The results are discussed in section 4, following with the conclusion of the paper. This work has been submitted to the shared task of the HASOC 2021 [1] [2] Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages Competition by team TNLP. 2. Related Works Gaurav Arora and Jio Haptik [3] used iNLTK in place of Semantic Evaluation. It provided insights on Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. Xiaozhi Ou et al. [4] Used XLM- RoBERTa and observed a 4.86% increment in accuracy compared to XLM. Bertelt Braaksma et al. [5] use LSTM and HuggingFace Transformers and show us that variation in negative classes plays an important role. Alex Wang et al. [6] use General Language Understanding Evaluation (GLUE). It confirms the utility of attention mechanisms and transfers learning methods such as ELMo in NLU systems, which combine to outperform the best sentence representation models on the GLUE benchmark. Alexis Conneau et al. [7] Make use of XNLI, a model derived from Multi-Genre Natural Language Inference Corpus (MultiNLI), to evaluate Cross-lingual Sentence Representations. Several approaches are based on cross-lingual sentence encoders and machine translation systems. Ratnavel Rajalakshmi, Yashwanth Reddy [8] performed HASOC detection for Multilingual tweets. The model is built by translating the Tweets, after which feature extraction and data imbalance is handled. Adina Williams et al. [9] make use of MultiNLI, a dataset used for the development and evaluation of machine learning models for sentence understanding. Thara S et al. [10] performed comparative analysis of different code-mixed natural language processing (NLP) approaches based on their accuracy. Although the implementation consisted of various real-time applications such as e Parsing, Machine Figure 1: Architecture of the proposed methodology Translation (MT), Automatic Speech Recognition (ASR), Information Retrieval (IR) and many more, the advantages were clearly described. Atharva Kulkarni et al. [11] Used deep learning models to classify Marathi text. Out of all the models used, IndicBERT and BiLSTM had the highest accuracy, just above 94%. Divyanshu Kakwani et al. [12] Introduced pre-trained multilingual language models for Indian languages and evaluated using various Indic models such as IndicCorp, IndicGLUE, IndicFT and IndicBERT. Ratnavel Rajalakshmi et al. [13] used LSTM, BLSTM and BERT based models to perform comparative analysis on Transformer based approach in Code-Mixed Tamil for Offensive Language. The paper also introduced vector representation such as TD-IDF and Glove. The authors [14] performed feature extraction with Ensemble Methods for Hate speech identification. Since German and Hindi were used, specific methods such as SMOTE Analysis, Mutual Information and Chi-Square were implemented. In another work, [15] they processed Hinglish (Mix of Hindi and English) tweets and trained a machine learning model for sentiment analysis. The model used Boosting along with classifiers to improve the efficiency of the algorithm. Applying machine learning models and deep learning algorithms are common across various applications including web page categorization [16, 17, 18] and sign detection [19]. Sentiment analysis in English is a most common problem and many approaches are suggested by various researchers by applying both machine learning and deep learning techniques [20, 21, 22]. To address the challenges in code-mixed Hindi-English tweets, different relevance factor has been proposed in [23], to find the borrowing likeliness in bi-lingual tweets. 3. Proposed Methodology The proposed work is a system where the input text is raw tweets consisting of offensive and non-offensive tweets. These tweets are preprocessed and cleaned for easier processing of the data. Next, this cleaned data is sent to the tokenizer, where the data is tokenized. The tokenizer also does padding and truncation of data. Following this process is embedding. These tokenized data are embedded using the appropriate embedding models such as IndicBERT, XLMRoberta, and MaskedLM. Further, these embedded outputs are saved and sent to the classifier model. These classifiers are trained with the embedded data along with the corresponding outputs. Finally, the predicted output is given, predicting whether the tweet is hate and offensive (HOF) tweets or non-hate offensive (NONE) tweets. Figure 1 illustrates the architecture of the proposed model. 3.1. Dataset We have considered the dataset released as part of the HASOC-2021 competition’s Subtask 2 : Identification of Conversational Hate-Speech in Code-Mixed Languages (ICHCL) dataset. The dataset consists of Hindi, English and code-mixed Hindi tweets. The task is to classify these tweets as non-hate offensive (NONE) or hate and offensive (HOF) tweets (HASOC 2021). The hate and offensive tweets consist of the tweets which represent offensive, hate speech and profane content. The non-hate offensive tweet consists of regular tweets. A conversational thread also can comprise hate and offensive content material, which is not always obvious simply from the single comment or the response to the comment. However, it may be recognized in the context of the discern content material is given. Hence, the dataset has a tree structure consisting of the source tweet, reply tweet, and comments. There are a total of 5740 tweets in the training dataset, which are classified into 2841 HOF tweets and 2899 NONE tweets. There are 1348 tweets in the testing dataset, which has a combination of HOF and NONE tweets. 3.2. Preprocessing Each entry in the dataset is made up of tweets and their corresponding comments in a tree-like structure. These tweets and comments were flattened into a single line by appending all the child nodes to the original tweet. Each tweet will have a tweet ID and its label. These flattened texts were then compiled into a new DataFrame which is now considered the data to train the model. The tweets consist of phrases and words that may complicate the model like mentions (e.g., @StevieG), URLs (e.g., www.youtube.com), emojis, unwanted newline, retweet tags(e.g., RT), and hashtags (e.g., #sunnyday). The tweets are then processed one by one, and any phrases similar to the ones mentioned above are removed from the tweets. Stop words are the most commonly used words in that language that are generally removed because they do not hold any information concerning the task at hand. Similarly, the tweets also have multiple stopwords (e.g., above, below, is, that, etc.), which must be removed in the preprocessing step. Hence, for English stopwords removal, the NLTK library is used, and for Hindi, a list of predefined stop words used in Hindi is saved into a text file and are removed from the tweets in the preprocessing step. Stemming is the process by which some derived words are converted to their root forms. (e.g., argued, arguing are converted to argue). Both the Hindi and English texts contain words that can be stemmed. For English words, SnowballStemmer is used to perform the steaming process, and for Hindi words, a list of Hindi stemmers used was produced by students of Banasthali University [24], this is used here too to perform stemming. Thus the dataset is subjected to an algorithm that goes through each word and stems it. Figure 2 illustrates the preprocessing carried out for the raw tweets. Table 1 and table 2 give some examples of the dataset before and after preprocessing respectively. 3.3. Embedding IndicBERT [12] is a multilingual model derived from the ALBERT model for Natural Language Processing tasks. The authors have supposedly trained it on a novel corpus of more than 9 billion tokens. It has also been tested on diverse NLP tasks. It is better than the other models as it uses around a tenth of the parameters, enabling it to process data much faster than other Table 1 Condensed dataset before preprocessing tweet_id text label 1 392450559693651968 yaar nahi hoti padhai leaven knows what’s wrong... HOF 1 393625939741794306 @IshanKa75853640 @deashingmilan Abe kapoor khandan... NONE 1 394176816055660000 And Modi #bhakts are busy drinking Gaumutra and... NONE 1 392421887578312711 Bhakton mooh chupao sharm se Baklol sab #IndiaWithPalestine HOF 1 396169865979863041 @kunalkamra88 Doctors aur Scientists se manga hai... HOF Figure 2: Preprocessing the data Table 2 Condensed dataset after preprocessing tweet_id text label 1 392450559693651968 yaar nahi hoti padhai leaven know what wrong... HOF 1 393625939741794306 abe kapoor khandan... HOF 1 394164896904802306 and modi #bhakt busi drink gaumutra... NONE 1 392421887578312711 bhakton mooh chupao sharm se baklol sab #indiawithpalestin HOF 1 396169865979863041 doctor aur scientist se manga hai... HOF models. However, this fact does not compromise its performance as it achieves results almost as good as or sometimes even better than comparable models. The Albert model from which IndicBERT is derived is in itself a derivative of the BERT model. This model covers 12 main Indian languages such as English, Bengali, Assamese, Hindi, Gujarati, Malayalam, Kannada, Oriya, Marathi, Telugu, Tamil and Punjabi. The monolingual data on which the model is trained is available on the author’s website. The model is made available for public use with the help of the Transformers [25] Library courtesy of the HuggingFace [26] website. The Transformers Python Module provides a helpful pipeline to download and load the IndicBERT Tokenizer and the Model. Now the preprocessed dataset is subjected to the Tokenizer. The entire dataset is taken as bulk and given as the input for the Tokenizer. However, due to memory constraints on the host machine, the entire tokenized dataset cannot be fed to the model. Therefore we had to split the tokenized dataset. However, the tokenized input exists as a Pytorch Tensor which cannot be split. Thus the output of the Tokenizer is converted into a Numpy array. Now the corresponding token for each dataset is extracted from the array. These extracted tokens are then again converted back to a Pytorch Tensor format. Now these individual tokens can be passed to the model to get the embedded output. Due to the memory constraints, the tokenized input is split into equal parts, consisting of 200 entries. These parts are then fed to the Model one by one, and the embedded outputs are saved to the disk for further processing. To save and load the embeddings, the Pickle Python Module has been used. XLM-RoBERTa is a multilingual RoBERTa version. It has been pre-trained on 2.5 TB of fil- tered CommonCrawl data with 100 languages. RoBERTa is a transformer model which is self-supervised and trained with a vast corpus of data. Self-supervised means that this model was pre-trained exclusively on raw texts which wasn’t labelled by an human and consists of an automatic process to produce inputs and its respective labels from those texts. It makes the model more efficient since this can be utilized in many scenarios and datasets. XLM-RoBERTa was pre-trained using the Masked language modelling (MLM) objective. Similar to the In- dicBERT model, the XLM-RoBERTa model is made available for public use with the help of the Transformers [25] Library courtesy of the HuggingFace [26] website. Transformers are used to download and load the XLM-RoBERTa tokenizer and model. Again, due to the memory constraints, the tokenized data is split into smaller chunks and fed in the model similar to the IndicBERT model. These tokens are passed to the model to get the embedded output and are saved to the disk for further processing. Prior to the addition of BERT sequences, 15% of the words in each training sample is re- places with a MASK token. The model is then trained to identify the original word that was masked based on the context of the rest of the training sample. Most MaskedLM models utilise the following methods to accomplish said tasks- Addition of a classification layer along with encoder, Conversion of product of vector outputs and embedding matrix into vocabulary di- mension and Probability of each word. However, the major disadvantage of the BERT loss function is that only the prediciton of masked values is considered, ignoring the non masked words. In our model, the tokenizer uses BertTokenizer and consists of a pre-trained model called ’bert-base-uncased’ and the model uses BertForMaskedLM which also makes use of the same pre-trained model. The tokenized input is padded and truncated with return_tensors as pre-trained PyTorch tensor. 3.4. Classifiers 3.4.1. Logistic Regression Logistic Regression is a straightforward machine learning model. It is derived from the Linear Regression Model. Binary Regression is when the output from the Linear Regression model is subjected to a cost function that constrains the output into binary values, that is, 0 or 1. This cost function uses the principle of a threshold. That is, if a value is above the given threshold, then the output is 1. If the value is lower than the threshold, then the value is 0. This way, the Simple Linear Regression model is used for binary classification, also known as Binary Logistic Regression. In this work, we load the embedded outputs from the disk. The corresponding labels for the data are also loaded. It is the training data for the Logistic Regression model. Now, this data is split into training data and validation data for validating the model. The Linear Regression Model is loaded using the Scikit-Learn [27] Library for Python. Two parameters were modified in the Logistic Regression model. The maximum number of iterations was set to 1000 and the random state was set to 11. The rest of the parameters are the same as the default values. The model is then trained on the training data. Now the trained model is validated, and the results are recorded. 3.4.2. Support Vector Machines (SVM) Support vector machines (SVMs) are supervised learning algorithms that can be used for outlier detection,classification and regression. SVM is successful in high-dimensional domains and uses a portion of training points, commonly known as support vectors, in the decision function, which makes it memory economical. The goal of the this algorithm is to discover a hyperplane that distinguishes between data points in an N-dimensional space. Here, N indicates the number of features. There are a variety of hyperplanes from which to divide the two types of data points. The primary objective is to choose a plane which has a significant distance between data points from both classes or the one which has the largest margin. By Increasing the margin distance, there is some reinforcement, making subsequent data points easier to classify. In this work, we load the embedded outputs from the disk and the corresponding labels for the data. It is the training data for the SVM model. Now, this data is split into training data and validation data for validating the model. The SVM Model is loaded using the Scikit-Learn [27] Library from Python. Three parameters were modified in the SVM model. The maximum number of iterations was set to 100, random state was set to 11 and RBF was chosen as the kernel. The model is then trained on the training data. Now the trained model is validated, and the results are recorded. 3.4.3. Ensemble - Majority Voting Ensembling is a machine learning method in which multiple weak learners/models are trained on a given data and combined to give each model better accuracy. In this work, we have made use of one such Ensembling technique - Majority Voting. Majority Voting in its purest essence is a technique in which a model makes a prediction that was predicted majorly by multiple other weak models. In a classification task, the predictions of the weak models are considered to be votes. These votes are tallied, and the prediction with the majority of the votes is the final prediction of that particular set of input variables. Various models make up the weak learners that contribute to the voting. The learners that we have used in this task are Logistic Regression, Stochastic Gradient Descent, Naive Bayes, Random Forest and Decision Tree. First, the embedded outputs are loaded from the disk. The corresponding labels are also loaded. The data is split into a training set and validation set. The multiple classifiers are trained on the training set one by one. After training is complete, the testing data is passed on to the models. The predictions of all the models are considered as votes, and majority voting takes place. The final prediction is obtained at this step. 3.4.4. Neural Network Neural Networks are algorithms which try to emulate the human brain’s behaviour and allow the computer to solve complex problems that it usually cannot. They generally comprise multiple layers of nodes. These nodes can be broadly classified as input nodes, hidden nodes and output nodes. Any Artificial Neural Network Model with more than one hidden layer is generally classified as a Deep Learning Model. Each node is connected to another node. Each node has some properties associated with it, such as a corresponding weight and a threshold value. Neural Networks are trained on data. They become more accurate as they are trained on more data and for a longer time. The Artificial Neural Network used in this model consists of a single input layer of size 32 and an activation layer with the ReLU function. The hidden layer consists of 4 layers. Two Dense layers sandwiched between two dropout layers consist of the hidden layers. The model tended to over-fit the data, which led to the implementation of the dropout layers. Finally, we have the output layer which provides a singular output and the activation function associated with it is the Sigmoid function. Then the model, with its loss function set to Binary Cross Entropy and the optimizer set to Stochastic Gradient Descent, was compiled. The model was run for 1000 epochs, and early stopping was employed. In this work, we load the embedded outputs from the disk. The corresponding labels for the data are also loaded. Now, this dataset is divided into training dataset and validation dataset for validating the model. The network is then trained on the training dataset. Now the trained network is validated, and the results are recorded. 4. Results and Discussion The lesser number of parameters in the IndicBERT model was one of the main reasons as to which it was specifically chosen as the base model. This feature makes it easier to train the model and get predictions. Further, this model has already been pre-trained in more than 12 Indian languages. This was done to utilize the relatedness among the pre-trained data and the code-mixed data used for training in this work. Although the same is possible in MaskedLM, it has not been pre-trained in Indian languages, and the resources spent to get the same performance are significantly higher. This is the reason why MaskedLM has a lower macro F1-score of 50.34% than IndicBERT and XLMRoberta. Due to this reason, a submission was not made for MaskedLM. XLMRoberta shows low macro F1-score of 51.43% compared to the other models. One of the reasons why IndicBERT performed better than XLMRoberta is because the data in IndicBERT was comparatively smaller than the data produced by XLMRoberta after embedding the data. It led to better performance while training IndicBERT; as for XLMRoberta, it was necessary to use PCA to reduce the number of data features. Hence, IndicBERT gave a better macro F1-score and results compared to XLMRoberta. In this research, two classifiers were applied on IndicBERT, Ensembling using majority voting and ANN. It is noticed that Ensembling gave a slightly better results than the results obtained using the ANN classifier. The evaluation metrics with the embedding type and its classifiers can be viewed in Table 1. Table 3 Quantitative performance validation results of different types of embedding model used with their classifiers (* denotes macro averaged values) Submission Name Types of Embedding Classifier Precision* Recall* F1-Score* - Masked LM Logistic Regression 50.35% 50.35% 50.34% T NLP_CMH_S2_X XLM-RoBERTa SVM 53.19% 52.77% 51.43% T NL_CMH_S3 IndicBERTt ANN 61.89% 61.83% 61.82% T NLP_CMH_S1 IndicBERT Ensembling 62.63% 62.62% 62.53% 5. Conclusion and Future Work This work was submitted to the FIRE-2021 shared task on Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC 2021). The submission obtained the 13th rank among the other submissions. In this research, the problem of identifying the conversational hate speech in code-mixed languages have been experimentally studied for English and Hindi tweets. The importance of tokenizing and embedding the data using different approaches have been analyzed in this research using IndicBERT, XLMRoberta and MaskedLM. Also, multiple classifiers were used on these embedded models, and the classifiers with the best performances were discussed. IndicBERT gave the best experimental results with the highest macro F1-score of 62.53% compared to the other embedding methods used. The models for embedding the data have been restricted to only the three models mentioned above in this work. However, none of these models was trained on code-mixed data and was trained on languages or text containing single languages. Further models that were trained on code-mixed data can be explored in the future. References [1] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in: FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December 2021, ACM, 2021. [2] S. Satapara, S. Modha, T. Mandl, H. Madhu, P. Majumder, Overview of the HASOC Subtrack at FIRE 2021: Conversational Hate Speech Detection in Code-mixed language , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. [3] G. Arora, inltk: Natural language toolkit for indic languages, CoRR abs/2009.12534 (2020). URL: https://arxiv.org/abs/2009.12534. a r X i v : 2 0 0 9 . 1 2 5 3 4 . [4] X. Ou, H. Li, Ynu@dravidian-codemix-fire2020: Xlm-roberta for multi-language sentiment analysis, in: FIRE, 2020. [5] B. Braaksma, R. Scholtens, S. van Suijlekom, R. Wang, A. Üstün, Fissa at semeval-2020 task 9: Fine-tuned for feelings, CoRR abs/2007.12544 (2020). URL: https://arxiv.org/abs/2007.12544. arXiv:2007.12544. [6] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, GLUE: A multi-task bench- mark and analysis platform for natural language understanding, CoRR abs/1804.07461 (2018). URL: http://arxiv.org/abs/1804.07461. a r X i v : 1 8 0 4 . 0 7 4 6 1 . [7] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, V. Stoyanov, XNLI: evaluating cross-lingual sentence representations, CoRR abs/1809.05053 (2018). URL: http://arxiv.org/abs/1809.05053. a r X i v : 1 8 0 9 . 0 5 0 5 3 . [8] B. YashwanthReddy, R. Rajalakshmi, Dlrg@hasoc 2020: A hybrid approach for hate and offensive content identification in multilingual tweets, in: FIRE, 2020. [9] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge corpus for sentence understanding through inference, CoRR abs/1704.05426 (2017). URL: http://arxiv.org/abs/ 1704.05426. a r X i v : 1 7 0 4 . 0 5 4 2 6 . [10] S. Thara, E. S., B. V., M. VidhyaSaiBhagavan, M. PhanindraReddy, Code mixed question answering challenge using deep learning methods, 2020 5th International Conference on Communication and Electronics Systems (ICCES) (2020) 1331–1337. [11] A. Kulkarni, M. Mandhane, M. Likhitkar, G. Kshirsagar, J. Jagdale, R. Joshi, Experimental evaluation of deep learning models for marathi text classification, CoRR abs/2101.04899 (2021). URL: https://arxiv.org/abs/2101.04899. a r X i v : 2 1 0 1 . 0 4 8 9 9 . [12] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar, IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 4948–4961. URL: https://aclanthology.org/2020.findings-emnlp.445. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . findings- emnlp.445. [13] R. Rajalakshmi, Y. Reddy, L. Kumar, DLRG@DravidianLangTech-EACL2021: Trans- former based approachfor offensive language identification on code-mixed Tamil, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravid- ian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 357–362. URL: https://aclanthology.org/2021.dravidianlangtech-1.53. [14] R. Rajalakshmi, B. Y. Reddy, Dlrg@hasoc 2019: An enhanced ensemble classifier for hate and offensive content identification, in: FIRE, 2019. [15] R. Rajalakshmi, P. Reddy, S. Khare, V. Ganganwar, Dlrg@cis 2021: Sentimental analysis of code-mixed hindi language tweets, in: CIS, 2021. [16] R. Rajalakshmi, Supervised term weighting methods for url classification, Journal of Computer Science 10 (2014) 1969–1976. doi:1 0 . 3 8 4 4 / j c s s p . 2 0 1 4 . 1 9 6 9 . 1 9 7 6 . [17] R. R, A. C., An effective and discriminative feature learning for url based web page classifi- cation, 2018. doi:1 0 . 1 1 0 9 / S M C . 2 0 1 8 . 0 0 2 4 0 . [18] R. Rajalakshmi, H. Tiwari, J. Patel, A. Kumar, R. Karthik., Design of kids-specific url classifier using recurrent convolutional neural network, Procedia Computer Science 167 (2020) 2124–2131. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . p r o c s . 2 0 2 0 . 0 3 . 2 6 0 . [19] S. V, A. S, B. R, R. R, Autonomous driving system with road sign recognition using convo- lutional neural networks, 2019. doi:1 0 . 1 1 0 9 / I C C I D S . 2 0 1 9 . 8 8 6 2 1 5 2 . [20] V. Ganganwar, R. Rajalakshmi, Implicit aspect extraction for sentiment analysis: A survey of recent approaches, Procedia Computer Science 165 (2019) 485–491. URL: https: //www.sciencedirect.com/science/article/pii/S1877050920300181. doi:h t t p s : / / d o i . o r g / 1 0 . 1016/j.procs.2020.01.010. [21] S. Sivakumar, R. R, Analysis of sentiment on movie reviews using word embedding self- attentive lstm, 2021. doi:1 0 . 4 0 1 8 / I J A C I . 2 0 2 1 0 4 0 1 0 3 . [22] S. Sivakumar, R. Rajalakshmi, Hybrid convolutional bidirectional recurrent neural network based sentiment analysis on movie reviews, Computational Intelligence 37 (2021) 735–757. [23] R. Rajalakshmi, R. Agrawal, Borrowing likeliness ranking based on relevance factor, in: Proceedings of the Fourth ACM IKDD Conferences on Data Sciences, CODS ’17, Association for Computing Machinery, New York, NY, USA, 2017. URL: https://doi.org/10. 1145/3041823.3067694. doi:1 0 . 1 1 4 5 / 3 0 4 1 8 2 3 . 3 0 6 7 6 9 4 . [24] S. Paul, N. Joshi, I. Mathur, Development of a hindi lemmatizer, CoRR abs/1305.6211 (2013). URL: http://arxiv.org/abs/1305.6211. a r X i v : 1 3 0 5 . 6 2 1 1 . [25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Brew, Huggingface’s transformers: State-of-the-art natural language processing, CoRR abs/1910.03771 (2019). URL: http://arxiv.org/abs/1910.03771. arXiv:1910.03771. [26] Hugging face – the ai community building the future., https://huggingface.co/, 2020. [27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in python, CoRR abs/1201.0490 (2012). URL: http://arxiv.org/abs/1201.0490. a r X i v : 1 2 0 1 . 0 4 9 0 .