=Paper=
{{Paper
|id=Vol-3681/T4-1
|storemode=property
|title=Overview of CoLI-Tunglish: Word-level Language Identification in Code-mixed Tulu Text at FIRE 2023
|pdfUrl=https://ceur-ws.org/Vol-3681/T4-1.pdf
|volume=Vol-3681
|authors=Asha Hegde,F. Balouchzahi,Sharal Coelho,H.L. Shashirekha,Hamada A. Nayel,Sabur Butt
|dblpUrl=https://dblp.org/rec/conf/fire/HegdeBCSNB23
}}
==Overview of CoLI-Tunglish: Word-level Language Identification in Code-mixed Tulu Text at FIRE 2023==
Overview of CoLI-Tunglish: Word-level Language Identification in Code-mixed Tulu Text at FIRE 2023 Asha Hegde1 , F. Balouchzahi2 , Sharal Coelho1 , H.L. Shashirekha1 , Hamada A. Nayel3 and Sabur Butt2 1 Department of Computer Science, Mangalore University, India 2 Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC), Mexico 3 Department of Computer Science Faculty of Computers and Artificial Intelligence, Benha University, Egypt Abstract Word-level Language Identification (LI) aims to identify the language of individual words within a given sentence. It is a preliminary step in processing code-mixed text in which words or sub-words belonging to more than one language are used in a sentence/words/sub-words, for various applications. Though there are several tools/models for word-level LI for high-resource languages, under-resourced languages like Tulu, Kannada etc., are less explored in this direction due to lack of annotated data. To address these challenges, we have open-sourced a Tulu code-mixed dataset (a combination of Tulu, Kannada, and/or English words/sub-words/affixes) for word-level LI of Tulu, Kannada, English, and mixed-language words, written in Roman script in the CoLI-Tunglish shared task. The objective of the shared task is to assign one of the six predefined categories: Tulu, Kannada, English, Mixed (a combination of Tulu, Kannada, and/or English languages), Name, Location, and Other, to each word in a given sentence. A total of 14 teams had registered for the shared task and 10 different runs were submitted by 5 teams. Most of the teams have explored Machine Learning (ML) classifiers trained Term Frequency - Inverse Document Frequency (TF-IDF) of character n-grams. The top-performing model obtained weighted F1 score and macro F1 score of 0.89 and 0.81, respectively, among all the models submitted by the participants. Keywords Language Identification, Tulu, Sequence Labeling, Word-level 1. Introduction Globally, South Asia stands out as the most linguistically diverse region boasting an astonishing array of over 650 distinct languages1 . India, as a prominent South Asian country encapsulates the linguistic richness within its borders, with a rich tapestry of languages reflecting its cultural heritage and diversity. Tulu is one of the Dravidian languages having a rich cultural and literary Forum for Information Retrieval Evaluation, December 15-18, 2023, India $ hegdekasha@gmail.com (A. Hegde); fbalouchzahi2021@cic.ipn.mx (F. Balouchzahi); sharalmucs@gmail.com (S. Coelho); hlsrekha@mangaloreuniversity.ac.in (H.L. Shashirekha); hamada.ali@fci.bu.edu.eg (H. A. Nayel); sbutt2021@cic.ipn.mx (S. Butt) https://sites.google.com/view/asha-hegde/home (A. Hegde); https://sites.google.com/view/fazlfrs/home (F. Balouchzahi); https://sites.google.com/view/sharalcoelho/home (S. Coelho); https://bu.edu.eg/staff/hamadaali14 (H. A. Nayel) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.deccanherald.com/content/ 652273/intl-meet-south-asian-languages.html CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings heritage and spoken by a community of over 4 million native speakers [1] in the coastal regions of the southern part of India, predominantly in Karnataka state [2]. Despite its significant speaker base, Tulu is facing the challenges of recognition and preservation for which a lot of efforts are ongoing to promote and sustain this unique linguistic tradition. As Tigalari - the Tulu script is not used much, Tulu text is often written in Kannada script. Further, Tulu was traditionally a spoken language, and since Kannada is taught from an early age, transcribing Tulu in Kannada script became widespread [3]. Tulu is the regional language of Dakshina Kannada and Kannada is the official language of Karnataka. Tuluvas (people whose mother tongue is Tulu) usually know both Tulu and Kannada languages fluently to read, write, and speak. In addition, many Kannada words are used in Tulu language. Moreover, English is widely spoken among Tulu-speaking individuals, particularly among those who are active on social media platforms. Tulu content such as songs, videos, movies, comedy programs, and skits, are immensely popular on social media and comments posted by Tulu users often comprise a mix of Tulu, Kannada, and/or English. Due to the limitations of technology in computer keyboards and smartphone keypads and the intricacies of composing words with consonant conjuncts in Kannada script, many Tulu users opt to employ Roman script or a combination of Kannada and Roman script when interacting on social media, resulting in code-mixed text [4]. This code-mixing can occur at various linguistic levels, including the paragraph, sentence, word, or sub-word, where users blend their native and/or local language like Tulu and/or Kannada with English [5, 6]. Due to the prevalence of Roman alphabets on computer keyboard layouts and smartphone keypads, people often prefer to write code-mixed content in Roman script rather than their native script. Social media platforms have granted users the liberty to compose text informally, often disregarding the grammar conventions of the specific languages used. This has led to a substan- tial influx of user-generated content characterized by incomplete words or sentences, catchy phrases, user-defined abbreviations (“gm” for “good morning”), slang terms (“meme”, “Gmeet”, “WhatsApp”), common abbreviations (“OMG” for “Oh my God”), and the repetition of characters (“soooooo sad” for “so sad”), among others [7, 8]. These informal language elements can make the content challenging to comprehend. Additionally, the prevalence of code-mixing, where words of one language interwoven with words of another language as prefixes or suffixes, complicates text analysis particularly due to conflicting phonetics. The expanding user base on social media platforms results in a continuous surge of user-generated content, making manual management and understanding of this text increasingly impractical. This underscores the need for automated tools and techniques capable of processing user-generated code-mixed text. The preliminary step in handling code-mixed text for many of the Natural Language Pro- cessing (NLP) tasks like Machine Translation [9], Parts-Of-Speech tagging [10], Sentiment Analysis [11, 12], Emotion Analysis [13, 14], Detecting Sign of Depression [15], Hate Speech and Offensive Language Identification [6, 16], Hope Speech Detection [17, 18], etc., is identifying the language of each word/phrase/sentence [7] and this task is known as a Language Identifica- tion (LI). Traditionally, LI has been predominantly studied at the document level, with a focus on high-resource languages, often overlooking low-resource languages. However, in recent times, due to technological advancements and the multilingual nature of countries like India, there has been a growing trend of users posting comments in code-mixed texts [7, 19]. Some of the prominent code-mixed Indian languages are: Hindi-English [19], Bengali-English [20], Kannada-English [7], Telugu-English [21], and Malayalam-English [22]. These code-mixed texts demand LI at word-level as each word in the text belongs to anyone language or combination of languages. Identifying the language of the words in code-mixed social media text gives insight into the linguistic intervention and can also be helpful in multilingual text processing. Word-level LI can be modeled as sequence labeling problem, where each word in the se- quence is tagged with one of the predefined languages including mixed language. Inspired by Shashirekha et al. [7], to address the challenges of word-level LI in code-mixed text, CoLI- Tunglish shared task introduces a gold standard corpus for word-level LI in Tulu code-mixed text. The objective of this task is to determine the language of each word in the given Tulu code-mixed data sourced from social media text [4]. CoLI-Tunglish dataset serves as a valuable resource for researchers and practitioners working on word-level LI in multilingual contexts, allowing them to develop and evaluate models that can effectively handle code-mixed data. The rest of the paper is organized as follows: Section 2 describes the related work and Section 3 describes the task description. Section 4 gives details about the evaluation metrics followed by the brief description about the baselines in Section 5. Overview of the submitted systems are described in Section 6 and Results are discussed in Section 7. The paper concludes in Section 8 along with some future avenues. 2. Related Work In recent years, there has been a growing interest among researchers in the field of code- mixed text, particularly in low-resource and under-resource languages for various applications [4, 5, 8, 23]. To address the challenges of LI in code-mixed text, several studies have been conducted employing various ML and Deep Learning (DL) algorithms and the description of some relevant works are given below: Chaitanya et al. [19] have explored LI of Hindi-English code-mixed data, employing feature vectors generated by the Continuous Bag of Words (CBOW) and Skipgram models, to train ML models (Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR), Gaussian Naive Bayes (GNB), k-Nearest Neighbor (kNN), and Adaptive Boosting (AdaBoost)). Among these models, SVM classifiers achieved highest accuracies of 67.33% and 67.34% using CBOW and Skipgram models, respectively. Gundapu and Mamidi [24] performed LI on Telugu- English code-mixed text using Conditional Random Fields (CRF) classifiers and obtained an accuracy of 91.28% by considering previous, current, and next words, their POS tags, word length, and character n-grams in the range (1, 3) as features. Mandal and Singh [25] proposed a multichannel Neural Network (NN) model of Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) models combined with Bidirectional LSTM (BiLSTM) and CRF, for LI in code-mixed Hindi-English and Bengali-English text. This multichannel NN model achieved accuracies of 93.32% and 93.28% for Hindi-English and Bengali-English data, respectively. Thara and Poornachandran [22] introduced a dataset for LI in code-mixed English-Malayalam text and utilized transformer-based model with fine-tuned Enhanced Light Efficiency Cophasing Telescope Resolution Actuator (ELECTRA) model and obtained a best performance with a macro F1 score of 0.9933. Veena et al. [26] explored SVM models trained with word and character 5-gram embeddings, for LI in code-mixed Hindi-English text and achieved better accuracy. Table 1 Statistics of CoLI-Kenglish dataset Tag Train set Test set Kannada 6,526 2,194 English 4,469 1,812 Kannada-English 1,379 93 Name 708 354 Location 102 31 Other 1,663 100 Total 14,847 7,241 To address the specific challenge of word-level LI in Kannada-English code-mixed texts, our previous work - the CoLI-Kanglish shared task [23], aimed to provide a solution by open-sourcing a dataset comprising Kannada-English code-mixed text written in the Roman script [7]. The task’s objective was to classify each word within the text into one of six predefined categories: Kannada, English, Kannada-English, Name, Location, or Other. The CoLI-Kenglish dataset used in CoLI-Kanglish shared task [23] is described in [7] and the statistics of the dataset are given in Table 1. The study reported the performance of various models submitted by participants in the CoLI-Kanglish shared task. Table 2 borrowed from [23] shows the final leaderboard in the CoLI-Kanglish shared task and the summary of some top-performing models is given below: Team Tiya1012 [27] achieved the top position in the competition by fine-tuning DistilBERT - a transformer-based model on the CoLI-Kenglish dataset and obtained a macro F1 score of 0.62 indicating promising progress in the field of word-level LI for code-mixed texts. Team Abyssinia [28] conducted experiments using various Language Models (LM) (Bidirectional Encoder Representations from Transformers (BERT), Multilingual BERT (mBERT), XLM-R, and RoBERTa from HuggingFace) in combination with a LSTM architecture. Notably, mBERT and XLM-R outperformed the other models, achieving a macro F1 score of 0.61 and securing the sec- ond rank in the competition. Team PDNJK [29] explored multiple transformer-based models for the LI task in code-mixed Kannada-English words. Their top-performing model based on BERT achieved a macro F1 score of 0.57, earning them the fourth position in the shared task. Team Habesha [30] took a different approach by training character-level LSTM and BiLSTM models with attention mechanisms. Their BiLSTM model outperformed the LSTM model, achieving a macro F1 score of 0.61 and securing the second place in the competition. Team Lidoma [31] investigated the use of character n-grams to generate character TF-IDF representation for train- ing traditional ML classifiers. Among their experiments, a simple kNN classifier performed the best, achieving a macro F1 score of 0.58. Team NLP_BFCAI [32] converted Bag-of-Characters into character vectors and introduced a character representation model known as Bag-of-n- Characters. They experimented with several traditional ML algorithms and found that the RF model, utilizing the proposed features, achieved a macro F1 score of 0.43 in the competition. To summarize, a considerable amount of research works are reported on word-level LI in code-mixed Indo-Aryan texts like Hindi-English and Bengali-English. However, word-level LI in code-mixed Dravidian language (Tamil, Malayalam, Kannada, and Telugu) texts are seen with very limited attention in this direction. Further, it is also clear that word-level LI in code-mixed Tulu text has not yet been explored by the researchers and this is the first-ever research attempt that focuses on word-level LI in code-mixed Tulu text. Table 2 Results of the CoLI-Kenglish shared task [23] Weighted Macro Rank Team name Precision Recall F1-score Precision Recall F1-score 1 Tiya1012 [27] 0.87 0.85 0.86 0.67 0.61 0.62 2 Abyssinia [28] 0.85 0.84 0.84 0.62 0.62 0.61 2 Habesha [30] 0.85 0.83 0.84 0.66 0.6 0.61 - LSVM-Baseline 0.84 0.84 0.83 0.67 0.57 0.59 3 Lidoma [31] 0.83 0.83 0.83 0.64 0.56 0.58 4 PDNJK [29] 0.86 0.85 0.86 0.58 0.58 0.57 - MLP-Baseline 0.84 0.81 0.82 0.60 0.60 0.57 - LR-Baseline 0.84 0.84 0.83 0.69 0.53 0.56 5 NLP_BFCAI [32] 0.73 0.73 0.72 0.52 0.41 0.43 6 iREL 0.68 0.62 0.64 0.38 0.45 0.39 7 JUNLP 0.69 0.67 0.67 0.33 0.34 0.3 8 PresiUniv 0.57 0.59 0.53 0.22 0.22 0.2 3. Task Description To address word-level LI in code-mixed Tulu texts, CoLI-Tunglish dataset is constructed by using the YouTube comments collected by Hegde et al. [4]. The comments are preprocessed by removing digits, punctuation, and control characters, and the remaining content is tokenized into individual words. These words are then manually annotated by native Tulu speakers who have fluency in both Kannada and English. Inspired by Balouchzahi et al. [23], the aim of the CoLI-Tunglish task is to promote research in word-level LI in Tulu - a low-resource Indian language. Participants are invited to use the dataset comprising of Tulu, Kannada, and English language content and develop models to categorize each word in the dataset into one of English, Tulu, Kannada, a mixture of two or three of the above languages (Mixed), a Named Entity denoting a name (Name) or location (Location), or designated as "other" (Other), categories. The Coli-Tunglish dataset consists of words categorized into three distinct language classes: “Tulu”, “Kannada”, and “English”, to denote words from these respective languages. The code-mixed nature is depicted by the “Mixed” class which is designated for words that blend word/prefixes/suffixes from Tulu, Kannada, and/or English languages in any order. Further, while “Name” class is assigned to the name of a person, “Location” class is used for geographical or place names, and any other words fall into the category of “Other” class. The “Mixed” category in the dataset presents a significant challenge for the LI task because these words are formed by combining Tulu, Kannada, and/or English words, often mixed with corresponding affixes (prefixes and suffixes) from these languages. The beauty and complexity of these mixed-language words emerge from the unique word patterns created by social media users, highlighting the diversity and adaptability of language in digital communication. The categories, description of the categories and the sample tokens of the CoLI-Tunglish dataset are shown in Table 3 and statistics of the class-wise distribution of the CoLI-Tunglish dataset is shown in Table 4. Table 3 Description and samples tokens of the classes in CoLI-Tunglish dataset Table 4 Class-wise distribution of Train, Development, and Test set Category Train set Development set Test set Tulu 8,647 1,461 4,118 English 5,499 889 2,617 Kannada 2,068 344 1,173 Name 1,104 162 513 Other 506 102 200 Mixed 403 69 194 Location 369 54 190 4. Evaluation Metrics In an imbalanced dataset, categories with a larger number of samples may affect the weighted F1 scores. The model may achieve high accuracy by simply predicting the majority class, and hence, the evaluation measure of accuracy may be misleading. Further, the weighted F1 score that gives the average weight of the number of samples available in that class fails to address data imbalance. On the other hand, the macro F1 score is often used in evaluating models trained on imbalanced data as they provide a balanced assessment of model performance across all classes, regardless of class distribution. Further, the macro F1 score gives equal importance to each class, making it a suitable metric to evaluate model performance in scenarios where class imbalances exist. As CoLI-Tunglish dataset is imbalanced, macro F1 score is used to evaluate the performance of the submitted models. The classification report2 tool which provides comprehensive metrics and insights for evaluating the performance of the systems available at Scikit-learn library is used to compute macro F1 score. 5. Baselines To benchmark the CoLI-Tunglish dataset, several ML classifiers (Multinomial Naive Bayes (MNB), SVM, Multilayer Perceptron (MLP), Decision Tree (DT), LR, RF, and Adaboost) are trained with TF-IDF of character n-grams in range (1, 3) considering top 5,000 features. Among these ML classifiers, as RF, DT, and SVM models, gave better performance, they are used as baselines for CoLI-Tunglish shared task. 6. Overview of the Submitted Systems A total of ten different runs were submitted by five different teams for the CoLI-Tunglish 2023 shared task and all five teams submitted their working notes. While 90% of the participants experimented different ML models, 10% of the participants implemented Transfer Learning (TL) approach. A summary of the models submitted by all five teams is given below: Team SATLAB developed two different working systems: i) Basic System: LIBLinear L2- regularized LR model trained with character n-grams in the range (1, 5) and ii) Context-Sensitive System: LIBLinear L2-regularized LR model trained with the output obtained by the Basic System. Their Context-sensitive system achieved a macro F1 score of 0.813 and secured the 1st rank in the competition. Team BFCAI explored ML models (SVM, Stochastic Gradient Descent, kNN and MLP) trained with TF-IDF of character n-grams with range (1, 4) and word length. Among their experiments, SVM model performed the best and secured the 2nd rank, achieving a macro F1 score of 0.812. Team Poorvi used TF-IDF of character n-grams with various ranges to train MNB, RF, LR, LinearSVC, DT, kNN, AdaBoost, One Vs Rest, and Gradient Boost. Their LinearSVC model trained with TF-IDF of character n-grams in the range (1, 4) was the most effective configuration observed for the word-level LI task and their proposed model obtained macro F1 score of 0.799 securing 3rd rank in the shared task. Team MUCS proposed three different models: i) CRF model trained with text-based features (word, Length of word, Beginning of sentence, End of sentence, etc), ii) Ensemble of ML classifiers (SVM, LR, and RF) with hard voting trained with fastText embeddings for words, and characters embeddings for Roman letters, and iii) Ensemble of ML classifiers (SVM, LR, and RF) with hard voting trained with TF-IDF of character n-grams, for the given datasets. Among all the models, the highest macro F1 score of 0.77 was reported for CRF model securing 4th rank. 2 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html Team IRLab@IITBHU used a two-step process for LI. They leveraged the mBERT model to obtain word embeddings and then applied a softmax activation function to obtain language predictions for each word in the code-mixed Tulu text. By fine-tuning mBERT on the shared task dataset and tuning hyperparameters for the Bi-LSTM layer, their model achieved a macro F1 score of 0.602 and placed 5th rank in the shared task. 7. Results and Discussion The best macro F1 score achieved by each team along with the macro F1 scores of three baselines shown in Table 5 provides a comprehensive comparison of the performances of the submitted models in the shared task against the baselines. This comparison reveals that the four teams achieved better macro F1 scores than the baseline models. The highest macro F1 score of 0.813 highlights the challenging nature of the shared task. Further, among the three baselines (RF, DT, and SVM) trained with character n-grams in the range (1, 3), the RF classifier achieved a better macro F1 score of 0.744 for LI in code-mixed Tulu text. Most of the teams employed a variety of ML models (SVM, LR, RF, kNN, MLP, MNB, DT, and One vs Rest) for LI in code-mixed Tulu text. In addition, participants also explored boosting clas- sifiers (Stochastic Gradient Descent, Adaboost, and Gradient Boost) to enhance the performance of the classifiers. Among all the participants, only one team implemented the mBERT model based on the TL approach. Further, ML models proposed by the participants are commonly trained with TF-IDF of character n-grams and two submissions used the pre-trained models for feature extraction. The TL-based model utilizes the features fine-tuned with mBERT model to train the Bi-LSTM classifier. The proposed models and the features used by the participating teams reveal the lack of computational tools in processing code-mixed Tulu text. The team that utilized an ML classifier trained on TF-IDF of character sequences, coupled with a feature selection method, outperformed the other models, including the mBERT model. This result underscores the significance of tailored feature engineering and selection strategies. It is noteworthy that most participating teams opted for language-independent features (TF- IDF of character n-grams) rather than exploring the potential of very few available pre-trained models. Surprisingly, no specific methods such as sub-word level representation, normalization, or character-level representation are explored by the participants in this shared task to directly address the challenges posed by code-mixed texts. This indicates a gap in leveraging specialized techniques for handling linguistic variations in multilingual data. 8. Conclusion LI serves as a crucial initial step for numerous NLP tasks, but is often neglected in low-resource languages. The recent technological advancements have led to a significant surge in the volume of text data in low-resource languages, particularly on social media platforms where code-mixed content - a blend of local/regional languages and English, is quite common. The combination of more than one language at word-level necessitates word-level LI in code-mixed texts. The primary objective of the CoLI-Tunglish shared task was to promote word-level LI in code-mixed Tulu texts. This task attracted considerable interest initially, with 14 teams expressing their Table 5 Results of CoLI-Tunglish shared task Weighted Macro Rank Team Name Precision Recall F1 score Precision Recall F1 score 1 SATLAB 0.898 0.901 0.898 0.851 0.783 0.813 2 BFCAI 0.899 0.902 0.899 0.859 0.777 0.812 3 Poorvi 0.891 0.893 0.891 0.821 0.781 0.799 4 MUCS 0.874 0.876 0.873 0.807 0.743 0.770 - RF-Baseline 0.859 0.861 0.854 0.841 0.693 0.744 - DT-Baseline 0.828 0.832 0.830 0.701 0.691 0.696 - SVM-Baseline 0.816 0.821 0.807 0.793 0.593 0.639 5 IRLab@IITBHU 0.843 0.857 0.838 0.740 0.571 0.602 intent to participate, ultimately resulting in the submission of ten distinct runs from five different teams. Most of the teams have explored ML models trained with TF-IDF of character n-grams, for different ranges of “n”. This underscores the limited availability of resources for the Tulu language. An ML model of stacking of ML classifiers trained with character n-grams emerged as the top performer, achieving a notable macro F1 score of 0.813. This outcome reveals the significance of effective feature engineering and highlights the substantial difficulty of the task, given the complexities introduced by code-mixing in Tulu texts. The results obtained by the models of the participating teams suggest a promising avenue for addressing LI challenges in low-resource and code-mixed language scenarios. Word-level LI for other Dravidian languages including Tamil, Telugu and Malayalam, will be addressed in future. References [1] A. Hegde, H. L. Shashirekha, A. K. Madasamy, B. R. Chakravarthi, A Study of Machine Translation Models for Kannada-Tulu, in: Congress on Intelligent Systems, Springer Nature Singapore Singapore, 2022, pp. 145–161. [2] S. B. Steever, The Dravidian Languages, Routledge, 2019. [3] K. Padmanabha, A Comparative Study of Tulu Dialects, Mangalore, 1990. [4] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus Creation for Sentiment Analysis in Code-Mixed Tulu Text, in: Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, 2022, pp. 33–40. [5] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, Dravidiancodemix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text, in: Language Resources and Evaluation, Springer, 2022, pp. 765–806. [6] F. Balouchzahi, H. L. Shashirekha, G. Sidorov, A. Gelbukh, A Comparative Study of Sylla- bles and Character Level N-grams for Dravidian Multi-script and Code-Mixed Offensive Language Identification, in: Journal of Intelligent & Fuzzy Systems, IOS Press, 2022, pp. 1–11. [7] H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, G. Sidorov, CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts, in: Acta Polytechnica Hungarica, 2022, pp. 123–141. [8] A. Hegde, H. L. Shashirekha, Leveraging Dynamic Meta Embedding for Sentiment Analysis and Detection of Homophobic/Transphobic Content in Code-mixed Dravidian Languages, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org, 2022. [9] I. Jadhav, A. Kanade, V. Waghmare, S. S. Chandok, A. Jarali, Code-Mixed Hinglish to English Language Translation Framework, in: 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), 2022, pp. 684–688. doi:10.1109/ ICSCDS53736.2022.9760834. [10] K. Akhil, R. Rajimol, V. Anoop, Parts-of-Speech Tagging for Malayalam using Deep Learning Techniques, in: International Journal of Information Technology, Springer, 2020, pp. 741–748. [11] S. Thara, P. Poornachandran, Social Media Text Analytics of Malayalam–English Code- Mixed using Deep Learning, in: Journal of big Data, Springer, 2022, p. 45. [12] F. Balouchzahi, H. Shashirekha, LA-SACo: A Study of Learning Approaches for Sentiments Analysis in Code-Mixing Texts, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2021, pp. 109–118. [13] S. Ghosh, A. Priyankar, A. Ekbal, P. Bhattacharyya, Multitasking of Sentiment Detection and Emotion Recognition in Code-Mixed Hinglish Data, 2023, pp. 110–182. [14] A. Hegde, H. L. Shashirekha, Learning Models for Emotion Analysis and Threatening Language Detection in Urdu Tweets, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2022. [15] A. Hegde, S. Coelho, A. E. Dashti, H. Shashirekha, MUCS@ Text-LT-EDI@ ACL 2022: Detecting Sign of Depression from Social Media Text using Supervised Learning Approach, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 312–316. [16] A. Hegde, M. D. Anusha, H. L. Shashirekha, Ensemble Based Machine Learning Models for Hate Speech and Offensive Content Identification, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org, 2021. [17] F. Balouchzahi, G. Sidorov, A. Gelbukh, PolyHope: Two-Level Hope Speech Detection from Tweets, in: Expert Systems with Applications, 2023, p. 120078. [18] A. Hande, R. Priyadharshini, A. Sampath, K. P. Thamburaj, P. Chandran, B. R. Chakravarthi, Hope Speech Detection in Under-Resourced Kannada Language, in: arXiv preprint arXiv:2108.04616, 2021. [19] I. Chaitanya, I. Madapakula, S. K. Gupta, S. Thara, Word Level Language Identification in Code-mixed Data using Word Embedding Methods for Indian Languages, in: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 2018, pp. 1137–1141. [20] S. Mandal, A. K. Singh, Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture, in: W-NUT 2018, 2018, p. 116. [21] S. Gundapu, R. Mamidi, Word Level Language Identification in English Telugu Code Mixed Data, in: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, 2018. [22] S. Thara, P. Poornachandran, Transformer Based Language Identification for Malayalam- English Code-Mixed Text, in: IEEE Access, IEEE, 2021, pp. 118837–118850. [23] F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H. Shashirekha, G. Sidorov, A. Gelbukh, Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada- English Texts at ICON 2022, in: Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, 2022, p. 38. [24] S. Gundapu, R. Mamidi, Word Level Language Identification in English Telugu Code Mixed Data, in: arXiv preprint arXiv:2010.04482, 2020. [25] S. Mandal, A. K. Singh, Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture, in: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Association for Computational Linguistics, 2018, pp. 116–120. [26] P. Veena, M. Anand Kumar, K. Soman, Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text, in: Computación y Sistemas, Instituto Politécnico Nacional, Centro de Investigación en Computación, 2018, pp. 65–74. [27] V. Vajrobol, CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada- English Texts Shared Task using the Distilka Model, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Associa- tion for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 7–11. URL: https://aclanthology.org/2022.icon-wlli.2. [28] A. Lambebo Tonja, M. Gemeda Yigezu, O. Kolesnikova, M. Shahiki Tash, G. Sidorov, A. Gelbukh, Transformer-based Model for Word Level Language Identification in Code- mixed Kannada-English Texts, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 18–24. URL: https://aclanthology.org/2022.icon-wlli.4. [29] P. Deka, N. Jyoti Kalita, S. Kumar Sarma, BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ICON 2022, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 12–17. URL: https: //aclanthology.org/2022.icon-wlli.3. [30] M. Gemeda Yigezu, A. Lambebo Tonja, O. Kolesnikova, M. Shahiki Tash, G. Sidorov, A. Gelbukh, Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 29–33. URL: https://aclanthology.org/2022.icon-wlli.6. [31] M. Shahiki Tash, Z. Ahani, A. Tonja, M. Gemeda, N. Hussain, O. Kolesnikova, Word Level Language Identification in Code-Mixed Kannada-English Texts using Traditional Machine Learning Algorithms, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 25–28. URL: https://aclanthology.org/2022.icon-wlli.5. [32] S. Ismail, M. K. Gallab, H. Nayel, BoNC: Bag of N-Characters Model for Word Level Language Identification, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code- mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 34–37. URL: https://aclanthology.org/2022.icon-wlli.7.