1. Introduction

Overview of CoLI-Dravidian: Word-level Code-Mixed Language Identification in Dravidian Languages

Asha Hegde

hegdekasha@gmail.com 1

Fazlourrahman Balouchzahi

fbalouchzahi2021@cic.ipn.mx 0

Sabur Butt

saburb@tec.mx 3

Sharal Coelho

sharalmucs@gmail.com 1

Kavya G

kavyamujk@gmail.com 1

Harshitha S Kumar

harshiskumar94@gmail.com 1

Sonith D

sonithksd@gmail.com 1

Shashirekha Hosahalli Lakshmaiah

Ameeta Agrawal

ameeta@pdx.edu 2 0 CIC , IPN , Mexico 1 Department of Computer Science, Mangalore University , India 2 Department of Computer Science, Portland State University , USA 3 IFE , Tecnologico de Monterrey , Mexico

Language Identification (LI) traditionally focuses on detecting languages in documents/sentences, primarily for high-resource languages like English, Spanish, German, and French. However, with growing technological advancements, LI challenges in multilingual countries like India, where users often create code-mixed content by blending local languages with English, have gained prominence. One such example is the combination of Dravidian languages: Tamil, Kannada, Malayalam, and Tulu, with English resulting in code-mixed texts. These code-mixed texts demand LI at word-level to analyze and process them under multilingual settings and acts as a preliminary step for many applications. Code-mixed Dravidian languages are rarely explored in the context of word-level LI. To address this lacuna, CoLI-Dravidian shared task focuses on word-level LI in code-mixed datasets of four Dravidian languages: Tamil, Kannada, Malayalam, and Tulu, written in Roman script. Participants of CoLI-Dravidian shared task are assigned the task of categorizing each word in the given sequence into one of the predefined categories. Out of ten teams who submitted the predictions of their models, the top-performing models achieved macro F1 scores of 0.7656, 0.9293, 0.8939, and 0.8678 for code-mixed Tamil, Kannada, Malayalam, and Tulu texts respectively, highlighting the dificulty and success of the task.

eol>Word-level Language Identification Code-mixed Dravidian Languages Data Collection

1. Introduction

Dravidian languages, a family of approximately 80 languages spoken by more than 220 million people in South Asia, have a rich and ancient history. A recent study suggests that the Dravidian language family, which includes major languages such as Tamil, Telugu, Kannada, and Malayalam, is around 4,500 years old [ 1 ]. People speaking these local, native or regional languages, are at ease using even English for everyday communication. These multilingual individuals often prefer to use multiple scripts and languages when sharing their thoughts and opinions on social media platforms. As a result, code-mixing has become the standard linguistic practice on social media these days [ 2 ]. Code-mixing can occur at various levels, including the paragraph, sentence, or word level, and can even extend to the subword level [ 3 ]. One of the primary tasks in computational linguistics in multilingual scenario is to identify the language of each word in code-mixed sentences. LI is crucial as it enables the development of more accurate Natural Language Processing (NLP) tools, which can be applied in various applications such as machine translation, sentiment analysis, and social media monitoring [ 4 ].

To tackle the challenges of word-level LI in Dravidian languages, we organized a shared task titled ”CoLI-Dravidian: Word-level Code-Mixed Language Identification in Dravidian Languages” 1 as part of Forum for Information Retrieval Evaluation (FIRE) 20242. The CoLI-Dravidian 2024 shared task provides code-mixed datasets in four languages - Kannada, Tamil, Malayalam, and Tulu - aiming to foster the development of advanced models for LI in these languages. The task was organized into two main phases: training and validation, followed by testing. In the first phase, participants were given labeled training and validation sets in the four languages to build and test their models respectively. During the testing phase, unlabeled test sets were provided in these languages and participants were required to run their models on the tests sets and submit their predictions via Codalab platform3 for evaluation. The participating teams were given opportunity to make up to five submissions per language and the best result for each language was used for the final ranking. The predictions were evaluated based on macro averaged precision, recall, F1 score, and accuracy, and the final ranking was based on macro averaged F1 score. Out of 37 teams registered for this shared task, 10 teams submitted their predictions making it to the final rankings and 8 teams submitted the working notes.

Rest of the paper is organized as follows: an overview of previous shared tasks on word-level LI in Dravidian languages and the various approaches used by participants are briefed in Related Works section 2. The datasets used in current version of the task, together with their description and statistics, are detailed in Datasets - section 3. A discussion of diferent models submitted by the participants is presented in System Description - section 4, followed by the final rankings and results in Ranking section 5. Finally, the findings are discussed in Findings - section 6, and Conclusion and Future Works section 7 outline the overall conclusions and potential directions for future research.

2. Related Works

Code-mixing has emerged as the default language of communication on social media allowing blending of words/sub-words from multiple languages and has gathered significant research attention, especially in the area of word-level LI with several notable studies contributing to the understanding of this complex linguistic behavior. Recently several studies have focused on LI tasks in code-mixed Dravidian languages. The description of CoLi-Kanglish [ 2 ] and CoLi-Tunglish [ 5 ] - our earlier shared tasks on word-level LI, and the summary of the models submitted to this shared task are given below:

2.1. CoLI-Kanglish 2022

In CoLI-Kanglish - a shared task [ 2 ] on word-level LI in Kannada-English code-mixed texts, participants were tasked with identifying each word belonging to one of six categories: Kannada, English, KannadaEnglish, Name, Location, and Other. The dataset was built by processing around 100,000 comments from Kannada YouTube videos and words in the dataset were annotated with six categories. Thirty submissions received from eight teams used several Machine Learning (ML) and Deep Learning (DL) models, including transformers like Distil Bidirectional Encoder Representations from Transformers (BERT) and multilingual (mBERT) and the best-performing model achieved an averaged macro F1 score of 0.62. Models utilizing neural networks and transformers generally outperformed traditional ML classifiers. Table 1 presents statistics of the dataset used in this shared task and descriptions of the best performing models are presented below:

Vajrobol [ 6 ] employed fine-tuning a DistilBERT-cased model - a pre-trained transformer model for CoLI-Kanglish task. Their model performed exceptionally well, achieving the highest averaged macro F1 score of 0.62 in the competition. The team’s approach of leveraging a pre-trained transformer model proved efective in tackling the complex nature of code-mixed texts. Tonja et al. [ 7 ] explored a variety of transformer models (BERT, mBERT, Robustly Optimized BERT Pretraining Approach (RoBERTa), and Cross-lingual Language Modeling-RoBERTa (XLM-R)) in combination with Long Short-Term Memory (LSTM) architecture to capture word-level dependencies in code-mixed Kannada-English dataset. Among these models, their proposed BERT model demonstrated the best performance, achieving averaged macro F1 score of 0.61. Their extensive experimentation with multiple transformer models positioned them second in the overall ranking, highlighting the efectiveness of multilingual transformers for this task. Yigezu et al. [ 8 ] focused on character-level models by implementing LSTM and Bidirectional LSTM

3https://codalab.lisn.upsaclay.fr/competitions/19357

(BiLSTM) architectures with attention mechanisms, designed to read text as a sequence of characters. BiLSTM model outperformed the LSTM, likely due to its ability to capture more complex patterns in code-mixed text and the attention mechanism further enhanced the model’s ability to focus on important parts of the text. Their model achieved an averaged macro F1 score of 0.61, placing them in a tie for second place with Tonja et al. [ 7 ]. Deka et al. [ 9 ] experimented with multiple transformer models for LI and among the models they experimented, BERT-based model demonstrated solid performance, securing an averaged macro F1 score of 0.57. This placed them fourth in the overall rankings. Their approach showcased the strength of transformer models in handling code-mixed text, particularly in identifying Kannada and English at the word level.

2.2. CoLI-Tunglish 2023

Hegde et al. [ 5 ] presented the CoLI-Tunglish shared task, which focuses on word-level LI in code-mixed Tulu texts [ 10 ]. This task aims to assign one of six predefined categories to each word in code-mixed Tulu-Kannada-English texts written in Roman script. The dataset used in this shared task consists of user-generated comments from YouTube, which were tokenized and annotated by native speakers. The ifnal dataset includes words categorized into Tulu, Kannada, English, mixed-language words, names, locations, and other categories and the mixed category posed challenges due to its complexity. The shared task attracted 14 teams, with 10 diferent submissions from 5 teams. Most teams used traditional ML methods exploring Support Vector Machine (SVM), k-Nearest Neighbors (kNN), and Random Forest (RF), trained on character n-grams and one team used Transfer Learning (TL) approach with mBERT. The highest-performing team, achieved a macro F1 score of 0.813 with a context-sensitive Logistic Regression (LR) model trained on character n-grams. Table 2 presents statistics of the dataset used in this shared task and the descriptions of the best performing models are presented below:

Bestgen [ 11 ] developed two systems for the CoLI-Tunglish task: a basic system and a context-sensitive one. The basic system used a LIBLinear L2-regularized LR model trained on character n-grams ranging from 1 to 5. The context-sensitive system built upon the basic system trained the LR model with additional context-based information. Their approach was highly efective, achieving the highest macro F1 score of 0.813 and securing first rank in the shared task. The team’s use of both basic and context-sensitive models demonstrated the importance of incorporating contextual information for word-level LI in code-mixed text. Fetouh and Nayel [ 12 ] explored a variety of ML models, including SVM, Stochastic Gradient Descent (SGD), kNN, and Multilayer Perceptron (MLP). These models were trained on Term Frequency-Inverse Document Frequency (TF-IDF) of character n-grams in the range of 1 to 4, along with word length as an additional feature. Among their experiments, SVM model performed the best, achieving a macro F1 score of 0.812, placing them second in the competition. Shetty [13] used TF-IDF of character n-grams in the range of 1 to 4 to train a range of models (Multinomial Naive Bayes (MNB), RF, LR, LinearSVC, Decision Tree (DT), kNN, AdaBoost, One Vs Rest, and Gradient Boost). Among the models proposed, LinearSVC model achieved a macro F1 score of 0.799 placing them in third position. The author’s experimentation with multiple classifiers and n-gram ranges showcased the value of using robust ML models to handle the challenges of word-level LI in code-mixed data. Chanda et al. [14] adopted a TL approach by fine-tuning mBERT model to generate word embeddings for Tulu code-mixed text and applied a softmax activation function to obtain language predictions for each word. By tuning the hyperparameters of BiLSTM layer added to the mBERT model, the team achieved a macro F1 score of 0.602, placing fifth in the competition. While their approach using TL with mBERT was novel, it did not outperform the traditional ML models used by other teams, indicating the complexity of code-mixed text handling in low-resource languages.

The literature review summarizes that the word-level LI shared tasks in Kannada and Tulu languages have given ample opportunities for researchers to process code-mixed texts and explore various learning models for word-level LI in these languages.

3. CoLI-Dravidian 2024 Dataset

In continuation with our earlier shared tasks - CoLi-Kanglish [ 2 ] and CoLi-Tunglish [ 5 ], CoLI-Dravidian 2024 shared task aims to advance research in word-level LI in four code-mixed Dravidian languages Tamil, Kannada, Malayalam, and Tulu. The goal of this shared task is to invite researchers to develop models that categorize each word in the given text into one of the predefined labels: Tamil/Kannada/Malayalam/Tulu/English, mixed language content (Mixed), named entities such as names (Name) and locations (Location), numbers (Number), and words that do not fit to any category (Other). While Tamil, Kannada, and Malayalam datasets have two distinct language classes: Tamil/Kannada/Malayalam and English, Tulu dataset has three distinct languages: Tulu, Kannada, and English. Digits are denoted as ‘Number’, ‘Name’ class is assigned to person names, ‘Location’ class is used for geographical locations, ‘Mixed’ class is designated for words that blend words/sufixes from Dravidian languages and/or English language in any order and the remaining words fall into ‘Other’ class for unclassified terms. The ‘Mixed’ category presents a significant challenge for LI task because these words are formed by the combination of Dravidian languages and/or English words, often mixed with corresponding afixes (prefixes and sufixes) from these languages. The beauty and complexity of these mixed-language words emerge from the unique word patterns created by social media users highlighting the diversity and adaptability of language in digital communication.

To address word-level LI in code-mixed Kannada, Tamil, and Malayalam texts, YouTube comments were collected using a custom-built scraper. The comments underwent pre-processing to remove punctuation and control characters, followed by tokenization into individual words. Each word was then manually annotated by a native speaker fluent in the regional language (Kannada, Tamil, or Malayalam) and English. Further, the dataset used in CoLi-Tunglish4 2023 shared task is used for word-level LI in code-mixed Tulu text in this shared task also [ 4, 5 ]. This task challenges the researchers to create models that efectively handle the linguistic complexity and diversity of code-mixed Dravidian texts. The statistics of the class-wise distribution of the Coli-Dravidian datasets are shown in Figure 1.

4https://sites.google.com/view/coli-tunglish/home

(a) Tamil (b) Kannada (c) Malayalam (d) Tulu

4. System Description

To benchmark datasets used in Coli-Dravidian shared task, experiments were conducted with diferent ML classifiers (SVM, MLP, DT, LR, RF, and AdaBoost) trained with TF-IDF of character n-grams in the range (1, 5). Among these classifiers, SVM, LR, and DT performed better and are therefore used as baselines for the shared task. More than 100 distinct predictions per language were submitted by 10 diferent teams. The description of models submitted by the participants and their performances are as follows:

Team PonsubashRaj explored MNB, LR, DT, SVM, and voting classifiers, trained with count vectorizers and TF-IDF, of character sequences. Their proposed voting classifiers trained with count vectorizers of character sequences secured 1st, 5th, 2nd, and 4th ranks for Tamil, Kannada, Malayalam, and Tulu texts respectively.

Team Kaivalya fine-tuned Multilingual Representations for Indian Languages (MuRiL) and mBERT pre-trained models for word-level LI task for all the four languages and found that MuRiL models outperformed mBERT models achieving 3rd, 1st, 2nd, and 2nd ranks for Tamil, Kannada, Malayalam, and Tulu texts respectively.

Team NLPnorth used MACHAMP5 model to fine-tune a wide range of transformer models and picked the best five language models based on their performances on the development sets. Further, they added Conditional Random Field (CRF) layer to the MACHAMP model to capture the likelihood between the consecutive words and obtained 4th, 2nd, 1st, and 1st ranks for Tamil, Kannada, Malayalam, and Tulu texts respectively.

Team Awsathama conducted a wide range of experiments using ML classifiers (MNB, LR, Support Vector Classifier (SVC), kNN, DT, RF, Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), and Categorical Boosting (CatBoost)) trained with count vectors and TF-IDF

5https://github.com/machamp-nlp/machamp

vectors, of character sequences. Their proposed XGBoost and SVC models trained with count vectors obtained 2nd and rank 3rd ranks for Tamil and Malayalam texts respectively. Further, SVC models trained with TF-IDF vectors obtained 3rd and 4th ranks for Tulu and Kannada texts respectively.

Team MUCS employed deep neural network models to implement two sequence labeling models: i) CoLi_CNN - a Convolutional Neural Network (CNN) model trained with MuRiL embeddings and ii) CoLi_TNN - a transformer neural networks trained from scratch, and a sequence-to-sequence learning model with BiLSTM encoder and LSTM decoder. Their proposed CoLi_CNN model obtained 6th rank for all the four languages.

Team MUCSNLPLAB trained CRF models with text features - word length, previous word, next word, and by tuning the hyperparameters, their proposed models obtained 3rd, 7th, 9th, and 7th ranks for Tamil, Kannada, Malayalam, and Tulu texts respectively.

Team TextTitans proposed a prompt based method using GPT 3.5 turbo, a large language model, to perform word-level LI in Tamil and Kannada texts and obtained 10th rank for both the languages.

Team abadian trained ML classifiers (SVM, SGD, kNN, and MLP) with TF-IDF of character sequences for word-level LI in Kannada and Malayalam texts and their proposed SVM model obtained 8th and 10th ranks for Kannada and Tulu texts respectively.

The findings reveal that a significant number of participants experimented with diferent transformer models, while few others opted for traditional ML techniques, and a smaller group focused on DL models. This diversity in approaches highlights the evolving landscape of the techniques used in the shared task.

5. Ranking

Conventionally, word-level LI datasets are imbalanced and this can skew the model evaluation. Hence, using both macro and weighted F1 scores provide a more comprehensive assessment, as macro treats all classes equally and weighted-average accounts for class imbalance based on their frequency. Together, these metrics ofer a better evaluation of model performance across all the classes. The predictions submitted by the participants of the shared task was evaluated based on macro F1 scores to rank the teams and ranking ties are resolved considering the weighted F1 score. Tables 3 and 4 presents the performances of the participating teams in the shared task along with the baselines.

The top four teams surpassed the baseline models, achieving higher macro F1 scores of 0.7656, 0.9293, 0.8939, and 0.8678 for code-mixed Tamil, Kannada, Malayalam, and Tulu texts, respectively, reflecting the dificulty and competitiveness of the shared task. This result underscores the advancement made by the top teams in addressing the task’s challenges.

6. Findings

37 teams registered for this shared task and 10 teams submitted their results for all the four languages. Figure 2 gives a glimpse of the number of teams and the learning approaches used by them to address word-level LI. Most of the teams have incorporated ML models using language-independent feature extraction techniques, like TF-IDF and CountVectorizer, while few teams have leveraged TL to improve their model’s performance for low-resource languages like Tulu. This approach demonstrates the flexibility of the models in handling languages that are not part of the original training data. Only one team has employed DL models by incorporating MuRiL embbedings - a language dependent representation and Keras embeddings - a language independent representation. Their proposed methodology found DL classifier trained with MuRiL embeddings to be more beneficial for performing the word-level LI task. This suggests that language-specific embeddings like MuRiL can provide a significant advantage in handling tasks for specific languages.

Participants have also encountered challenges while working with code-mixed text in Roman script. To overcome this, they have either fine-tuned suitable pre-trained models for the datasets or employed language-independent feature extraction methods. However, language dependent resources for Tulu remain limited compared to other languages. Further, the issue of extreme class imbalance in the given datasets is not addressed by any of the participants.

W_Pr

W_Re

Rank

7. Conclusion and Future Works

This paper describes Coli-Dravidian 2024 - a word-level LI shared task and presents findings of the task. The task is focused on four low-resource Dravidian languages - Tamil, Kannada, Malayalam, and Tulu, intertwined with English, reflecting the real-world linguistic dynamics of multilingual communities in the digital age. Further, it underscores the importance of recognizing the unique characteristics of these low-resource languages and highlights the eforts to preserve linguistic diversity in an increasingly interconnected world.

The fine-tuned MuRiL model excelled for Kannada, achieving the highest macro F1 score of 0.9294, and also performed well for Tulu with a macro F1 score of 0.8585, underscoring its versatility in handling less commonly studied languages in the Dravidian family. For Malayalam, the MACHAMP model, with an added CRF layer, achieved the best result with a macro F1 score of 0.8939, showcasing its efectiveness in capturing language sequences. In case of Tamil, a voting classifier trained on character sequences produced the highest score of 0.7656, which highlights the need for further refinement of models for this language, potentially through more sophisticated contextual understanding. The efectiveness of these methods depends heavily on the linguistic and code-mixing properties of each Dravidian language.

By using the datasets of this shared task, researchers can focus on adding more context and improving transformer models to better understand the unique details of Dravidian languages in real-world tasks like sentiment analysis, translation, and monitoring social media. The shared task’s outcomes emphasize the importance of continued research into code-mixed LI, which is crucial for preserving linguistic diversity in the digital age.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling check. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. [13] P. Shetty, Word-Level Language Identification of Code-Mixed Tulu-English Data., in: FIRE (Working Notes), 2023, pp. 198–204. [14] S. Chanda, A. Mishra, S. Pal, Advancing Language Identification in Code-Mixed Tulu Texts: Harnessing Deep Learning Techniques., in: FIRE (Working Notes), 2023, pp. 223–230.

[1]

Kolipakam ,

F. M.

Jordan ,

Dunn ,

S. J.

Greenhill ,

Bouckaert ,

R. D.

Gray ,

Verkerk ,

A Bayesian

Phylogenetic Study of the Dravidian Language Family , Royal Society open science 5 ( 2018 ) 171504 .

[2]

Balouchzahi ,

Butt ,

Hegde ,

Ashraf ,

Shashirekha ,

Sidorov ,

Gelbukh , Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022 , Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts ( 2022 ) 38 .

[3]

Shashirekha , LAs for HASOC-Learning Approaches for Hate Speech and Ofensive Content Identification ., in: In FIRE (working notes) , 2020 , pp. 145 - 151 .

[4]

Hegde ,

M. D.

Anusha ,

Coelho ,

H. L.

Shashirekha ,

B. R.

Chakravarthi , Corpus Creation for Sentiment Analysis in Code-mixed Tulu Text, in: Proceedings of the 1st Annual Meeting of the ELRA /ISCA Special Interest Group on Under-Resourced Languages , 2022 , pp. 33 - 40 .

[5]

Hegde ,

Balouchzahi ,

Coelho ,

Shashirekha ,

H. A.

Nayel ,

Butt , Overview of CoLITunglish: Word-level Language Identification in Code-mixed Tulu Text at FIRE 2023 , in: FIRE (Working Notes), 2023 , pp. 179 - 190 .

[6]

Vajrobol , CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka Model , in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts , 2022 , pp. 7 - 11 .

[7]

A. L.

Tonja ,

M. G.

Yigezu ,

Kolesnikova ,

M. S.

Tash ,

Sidorov ,

Gelbukh , Transformerbased Model for Word Level Language Identification in Code-mixed Kannada-English Texts , in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts , 2022 , pp. 18 - 24 .

[8]

M. G.

Yigezu ,

A. L.

Tonja ,

Kolesnikova ,

M. S.

Tash ,

Sidorov ,

Gelbukh , Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach , in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts , 2022 , pp. 29 - 33 .

[9]

Deka ,

N. J.

Kalita ,

S. K.

Sarma , BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ ICON 2022, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts , 2022 , pp. 12 - 17 .

[10]

Hegde ,

Balouchzahi ,

Coelho , S. H L , H. A. Nayel , S. Butt, CoLI@FIRE2023: Findings of Word-level Language Identification in Code-mixed Tulu Text, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation , FIRE '23, Association for Computing Machinery, 2024 , p. 25 - 26 .

[11]

Bestgen , Using Character Ngrams for Word-Level Language Identification in Trilingual CodeMixed Data (and Even More) ., in: FIRE (Working Notes), 2023 , pp. 191 - 197 .

[12] A. M. Fetouh , H. Nayel , BFCAI at CoLI-Tunglish@ FIRE 2023: Machine Learning Based Model for Word-level Language Identification in Code-mixed Tulu Texts ., in: FIRE (Working Notes) , 2023 , pp. 205 - 212 .