<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of CoLI-Dravidian: Word-level Code-Mixed Language Identification in Dravidian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asha Hegde</string-name>
          <email>hegdekasha@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fazlourrahman Balouchzahi</string-name>
          <email>fbalouchzahi2021@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabur Butt</string-name>
          <email>saburb@tec.mx</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sharal Coelho</string-name>
          <email>sharalmucs@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kavya G</string-name>
          <email>kavyamujk@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harshitha S Kumar</string-name>
          <email>harshiskumar94@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonith D</string-name>
          <email>sonithksd@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shashirekha Hosahalli Lakshmaiah</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ameeta Agrawal</string-name>
          <email>ameeta@pdx.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIC</institution>
          ,
          <addr-line>IPN</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science, Portland State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>IFE</institution>
          ,
          <addr-line>Tecnologico de Monterrey</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Language Identification (LI) traditionally focuses on detecting languages in documents/sentences, primarily for high-resource languages like English, Spanish, German, and French. However, with growing technological advancements, LI challenges in multilingual countries like India, where users often create code-mixed content by blending local languages with English, have gained prominence. One such example is the combination of Dravidian languages: Tamil, Kannada, Malayalam, and Tulu, with English resulting in code-mixed texts. These code-mixed texts demand LI at word-level to analyze and process them under multilingual settings and acts as a preliminary step for many applications. Code-mixed Dravidian languages are rarely explored in the context of word-level LI. To address this lacuna, CoLI-Dravidian shared task focuses on word-level LI in code-mixed datasets of four Dravidian languages: Tamil, Kannada, Malayalam, and Tulu, written in Roman script. Participants of CoLI-Dravidian shared task are assigned the task of categorizing each word in the given sequence into one of the predefined categories. Out of ten teams who submitted the predictions of their models, the top-performing models achieved macro F1 scores of 0.7656, 0.9293, 0.8939, and 0.8678 for code-mixed Tamil, Kannada, Malayalam, and Tulu texts respectively, highlighting the dificulty and success of the task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Word-level Language Identification</kwd>
        <kwd>Code-mixed</kwd>
        <kwd>Dravidian Languages</kwd>
        <kwd>Data Collection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Dravidian languages, a family of approximately 80 languages spoken by more than 220 million people
in South Asia, have a rich and ancient history. A recent study suggests that the Dravidian language
family, which includes major languages such as Tamil, Telugu, Kannada, and Malayalam, is around 4,500
years old [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. People speaking these local, native or regional languages, are at ease using even English
for everyday communication. These multilingual individuals often prefer to use multiple scripts and
languages when sharing their thoughts and opinions on social media platforms. As a result, code-mixing
has become the standard linguistic practice on social media these days [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Code-mixing can occur at
various levels, including the paragraph, sentence, or word level, and can even extend to the subword
level [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One of the primary tasks in computational linguistics in multilingual scenario is to identify
the language of each word in code-mixed sentences. LI is crucial as it enables the development of more
accurate Natural Language Processing (NLP) tools, which can be applied in various applications such
as machine translation, sentiment analysis, and social media monitoring [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>To tackle the challenges of word-level LI in Dravidian languages, we organized a shared task titled
”CoLI-Dravidian: Word-level Code-Mixed Language Identification in Dravidian Languages” 1 as part of
Forum for Information Retrieval Evaluation (FIRE) 20242. The CoLI-Dravidian 2024 shared task provides
code-mixed datasets in four languages - Kannada, Tamil, Malayalam, and Tulu - aiming to foster the
development of advanced models for LI in these languages. The task was organized into two main
phases: training and validation, followed by testing. In the first phase, participants were given labeled
training and validation sets in the four languages to build and test their models respectively. During
the testing phase, unlabeled test sets were provided in these languages and participants were required
to run their models on the tests sets and submit their predictions via Codalab platform3 for evaluation.
The participating teams were given opportunity to make up to five submissions per language and the
best result for each language was used for the final ranking. The predictions were evaluated based on
macro averaged precision, recall, F1 score, and accuracy, and the final ranking was based on macro
averaged F1 score. Out of 37 teams registered for this shared task, 10 teams submitted their predictions
making it to the final rankings and 8 teams submitted the working notes.</p>
      <p>Rest of the paper is organized as follows: an overview of previous shared tasks on word-level LI in
Dravidian languages and the various approaches used by participants are briefed in Related Works
section 2. The datasets used in current version of the task, together with their description and statistics,
are detailed in Datasets - section 3. A discussion of diferent models submitted by the participants is
presented in System Description - section 4, followed by the final rankings and results in Ranking
section 5. Finally, the findings are discussed in Findings - section 6, and Conclusion and Future Works
section 7 outline the overall conclusions and potential directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Code-mixing has emerged as the default language of communication on social media allowing blending
of words/sub-words from multiple languages and has gathered significant research attention, especially
in the area of word-level LI with several notable studies contributing to the understanding of this
complex linguistic behavior. Recently several studies have focused on LI tasks in code-mixed Dravidian
languages. The description of CoLi-Kanglish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and CoLi-Tunglish [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] - our earlier shared tasks on
word-level LI, and the summary of the models submitted to this shared task are given below:
      </p>
      <sec id="sec-2-1">
        <title>2.1. CoLI-Kanglish 2022</title>
        <p>
          In CoLI-Kanglish - a shared task [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] on word-level LI in Kannada-English code-mixed texts, participants
were tasked with identifying each word belonging to one of six categories: Kannada, English,
KannadaEnglish, Name, Location, and Other. The dataset was built by processing around 100,000 comments
from Kannada YouTube videos and words in the dataset were annotated with six categories. Thirty
submissions received from eight teams used several Machine Learning (ML) and Deep Learning (DL)
models, including transformers like Distil Bidirectional Encoder Representations from Transformers
(BERT) and multilingual (mBERT) and the best-performing model achieved an averaged macro F1 score
of 0.62. Models utilizing neural networks and transformers generally outperformed traditional ML
classifiers. Table 1 presents statistics of the dataset used in this shared task and descriptions of the best
performing models are presented below:
        </p>
        <p>
          Vajrobol [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] employed fine-tuning a DistilBERT-cased model - a pre-trained transformer model for
CoLI-Kanglish task. Their model performed exceptionally well, achieving the highest averaged macro
F1 score of 0.62 in the competition. The team’s approach of leveraging a pre-trained transformer model
proved efective in tackling the complex nature of code-mixed texts. Tonja et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] explored a variety
of transformer models (BERT, mBERT, Robustly Optimized BERT Pretraining Approach (RoBERTa), and
Cross-lingual Language Modeling-RoBERTa (XLM-R)) in combination with Long Short-Term Memory
(LSTM) architecture to capture word-level dependencies in code-mixed Kannada-English dataset. Among
these models, their proposed BERT model demonstrated the best performance, achieving averaged
macro F1 score of 0.61. Their extensive experimentation with multiple transformer models positioned
them second in the overall ranking, highlighting the efectiveness of multilingual transformers for this
task. Yigezu et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] focused on character-level models by implementing LSTM and Bidirectional LSTM
        </p>
        <sec id="sec-2-1-1">
          <title>3https://codalab.lisn.upsaclay.fr/competitions/19357</title>
          <p>
            (BiLSTM) architectures with attention mechanisms, designed to read text as a sequence of characters.
BiLSTM model outperformed the LSTM, likely due to its ability to capture more complex patterns
in code-mixed text and the attention mechanism further enhanced the model’s ability to focus on
important parts of the text. Their model achieved an averaged macro F1 score of 0.61, placing them in a
tie for second place with Tonja et al. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Deka et al. [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] experimented with multiple transformer models
for LI and among the models they experimented, BERT-based model demonstrated solid performance,
securing an averaged macro F1 score of 0.57. This placed them fourth in the overall rankings. Their
approach showcased the strength of transformer models in handling code-mixed text, particularly in
identifying Kannada and English at the word level.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. CoLI-Tunglish 2023</title>
        <p>
          Hegde et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] presented the CoLI-Tunglish shared task, which focuses on word-level LI in code-mixed
Tulu texts [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. This task aims to assign one of six predefined categories to each word in code-mixed
Tulu-Kannada-English texts written in Roman script. The dataset used in this shared task consists of
user-generated comments from YouTube, which were tokenized and annotated by native speakers. The
ifnal dataset includes words categorized into Tulu, Kannada, English, mixed-language words, names,
locations, and other categories and the mixed category posed challenges due to its complexity. The
shared task attracted 14 teams, with 10 diferent submissions from 5 teams. Most teams used traditional
ML methods exploring Support Vector Machine (SVM), k-Nearest Neighbors (kNN), and Random Forest
(RF), trained on character n-grams and one team used Transfer Learning (TL) approach with mBERT.
The highest-performing team, achieved a macro F1 score of 0.813 with a context-sensitive Logistic
Regression (LR) model trained on character n-grams. Table 2 presents statistics of the dataset used in
this shared task and the descriptions of the best performing models are presented below:
        </p>
        <p>
          Bestgen [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] developed two systems for the CoLI-Tunglish task: a basic system and a context-sensitive
one. The basic system used a LIBLinear L2-regularized LR model trained on character n-grams ranging
from 1 to 5. The context-sensitive system built upon the basic system trained the LR model with
additional context-based information. Their approach was highly efective, achieving the highest
macro F1 score of 0.813 and securing first rank in the shared task. The team’s use of both basic and
context-sensitive models demonstrated the importance of incorporating contextual information for
word-level LI in code-mixed text. Fetouh and Nayel [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] explored a variety of ML models, including
SVM, Stochastic Gradient Descent (SGD), kNN, and Multilayer Perceptron (MLP). These models were
trained on Term Frequency-Inverse Document Frequency (TF-IDF) of character n-grams in the range
of 1 to 4, along with word length as an additional feature. Among their experiments, SVM model
performed the best, achieving a macro F1 score of 0.812, placing them second in the competition. Shetty
[13] used TF-IDF of character n-grams in the range of 1 to 4 to train a range of models (Multinomial
Naive Bayes (MNB), RF, LR, LinearSVC, Decision Tree (DT), kNN, AdaBoost, One Vs Rest, and Gradient
Boost). Among the models proposed, LinearSVC model achieved a macro F1 score of 0.799 placing them
in third position. The author’s experimentation with multiple classifiers and n-gram ranges showcased
the value of using robust ML models to handle the challenges of word-level LI in code-mixed data.
Chanda et al. [14] adopted a TL approach by fine-tuning mBERT model to generate word embeddings
for Tulu code-mixed text and applied a softmax activation function to obtain language predictions for
each word. By tuning the hyperparameters of BiLSTM layer added to the mBERT model, the team
achieved a macro F1 score of 0.602, placing fifth in the competition. While their approach using TL with
mBERT was novel, it did not outperform the traditional ML models used by other teams, indicating the
complexity of code-mixed text handling in low-resource languages.
        </p>
        <p>The literature review summarizes that the word-level LI shared tasks in Kannada and Tulu languages
have given ample opportunities for researchers to process code-mixed texts and explore various learning
models for word-level LI in these languages.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. CoLI-Dravidian 2024 Dataset</title>
      <p>
        In continuation with our earlier shared tasks - CoLi-Kanglish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and CoLi-Tunglish [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], CoLI-Dravidian
2024 shared task aims to advance research in word-level LI in four code-mixed Dravidian languages
Tamil, Kannada, Malayalam, and Tulu. The goal of this shared task is to invite researchers to develop
models that categorize each word in the given text into one of the predefined labels:
Tamil/Kannada/Malayalam/Tulu/English, mixed language content (Mixed), named entities such as names (Name) and
locations (Location), numbers (Number), and words that do not fit to any category (Other). While Tamil,
Kannada, and Malayalam datasets have two distinct language classes: Tamil/Kannada/Malayalam and
English, Tulu dataset has three distinct languages: Tulu, Kannada, and English. Digits are denoted as
‘Number’, ‘Name’ class is assigned to person names, ‘Location’ class is used for geographical locations,
‘Mixed’ class is designated for words that blend words/sufixes from Dravidian languages and/or English
language in any order and the remaining words fall into ‘Other’ class for unclassified terms. The ‘Mixed’
category presents a significant challenge for LI task because these words are formed by the combination
of Dravidian languages and/or English words, often mixed with corresponding afixes (prefixes and
sufixes) from these languages. The beauty and complexity of these mixed-language words emerge
from the unique word patterns created by social media users highlighting the diversity and adaptability
of language in digital communication.
      </p>
      <p>
        To address word-level LI in code-mixed Kannada, Tamil, and Malayalam texts, YouTube comments
were collected using a custom-built scraper. The comments underwent pre-processing to remove
punctuation and control characters, followed by tokenization into individual words. Each word was
then manually annotated by a native speaker fluent in the regional language (Kannada, Tamil, or
Malayalam) and English. Further, the dataset used in CoLi-Tunglish4 2023 shared task is used for
word-level LI in code-mixed Tulu text in this shared task also [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. This task challenges the researchers
to create models that efectively handle the linguistic complexity and diversity of code-mixed Dravidian
texts. The statistics of the class-wise distribution of the Coli-Dravidian datasets are shown in Figure 1.
      </p>
      <sec id="sec-3-1">
        <title>4https://sites.google.com/view/coli-tunglish/home</title>
        <p>(a) Tamil
(b) Kannada
(c) Malayalam
(d) Tulu</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. System Description</title>
      <p>To benchmark datasets used in Coli-Dravidian shared task, experiments were conducted with diferent
ML classifiers (SVM, MLP, DT, LR, RF, and AdaBoost) trained with TF-IDF of character n-grams in the
range (1, 5). Among these classifiers, SVM, LR, and DT performed better and are therefore used as
baselines for the shared task. More than 100 distinct predictions per language were submitted by 10
diferent teams. The description of models submitted by the participants and their performances are as
follows:</p>
      <p>Team PonsubashRaj explored MNB, LR, DT, SVM, and voting classifiers, trained with count
vectorizers and TF-IDF, of character sequences. Their proposed voting classifiers trained with count
vectorizers of character sequences secured 1st, 5th, 2nd, and 4th ranks for Tamil, Kannada, Malayalam,
and Tulu texts respectively.</p>
      <p>Team Kaivalya fine-tuned Multilingual Representations for Indian Languages (MuRiL) and mBERT
pre-trained models for word-level LI task for all the four languages and found that MuRiL models
outperformed mBERT models achieving 3rd, 1st, 2nd, and 2nd ranks for Tamil, Kannada, Malayalam, and
Tulu texts respectively.</p>
      <p>Team NLPnorth used MACHAMP5 model to fine-tune a wide range of transformer models and
picked the best five language models based on their performances on the development sets. Further,
they added Conditional Random Field (CRF) layer to the MACHAMP model to capture the likelihood
between the consecutive words and obtained 4th, 2nd, 1st, and 1st ranks for Tamil, Kannada, Malayalam,
and Tulu texts respectively.</p>
      <p>Team Awsathama conducted a wide range of experiments using ML classifiers (MNB, LR, Support
Vector Classifier (SVC), kNN, DT, RF, Light Gradient Boosting Machine (LightGBM), Extreme Gradient
Boosting (XGBoost), and Categorical Boosting (CatBoost)) trained with count vectors and TF-IDF</p>
      <sec id="sec-4-1">
        <title>5https://github.com/machamp-nlp/machamp</title>
        <p>vectors, of character sequences. Their proposed XGBoost and SVC models trained with count vectors
obtained 2nd and rank 3rd ranks for Tamil and Malayalam texts respectively. Further, SVC models
trained with TF-IDF vectors obtained 3rd and 4th ranks for Tulu and Kannada texts respectively.</p>
        <p>Team MUCS employed deep neural network models to implement two sequence labeling models: i)
CoLi_CNN - a Convolutional Neural Network (CNN) model trained with MuRiL embeddings and ii)
CoLi_TNN - a transformer neural networks trained from scratch, and a sequence-to-sequence learning
model with BiLSTM encoder and LSTM decoder. Their proposed CoLi_CNN model obtained 6th rank
for all the four languages.</p>
        <p>Team MUCSNLPLAB trained CRF models with text features - word length, previous word, next
word, and by tuning the hyperparameters, their proposed models obtained 3rd, 7th, 9th, and 7th ranks
for Tamil, Kannada, Malayalam, and Tulu texts respectively.</p>
        <p>Team TextTitans proposed a prompt based method using GPT 3.5 turbo, a large language model,
to perform word-level LI in Tamil and Kannada texts and obtained 10th rank for both the languages.</p>
        <p>Team abadian trained ML classifiers (SVM, SGD, kNN, and MLP) with TF-IDF of character sequences
for word-level LI in Kannada and Malayalam texts and their proposed SVM model obtained 8th and 10th
ranks for Kannada and Tulu texts respectively.</p>
        <p>The findings reveal that a significant number of participants experimented with diferent transformer
models, while few others opted for traditional ML techniques, and a smaller group focused on DL
models. This diversity in approaches highlights the evolving landscape of the techniques used in the
shared task.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Ranking</title>
      <p>Conventionally, word-level LI datasets are imbalanced and this can skew the model evaluation. Hence,
using both macro and weighted F1 scores provide a more comprehensive assessment, as macro treats all
classes equally and weighted-average accounts for class imbalance based on their frequency. Together,
these metrics ofer a better evaluation of model performance across all the classes. The predictions
submitted by the participants of the shared task was evaluated based on macro F1 scores to rank the
teams and ranking ties are resolved considering the weighted F1 score. Tables 3 and 4 presents the
performances of the participating teams in the shared task along with the baselines.</p>
      <p>The top four teams surpassed the baseline models, achieving higher macro F1 scores of 0.7656, 0.9293,
0.8939, and 0.8678 for code-mixed Tamil, Kannada, Malayalam, and Tulu texts, respectively, reflecting
the dificulty and competitiveness of the shared task. This result underscores the advancement made by
the top teams in addressing the task’s challenges.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Findings</title>
      <p>37 teams registered for this shared task and 10 teams submitted their results for all the four languages.
Figure 2 gives a glimpse of the number of teams and the learning approaches used by them to address
word-level LI. Most of the teams have incorporated ML models using language-independent feature
extraction techniques, like TF-IDF and CountVectorizer, while few teams have leveraged TL to improve
their model’s performance for low-resource languages like Tulu. This approach demonstrates the
flexibility of the models in handling languages that are not part of the original training data. Only one team
has employed DL models by incorporating MuRiL embbedings - a language dependent representation
and Keras embeddings - a language independent representation. Their proposed methodology found
DL classifier trained with MuRiL embeddings to be more beneficial for performing the word-level LI
task. This suggests that language-specific embeddings like MuRiL can provide a significant advantage
in handling tasks for specific languages.</p>
      <p>Participants have also encountered challenges while working with code-mixed text in Roman script.
To overcome this, they have either fine-tuned suitable pre-trained models for the datasets or employed
language-independent feature extraction methods. However, language dependent resources for Tulu
remain limited compared to other languages. Further, the issue of extreme class imbalance in the given
datasets is not addressed by any of the participants.</p>
      <p>W_Pr</p>
      <p>W_Re</p>
      <p>Rank</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Works</title>
      <p>This paper describes Coli-Dravidian 2024 - a word-level LI shared task and presents findings of the task.
The task is focused on four low-resource Dravidian languages - Tamil, Kannada, Malayalam, and Tulu,
intertwined with English, reflecting the real-world linguistic dynamics of multilingual communities in
the digital age. Further, it underscores the importance of recognizing the unique characteristics of these
low-resource languages and highlights the eforts to preserve linguistic diversity in an increasingly
interconnected world.</p>
      <p>The fine-tuned MuRiL model excelled for Kannada, achieving the highest macro F1 score of 0.9294,
and also performed well for Tulu with a macro F1 score of 0.8585, underscoring its versatility in handling
less commonly studied languages in the Dravidian family. For Malayalam, the MACHAMP model, with
an added CRF layer, achieved the best result with a macro F1 score of 0.8939, showcasing its efectiveness
in capturing language sequences. In case of Tamil, a voting classifier trained on character sequences
produced the highest score of 0.7656, which highlights the need for further refinement of models for this
language, potentially through more sophisticated contextual understanding. The efectiveness of these
methods depends heavily on the linguistic and code-mixing properties of each Dravidian language.</p>
      <p>By using the datasets of this shared task, researchers can focus on adding more context and improving
transformer models to better understand the unique details of Dravidian languages in real-world tasks
like sentiment analysis, translation, and monitoring social media. The shared task’s outcomes emphasize
the importance of continued research into code-mixed LI, which is crucial for preserving linguistic
diversity in the digital age.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.
[13] P. Shetty, Word-Level Language Identification of Code-Mixed Tulu-English Data., in: FIRE
(Working Notes), 2023, pp. 198–204.
[14] S. Chanda, A. Mishra, S. Pal, Advancing Language Identification in Code-Mixed Tulu Texts:
Harnessing Deep Learning Techniques., in: FIRE (Working Notes), 2023, pp. 223–230.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kolipakam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Greenhill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bouckaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Verkerk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Bayesian</given-names>
            <surname>Phylogenetic</surname>
          </string-name>
          <article-title>Study of the Dravidian Language Family</article-title>
          ,
          <source>Royal Society open science 5</source>
          (
          <year>2018</year>
          )
          <fpage>171504</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          , Overview of CoLI-Kanglish:
          <article-title>Word Level Language Identification in Code-mixed</article-title>
          <source>Kannada-English Texts at ICON</source>
          <year>2022</year>
          ,
          <article-title>Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts (</article-title>
          <year>2022</year>
          )
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <article-title>LAs for HASOC-Learning Approaches for Hate Speech and Ofensive Content Identification</article-title>
          ., in: In
          <source>FIRE (working notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Anusha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Corpus Creation for Sentiment Analysis in Code-mixed Tulu Text, in: Proceedings of the 1st Annual Meeting of the ELRA</article-title>
          /ISCA Special Interest Group on
          <string-name>
            <surname>Under-Resourced Languages</surname>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Nayel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          , Overview of CoLITunglish:
          <article-title>Word-level Language Identification in Code-mixed Tulu Text at FIRE 2023</article-title>
          , in: FIRE (Working Notes),
          <year>2023</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vajrobol</surname>
          </string-name>
          , CoLI-Kanglish:
          <article-title>Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka Model</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Tonja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Yigezu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kolesnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Tash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Transformerbased Model for Word Level Language Identification in Code-mixed Kannada-English Texts</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Yigezu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Tonja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kolesnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Tash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Deka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Kalita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Sarma</surname>
          </string-name>
          ,
          <article-title>BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ ICON 2022, in:</article-title>
          <source>Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H L</given-names>
            , H. A.
            <surname>Nayel</surname>
          </string-name>
          , S. Butt, CoLI@FIRE2023:
          <article-title>Findings of Word-level Language Identification in Code-mixed Tulu Text, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          , FIRE '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2024</year>
          , p.
          <fpage>25</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bestgen</surname>
          </string-name>
          ,
          <article-title>Using Character Ngrams for Word-Level Language Identification in Trilingual CodeMixed Data (and Even More)</article-title>
          ., in: FIRE (Working Notes),
          <year>2023</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>A. M. Fetouh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Nayel</surname>
          </string-name>
          , BFCAI at CoLI-Tunglish@
          <article-title>FIRE 2023: Machine Learning Based Model for Word-level Language Identification in Code-mixed Tulu Texts</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>