<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Language-Specific Characteristics for Word-level Language Identification in Dravidian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachana Nagaraju</string-name>
          <email>rachananagaraju20@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>H L Shashirekha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore, Karnataka</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Language Identification (LI) is essential for many Natural Language Processing (NLP) applications, including sentiment analysis, machine translation, and information retrieval. Reliable LI is crucial in multilingual and informal communication settings, where language boundaries can blur. This is particularly true in India, where social media users often create code-mixed text combining English with regional and/or local languages. The Dravidian languages - Kannada, Tamil, Malayalam, Telugu, and Tulu - have complex structures and are frequently Romanized and/or mixed with English language in digital conversations. Processing code-mixed text in these languages is challenging as they are low-resourced. To tackle these issues, the shared task on 'Word-level Language Identification for Code-Mixed Dravidian Languages' by CoLI-Dravidian @ Forum for Information Retrieval Evaluation (FIRE) 2025 provided word-level annotated datasets in Kannada, Tamil, Malayalam, Telugu, and Tulu, with the aim of fostering the development of strong LI systems. In this paper, we - team MUCS model the Word-level LI task as a classical sequence labeling problem and describe a Conditional Random Field (CRF)-based pipeline, blending various lexical and contextual features, to address the challenges of the shared task. Our system performed well across all languages, achieving Macro F1 scores of 0.9040 in Kannada (6th rank), 0.5955 in Tamil (8th rank), 0.7620 in Malayalam (6th rank), 0.7289 in Telugu (4th rank), and 0.7963 in Tulu (5th rank). These results show that a carefully designed classical sequence labeling approach can remain competitive with other methods, even in noisy and code-mixed multilingual settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Language Identification</kwd>
        <kwd>Code-Mixing</kwd>
        <kwd>Dravidian Languages</kwd>
        <kwd>Conditional Random Fields</kwd>
        <kwd>Multilingual NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        LI involves identifying the language of a text fragment, ranging from a full document to just one
word. This task is a crucial first step in many NLP applications like machine translation, sentiment
analysis, and named entity recognition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Although LI at the sentence or document level is quite
advanced, identifying the language at the word-level is still complicated, particularly in informal,
multilingual contexts. This complexity is especially evident in Indian languages, where multilingualism
is common. Digital conversations often include significant code-mixing, which is the blending of
words and structures from diferent languages including English in the same sentence or utterance
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Code-mixed texts on social media are typically informal, inconsistent in spelling, and heavily
transliterated. They are often written in Roman and/or native script, regardless of the user’s native
language. These traits greatly limit the efectiveness of traditional NLP tools, as they expect monolingual
and well-formed input.
      </p>
      <p>Dravidian languages are a major language family in South India and parts of Sri Lanka. Key members
of this family include Kannada, Tamil, Telugu, Malayalam, and Tulu. These languages are rich in
morphology and highly agglutinative, but they lack suficient computational resources. Despite having
millions of native speakers, they are labeled as low-resource languages in NLP due to lack of large
annotated datasets, pre-trained models, and linguistic tools. Much of the online content in these</p>
      <p>Language
Kannada
Malayalam
Telugu
Tamil
Tulu
languages, particularly on social media, is informal and often mixed with English, which creates
additional challenges for automated processing.</p>
      <p>
        To tackle the challenges of word-level LI posed by code-mixed text in Dravidian languages on social
media, the CoLI-Dravidian @ FIRE 20251 shared task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced a Word-level LI challenge. This
initiative aims to enable more accurate linguistic analysis in multilingual contexts. Participants are
given word-level annotated datasets in five languages — Kannada, Tamil, Telugu, Malayalam, and Tulu
— with each word tagged as belonging to a specific language (e.g., TAMIL, MALAYALAM) or a functional
class (e.g., ENGLISH, NAME, SYM, MIXED, OTHER). The goal of this task is to evaluate and improve LI
systems that perform well in noisy, low-resource, and morphologically complex situations. Table 1
presents the tags and tag-wise distribution of words in five languages of the Train set. It can be observed
that the Train sets of all the languages are imbalanced.
      </p>
      <p>
        We, team - MUCS, participated in this shared task using a traditional yet efective approach. We
modeled word-level LI task as a classical sequence labeling problem and developed a pipeline that
1https://www.codabench.org/competitions/7902/
included thoughtfully designed lexical, orthographic, and contextual features, to train a CRF model for
each language. Our code is available on GitHub2 to reproduce the results and explore further. This setup
allows for better adaptation to the morphology and word patterns of the individual languages. While
neural models have recently gained popularity in NLP, we believe that language-specific feature-based
CRF models remain efective, particularly in low-resource and noisy conditions and well-designed
features can surpass end-to-end neural systems [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. In this paper, we describe our complete approach
for word-level LI in Dravidian code-mixed data, covering feature extraction, model training, and
evaluation. We also provide detailed analyses of our system’s behavior across languages and explore its
potential for real-world multilingual applications.
      </p>
      <p>The subsequent sections of this paper details the related works (Section 2), methodology (Section 3),
experiments, results, and implications of our approach (Section 4), declaration on generative AI (Section
5) followed by conclusion and future works (Section 6).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>LI in code-mixed and multilingual settings has received growing attention in recent years, particularly
in low-resourced languages such as those in the Dravidian family. Several shared tasks have previously
focused on word-level LI in code-mixed Indic texts. Notable among them are CoLI-Kanglish:
WordLevel Language Identification in Code-mixed Kannada-English Texts at ICON 2022 [ 7, 8], CoLI-Tunglish:
Word-level Language Identification in Code-mixed Tulu Texts at FIRE 2023 [ 9, 10], and CoLI-Dravidian:
Word-level Code-Mixed Language Identification in Dravidian Languages at FIRE 2024 [ 11]. These tasks
have laid the groundwork for advancing research in LI in multilingual and morphologically rich
environments. Several researchers have explored traditional Machine Learning (ML), Deep Learning
(DL), and transformer-based models for word-level and sentence-level LI tasks. Few notable works are
described below:</p>
      <p>Chakravarthi et al. [12] introduced DravidianCodeMix dataset, which contains social media comments
in Tamil-English, Kannada-English, and Malayalam-English, for sentiment analysis and ofensive
language detection. They reported baseline experiments using classical ML models and DL architectures,
allowing for comparative benchmarks in later studies. Shimi et al. [13] conducted a comparative study
of diferent ML algorithms (Naive Bayes, Support Vector Machines, Logistic Regression, and Random
Forest), alongside transformer models such as BERT and mBERT. Their study focused on Tamil and
Malayalam and found that while classical models achieved accuracy between 85-89%, transformer
models reached up to 98% accuracy. This highlighted the strength of pre-trained language models on
both monolingual and code-mixed data.</p>
      <p>The VarDial Dravidian Language Identification shared task [ 14] evaluated various methods for
identifying languages in code-mixed data. The organizers of the shared task compared character n-gram models
with contextual transformers like RoBERTa. While transformers are popular, character-based models
showed strong macro F1 scores, particularly in low-resource settings involving Kannada, Tamil, and
Malayalam. Deroy and Maity [15] explored prompt-based learning with GPT-3.5 Turbo for word-level
LI in code-mixed Kannada and Tamil. The model showed high precision for English and Kannada
words but faced challenges with mixed-language segments. This indicated the limitations of
promptonly approaches in complex linguistic environments like intra-word code-mixing. Hande et al. [16]
used transfer learning models such as ULMFiT and BERT for detecting ofensive language in
codemixed Tamil, Malayalam, and Kannada. While the main focus was not on LI, their pipeline involved
language-aware preprocessing and word-level modeling strategies.</p>
      <p>Mandalam and Sharma [17] trained Logistic Regression and LSTM networks with Term
FrequencyInverse Document Frequency (TF-IDF) for sentiment analysis on code-mixed Tamil and Malayalam
texts in FIRE 2020. Their results showed that neural models performed better when they used
domainspecific pre-processing. Their setup included modules for identifying intermediate languages to help
with classification. Saumya et al. [ 18] experimented with lightweight models like Naive Bayes and
2https://github.com/rachanabn20/CoLI-Dravidian-FIRE-2025
shallow neural networks, using n-gram features for detecting ofensive content in Tamil-English and
Malayalam-English datasets. Their study shows that simple lexical models worked well in noisy social
media environments. The IndicNLP@KGP team [19] participated in the DravidianLangTech-EACL 2021
shared task using a combination of AWD-LSTM and transformer models for word-level classification.
Their eforts led to F1 scores of 0.97 for Malayalam and 0.77 for Tamil. This performance demonstrates
their strength in handling morphologically rich, code-mixed inputs.</p>
      <p>Informed by these prior works, our system adopts a feature-rich CRF framework tailored to handle
intra-word code-mixing and symbol/other word categorization. Unlike many previous approaches
focused on sentence-level labels or sentiment detection, our system is optimized for fine-grained
word-level classification as demanded by the CoLI-Dravidian @ FIRE 2025 task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes our proposed system pipeline for the CoLI-Dravidian @ FIRE 2025 shared task.
We employed a CRF model to perform word-level LI in code-mixed Dravidian social media text. CRF
model is well-suited for sequential data with strong inter-word dependencies and provides interpretable
feature-based modeling. Figure 1 visualizes the architecture of our CRF-based pipeline for word-level LI.
We transform the given (word, tag) pairs into sentence structure based on sentence boundaries. Each
sentence is then processed as a sequence of words and each word is transformed into a feature vector
using handcrafted features (e.g., character n-grams, word shape, contextual clues). These vectors are
fed to the CRF layer, to predict the most likely sequence of language tags by modeling dependencies
between neighboring tags. The steps involved in building CRF model are described below:</p>
      <sec id="sec-3-1">
        <title>3.1. Language-Specific Characteristics</title>
        <p>Tamil, Kannada, Malayalam, Telugu, and Tulu languages are morphologically rich and exhibit
agglutinative word formation where words are formed by stringing two or more morphemes without altering
them, making morphology highly regular. Code-mixed data involving these languages often blend
native morphology with English words or named entities. For instance, sufix patterns in Tamil (e.g.,
-kura, -vanga, -ungal) or Telugu (e.g., -chusthe, -andi, -nunchi) provide cues for LI. Sufixes such as -andi
in Telugu or -vanga in Tamil often indicate politeness, intent, or action and are highly language-specific.
Kannada exhibits characteristic endings like -alli (locative), -ige (dative), and -iddu (copula), while
Malayalam frequently uses markers such as -ille (negation), -kal (plural), and -um (conjunctive/also).
Tulu, though less resourced, shows identifiable sufixes like -d (past tense), -er (plural), and -ndu
(emphatic). In contrast, named entities and borrowed English terms may appear uniformly across languages,
adding ambiguity for LI. These orthographic and morphological cues thus play a vital role in LI within
code-mixed settings.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature Extraction</title>
        <p>
          The core of our model pipeline is word2features() function, which generates handcrafted features
for each word in a sentence. Inspired by prior CRF-based NER3 and LID systems [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the function
3Lample et al. (2016), introduced neural sequence labeling architectures for NER, combining word- and character-level features
incorporates both word-level and context-level properties. For a given word  in a sentence  =
{1, 2, ..., }, we extract:
• Current word features:
– Lowercased form of the word if the text is in Roman script.
– Character prefixes/sufixes (1 to 3 characters) - short prefixes/sufixes help capture
morphological or inflectional endings relevant to agglutinative languages.
– Word shape - all-capitals, all-lowercase, title case - casing features help distinguish acronyms,
named entities, and sentence boundaries.
– Digit/emoji/URL/symbol flags - flags for emojis or symbols are useful in social media or
informal text settings.
– Length of the word.
– Language-specific morphological clues (e.g., frequent sufixes like -vanga in Tamil, -andi in
        </p>
        <p>Telugu, -an or -amma in Malayalam, -nu or -ra in Kannada, and -da or -du in Tulu).
• Contextual features:
– Previous and next word features - contextual windows (typically size 1 or 2) helps to capture
the features better.
– Position of the word in the sentence - sentence-initial or sentence-final positions can hint at
part-of-speech or discourse-level roles.</p>
        <p>The word2features() function constructs a rich feature vector for each word, which is then passed
to the CRF model, enabling sparse yet interpretable modeling.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Training</title>
        <p>We used sklearn-crfsuite4 - a fast CRF implementation as a Python wrapper for labeling sequence
data, to train CRF models for each language using language-specific features of that language and the
corresponding tags. Words are classified into one of the tags based on their form and context. To ensure
reproducibility and robustness across the linguistically diverse Dravidian languages, hyperparameters
are selected through a combination of insights from prior CRF-based sequence labeling work [11] and
systematic empirical tuning on held-out validation sets for each language. Specifically, we performed
grid search over L1 (1) and L2 (2) regularization coeficients, testing values in the ranges 0.01–0.2 for
1 and 0.001–0.02 for 2, while monitoring F1 scores to balance generalization and overfitting. This
per-language tuning revealed that the shared configuration ( 1 = 0.1, 2 = 0.01) yielded the optimal
trade-of across all datasets, with minimal variance (e.g., &lt;2% F1 score fluctuation between languages).
Other parameters, such as max iterations (200) and context window size (2), are similarly validated
on validation splits to avoid underfitting on agglutinative patterns while preventing noise from larger
windows. The selected hyperparameters are summarized in Table 2.</p>
        <p>One of the practical advantages of using CRF models, particularly in our pipeline, is their training
eficiency and low computational overhead. Using sklearn-crfsuite on Google Colab, CRF model
for each language was trained within 5–7 minutes, even when using feature-rich configurations. In
contrast to large transformer-based models that often require GPU acceleration and hours of training
time, our CRF-based approach is CPU-friendly, memory-eficient, and well-suited for low-resource
environments. This makes our system accessible to researchers with limited computing infrastructure
while still achieving competitive accuracy in code-mixed setting.
4https://sklearn-crfsuite.readthedocs.io/en/latest/</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>In this section, we describe the dataset, experimental setup, evaluation metrics, and the results of our
word-level LI models. We performed experiments using the datasets provided by the organizers of
CoLI-Dravidian @ FIRE 2025 shared task for five languages: Kannada, Malayalam, Telugu, Tamil, and
Tulu. The task is framed as a classical sequence labeling problem, with sequence tags that include native
language, English, named entities, symbols, mixed words, and other semantic categories. The dataset is
pre-tokenized into words in Roman script and annotated with word-level tags. Table 3 summarizes the
statistics of the datasets in terms of the number of words for the five languages across Train, Validation,
and Test sets, for each language.</p>
      <sec id="sec-4-1">
        <title>4.1. Results</title>
        <p>The proposed CRF model is evaluated on the Test sets of each language based on Macro F1-score which
is well-suited for imbalanced class distributions. Table 4 presents the Precision, Recall and Macro
F1-scores obtained by our models for the five Dravidian languages for both Validation and Test sets. Our
system performed consistently well, securing competitive ranks in each language track. Notably, we
achieved 4th rank in Telugu, 5th rank in Tulu, 8th in Tamil and 6th rank in both Kannada and Malayalam
tracks. Figure 2 shows performances of the models submitted by all participants of the shared task
for each language. It is evident that classical sequence labeling approaches, when engineered with
domain-relevant features, can compete efectively against neural and multilingual transformer-based
systems. Overall, the results of our models reafirm that even in the presence of multilinguality, informal
orthography, and limited data, structured models like CRFs when paired with linguistic insights can
deliver reliable and interpretable performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Error Analysis</title>
        <p>Despite overall promising performance, our models exhibited several notable misclassifications in all
ifve languages. To understand the behavior of our system across the languages, we present the confusion
matrices in Figure 3. These matrices highlight common error patterns and dominant misclassification
trends. Frequent confusions are observed between script-similar pairs (e.g., Kannada–Tulu),
classsparse categories (e.g., MIXED), and semantically overlapping tags (e.g., ENGLISH and NAME). Table 5
lists manually inspected word-level prediction errors to highlight specific model confusions for each
language as given below:
• Kannada: The model struggles to disambiguate mixed-language words that contain Kannada
morphemes but exhibit lexical borrowing (e.g., MIXED WORD “babydu” misclassified as ENGLISH).
CRF dependency on local context also causes OTHER or MIXED tokens to be absorbed into dominant
language tags like KANNADA. Improved modeling of bilingual spans or hybrid morphemes could
reduce these inconsistencies.
• Malayalam: Overlapping of native sufixes (-an, -amma) and common name endings leads
to confusion between MALAYALAM and NAME tags. Using an external name lexicon could help
disambiguate such tokens. Stylized English forms also confuse the system when they are
morphologically similar to Malayalam (e.g., “kazhinuu”).
• Telugu: Errors are dominated by transliterated or borrowed forms in Roman script. Models
fail to separate phonetically similar tokens such as “thalli” (native) and “tally” (borrowed from
English). Subword normalization and phonetic-aware features can mitigate this.
• Tamil: Confusion between TAMIL and MIXED tags occurs frequently in constructs like “call
pannunga”, where an English verb stem merges with a Tamil sufix. Misclassifications between
NAME and ENGLISH tags occur in named entities using Latin script — e.g., “Sumara” or “Thomas”.
• Tulu: Due to script similarity with Kannada, Tulu words are frequently misclassified. Class
imbalance also skews predictions towards the more frequent TULU tag. A curated list of orthographic
or morphological patterns unique to each language could help improve separability.
(a) Kannada – 6th Rank
(b) Malayalam – 6th Rank
(c) Telugu – 4th Rank
(d) Tamil– 8th Rank
(e) Tulu – 5th Rank
While each language presents its own set of challenges, several recurring error patterns are consistent
across all five languages. These issues arise primarily from multilingual interference, limited
representation of rare tags, and orthographic ambiguities in transliterated or borrowed words. For example, i)
MIXED class emerges as the most error-prone due to heterogeneity and ii) orthographically ambiguous
or low-frequency tags (e.g., LOCATION, NUMBER) are also challenging, although strong orthographic
cues enable better separation in tags like SYM. Such trends indicate that certain error types stem from
structural and distributional properties of the data rather than language-specific phenomena.</p>
        <p>A significant challenge to the performance of the proposed models is data imbalance. Training
sets of all the languages are substantially imbalanced. For example, tags like ENGLISH and TAMIL
dominate, while low-resource tags such as MIXED, KANNADA, and LOCATION are underrepresented.
(a) Kannada
(b) Malayalam
(c) Telugu
(d) Tamil
(e) Tulu
Figure 3: Confusion Matrices Showing Class-wise Predictions for the Five Languages
The early experimentations with class reweighting and minority upsampling increased sensitivity on
rare tags at the cost of significant overfitting and reduced macro-F1 performance. Taking this into
considering and based on the CRF-based model developed by Asha et al. [11], we opted to use the
imbalance Train sets provided by the organizers of the shared task. Nonetheless, improving error
rates on low-frequency tags remains a key direction. Advanced imbalance-aware training (e.g., focal
loss, synthetic data augmentation, or curriculum sampling) could further reduce confusion in future
iterations.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>We explored a Naive Baseline that serves as a simple benchmark for sequence labeling tasks. It
completely ignores the input text and assigns the most frequent label from the training set to every
token in the test set, regardless of its actual content or context. For example, if KANNADA is the
most common tag in the training data, then every word in the test set will be labeled as KANNADA,
irrespective of its true language or identity. This serves as a sanity check and establishes a lower bound
for performance comparison.</p>
        <p>The features used to train the proposed CRF-based word-level LI are categorized into three major
groups: (i) contextual word window features, (ii) character-level prefix and sufix n-grams, and (iii) word
shape features (e.g., capitalization, presence of digits, special characters). To understand the contribution
of diferent feature sets in our proposed model, We performed an ablation study by systematically
removing one feature group at a time. The results show that removing Prefix and sufix n-gram
features causes the largest drop in performance, highlighting the importance of morphological cues in
handling agglutinative Dravidian languages. Word shape features contribute modestly, ofering small
but consistent gains. In contrast, removing contextual word features improves the performance for
all languages, suggesting that the CRF’s sequential modeling is efective in capturing local dependencies
from the Word shape and Prefix and sufix N-gram features. Overall, this analysis confirms that
the success of our CRF model stems not only from the algorithm itself but also from the inclusion of
well-designed, language-specific, and morphology-aware feature engineering.</p>
        <p>The results of the ablation study and the Naive Baseline in terms of F1 score are presented in
Table 6. Ablation study reveals that the following consistent patterns emerge across the five Dravidian
languages:
• Prefix and Sufix N-grams: This feature group proves to be the most influential. Removing it
leads to a substantial drop in F1 score—up to 13% in Tamil and over 6% in Kannada—indicating
that subword morphological patterns play a crucial role, particularly in agglutinative languages.
• Contextual Word Features: Surprisingly, excluding these features results in little to no
performance degradation; in fact, slight improvements are observed for most languages. This suggests
that the CRF’s sequential modeling and lexical features are suficient to capture local dependencies,
making this group less critical in our setup.
• Word Shape Features: These contribute modestly, with minimal variation (less than 1%
difference) when omitted. Their utility appears somewhat language-dependent—slightly more
beneficial for Malayalam and Tamil—likely due to inconsistent capitalization and informal
orthography common in code-mixed digital text.
• Naive Baseline: The naive model performs poorly across all languages (average F1 ≈ 31%),
reafirming the dificulty of the task and the advantages of engineered linguistic features and
structured CRF modeling.</p>
        <p>Overall, these findings confirm that the efectiveness of our CRF-based system is heavily dependent
on well-designed linguistic feature engineering—particularly character n-gram morphology—which is
especially crucial in code-mixed, low-resource scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Declaration on Generative AI</title>
      <p>While drafting this paper, the authors utilized AI-based tools such as grammar correction and
formatting support to assist in improving clarity and presentation. All core ideas, experimental design,
implementation, interpretation of results, and written content were solely developed and curated by
the authors. The final submission reflects original human-authored work grounded in independent
research and critical analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this paper, we presented our approach for the CoLI-Dravidian shared task at FIRE 2025, focusing
on fine-grained word-level LI in code-mixed multilingual social media texts involving five Dravidian
languages - Kannada, Malayalam, Tamil, Telugu and, Tulu. Through a CRF-based pipeline and carefully
engineered features, our system demonstrated strong performance for Kannada and Tulu, while
highlighting the inherent challenges in handling Tamil due to script ambiguity and overlapping vocabulary
with English. Despite encouraging results, several limitations persist. The system occasionally struggles
with ambiguous words, named entities, and transliterated words particularly in noisy informal text.
These challenges point to the need for more context-aware and deep semantic models. This study lays
a foundation for deeper exploration into multilingual and code-mixed language processing within the
Dravidian language family, with potential applications in conversational AI, social media moderation,
and regional language technologies. We would like to explore transformer-based multilingual models
and character-level embeddings to better capture contextual dependencies. Incorporating external
linguistic resources or pretraining on larger domain-specific corpora could also help to improve the
performance, especially for low-resource languages like Tulu. Moreover, a focus on semi-supervised or
zero-shot methods may further extend the scalability of our system to unseen dialects and languages.
[7] F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H. L. Shashirekha, G. Sidorov, A. Gelbukh, Overview
of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at
ICON 2022, in: Proceedings of the 19th International Conference on Natural Language Processing
(ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English
Texts, 2022, pp. 38–45.
[8] H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, G. Sidorov, Coli-Machine Learning Approaches
for Code-mixed Language Identification at the Word Level in Kannada-English texts, Acta
Polytechnica Hungarica 19 (2022) 123–141.
[9] A. Hegde, F. Balouchzahi, S. Coelho, S. HL, H. A. Nayel, S. Butt, CoLI@ FIRE2023: Findings of
Word-level Language Identification in Code-Mixed Tulu Text, in: Proceedings of the 15th Annual
Meeting of the Forum for Information Retrieval Evaluation, 2023, pp. 25–26.
[10] A. Hegde, F. Balouchzahi, S. Coelho, H. L. Shashirekha, H. A. Nayel, S. Butt, Overview of
CoLITunglish: Word-level Language Identification in Code-Mixed Tulu Text at FIRE 2023, in: FIRE
(Working Notes), 2023, pp. 179–190.
[11] A. Hegde, F. Balouchzahi, S. Butt, S. Coelho, K. G, H. S. Kumar, S. D, S. H. L., A. Agrawal,
Coli@fire2024: Findings of word-level code-mixed language identification in dravidian languages,
in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation
(FIRE ’24), ACM, Gandhinagar, India, 2024, pp. 38–41. URL: https://doi.org/10.1145/3734947.3735663.
doi:10.1145/3734947.3735663.
[12] B. R. Chakravarthi, R. Priyadharshini, M. A. Kumar, P. Krishnamurthy, E. Sherly,
DravidianCodeMix: Sentiment Analysis and Ofensive Language Identification Dataset for Dravidian
Languages in Code-Mixed Text, in: Proceedings of the First Workshop on Speech and Language
Technologies for Dravidian Languages, Association for Computational Linguistics, 2021, pp. 1–11.
[13] R. Shimi, R. Thomas, S. Rajeev, An Empirical Evaluation of Machine Learning and
TransformerBased Models for Code-Mixed Text Classification, Journal of Computational Linguistics and
Applications (2024). To appear.
[14] M. Gaman, T. Jauhiainen, M. Lui, M. Zampieri, Findings of the Vardial Evaluation Campaign 2021,
in: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
(VarDial 2021), Association for Computational Linguistics, 2021, pp. 1–15.
[15] N. Deroy, S. Maity, Prompting GPT-3.5 for Word-Level Language Identification in South Indian</p>
      <p>Code-Mixed Texts, ArXiv preprint (2024). ArXiv:2403.10258.
[16] A. Hande, P. Yogi, B. R. Chakravarthi, J. P. McCrae, Ofensive Language Identification in Dravidian
Code-Mixed Social Media Texts, in: Proceedings of the First Workshop on Speech and Language
Technologies for Dravidian Languages, Association for Computational Linguistics, 2021, pp. 18–26.
[17] V. Mandalam, N. Sharma, Sentiment Analysis on Code-Mixed Dravidian Languages using TF-IDF
and Deep Learning Approaches, in: Working Notes of FIRE 2020 - Forum for Information Retrieval
Evaluation, CEUR-WS.org, India, 2021, pp. 123–130.
[18] G. Saumya, R. Rajeev, R. Thomas, Tanglish and Manglish Ofensive Content Detection: A
Comparative Study of Classical and Neural Approaches, in: Proceedings of the DravidianLangTech-EACL
2021 Workshop, Association for Computational Linguistics, 2021, pp. 132–138.
[19] S. Bose, G. Kharola, S. K. Naskar, IndicNLP@KGP at DravidianLangTech-EACL2021: Ensembles of
Transformer and LSTM models for Ofensive Language Identification in Code-Mixed Texts, in:
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages,
Association for Computational Linguistics, 2021, pp. 27–35.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jauhiainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lindén</surname>
          </string-name>
          , Automatic Language Identification in Texts: A Survey,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>65</volume>
          (
          <year>2019</year>
          )
          <fpage>675</fpage>
          -
          <lpage>782</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>Automatic Language Identification for Transcribed Speech</article-title>
          ,
          <string-name>
            <surname>Interspeech</surname>
          </string-name>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <article-title>A Survey of Code-Mixed Data and Approaches in Natural Language Processing</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>00245</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Hosahalli</given-names>
            <surname>Lakshmaiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Overview of CoLI-Dravidian
          <year>2025</year>
          :
          <article-title>Word-level Code-Mixed Language Identification in Dravidian Languages, in: Forum for Information Retrieval Evaluation FIRE -</article-title>
          <year>2025</year>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <article-title>Neural Architectures for Named Entity Recognition</article-title>
          ,
          <source>in: Proceedings of NAACL-HLT</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondratyuk</surname>
          </string-name>
          , UFAL Submission to the
          <source>IWPT 2019 Shared Task: Parsers for 50 Languages using 50 Treebanks, Proceedings of the Shared Task at the 15th International Conference on Parsing Technologies (IWPT)</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>