<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word-level Language Identification using Character-level Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amaan Ahmad</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asha Hegde</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sharal Coelho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Manipal Institute of Technology (MIT)</institution>
          ,
          <addr-line>Bangaluru</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The social media contexts poses significant challenges for natural language processing (NLP) and machine learning tasks. Existing language detection tools struggle to identify languages at the word level. Word-level Language Identification (LID) is a critical task in NLP, particularly for handling code-mixed and multilingual text prevalent in social media and digital communication. In this work, we address the challenge of identifying languages at the word level across five diverse languages: Kannada, Malayalam, Telugu, Tamil and Tulu. We employ a feature extraction approach by using Term Frequency-Inverse Document Frequency (TF-IDF) of character n-grams ranging from 1 to 4, which are then fed into classical machine learning model (ExtraTrees Classifier). Our method achieves superior performance, securing better ranks in benchmark evaluations for all five languages. Experimental results demonstrate high accuracy for ExtraTrees.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Word Level Language Identification</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>ExtraTrees</kwd>
        <kwd>Dravidian Languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Language identification (LID) is a foundational task in natural language processing (NLP) that involves
determining the language of a given text segment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While document-level or sentence-level LID has
been well-studied, word-level LID presents unique challenges, especially in multilingual and code-mixed
scenarios where multiple languages co-exist within a single utterance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Word-level LID aims to
tag each word or token with its corresponding language, enabling downstream applications such as
machine translation, sentiment analysis, and hate speech detection in diverse linguistic contexts. The
importance of word-level LID has grown with the proliferation of social media platforms, where users
frequently mix languages such as Hindi-English, Telugu-English [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Kannada-English,
MalayalamEnglish, and so on[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Early approaches to LID relied on rule-based methods or simple statistical
models, but recent advancements have incorporated machine learning and deep learning techniques.
For instance, character-level features have proven efective due to their ability to capture orthographic
patterns unique to languages. Dravidian languages are a well-known language category spoken by
more than 250 million people, mainly in South India, Sri Lanka, and other parts of South Asia. Kannada,
Telugu, Tamil, Malayalam and Tulu, are the most widely spoken Dravidian languages [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These
Dravidian languages are low-resource languages due to the limited availability of digital tools and
resources.
      </p>
      <p>To address the aforementioned limitations, we propose a lightweight pipeline for word-level LID
that leverages classical machine learning models with robust feature engineering. Our approach uses
TF-IDF combined with character n-grams (1-4 grams) to extract discriminative features from words.
The features are then input to SVM and DT classifiers, chosen for their eficiency, interpretability, and
strong performance on textual data.</p>
      <p>The rest of the paper is organized as follows: Section 2 contains Related Work. While Section 3
describes the Methodology, Section 4 gives a description of the Experiments, Results, and Observations
followed by Conclusion in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recently, researchers in the field of code-mixed text have shown growing interest in under-resourced
and low-resource languages for various applications, such as sentiment analysis[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], machine translation,
and so on [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The task of word-level LI has become increasingly important as multilingual
and code-mixed content continues to grow, especially on digital platforms like Facebook, YouTube,
etc. Researchers have explored various approaches to address word-level LI for languages where
extensive corpora and linguistic resources are readily available [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, the challenge of processing
languages with limited resources, often referred to as low-resource languages for word-level LI has
gained significant attention. Some of the related works for word-level LI are described below:
      </p>
      <p>
        Thara and Poornachandran [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have scraped YouTube comments to identify bilingual
MalayalamEnglish code-mixed text. To filter out the comments they have removed English alphabets, numbers,
special characters, and emoticons. They used transformer models (CamemBERT, XLMRoBERTa,
ELECTRA, and DistilBERT) to predict language tags at the word-level. The results of this study showed
that ELECTRA performed better than other models by obtaining F1-score of 0.993. Deka et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
proposed the Bidirectional Encoder Representations for Transformers (BERT) based approach for LI
using Kannada-English code-mixed corpus. Their approach achieved 86% weighted average F1-score
and a macro average F1-score of 57%. To identify the language of words in code-mixed Kannada
texts Yigezu et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed a Bi-LSTM with an attention model that integrates BERT features to
enhance word-level LI accuracy. While the existing work on word-level LI in low-resource has made
significant advances, there are still some limitations that create opportunities for further research. For
instance, code-mixed data often depends on the availability of high-resource languages, which are not
always accessible. Processing user-generated text is another challenge due to its variability, including
code-mixed nature, spelling or grammatical errors, etc. Additionally, it often lacks context, making
it harder to interpret meaning and intent accurately. Additionally, the efectiveness of incorporating
linguistic features can vary greatly depending on the specific languages and features used, and finding
the optimal combination needs to be explored.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>Language Identification (LI) is the process of determining the language present in a given text and
serves as a foundational step for many applications, including sentiment analysis, machine translation,
information retrieval, and natural language understanding. In multilingual India, particularly among
young people, social media posts often contain code-mixed text that blends local languages with English
across diferent levels. This creates substantial challenges for LI, especially when code-mixing occurs
within a single word.</p>
      <p>Dravidian languages, spoken widely in southern India, are under-resourced despite their rich
morphological complexity. They face additional technological hurdles, especially regarding script representation
in digital spaces, which often leads users to adopt Roman or hybrid scripts for online communication.
While this widespread code-mixing ofers a wealth of linguistic data for research, it remains relatively
unexplored. To tackle the challenge of word-level LI for Dravidian languages, we are organizing a
shared task that provides datasets for five languages (Kannada, Tamil, Malayalam, Telugu, and Tulu)
encouraging the development of advanced LI models.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In the proposed methodology, the word-level LI task is modeled as a sequence labeling problem where
the goal is to assign a label to each word in a sequence. It is achieved by by training ML models.
The framework of proposed model is shown in Figure 1 and the steps involved in the framework are
described in the following subsection.</p>
      <p>involves supervised machine learning using character-level features and multiple classifiers for
comparison.</p>
      <sec id="sec-4-1">
        <title>4.1. Feature Extraction Using TF-IDF Vectorization</title>
        <p>We employed a TF-IDF vectorizer configured for character-level analysis with an n-gram range of (1, 4) to
efectively capture subword patterns, such as character sequences distinctive to specific languages. The
vectorizer was fitted on the training data to acquire the vocabulary and convert it into a sparse matrix
representation. Subsequently, the test data was transformed using the fitted vectorizer to maintain
consistency and prevent data leakage. For generating final predictions on unseen data, the ’Word’
column is transformed utilizing the same fitted vectorizer.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Label Encoding</title>
        <p>The LabelEncoder is used to convert categorical language tags into numerical labels. The target_names
considered as the list of unique language classes, and all_labels as their corresponding indices for
multi-class handling.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <sec id="sec-5-1">
        <title>5.1. Dataset</title>
        <p>The datasets comprising code-mixed and monolingual texts for the five languages (Kannada, Malayalam,
Telugu, Tamile and Tulu) is utilized. Each dataset includes word-level annotations. Table 1 contains
sample words, their English translations, and the corresponding labels/tags from the given dataset.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>The performance of the classifier is evaluated based on Macro F1-Score (M_F1). Macro scores are
preferred for evaluating the performance across all classes without bias. The performances of the
proposed CRF models on Test sets are shown in Table 2.</p>
        <p>We evaluate our method on datasets for five languages: Kannada, Malayalam, Telugu, Tamile and
Tulu, achieving better ranks compared to baselines in benchmark tasks. This work emphasizes simplicity
and efectiveness, making it suitable for resource-limited environments while outperforming more
complex models in speed and accuracy on the given datasets.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we describe that TF-IDF with character n-grams and classical ML model can achieve
better resulta in word-level LID for multiple languages, addressing key limitations of prior methods.
The "CoLI-Dravidian@2025: Word-level Code-Mixed Language Identification in Dravidian Languages"
shared task at FIRE 2025. By training the ML models with character n-grams features, the proposed
model obtained Macro F1 score of 0.7084 for Tamil and 0.7572 for Telugu resepectively. For both
languages we secured 3rd rank. Future work could integrate this with deep learning approaches.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Chat GPT-4 in order to:Grammar and spelling
check. Using this tool, the author(s) reviewed and edited the content as needed and take full responsibility
for the publication’s control.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Doğruöz</surname>
          </string-name>
          ,
          <article-title>Word level language identification in online multilingual communication</article-title>
          ,
          <source>in: Proceedings of the 2013 conference on empirical methods in natural language processing</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>857</fpage>
          -
          <lpage>862</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shanmugalingam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sumathipala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Premachandra</surname>
          </string-name>
          ,
          <article-title>Word level language identification of code mixing text in social media using nlp</article-title>
          ,
          <source>in: 2018 3rd international conference on information technology research (ICITR)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gundapu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mamidi</surname>
          </string-name>
          ,
          <article-title>Word level language identification in english telugu code mixed data</article-title>
          ,
          <source>in: Proceedings of the 32nd Pacific Asia conference on language, information and computation</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jhamtani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Bhogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raychoudhury</surname>
          </string-name>
          ,
          <article-title>Word-level language identification in bi-lingual code-switched texts</article-title>
          ,
          <source>in: Proceedings of the 28th Pacific Asia Conference on language, information and computing</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>348</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HL</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Nayel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          , Coli@ fire2023:
          <article-title>Findings of word-level language identification in code-mixed tulu text</article-title>
          ,
          <source>in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Nayel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <article-title>Overview of coli-tunglish: Word-level language identification in code-mixed tulu text at fire 2023</article-title>
          ., in: FIRE (Working Notes),
          <year>2023</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Anusha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed tulu text</article-title>
          ,
          <source>in: Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Mucsd</surname>
          </string-name>
          @ dravidianlangtech2023:
          <article-title>Predicting sentiment in social media text using machine learning techniques</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          , K. G,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HL</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Coli@ fire2024:
          <article-title>Findings of word-level code-mixed language identification in dravidian languages</article-title>
          ,
          <source>in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Overview of coli-kanglish: Word level language identification in code-mixed kannada-english texts at icon 2022</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Poornachandran</surname>
          </string-name>
          ,
          <article-title>Transformer Based Language Identification for Malayalam-English Code-Mixed Text</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>118837</fpage>
          -
          <lpage>118850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Deka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Kalita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Sarma</surname>
          </string-name>
          ,
          <article-title>BERT-Based Language Identification in Code-Mix KannadaEnglish Text at the CoLI-Kanglish Shared Task@ ICON 2022, in:</article-title>
          <source>Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Yigezu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Tonja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kolesnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Tash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Word Level Language Identification in Code-Mixed Kannada-English Texts using Deep Learning Approach</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>