<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transformer Driven Word Level Classification of Dravidian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>C.Jerin Mahibha</string-name>
          <email>jerinmahibha@msec.edu.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wordson Robert</string-name>
          <email>wordsonrobert@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gersome Shimi</string-name>
          <email>gshimi2022@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Durairaj Thenmozhi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Science Education and Research</institution>
          ,
          <addr-line>Kolkotta</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Madras Christian College</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Meenakshi Sundararajan Engineering College</institution>
          ,
          <addr-line>Chennai</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Language detection is the process of automatically identifying the language used in a text, even when that text is not always coherent or grammatically correct. Language detection has become an essential tool in today's digital landscape focused on consumers, where businesses rely on user-generated content to tailor advertisements, products, and services more efectively. The challenge becomes much tougher when dealing with code-mixed or multilingual text, which is common in linguistically diverse regions like South India. These texts often include multiple languages, sometimes written in non-native scripts, and frequently involve code-switching at various levels, making it dificult for models trained only on monolingual data to perform well. To identify the language associated with a word, a high-performance model has been proposed for the CoLI-Dravidian@FIRE 2025 shared task for word-level identification, focused on identifying Dravidian languages such as Tamil, Telugu, Malayalam, Kannada, and Tulu. The proposed system uses a language-agnostic model to identify languages associated with words using the data sets provided by the organizers of CoLI-Dravidian@FIRE 2025. The results of the proposed system are encouraging with an Macro F1 score of 0.8995 for Kannada, 0.7434 for Tamil, 0.8271 for Malayalam, 0.9515 for Telugu and 0.8224 for Tulu. We were ranked 1 in the leaderboard for the languages Tamil, Malayalam, and Telugu and ranked 2 for the language Tulu. These results show the strength of the proposed model in word-level language identification of Dravidian languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Language detection is a process by which we detect the language of an unknown text. Often when we
deal with languages that have a similar origin or a source, then we can use various information we
know about these languages to guess with a reasonable amount of error which language the text is
from. The information used here is that languages, and more specifically languages of Dravidian origin,
have very similar synonyms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Most Dravidian languages, like Tamil, Tulu, and Malayalam use this
extensively. The shared task by CoLI-Dravidian @ FIRE 2025 has organized a shared task on identifying
word-level languages of the most prominent Dravidian languages.
      </p>
      <p>
        Language detection is usually a tedious and rigorous process. It relies heavily on the vectorization of
the most commonly used phrases, and the model has to be statistically trained on that data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
datasets also need to be standardized to reduce any misunderstandings that the models, like BERT
(Bidirectional Encoder Representations from Transformers), might have. Transformer models can make
high-precision, reasoning-backed decisions from the data that is provided. It’s truly a high-performing
model that can understand nuance and context in the datasets [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It also uses its pretrained library to
connect the dataset to the grammar patterns of languages it already understands, thus giving a much
more precise prediction. It also uses smaller details like punctuation, word length, and word order to
provide a much more informed and efective result.
      </p>
      <p>
        For most languages, models like XLM-BERT are highly efective because they are user-eficient and
come pretrained on over 100 languages. However, for Dravidian languages, models like RoBERTa and
BERT often perform better. In the case of low-resource languages like Tamil and Tulu [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], when working
with unlabeled data, semi-supervised learning techniques like pseudo-labeling are particularly efective
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For similar languages, generalization tends to work well, which boosts the performance of these
models. BERT, in particular, excels in both generalization and data expansion [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Retraining the model
on the pseudo-labeled dataset can lead to even better results for transformer-based models like BERT.
      </p>
      <p>The paper is organized with Section 2 and 3 discussing on related works and datasets, Section 4 on
system description, Section 5 discussing results, Section 6 and 7 contributing the error analysis and
conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        A survey of Language Identification of Code-Mixed Text based on Techniques, Data Availability and
Challenges had been done by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The findings revealed that excellent performance had been shown by
a multichannel CNN incorporated with BILSTM and CRF. Considering Non-neural network techniques,
SVM and CRF are recommended to be applied. Transformer based technique can also be considered
one of the most robust techniques for code-mixed Language Identification [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] due to its remarkable
performance in equivalent tasks. BERT, a transformer model, along with its variants - CamemBERT,
DistilBERT had been used to implement a word level language identification of Malayalam-English
codemixed data on a dataset collected from social media platforms [9]. A word-level language identification
model for code-mixed Indonesian, Javanese, and English tweets has been implemented using various
approaches, like fine-tuning BERT, BLSTM-based, and CRF [ 10]. BERT’s ability to understand each
word’s context from the given text sequence is evident from the results obtained by the fine-tuned
IndoBERTweet model [11]. A framework for language identification has been proposed that makes
use of a dynamic switching mechanism for efective language classification of both words that are
borrowed or embedded from other languages as well as words that are valid in multiple languages [12].
Identification of languages on Twitter has been implemented by exploiting a transfer learning approach
and fine-tuning BERT models by [ 13]. It involves Hindi-English-Urdu codemixed text for language
pretraining and Hindi-English codemixed text for subsequent word-level language classification [ 14]. It is
evident from the results that the representations pre-trained over codemixed data produce better results
than their monolingual counterpart. The use of a Transformer based model for word-level language
identification in code-mixed Kannada English texts has been proposed by [ 15]. An empirical analysis
of Dravidian language identification in social media text using machine learning and deep learning
approaches with k-fold cross validation has been implemented [16]. The empirical analysis focused on
various Machine Learning and Deep Learning models based on performance measures like accuracy,
precision, recall, and F1-score. It was found that the language agnostic model outperformed all other
models considering the task of language detection in Dravidian languages. Language Identification from
code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam,
and Tamil was a part of Dravidian Language Identification (DLI) shared task organized at VarDial 2021
workshop. [17] had used a Naive Bayes based classifier with adaptive language models to obtain a
competitive performance in the shared task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data set</title>
      <p>The dataset used by the proposed system was provided by the shared task organizers and was available
for five languages, namely, Tamil, Tulu, Telugu, Kannada, and Malayalam. Separate datasets were
provided for training, validation, and testing the model. The proposed task was to classify the language
associated with the words from the test dataset.</p>
      <p>The distribution of data in all the datasets is tabulated in the Table 1. The number of instances in the
training dataset was 30910, 13514, 25995, 6280, and 29524 for Kannada, Tamil, Malayalam, Telugu, and
Tulu, respectively. The evaluation dataset had 2016 instances for Kannada, 1984 instances for Tamil,
2008 instances for Malayalam, 515 instances for Telugu, and 3006 instances for Tulu. There were 2075,
2006, 1997, 494, and 3283 instances in the test dataset for the languages Kannada, Tamil, Malayalam,
Telugu, and Tulu, respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. System Description</title>
      <sec id="sec-4-1">
        <title>The Figure 1 gives a visual distribution of all the given datasets.</title>
        <p>The architecture of the proposed system uses a language-agnostic model to identify the language
associated with the words in the test dataset. The languages that are considered for identification
include five Dravidian languages: Kannada, Tamil, Malayalam, Telugu, and Tulu. The figure 2 illustrates
the components of the proposed system. The training dataset is used to train the model, optimizing
it for maximum accuracy. The performance of the proposed model is evaluated using the evaluation
dataset. The trained model is then used for predicting the language associated with instances of the
test dataset, which uses diferent metrics, such as accuracy, precision, recall, and F1 scores, to assess its
performance.
4.1. Methodology
To accomplish the language detection, from the available transformer models, a language-agnostic
model was chosen. The model was trained for 10 epochs by setting the parameter representing the
number of labels in the respective dataset.</p>
        <p>Language agnostic BERT Sentence Embedding [18] is a multilingual model for cross-lingual sentence
embedding in 109 languages. Pre-training can be accomplished by combining masked language
modelling (MLM) with translation language modelling (TLM). This model performs well in multilingual
sentence embedding and multilingual text retrieval. To make the process of training more eficient, a
dual-encoder architecture has been used, which is considered to be an efective approach for learning
cross-lingual embeddings. The BERT transformer model forms the base of the encoder architecture,
which has 12 transformer blocks, 12 attention heads, and 768 per-position hidden units. All languages
share the various encoder parameters. Tokenization in Language agnostic model plays a crucial role
in preparing the text for input into the model, allowing it to process and understand the semantic
meaning of sentences across multiple languages. The input text is tokenized into smaller units using the
WordPiece tokenizer. This involves breaking down words into sub word units. Each token is assigned a
unique token ID, which corresponds to its index in the tokenizer’s vocabulary. Special tokens are also
added to the tokenized input to mark the beginning and end of sentences, as well as to denote padding or
unknown tokens. Along with token IDs, attention masks are also generated to indicate which tokens are
actual words and which ones are padding tokens. This helps the model focus only on the relevant parts
of the input during processing. From the last transformer block, normalized [CLS] token representations
are extracted as the sentence embeddings. A shared transformer network is used to encode the source
and target text, and the translation ranking task helps to get similar representations for the source and
target text. Mapping similar words from diferent languages to a common representation is part of
the parameter sharing capacity of the encoders by altering the hyperparameters associated with the
model, it is being trained. The model is trained with the objective of feature prediction. The number
of labels is set to 4 or 5 based on the number of labels provided in the dataset while tuning the model
for language detection, which is trained for 10 epochs. The model has been implemented with Adam
optimizer and a batch size of 32. The process behind this method is represented by Figure 3. Language
embeddings represent entire languages as fixed-size vectors in an embedding space. These embeddings
can capture various linguistic properties of languages, such as vocabulary, syntax, and semantics. In the
context of language detection tasks, language embeddings contribute by capturing the unique linguistic
characteristics of diferent languages in a continuous vector space.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>To assess the accuracy of the proposed model, the parameter macro-F1 score has been used by the
organizers. Macro-F1 is a very balanced metric to use because it gives equal importance to how well we
predict each of the labels, even if the sample sizes for them are unequal. It’s the unweighted average
of the F1 scores for each label in the dataset. And the F1 score itself is the harmonic mean of how
many correct predictions we made for a class label and how many actual instances of that class labels
were found correct. This works especially well for the proposed system because the dataset is clearly
unbalanced across the five languages.</p>
      <p>The performance scores of the proposed models are represented in the table 2. The proposed system
achieved a macro F1 Score of 0.8995 for Kannada, 0.7434 for Tamil, 0.8271 for Malayalam, 0.9515 for
Telugu and 0.8224 for Tulu. These scores show that our models were pretty accurate across the board.
In the CoLI-Dravidian @ FIRE 2025 shared task, this helped us to be ranked 1st for Tamil, Malayalam,
and Telugu, 2nd for Tulu and 7th for Kannada which shows that the proposed system for language
detection predicted up really well compared to others.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Error Analysis</title>
      <p>The Macro F1 score obtained for the task using the proposed language-agnostic model shows that more
false positive and false negative classifications have occurred. One reason for this could be the data
imbalance nature of the dataset. The confusion matrix of the proposed system considering the diferent
languages is represented in figures 4, 5, 6, and 7. The consistent performance of the model suggests
that it generalizes beyond memorization, efectively classifying across languages. The proposed model
utilizes a language-agnostic approach, which is efective for low-resource languages. This consistency
across languages shows that the proposed model is equally efective for every class label, not just the
ones with more data.</p>
      <p>The occasional mismatches or deeply etymologically ambiguous words that led to errors are areas
that can still be improved on. A few of the misclassifications are represented in Table 3.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Especially online, for digital content and personalization, language detection has become an
increasingly important process. It enables platforms to deliver more customized content to users, moderate
information better, regulate misinformation, and block hate speech. This process becomes even more
crucial when dealing with similar, niche languages—like those of Dravidian origin—because it allows
for better tagging, search engine results, sarcasm detection, and even sentiment analysis. Recognizing
its criticality, FIRE 2025 introduced a shared task focused specifically on Dravidian language
detection. For this task, given under the banner CoLI-Dravidian @ FIRE 2025 the proposed system uses
a language-agnostic model to efectively classify languages. The achieved performance metrics are
consistently high across languages, indicating strong and reliable performance. However, by adopting
more customizable approaches—such as time-based anomaly detection and deeper model refinement—
the boundaries can be pushed further and develop even more accurate and adaptable deep learning
models.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>[9] S. Thara, P. Poornachandran, Transformer based language identification for malayalam-english
code-mixed text, IEEE Access 9 (2021) 118837–118850.
[10] A. Hegde, F. Balouchzahi, S. Coelho, H. Shashirekha, H. A. Nayel, S. Butt, Overview of coli-tunglish:
Word-level language identification in code-mixed tulu text at fire 2023., in: FIRE (Working Notes),
2023, pp. 179–190.
[11] A. F. Hidayatullah, R. A. Apong, D. T. Lai, A. Qazi, Corpus creation and language identification for
code-mixed indonesian-javanese-english tweets, PeerJ Computer Science 9 (2023) e1312.
[12] N. Sarma, R. S. Singh, D. Goswami, Switchnet: Learning to switch for word-level language
identification in code-mixed social media text, Natural Language Engineering 28 (2022) 337–359.
[13] M. Z. Ansari, M. S. Beg, T. Ahmad, M. J. Khan, G. Wasim, Language identification of
hindienglish tweets using code-mixed bert, in: 2021 IEEE 20th International Conference on Cognitive
Informatics &amp; Cognitive Computing (ICCI* CC), IEEE, 2021, pp. 248–252.
[14] H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, G. Sidorov, Coli-machine learning approaches
for code-mixed language identification at the word level in kannada-english texts, arXiv preprint
arXiv:2211.09847 (2022).
[15] A. L. Tonja, M. G. Yigezu, O. Kolesnikova, M. S. Tash, G. Sidorov, A. Gelbuk, Transformer-based
model for word level language identification in code-mixed kannada-english texts, arXiv preprint
arXiv:2211.14459 (2022).
[16] G. Shimi, C. Mahibha, D. Thenmozhi, An empirical analysis of language detection in dravidian
languages, Indian Journal of Science and Technology 17 (2024) 1515–1526.
[17] T. Jauhiainen, T. Ranasinghe, M. Zampieri, Comparing approaches to dravidian language
identification, arXiv preprint arXiv:2103.05552 (2021).
[18] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding,
arXiv preprint arXiv:2007.01852 (2020).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Steever</surname>
          </string-name>
          ,
          <article-title>Introduction to the dravidian languages</article-title>
          ,
          <source>in: The Dravidian languages, Routledge</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <article-title>Text representations and word embeddings: Vectorizing textual data, in: Applied data science in tourism: Interdisciplinary approaches</article-title>
          , methodologies, and applications, Springer,
          <year>2022</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sulaiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hamzah</surname>
          </string-name>
          ,
          <article-title>Evaluation of transfer learning and adaptability in large language models with the glue benchmark</article-title>
          , Authorea
          <string-name>
            <surname>Preprints</surname>
          </string-name>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HL</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Nayel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          , Coli@ fire2023:
          <article-title>Findings of word-level language identification in code-mixed tulu text</article-title>
          ,
          <source>in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Pseudo labeling methods for semi-supervised semantic segmentation: A review and future perspectives</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Koroteev</surname>
          </string-name>
          ,
          <article-title>Bert: a review of applications in natural language processing and understanding</article-title>
          ,
          <source>arXiv preprint arXiv:2103.11943</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Hidayatullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Qazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T. C.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Apong</surname>
          </string-name>
          ,
          <article-title>A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development</article-title>
          ,
          <source>IEEE access 10</source>
          (
          <year>2022</year>
          )
          <fpage>122812</fpage>
          -
          <lpage>122831</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Overview of coli-kanglish: Word level language identification in code-mixed kannada-english texts at icon 2022</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>