<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GraMLID: GRU-Assisted Multilingual BERT for Word-Level Language Identification in Low-Resource Dravidian Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishna Tewari</string-name>
          <email>krishnatewari.rs.cse24@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Supriya Chanda</string-name>
          <email>supriya.chanda@bennett.edu.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suhani Verma</string-name>
          <email>btbte23017_suhani@banasthali.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Banasthali Vidyapith</institution>
          ,
          <addr-line>Rajasthan</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Bennett University</institution>
          ,
          <addr-line>Greater Noida</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Indian Institute of Technology (BHU)</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Word-level Language Identification in code-mixed social media text is a challenging task due to transliteration, script similarity, class imbalance, and noisy user-generated content. To address these challenges, we participated in the FIRE 2025 shared task on LID for five low-resource Dravidian languages (Kannada, Malayalam, Tamil, Telugu, and Tulu; alongside English). We propose a hybrid mBERT+GRU model that leverages multilingual transformer representations with recurrent sequence modeling. The model was trained with a learning rate of 2e-5, weight decay of 0.01, batch size of 16, and 150 epochs, with early stopping criteria to prevent overfitting. To handle class imbalance, we employed Focal Loss and oversampling strategies, while prediction cleaning was applied to remove irrelevant tags to ensure more accurate sequence labeling. Evaluation on the oficial shared task dataset, released by the organizers, demonstrates competitive performance across all languages. Our approach achieved peak accuracy of 0.94 for Kannada, with results of 0.89 for Tamil, 0.86 for Telugu, 0.85 for Tulu, and 0.83 for Malayalam. These findings highlight the efectiveness of combining transformer embeddings with lightweight recurrent layers, complemented by loss reweighting, prediction refinement, and early stopping, for robust LID in low-resource and code-mixed settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Language Identification</kwd>
        <kwd>Dravidian Languages</kwd>
        <kwd>Code-Mixing</kwd>
        <kwd>GRU</kwd>
        <kwd>mBERT</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid proliferation of multilingual and code-mixed content on social media has amplified the need
shared task, which provides a benchmark dataset covering five Dravidian languages alongside English
for word-level LID.</p>
      <p>In this work, we present our participation in the FIRE 2025 shared task [9]. We propose a hybrid
combining mBERT1 and GRU2 architecture that combines transformer-based contextual embeddings
with lightweight recurrent sequence modeling. To further enhance robustness, we incorporate strategies
for handling class imbalance and apply prediction refinement to ensure consistent labeling across
codemixed sequences.</p>
      <p>The rest of the paper is structured as follows: Section 2 discusses related work; Section 3 describes
the dataset; Section 4 presents the proposed methodology; Section 5 reports results and analysis; and
Section 6 concludes with key findings.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Over the decades, LID has progressed from rule-based statistical systems to modern neural and
transformer models, particularly to handle code-mixed and low-resource language scenarios.</p>
      <p>Initial LID systems predominantly used rule-based techniques, such as character-level n-gram models,
which were quite efective for monolingual environments [ 10]. However, these methods struggled with
code-mixed or transliterated text common in social media. Statistical models like Hidden Markov Models
(HMMs) and Support Vector Machines (SVMs) ofered improvements for high-resource languages [ 11],
but their performance degraded significantly in noisy, short, social media style code-mixed scenarios.</p>
      <p>Neural methods marked a step change. LSTM-based sub-word LID models for Indian languages and
achieved robust performance on short sequences [12]. Bidirectional LSTMs (BiLSTM) for Hindi-English
code-mixed texts yield notable improvements in handling noise and brevity in social media content
[13].</p>
      <p>The transformer era brought significant momentum to multilingual LID. Multilingual BERT (mBERT)
[14] supports 104 languages, while XLM [15] improved cross-lingual learning for over 100 languages.
India-focused variants like IndicBERT [16] and MuRIL [17] are tailored for Indian linguistic phenomena
such as code-mixing and transliteration, improving performance in low-resource settings.</p>
      <p>For structured sequence prediction, the BiLSTM-CRF architecture [18] has been widely deployed
across sequence labeling tasks, establishing a precedent for combining contextual encodings with
structure-aware decoding models. Despite these advancements, research focused specifically on
lowresource Dravidian languages remains limited. The CoLI-Kanglish shared task (ICON 2022) provided a
benchmark for Kannada-English word-level LID. BERT-based models achieve an 86% weighted 1-score
[19], while an overview report noted the highest macro 1 around 0.62 [20]. Earlier work applied
traditional classifiers such as KNN and SVM, reaching 1-scores of around 0.58 [21]. Eforts further
expanding to multiple Dravidian languages include Kannada-English dataset and benchmarked ML, DL,
and transfer learning models, showing CoLI-ngrams achieved a macro 1 of 0.64 [22].</p>
      <p>Recently explored prompt engineering using GPT-3.5 Turbo for word-level LID in Dravidian languages,
noting higher accuracy for Kannada over Tamil; further demonstrating the potential of large language
model-based prompting for low-resource code-mixed LID [23]. Research in very low-resource and
code-mixed LID often hinges on clever use of minimal data. Mandal and Sanand [24] proposed three
strategies for code-mixed LID using minimal resources, achieving ensemble accuracy of approximately
92.6%.</p>
      <p>In summary, while rule-based, neural, and transformer approaches have advanced LID significantly,
their adaptation to code-mixed and low-resource Dravidian scenarios remains incomplete. Datasets
for Tulu in particular are sparse, and class imbalance continues to degrade system performance. Few
studies have combined hybrid architectures, imbalance-aware learning, and sequence refinement.</p>
      <p>Our work addresses these gaps by introducing a hybrid mBERT+GRU model, incorporating Focal Loss,
oversampling, and prediction cleaning for robust word-level LID across five low-resource Dravidian
1bert-base-multilingual-cased
2Gated Recurrent Unit
languages. In doing so, we build on the prior strengths while advancing resilience in challenging
multilingual contexts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset used in this study is released by the organizers of the FIRE 2025 Word-Level LID shared task.
It consists of token-level annotated social media text across five low-resource Dravidian languages;
Kannada, Malayalam, Tamil, Telugu, and Tulu, in addition to English.</p>
      <p>Each language dataset varies in size and number of tag types, capturing a wide range of linguistic
phenomena. For example, the Kannada dataset includes tags such as kn, en, name, and loc, while
the Malayalam dataset also introduces additional tags like num and plc. Similarly, the Tamil dataset
contains unique composite tags like tmen to capture mixed Tamil-English tokens, while the Tulu dataset
contains cross-lingual overlaps with Kannada tokens.</p>
      <p>Table 1 summarizes the number of training and validation sentences along with the tag types defined
for each language in the FIRE 2025 LID dataset. Table 2 provides a detailed breakdown of tag frequency
distributions across these splits, highlighting strong class imbalances across languages; for example,
English tokens dominate in Kannada and Tulu, whereas native tokens are more prevalent in Tamil and
Malayalam. Such disparities emphasize the necessity of strategies like loss reweighting and oversampling
in our modeling pipeline. Finally, Table 3 presents representative example sentences from diferent
Dravidian languages in the dataset, showcasing the complexity of multilingual, code-mixed text and
further motivating the development of robust and adaptable models.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In this section, we describe in detail the methodology followed in our work on word-level LID for
Dravidian code-mixed texts. The pipeline is designed to handle the complex linguistic nature of
codeswitching, transliteration, and multilingual social media data. It consists of three main components: (i)
preprocessing of raw data, (ii) model architecture combining mBERT and GRU, and (iii) training setup
and optimization strategies. A stepwise overview of the architecture is summarized in Algorithm 1.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>Our preprocessing pipeline begin with the removal of unwanted characters, such as punctuation marks,
special symbols, hashtags, and user mentions. While these features often serve as pragmatic markers
in social media conversations, they do not directly contribute to LID at the token level. URLs are also
stripped, as they are language-agnostic and introduce unnecessary noise into the embeddings.</p>
        <p>Emojis, which are pervasive in online communication are removed. Unlike many NLP tasks where
numbers can be discarded, in our case numeric tokens are retained because the dataset explicitly
contained tags such as num, marking them as meaningful entities. This decision is essential to ensure
consistency between the preprocessing pipeline and the annotation scheme.</p>
        <p>Finally, redundant whitespace is normalized, ensuring uniform tokenization across sentences. The
preprocessed data therefore represented a corpus that preserved meaningful linguistic and semantic
markers while filtering noise irrelevant to the identification of language tags.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Architecture</title>
        <p>The cornerstone of our approach is a hybrid architecture that combines the strengths of
transformerbased encoders with recurrent sequence learners. Specifically, we employ the mBERT model as the base
encoder and a GRU layer for sequential modeling.
4.2.1. Multilingual BERT Encoder
mBERT (bert-base-multilingual-cased) is a transformer-based model pre-trained on 104
languages using masked language modeling and next sentence prediction objectives. Its contextualized
embeddings capture both inter-lingual and intra-lingual nuances, making it particularly suitable for
multilingual and code-mixed scenarios. Each tokenized input sentence  = (1, 2, . . . ,  ) is passed
through the mBERT encoder, producing contextual embeddings  = (1, 2, . . . ,  ), where each 
captures bidirectional context around token  .</p>
        <sec id="sec-4-2-1">
          <title>4.2.2. GRU Sequence Learner</title>
          <p>Although transformers excel at capturing global context, they often underperform in modeling
finegrained sequential dependencies over long sequences, especially in noisy and code-mixed settings. To
complement this, we integrate a GRU layer on top of mBERT embeddings. The GRU is a lightweight
recurrent neural network variant that eficiently models temporal dependencies through its gating
mechanisms. The GRU processes the embedding sequence , producing hidden states  = (ℎ1, ℎ2, . . . , ℎ )
that captures sequential context in a manner complementary to the transformer’s global attention.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.3. Classification Layer</title>
          <p>The final hidden states  are passed through a fully connected layer, followed by a softmax function to
produce probability distributions over the language tags for each token. Formally,
ˆ = softmax( · ℎ + ),
where  and  are trainable parameters of the classification layer. This ensures token-level predictions
that aligned with the shared task’s requirements.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.4. Handling Data Imbalance and Cleaning Predictions</title>
          <p>Code-mixed datasets sufer from high class imbalance, with English and dominant native languages
heavily outnumbering minority tags such as named entities, numerals, or rare transliterated words. To
mitigate this, we use Focal Loss, which dynamically down-weights easy-to-classify samples and places
greater emphasis on harder, minority-class tokens. Additionally, oversampling of minority classes is
performed during training to artificially balance the dataset.</p>
          <p>Finally, a post-processing step called prediction cleaning is applied. This involved filtering out
irrelevant labels such as O (outside any language span) or sym (symbols), which occasionally appear in
predictions despite not being semantically meaningful for the downstream evaluation.</p>
          <p>The complete stepwise procedure of the model pipeline is presented in Algorithm 1.
Algorithm 1 Proposed mBERT+GRU Framework for Word-Level LID
1: Input: Tokenized code-mixed sentence  = (1, 2, . . . ,  )
2: Preprocessing: Clean text (remove URLs, hashtags, punctuation, mentions, emojis; retain numbers)
3: Obtain contextualized embeddings  = mBERT( )
4: Pass embeddings through GRU layer:  = GRU()
5: Apply fully connected + softmax: ˆ = softmax( · ℎ + )
6: Compute loss using Focal Loss with dynamic class weighting
7: Oversample minority classes during training
8: Perform prediction cleaning to remove irrelevant tags (O, sym)
9: Output: Predicted sequence labels ˆ = (ˆ1, ˆ2, . . . , ˆ )</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training Setup</title>
        <p>The model is trained end-to-end with token-level supervision from the FIRE 2025 LID shared task
dataset. To optimize performance, we employ several training strategies, which we describe below.</p>
        <p>We use the Adam optimizer with decoupled weight decay (AdamW), which has become the de-facto
standard for transformer-based fine-tuning. The learning rate is initialized at 2 ×10−5, a value empirically
tuned for stability, and weight decay is set at 0.01 to prevent overfitting. A batch size of 16 is adopted,
balancing computational eficiency with gradient stability.</p>
        <p>The model is trained for a maximum of 150 epochs. However, to mitigate overfitting and reduce
unnecessary computation, we employ an early stopping criterion. Training is terminated once the
validation loss plateaued for 3 consecutive epochs, ensuring that the model retains generalizable
performance without memorizing training data.</p>
        <p>Given the nature of social media text, which can range from short phrases to longer posts, we set
the maximum sequence length to 512 tokens. This value ensures coverage for most sentences without
truncation. The WordPiece tokenizer associated with mBERT is used to handle out-of-vocabulary
tokens, ensuring robust subword segmentation across languages.</p>
        <p>The choice of Focal Loss is crucial in addressing dataset imbalance. Unlike traditional cross-entropy,
which treats all tokens equally, Focal Loss modulates the contribution of easy versus hard samples,
with a focusing parameter  that down-weights well-classified examples. This ensures that rare labels
such as numerals or location names are not overshadowed by dominant classes. Oversampling further
complemented this by artificially replicating underrepresented class instances during training, balancing
the gradient contributions across labels.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We evaluate the performance of our proposed mBERT+GRU framework on the FIRE 2025 LID shared
task datasets. The experiments are carried out on the validation datasets provided by the organizers,
where we computed detailed classification reports (per-class Precision, Recall, 1-Score, Accuracy, and
Support). These are reported for each of the five Dravidian languages separately. The final test set
results are obtained from the oficial leaderboard and are summarized at the end of this section.</p>
      <p>We observe that performance is consistently strong for high-frequency classes such as ENGLISH
and the major Dravidian language tag in each dataset, whereas minority categories (e.g., Location,
Number, Other, Place) tend to have lower 1-scores due to data imbalance.</p>
      <p>On the Kannada validation set (Table 4), the model achieves an overall accuracy of 0.9079. It
demonstrates strong recognition of English (1 = 0.97) and Kannada (1 = 0.89), although categories such as
“other” and “name” are comparatively weaker. For Tulu (Table 5), the accuracy was 0.8177, with high
scores for English (1 = 0.90) and Tulu (1 = 0.87), while mixed-language tokens remain particularly
challenging (1 = 0.48).</p>
      <p>In the case of Telugu (Table 6), the model obtains an overall accuracy of 0.7948. Performance is
excellent for English (1 = 0.90) and mixed tokens (1 = 0.98), but categories with sparse representation
such as “number” (1 = 0.33) and “other” (1 = 0.42) prove dificult to classify reliably. Similarly, the
Tamil dataset (Table 7) yields an accuracy of 0.8989, where Tamil (1 = 0.93) and English (1 = 0.93)
are predicted with high consistency, whereas less frequent categories like “location” (1 = 0.65) show
reduced performance.</p>
      <p>Finally, the Malayalam dataset (Table 8) reached an overall accuracy of 0.8705. The model performs
particularly well on Malayalam (1 = 0.94) and English (1 = 0.90), but struggles with underrepresented
categories such as “place” (1 = 0.00) and “mixed” tokens (1 = 0.35). Taken together, these results
highlight the robustness of the approach in handling high-resource categories, while underscoring
persistent challenges in dealing with rare or highly imbalanced classes.</p>
      <sec id="sec-5-1">
        <title>5.1. Leaderboard Results</title>
        <p>The final system submissions were evaluated on the oficial test sets, and the scores were reported on
the shared task leaderboard. The results across the five languages are summarized in Table 9. Among
the languages, Kannada achieved the highest score of 0.94, followed by Tamil with 0.89, and Telugu with
0.86. Tulu and Malayalam obtained scores of 0.85 and 0.83, respectively. These leaderboard outcomes
are consistent with the validation results, reflecting strong performance in high-resource languages
such as Kannada and Tamil, while relatively lower but competitive results were observed in Malayalam
and Tulu.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Error Analysis</title>
        <p>Despite strong overall performance, the model shows weaknesses in handling minority categories such
as place, number, and other, where limited training instances and class imbalance reduce reliability.
Malayalam and Tulu exhibit comparatively lower scores, largely due to sparse data and script overlap
leading to higher confusion among closely related tokens. The GRU layer, while efective for short
dependencies, struggles with long or abrupt language switches typical of social media text. Moreover,
mBERT’s general-domain pretraining limits its ability to fully capture domain-specific
transliterations and informal expressions, suggesting that domain-adaptive fine-tuning and richer cross-lingual
representations could further enhance robustness.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented a hybrid mBERT+GRU model for word-level LID in code-mixed Dravidian
social media text, addressing challenges of transliteration, noisy input, and class imbalance through
focal loss, oversampling, and prediction refinement. Our system achieved strong leaderboard results,
peaking at 0.94 accuracy for Kannada, alongside competitive scores for Tamil (0.89), Telugu (0.86), Tulu
(0.85), and Malayalam (0.83), demonstrating the efectiveness of combining multilingual transformer
embeddings with lightweight sequential modeling. While the approach proved robust across languages,
relatively lower performance in Malayalam and Tulu highlights the limitations posed by data scarcity
and script overlap. Future work should explore cross-lingual pretraining with domain-specific corpora,
advanced sequence encoders such as graph or attention-based architectures, and transfer learning
across related Dravidian languages to enhance generalization. Further emphasis should also be placed
on model-agnostic post-processing and deployment-oriented strategies for reliable, real-time LID in
multilingual user-generated content.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[2] S. Chanda, K. Tewari, A. Mukherjee, S. Pal, Leveraging chatgpt and xlm-roberta for sarcasm
detection in dravidian code-mixed languages, in: Proceedings of FIRE (Working Notes), Forum for
Information Retrieval Evaluation, 2024, India, 2024. URL: https://ceur-ws.org/Vol-4054/T4-14.pdf.
[3] F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H. L. Shashirekha, G. Sidorov, A. Gelbukh, Overview
of coli-kanglish: Word level language identification in code-mixed kannada-english texts at icon
2022, in: Proceedings of the 19th International Conference on Natural Language Processing
(ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English
Texts, 2022, pp. 38–45.
[4] R. Prathiba, R. Kannan, Language identification in code-mixed data: Challenges and approaches,</p>
      <p>Journal of Intelligent Systems (2020).
[5] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus creation for
sentiment analysis in code-mixed tulu text, in: Proceedings of the 1st Annual Meeting of the
ELRA/ISCA Special Interest Group on Under-Resourced Languages (SIGUL), European Language
Resources Association (ELRA), Marseille, France, 2022, pp. 33–40.
[6] A. Hegde, F. Balouchzahi, S. Coelho, S. H L, H. A. Nayel, S. Butt, Coli@fire2023: Findings of
word-level language identification in code-mixed tulu text, in: Proceedings of the 15th Annual
Meeting of the Forum for Information Retrieval Evaluation, FIRE ’23, Association for Computing
Machinery, New York, NY, USA, 2024, p. 25–26. URL: https://doi.org/10.1145/3632754.3633075.
doi:10.1145/3632754.3633075.
[7] A. Hegde, F. Balouchzahi, S. Coelho, H. L. Shashirekha, H. A. Nayel, S. Butt, Overview of
colitunglish: Word-level language identification in code-mixed tulu text at fire 2023, in: Forum for
Information Retrieval Evaluation (FIRE 2023) Working Notes, 2023, pp. 179–190.
[8] A. Hegde, F. Balouchzahi, S. Butt, S. Coelho, K. G, H. S. Kumar, S. D, S. H. L., A. Agrawal,
Coli@fire2024: Findings of word-level code-mixed language identification in dravidian languages,
in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval
Evaluation, FIRE ’24, Association for Computing Machinery, New York, NY, USA, 2025, p. 7–10. URL:
https://doi.org/10.1145/3734947.3735663. doi:10.1145/3734947.3735663.
[9] A. Hegde, F. Balouchzahi, S. Butt, S. Coelho, S. Hosahalli Lakshmaiah, A. Agrawal, Overview of
CoLI-Dravidian 2025: Word-level Code-Mixed Language Identification in Dravidian Languages,
in: Forum for Information Retrieval Evaluation FIRE - 2025, 2025.
[10] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, R. L. Mercer, A statistical approach to language
identification, Computational Linguistics 18 (1992) 611–620.
[11] B. Hughes, T. Baldwin, M. Lui, Re-examining language identification, Journal of Computational</p>
      <p>Linguistics 32 (2006) 45–60.
[12] A. Joshi, S. Negi, N. Goel, L. Singh, M. Shrivastava, Towards sub-word level language
identification for indian languages, in: Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (ACL), 2016, pp. 1–10.
[13] Y. Zhang, Z. Yang, J. Qi, Deep learning for code-mixed language identification, in: Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp.
2246–2255.
[14] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[15] A. Conneau, U. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, A. Joulin,
M. Koepke, Cross-lingual language model pretraining, in: Advances in Neural Information
Processing Systems (NeurIPS), 2019, pp. 7057–7067.
[16] D. Kakwani, A. Kunchukuttan, S. Gella, P. Bhattacharyya, M. Gokhale, A. Agarwal, R. Bhat,
N. Kedia, A. Sharma, M. Kumar, IndicNLPSuite: Monolingual corpora, evaluation benchmarks
and pre-trained multilingual language models for indian languages, in: Proceedings of the 12th
Language Resources and Evaluation Conference (LREC), 2020, pp. 1490–1499.
[17] S. Khanuja, A. Kunchukuttan, S. Kumar, M. Singh, S. Prasad, S. Gella, P. Bhattacharyya, A. Kumar,</p>
      <p>MuRIL: Multilingual representations for indian languages, arXiv preprint arXiv:2103.10730 (2021).
[18] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint
arXiv:1508.01991 (2015).
[19] P. Deka, N. J. Kalita, S. K. Sarma, Bert-based language identification in code-mix kannada-english
text at the coli-kanglish shared task, in: ICON 2022 Shared Task on Word Level Language
Identification in Code-mixed Kannada-English Texts, ACL, 2022, pp. 12–17.
[20] F. Balouchzahi, S. Butt, A. Hegde, et al., Overview of coli-kanglish: Word level language
identification in code-mixed kannada-english texts at icon 2022, in: ICON 2022 Shared Task on Word Level
Language Identification, ACL, 2022, pp. 38–45.
[21] M. Shahiki Tash, Z. Ahani, A. Tonja, et al., Word level language identification in code-mixed
kannada-english texts using traditional machine learning algorithms, in: ICON 2022 Shared Task
on Word Level Language Identification, ACL, 2022, pp. 25–28.
[22] H. Shashirekha, F. Balouchzahi, M. Anusha, et al., Coli-machine learning approaches for
codemixed language identification at the word level in kannada-english texts, in: CoLI shared task
workshop, 2022.
[23] A. Deroy, S. Maity, Prompt engineering using gpt for word-level code-mixed language identification
in low-resource dravidian languages, arXiv preprint arXiv:2411.04025 (2024).
[24] S. Mandal, S. Sanand, Strategies for language identification in code-mixed low resource languages,
arXiv preprint arXiv:1810.07156 (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <article-title>Automatic identification of multilingual documents</article-title>
          ,
          <source>in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>658</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>