1. Introduction

GraMLID: GRU-Assisted Multilingual BERT for Word-Level Language Identification in Low-Resource Dravidian Texts

Krishna Tewari

krishnatewari.rs.cse24@itbhu.ac.in 2

Supriya Chanda

supriya.chanda@bennett.edu.in 1

Suhani Verma

btbte23017_suhani@banasthali.in 0 0 Banasthali Vidyapith , Rajasthan , INDIA 1 Bennett University , Greater Noida , INDIA 2 Indian Institute of Technology (BHU) , Varanasi , INDIA

2026

Word-level Language Identification in code-mixed social media text is a challenging task due to transliteration, script similarity, class imbalance, and noisy user-generated content. To address these challenges, we participated in the FIRE 2025 shared task on LID for five low-resource Dravidian languages (Kannada, Malayalam, Tamil, Telugu, and Tulu; alongside English). We propose a hybrid mBERT+GRU model that leverages multilingual transformer representations with recurrent sequence modeling. The model was trained with a learning rate of 2e-5, weight decay of 0.01, batch size of 16, and 150 epochs, with early stopping criteria to prevent overfitting. To handle class imbalance, we employed Focal Loss and oversampling strategies, while prediction cleaning was applied to remove irrelevant tags to ensure more accurate sequence labeling. Evaluation on the oficial shared task dataset, released by the organizers, demonstrates competitive performance across all languages. Our approach achieved peak accuracy of 0.94 for Kannada, with results of 0.89 for Tamil, 0.86 for Telugu, 0.85 for Tulu, and 0.83 for Malayalam. These findings highlight the efectiveness of combining transformer embeddings with lightweight recurrent layers, complemented by loss reweighting, prediction refinement, and early stopping, for robust LID in low-resource and code-mixed settings.

eol>Language Identification Dravidian Languages Code-Mixing GRU mBERT Social Media

1. Introduction

The rapid proliferation of multilingual and code-mixed content on social media has amplified the need shared task, which provides a benchmark dataset covering five Dravidian languages alongside English for word-level LID.

In this work, we present our participation in the FIRE 2025 shared task [9]. We propose a hybrid combining mBERT1 and GRU2 architecture that combines transformer-based contextual embeddings with lightweight recurrent sequence modeling. To further enhance robustness, we incorporate strategies for handling class imbalance and apply prediction refinement to ensure consistent labeling across codemixed sequences.

The rest of the paper is structured as follows: Section 2 discusses related work; Section 3 describes the dataset; Section 4 presents the proposed methodology; Section 5 reports results and analysis; and Section 6 concludes with key findings.

2. Related Work

Over the decades, LID has progressed from rule-based statistical systems to modern neural and transformer models, particularly to handle code-mixed and low-resource language scenarios.

Initial LID systems predominantly used rule-based techniques, such as character-level n-gram models, which were quite efective for monolingual environments [ 10]. However, these methods struggled with code-mixed or transliterated text common in social media. Statistical models like Hidden Markov Models (HMMs) and Support Vector Machines (SVMs) ofered improvements for high-resource languages [ 11], but their performance degraded significantly in noisy, short, social media style code-mixed scenarios.

Neural methods marked a step change. LSTM-based sub-word LID models for Indian languages and achieved robust performance on short sequences [12]. Bidirectional LSTMs (BiLSTM) for Hindi-English code-mixed texts yield notable improvements in handling noise and brevity in social media content [13].

The transformer era brought significant momentum to multilingual LID. Multilingual BERT (mBERT) [14] supports 104 languages, while XLM [15] improved cross-lingual learning for over 100 languages. India-focused variants like IndicBERT [16] and MuRIL [17] are tailored for Indian linguistic phenomena such as code-mixing and transliteration, improving performance in low-resource settings.

For structured sequence prediction, the BiLSTM-CRF architecture [18] has been widely deployed across sequence labeling tasks, establishing a precedent for combining contextual encodings with structure-aware decoding models. Despite these advancements, research focused specifically on lowresource Dravidian languages remains limited. The CoLI-Kanglish shared task (ICON 2022) provided a benchmark for Kannada-English word-level LID. BERT-based models achieve an 86% weighted 1-score [19], while an overview report noted the highest macro 1 around 0.62 [20]. Earlier work applied traditional classifiers such as KNN and SVM, reaching 1-scores of around 0.58 [21]. Eforts further expanding to multiple Dravidian languages include Kannada-English dataset and benchmarked ML, DL, and transfer learning models, showing CoLI-ngrams achieved a macro 1 of 0.64 [22].

Recently explored prompt engineering using GPT-3.5 Turbo for word-level LID in Dravidian languages, noting higher accuracy for Kannada over Tamil; further demonstrating the potential of large language model-based prompting for low-resource code-mixed LID [23]. Research in very low-resource and code-mixed LID often hinges on clever use of minimal data. Mandal and Sanand [24] proposed three strategies for code-mixed LID using minimal resources, achieving ensemble accuracy of approximately 92.6%.

In summary, while rule-based, neural, and transformer approaches have advanced LID significantly, their adaptation to code-mixed and low-resource Dravidian scenarios remains incomplete. Datasets for Tulu in particular are sparse, and class imbalance continues to degrade system performance. Few studies have combined hybrid architectures, imbalance-aware learning, and sequence refinement.

Our work addresses these gaps by introducing a hybrid mBERT+GRU model, incorporating Focal Loss, oversampling, and prediction cleaning for robust word-level LID across five low-resource Dravidian 1bert-base-multilingual-cased 2Gated Recurrent Unit languages. In doing so, we build on the prior strengths while advancing resilience in challenging multilingual contexts.

3. Dataset

The dataset used in this study is released by the organizers of the FIRE 2025 Word-Level LID shared task. It consists of token-level annotated social media text across five low-resource Dravidian languages; Kannada, Malayalam, Tamil, Telugu, and Tulu, in addition to English.

Each language dataset varies in size and number of tag types, capturing a wide range of linguistic phenomena. For example, the Kannada dataset includes tags such as kn, en, name, and loc, while the Malayalam dataset also introduces additional tags like num and plc. Similarly, the Tamil dataset contains unique composite tags like tmen to capture mixed Tamil-English tokens, while the Tulu dataset contains cross-lingual overlaps with Kannada tokens.

Table 1 summarizes the number of training and validation sentences along with the tag types defined for each language in the FIRE 2025 LID dataset. Table 2 provides a detailed breakdown of tag frequency distributions across these splits, highlighting strong class imbalances across languages; for example, English tokens dominate in Kannada and Tulu, whereas native tokens are more prevalent in Tamil and Malayalam. Such disparities emphasize the necessity of strategies like loss reweighting and oversampling in our modeling pipeline. Finally, Table 3 presents representative example sentences from diferent Dravidian languages in the dataset, showcasing the complexity of multilingual, code-mixed text and further motivating the development of robust and adaptable models.

4. Methodology

In this section, we describe in detail the methodology followed in our work on word-level LID for Dravidian code-mixed texts. The pipeline is designed to handle the complex linguistic nature of codeswitching, transliteration, and multilingual social media data. It consists of three main components: (i) preprocessing of raw data, (ii) model architecture combining mBERT and GRU, and (iii) training setup and optimization strategies. A stepwise overview of the architecture is summarized in Algorithm 1.

4.1. Preprocessing

Our preprocessing pipeline begin with the removal of unwanted characters, such as punctuation marks, special symbols, hashtags, and user mentions. While these features often serve as pragmatic markers in social media conversations, they do not directly contribute to LID at the token level. URLs are also stripped, as they are language-agnostic and introduce unnecessary noise into the embeddings.

Emojis, which are pervasive in online communication are removed. Unlike many NLP tasks where numbers can be discarded, in our case numeric tokens are retained because the dataset explicitly contained tags such as num, marking them as meaningful entities. This decision is essential to ensure consistency between the preprocessing pipeline and the annotation scheme.

Finally, redundant whitespace is normalized, ensuring uniform tokenization across sentences. The preprocessed data therefore represented a corpus that preserved meaningful linguistic and semantic markers while filtering noise irrelevant to the identification of language tags.

4.2. Model Architecture

The cornerstone of our approach is a hybrid architecture that combines the strengths of transformerbased encoders with recurrent sequence learners. Specifically, we employ the mBERT model as the base encoder and a GRU layer for sequential modeling. 4.2.1. Multilingual BERT Encoder mBERT (bert-base-multilingual-cased) is a transformer-based model pre-trained on 104 languages using masked language modeling and next sentence prediction objectives. Its contextualized embeddings capture both inter-lingual and intra-lingual nuances, making it particularly suitable for multilingual and code-mixed scenarios. Each tokenized input sentence = (1, 2, . . . , ) is passed through the mBERT encoder, producing contextual embeddings = (1, 2, . . . , ), where each captures bidirectional context around token .

4.2.2. GRU Sequence Learner

Although transformers excel at capturing global context, they often underperform in modeling finegrained sequential dependencies over long sequences, especially in noisy and code-mixed settings. To complement this, we integrate a GRU layer on top of mBERT embeddings. The GRU is a lightweight recurrent neural network variant that eficiently models temporal dependencies through its gating mechanisms. The GRU processes the embedding sequence , producing hidden states = (ℎ1, ℎ2, . . . , ℎ ) that captures sequential context in a manner complementary to the transformer’s global attention.

4.2.3. Classification Layer

The final hidden states are passed through a fully connected layer, followed by a softmax function to produce probability distributions over the language tags for each token. Formally, ˆ = softmax( · ℎ + ), where and are trainable parameters of the classification layer. This ensures token-level predictions that aligned with the shared task’s requirements.

4.2.4. Handling Data Imbalance and Cleaning Predictions

Code-mixed datasets sufer from high class imbalance, with English and dominant native languages heavily outnumbering minority tags such as named entities, numerals, or rare transliterated words. To mitigate this, we use Focal Loss, which dynamically down-weights easy-to-classify samples and places greater emphasis on harder, minority-class tokens. Additionally, oversampling of minority classes is performed during training to artificially balance the dataset.

Finally, a post-processing step called prediction cleaning is applied. This involved filtering out irrelevant labels such as O (outside any language span) or sym (symbols), which occasionally appear in predictions despite not being semantically meaningful for the downstream evaluation.

The complete stepwise procedure of the model pipeline is presented in Algorithm 1. Algorithm 1 Proposed mBERT+GRU Framework for Word-Level LID 1: Input: Tokenized code-mixed sentence = (1, 2, . . . , ) 2: Preprocessing: Clean text (remove URLs, hashtags, punctuation, mentions, emojis; retain numbers) 3: Obtain contextualized embeddings = mBERT( ) 4: Pass embeddings through GRU layer: = GRU() 5: Apply fully connected + softmax: ˆ = softmax( · ℎ + ) 6: Compute loss using Focal Loss with dynamic class weighting 7: Oversample minority classes during training 8: Perform prediction cleaning to remove irrelevant tags (O, sym) 9: Output: Predicted sequence labels ˆ = (ˆ1, ˆ2, . . . , ˆ )

4.3. Training Setup

The model is trained end-to-end with token-level supervision from the FIRE 2025 LID shared task dataset. To optimize performance, we employ several training strategies, which we describe below.

We use the Adam optimizer with decoupled weight decay (AdamW), which has become the de-facto standard for transformer-based fine-tuning. The learning rate is initialized at 2 ×10−5, a value empirically tuned for stability, and weight decay is set at 0.01 to prevent overfitting. A batch size of 16 is adopted, balancing computational eficiency with gradient stability.

The model is trained for a maximum of 150 epochs. However, to mitigate overfitting and reduce unnecessary computation, we employ an early stopping criterion. Training is terminated once the validation loss plateaued for 3 consecutive epochs, ensuring that the model retains generalizable performance without memorizing training data.

Given the nature of social media text, which can range from short phrases to longer posts, we set the maximum sequence length to 512 tokens. This value ensures coverage for most sentences without truncation. The WordPiece tokenizer associated with mBERT is used to handle out-of-vocabulary tokens, ensuring robust subword segmentation across languages.

The choice of Focal Loss is crucial in addressing dataset imbalance. Unlike traditional cross-entropy, which treats all tokens equally, Focal Loss modulates the contribution of easy versus hard samples, with a focusing parameter that down-weights well-classified examples. This ensures that rare labels such as numerals or location names are not overshadowed by dominant classes. Oversampling further complemented this by artificially replicating underrepresented class instances during training, balancing the gradient contributions across labels.

5. Results

We evaluate the performance of our proposed mBERT+GRU framework on the FIRE 2025 LID shared task datasets. The experiments are carried out on the validation datasets provided by the organizers, where we computed detailed classification reports (per-class Precision, Recall, 1-Score, Accuracy, and Support). These are reported for each of the five Dravidian languages separately. The final test set results are obtained from the oficial leaderboard and are summarized at the end of this section.

We observe that performance is consistently strong for high-frequency classes such as ENGLISH and the major Dravidian language tag in each dataset, whereas minority categories (e.g., Location, Number, Other, Place) tend to have lower 1-scores due to data imbalance.

On the Kannada validation set (Table 4), the model achieves an overall accuracy of 0.9079. It demonstrates strong recognition of English (1 = 0.97) and Kannada (1 = 0.89), although categories such as “other” and “name” are comparatively weaker. For Tulu (Table 5), the accuracy was 0.8177, with high scores for English (1 = 0.90) and Tulu (1 = 0.87), while mixed-language tokens remain particularly challenging (1 = 0.48).

In the case of Telugu (Table 6), the model obtains an overall accuracy of 0.7948. Performance is excellent for English (1 = 0.90) and mixed tokens (1 = 0.98), but categories with sparse representation such as “number” (1 = 0.33) and “other” (1 = 0.42) prove dificult to classify reliably. Similarly, the Tamil dataset (Table 7) yields an accuracy of 0.8989, where Tamil (1 = 0.93) and English (1 = 0.93) are predicted with high consistency, whereas less frequent categories like “location” (1 = 0.65) show reduced performance.

Finally, the Malayalam dataset (Table 8) reached an overall accuracy of 0.8705. The model performs particularly well on Malayalam (1 = 0.94) and English (1 = 0.90), but struggles with underrepresented categories such as “place” (1 = 0.00) and “mixed” tokens (1 = 0.35). Taken together, these results highlight the robustness of the approach in handling high-resource categories, while underscoring persistent challenges in dealing with rare or highly imbalanced classes.

5.1. Leaderboard Results

The final system submissions were evaluated on the oficial test sets, and the scores were reported on the shared task leaderboard. The results across the five languages are summarized in Table 9. Among the languages, Kannada achieved the highest score of 0.94, followed by Tamil with 0.89, and Telugu with 0.86. Tulu and Malayalam obtained scores of 0.85 and 0.83, respectively. These leaderboard outcomes are consistent with the validation results, reflecting strong performance in high-resource languages such as Kannada and Tamil, while relatively lower but competitive results were observed in Malayalam and Tulu.

5.2. Error Analysis

Despite strong overall performance, the model shows weaknesses in handling minority categories such as place, number, and other, where limited training instances and class imbalance reduce reliability. Malayalam and Tulu exhibit comparatively lower scores, largely due to sparse data and script overlap leading to higher confusion among closely related tokens. The GRU layer, while efective for short dependencies, struggles with long or abrupt language switches typical of social media text. Moreover, mBERT’s general-domain pretraining limits its ability to fully capture domain-specific transliterations and informal expressions, suggesting that domain-adaptive fine-tuning and richer cross-lingual representations could further enhance robustness.

6. Conclusion

In this paper, we presented a hybrid mBERT+GRU model for word-level LID in code-mixed Dravidian social media text, addressing challenges of transliteration, noisy input, and class imbalance through focal loss, oversampling, and prediction refinement. Our system achieved strong leaderboard results, peaking at 0.94 accuracy for Kannada, alongside competitive scores for Tamil (0.89), Telugu (0.86), Tulu (0.85), and Malayalam (0.83), demonstrating the efectiveness of combining multilingual transformer embeddings with lightweight sequential modeling. While the approach proved robust across languages, relatively lower performance in Malayalam and Tulu highlights the limitations posed by data scarcity and script overlap. Future work should explore cross-lingual pretraining with domain-specific corpora, advanced sequence encoders such as graph or attention-based architectures, and transfer learning across related Dravidian languages to enhance generalization. Further emphasis should also be placed on model-agnostic post-processing and deployment-oriented strategies for reliable, real-time LID in multilingual user-generated content.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [2] S. Chanda, K. Tewari, A. Mukherjee, S. Pal, Leveraging chatgpt and xlm-roberta for sarcasm detection in dravidian code-mixed languages, in: Proceedings of FIRE (Working Notes), Forum for Information Retrieval Evaluation, 2024, India, 2024. URL: https://ceur-ws.org/Vol-4054/T4-14.pdf. [3] F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H. L. Shashirekha, G. Sidorov, A. Gelbukh, Overview of coli-kanglish: Word level language identification in code-mixed kannada-english texts at icon 2022, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, 2022, pp. 38–45. [4] R. Prathiba, R. Kannan, Language identification in code-mixed data: Challenges and approaches,

Journal of Intelligent Systems (2020). [5] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus creation for sentiment analysis in code-mixed tulu text, in: Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (SIGUL), European Language Resources Association (ELRA), Marseille, France, 2022, pp. 33–40. [6] A. Hegde, F. Balouchzahi, S. Coelho, S. H L, H. A. Nayel, S. Butt, Coli@fire2023: Findings of word-level language identification in code-mixed tulu text, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’23, Association for Computing Machinery, New York, NY, USA, 2024, p. 25–26. URL: https://doi.org/10.1145/3632754.3633075. doi:10.1145/3632754.3633075. [7] A. Hegde, F. Balouchzahi, S. Coelho, H. L. Shashirekha, H. A. Nayel, S. Butt, Overview of colitunglish: Word-level language identification in code-mixed tulu text at fire 2023, in: Forum for Information Retrieval Evaluation (FIRE 2023) Working Notes, 2023, pp. 179–190. [8] A. Hegde, F. Balouchzahi, S. Butt, S. Coelho, K. G, H. S. Kumar, S. D, S. H. L., A. Agrawal, Coli@fire2024: Findings of word-level code-mixed language identification in dravidian languages, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’24, Association for Computing Machinery, New York, NY, USA, 2025, p. 7–10. URL: https://doi.org/10.1145/3734947.3735663. doi:10.1145/3734947.3735663. [9] A. Hegde, F. Balouchzahi, S. Butt, S. Coelho, S. Hosahalli Lakshmaiah, A. Agrawal, Overview of CoLI-Dravidian 2025: Word-level Code-Mixed Language Identification in Dravidian Languages, in: Forum for Information Retrieval Evaluation FIRE - 2025, 2025. [10] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, R. L. Mercer, A statistical approach to language identification, Computational Linguistics 18 (1992) 611–620. [11] B. Hughes, T. Baldwin, M. Lui, Re-examining language identification, Journal of Computational

Linguistics 32 (2006) 45–60. [12] A. Joshi, S. Negi, N. Goel, L. Singh, M. Shrivastava, Towards sub-word level language identification for indian languages, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016, pp. 1–10. [13] Y. Zhang, Z. Yang, J. Qi, Deep learning for code-mixed language identification, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 2246–2255. [14] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [15] A. Conneau, U. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, A. Joulin, M. Koepke, Cross-lingual language model pretraining, in: Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 7057–7067. [16] D. Kakwani, A. Kunchukuttan, S. Gella, P. Bhattacharyya, M. Gokhale, A. Agarwal, R. Bhat, N. Kedia, A. Sharma, M. Kumar, IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages, in: Proceedings of the 12th Language Resources and Evaluation Conference (LREC), 2020, pp. 1490–1499. [17] S. Khanuja, A. Kunchukuttan, S. Kumar, M. Singh, S. Prasad, S. Gella, P. Bhattacharyya, A. Kumar,

MuRIL: Multilingual representations for indian languages, arXiv preprint arXiv:2103.10730 (2021). [18] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015). [19] P. Deka, N. J. Kalita, S. K. Sarma, Bert-based language identification in code-mix kannada-english text at the coli-kanglish shared task, in: ICON 2022 Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, ACL, 2022, pp. 12–17. [20] F. Balouchzahi, S. Butt, A. Hegde, et al., Overview of coli-kanglish: Word level language identification in code-mixed kannada-english texts at icon 2022, in: ICON 2022 Shared Task on Word Level Language Identification, ACL, 2022, pp. 38–45. [21] M. Shahiki Tash, Z. Ahani, A. Tonja, et al., Word level language identification in code-mixed kannada-english texts using traditional machine learning algorithms, in: ICON 2022 Shared Task on Word Level Language Identification, ACL, 2022, pp. 25–28. [22] H. Shashirekha, F. Balouchzahi, M. Anusha, et al., Coli-machine learning approaches for codemixed language identification at the word level in kannada-english texts, in: CoLI shared task workshop, 2022. [23] A. Deroy, S. Maity, Prompt engineering using gpt for word-level code-mixed language identification in low-resource dravidian languages, arXiv preprint arXiv:2411.04025 (2024). [24] S. Mandal, S. Sanand, Strategies for language identification in code-mixed low resource languages, arXiv preprint arXiv:1810.07156 (2018).

[1]

Lui ,

Baldwin , Automatic identification of multilingual documents , in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics , 2014 , pp. 658 - 667 .