=Paper=
{{Paper
|id=Vol-3681/T4-3
|storemode=property
|title=Word-Level Language Identification of Code-Mixed Tulu-English Data
|pdfUrl=https://ceur-ws.org/Vol-3681/T4-3.pdf
|volume=Vol-3681
|authors=Poorvi Shetty
|dblpUrl=https://dblp.org/rec/conf/fire/Shetty23
}}
==Word-Level Language Identification of Code-Mixed Tulu-English Data==
Word-Level Language Identification of Code-Mixed Tulu-English Data Poorvi Shetty1 1 JSS Science and Technology University, Mysuru, India Abstract Code-mixing, the amalgamation of languages in speech, particularly common in India, generates informal, multilingual content on social media. Analyzing this content for linguistic tasks, notably Language Identification, is crucial. This study focuses on word-level Language Identification in Tulu-English code- mixed words, using diverse embeddings and classifiers. Results show promising accuracy, affirming the viability of the proposed approach, with the best system achieving a weighted average F1 score of 0.799. The study enhances multilingual processing by providing insights into effective language identification in complex linguistic scenarios, with broader implications for communication understanding in multilingual societies. The proposed system ranked 3rd in the shared task. Keywords language identification, code-mixing, multilingual communication, word embeddings, classifiers, Tulu- English, code-mixed words, multilingual processing 1. Introduction Language Identification (LID) in Natural Language Processing (NLP) refers to the process of determining the natural language in which a given piece of text is written. It involves analyzing various linguistic features and patterns within the text to accurately determine the language it belongs to. Tulu, along with the state language Kannada is part of the cultural and linguistic landscape of Karnataka, India. Those proficient in Tulu, known as Tuluvas, commonly exhibit fluency in both Tulu and Kannada, encompassing reading, writing, and verbal communication. Moreover, the Tulu language incorporates numerous lexical elements from Kannada. Additionally, the usage of English characters holds prominence among many Tulu speakers, particularly those active on social media platforms. Notably, the commentary contributed by Tulu users in response to Tulu-focused content on social media platforms often manifests as a linguistic amalgamation, involving Tulu, Kannada, and English. This intricate linguistic phenomenon has given rise to a valuable collection of trilingual code-mixed data, an area that has remained relatively unexplored within the realm of research. [1, 2] This paper delves into the realm of word-level LID within the context of code-mixed Tulu- English (Tu-En) textual compositions. These textual instances have been sourced from com- Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open poorvishetty1202@gmail.com (P. Shetty) Orcid 0009-0004-2243-5176 (P. Shetty) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings mentary sections of Tulu YouTube videos, consequently facilitating the construction of the Code-mixed Tulu-English Language Identification (CoLI-Tunglish) dataset. This task was part of the Word-level Language Identification in Code-mixed Tulu Texts (CoLI-Tunglish) shared task[3]. A similar shared task CoLI-Kanglish (Kannada and English) was conducted last year [4]. 2. Related Work In addressing the challenge of code-mixed language identification, several researchers have contributed innovative approaches. Gundapu and Mamidi [5] introduced Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) for English-Telugu code-mixed data, ulti- mately finding success with CRFs. Sabty et al. [6] focused on Arabic-English (AR-EN) text and found Segmental Recurrent Neural Networks (SegRNN) to excel in intra-word language identification. Mandal et al. [7] presented supervised learning methods for Bengali-English code- mixed data, utilizing character-based and root phone-based encodings in deep Long Short-Term Memory (LSTM) models. In the realm of code-mixed language identification, researchers have explored various method- ologies. Ojo et al. [8] delved into code-mixed Kannada and English (Kn-En) texts, achieving high accuracy with their CK-Keras model, incorporating pre-trained Word2Vec embeddings. Tonja et al. [9] introduced a Transformer-based model for word-level language identification in code-mixed Kannada-English texts. Uchoi and Kaur [10] combined language-specific morpho- logical dictionary-based approaches with character n-gram language models to achieve precise word classification in English and Punjabi code-mixed sentences. Researchers have developed versatile approaches to address code-mixed language identifi- cation across various languages and contexts. Chittaranjan et al. [11] presented a CRF-based system that incorporates lexical, contextual, character n-gram, and special character features, applicable to multiple languages. Gella et al. [12] tackled language identification in concise code-mixed documents across 28 languages. Sarma et al. [13] addressed word-level language identification in a multilingual context, proposing and evaluating strategies for low-resource languages like Assamese, Bengali, Hindi, and English. Studies have also explored the effectiveness of BERT and Transformer models in code-mixed language identification. Hidayatullah et al. [14] demonstrated the superiority of fine-tuned IndoBERTweet models, utilizing sub-word language representations for accurate language identification. Shashirekha et al. [15] created the CoLI-Kenglish dataset and employed various models, with the CoLI-ngrams model standing out as superior. Vajrobol [16] utilized transformer- based techniques, fine-tuning the DistilBERT model to discern the language of individual words within code-mixed Kannada-English texts using the Distilka model. 3. Existing Dataset The Code-mixed Tulu-English Language Identification (CoLI-Tunglish) dataset [2] consists of Tulu, Kannada, and English words in Roman script and is grouped into seven major categories, namely, ”Tulu”, “Kannada”, “English”, “Mixed-language”, “Name”, “Location” and “Other”. These Table 1 Classwise distribution within the training set of the dataset provided by Hegde et al. Category Count Name 8647 Location 5499 English 2068 Tulu 1104 Kannada 506 Mixed 403 Other 369 texts are extracted from Tulu YouTube video comments, a rich source of trilingual code-mixed data. 4. Data Preprocessing The undertaken methodologies within this study encompassed initial preprocessing steps, including the conversion of the provided text to lowercase, followed by its representation as strings for further analysis. The following embedding techniques were employed and tested to encapsulate the inherent linguistic characteristics of the text data: Bag-of-Words (BoW) is a basic and widely used text representation technique in NLP. It treats each document (or piece of text) as a ”bag” of individual words, disregarding the order and structure of the words. The basic idea is to create a vocabulary of all unique words in the entire corpus (collection of documents). For each document, a vector is created where each dimension corresponds to a word from the vocabulary, and the value in each dimension represents the frequency of that word in the document. BoW is simple and efficient but does not capture word order or context. Character n-grams are a more fine-grained technique that represents text by breaking it down into chunks of characters, rather than words. An n-gram is a contiguous sequence of n characters in a string. The character n-grams technique was trialled across varying n-gram intervals, specifically (1,2), (1,3), and (1,4), i.e., we are considering all possible combinations of character sequences with lengths ranging from 1 to 2 characters, 1 to 3 characters, and 1 to 4 characters. Character n-grams capture subword information. 5. Classifiers A comprehensive array of models was applied in this study to address the task at hand. The Scikit-Learn library was employed for model implementation, and default parameters were utilized. The utilization of this diverse set of models aimed at exploring a wide spectrum of possibilities and capturing nuanced patterns within the code-mixed data. The descriptions of the models used is as follows: RandomForest is an ensemble learning method that builds a forest of decision trees and combines their predictions to improve accuracy and reduce overfitting in classification and regression tasks. Multinomial Naive Bayes is a classification algorithm commonly used for text and document classification tasks. It’s based on the Bayes’ theorem and assumes that features are conditionally independent. Logistic Regression is a simple linear classification algorithm used for binary classification problems. It models the probability of a binary outcome. Linear Support Vector Classifier is a linear machine learning model used for binary classification. It aims to find a hyperplane that best separates the data into two classes. A Decision Tree is a tree-like model that makes decisions by recursively splitting the dataset based on the most significant feature at each node. KNN is a non-parametric and instance-based algorithm used for classification and regression. It classifies data points based on the majority class of their k-nearest neighbors. AdaBoost is an ensemble learning technique that combines multiple weak learners (usually decision trees) to create a strong classifier. OneVsRest classifier was used, a multi-class classifi- cation strategy where a separate binary Logistic Regression classifier is trained for each class to handle multi-class classification problems. Gradient Boosting is an ensemble method that builds an additive model by training weak learners sequentially, where each new learner corrects the errors made by the previous one. Stacking classifier was used, which combines multiple base models (LinearSVC, RandomForest, KNN) with a meta-learner (Logistic Regression) to improve overall model performance. The Voting Classifier combines the predictions of multiple classifiers (e.g., LR, RF, and SVC) using majority voting or weighted voting to make a final decision. Bagging (Bootstrap Aggregating) is an ensemble technique that trains multiple instances of the same base model (KNN) on bootstrapped samples of the data and combines their predictions. 6. Methodology After performing data preprocessing, each of the models mentioned in the previous section was trained separately using the various word embeddings discussed. Table 2 (refer to Table 2) displays the weighted average F1 scores of the classifiers when combined with different combinations of word embeddings, including Bag of Words (BoW) and Character n-grams with varying n-gram ranges. The evaluation of the models was based on their performance in terms of the weighted average F1 score. This particular metric is well-suited for assessing multi- class classification models because it accounts for class imbalances, provides a comprehensive measure of overall performance across all classes, and considers practical considerations such as the significance of individual classes. This metric delivers a balanced evaluation that combines both precision and recall, making it a valuable tool for selecting models and evaluating their performance in real-world applications. 7. Results Out of all the models, CountVectorizer with an n-gram range of (1,4), coupled with LinearSVC classifier was the most effective configuration observed for the language identification task. This combination adeptly captures linguistic nuances and establishes clear decision boundaries, Table 2 Weighted average F1 score of models on the Development set Model BoW (1, 2) n-grams (1, 3) n-grams (1, 4) n-grams Multinomial NB 0.60 0.66 0.74 0.76 Random Forest 0.73 0.85 0.86 0.86 Logistic Regression 0.61 0.78 0.84 0.85 Linear SVC 0.73 0.77 0.84 0.87 Decision Tree 0.73 0.83 0.82 0.82 KNN 0.63 0.81 0.81 0.80 AdaBoost 0.40 0.46 0.51 0.51 One Vs Rest 0.59 0.77 0.84 0.85 Gradient Boost 0.53 0.75 0.76 0.75 Stacking 0.73 0.86 0.83 0.86 Voting 0.72 0.85 0.85 0.85 Bagging 0.63 0.82 0.81 0.81 showcasing superior accuracy and precision in distinguishing languages. With the development dataset, the model gives a weighted average F1 score of 0.87. The weighted average F1 score with this set-up for the test dataset was 0.799. This was the third best score in the CoLi-Tunglish shared task. 8. Conclusion This study addressed the task of language identification within code-mixed Tulu-English words, prevalent in multilingual communication. Through the utilization of diverse word embeddings and classifiers, significant progress was made in effectively meeting this challenge. Notably, character n-grams in the range 1 to 4 with LinearSVC classifier demonstrated exceptional performance, yielding the highest weighted average F1 score compared to the other embeddings- model that were evaluated, highlighting the critical role of appropriate selection in achieving accurate language identification. Further exploration could involve refining embeddings and considering ensemble strategies to advance the accuracy and resilience of code-mixed language identification systems. References [1] N. H. Hebbar, Tulu Language - Its Script and Dialects, https://www.mangaloretoday.com/ opinion/Tulu-Language-Its-Script-and-Dialects.html, -. [Accessed 07-10-2023]. [2] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus Creation for Sentiment Analysis in Code-Mixed Tulu Text, in: Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, 2022, pp. 33–40. [3] A. Hagde, F. Balouchzahi, S. Coelho, S. Hosahalli Lakshmaiah, H. A Nayel, S. Butt, Overview of coli-tunglish: Word-level language identification in code-mixed tulu texts at fire 2023, in: Forum for Information Retrieval Evaluation FIRE - 2023, 2023. [4] F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H. Shashirekha, G. Sidorov, A. Gelbukh, Overview of coli-kanglish: Word level language identification in code-mixedkannada- english texts at icon 2022, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, 2022, pp. 38–45. [5] S. Gundapu, R. Mamidi, Word level language identification in English Telugu code mixed data, in: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Association for Computational Linguistics, Hong Kong, 2018. URL: https://aclanthology.org/Y18-1021. [6] C. Sabty, I. Mesabah, Özlem Çetinoğlu, S. Abdennadher, Language identification of intra-word code-switching for arabic–english, Array 12 (2021) 100104. URL: https:// www.sciencedirect.com/science/article/pii/S2590005621000473. doi:https://doi.org/10. 1016/j.array.2021.100104 . [7] S. Mandal, S. D. Das, D. Das, Language Identification of Bengali-English Code-Mixed data using Character & Phonetic based LSTM Models, 2018. URL: http://arxiv.org/abs/1803. 03859. doi:10.48550/arXiv.1803.03859 , arXiv:1803.03859 [cs] version: 1. [8] O. E. Ojo, A. Gelbukh, H. Calvo, A. Feldman, O. O. Adebanji, J. Armenta-Segura, Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 1–6. URL: https://aclanthology.org/2022.icon-wlli.1. [9] A. L. Tonja, M. G. Yigezu, O. Kolesnikova, M. S. Tash, G. Sidorov, A. Gelbuk, Transformer- based Model for Word Level Language Identification in Code-mixed Kannada-English Texts, 2022. URL: http://arxiv.org/abs/2211.14459. doi:10.48550/arXiv.2211.14459 , arXiv:2211.14459 [cs]. [10] E. Uchoi, M. Kaur, Language Identification of English and Punjabi, Eur. Chem. Bull. (2023) 4119–4123. doi:10.48047/ecb/2023.12.si6.367 . [11] G. Chittaranjan, Y. Vyas, K. Bali, M. Choudhury, Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System, in: Proceedings of the First Workshop on Computational Approaches to Code Switching, Association for Computational Linguistics, Doha, Qatar, 2014, pp. 73–79. URL: https://aclanthology.org/ W14-3908. doi:10.3115/v1/W14- 3908 . [12] S. Gella, K. Bali, M. Choudhury, “ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification, in: Proceedings of the 11th International Conference on Natural Language Processing, NLP Association of India, Goa, India, 2014, pp. 368–377. URL: https://aclanthology.org/W14-5151. [13] N. Sarma, S. R. Singh, D. Goswami, Word level language identification in assamese-bengali- hindi-english code-mixed social media text, in: 2018 International Conference on Asian Language Processing (IALP), 2018, pp. 261–266. doi:10.1109/IALP.2018.8629104 . [14] A. F. Hidayatullah, R. A. Apong, D. T. C. Lai, A. Qazi, Corpus creation and lan- guage identification for code-mixed Indonesian-Javanese-English Tweets, PeerJ Com- puter Science 9 (2023). URL: https://www.readcube.com/articles/10.7717%2Fpeerj-cs.1312. doi:10.7717/peerj- cs.1312 . [15] H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, G. Sidorov, CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada- English Texts, 2022. URL: http://arxiv.org/abs/2211.09847. doi:10.48550/arXiv.2211. 09847 , arXiv:2211.09847 [cs]. [16] V. Vajrobol, CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada- English Texts Shared Task using the Distilka model, in: Proceedings of the 19th In- ternational Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, Association for Computational Linguistics, IIIT Delhi, New Delhi, India, 2022, pp. 7–11. URL: https: //aclanthology.org/2022.icon-wlli.2.