<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Identification in Code-mixed Kannada Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abdollah Abadian</string-name>
          <email>abdullah.abadian@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>SVM, Machine-Learninge, Coli-Kenglish, NLP, Language-Identification</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Sistan and Baluchestan</institution>
          ,
          <addr-line>Zahedan</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As digital communication continues to expand in multilingual contexts, code-mixing has become a common phenomenon, presenting significant challenges for Language Identification (LI) at the word level. This paper explores these challenges with a focus on the interplay between Kannada and English. We utiization the CoLIKenglish dataset, which was meticulously constructed from comments on Kannada YouTube videos. Within the framework of the CoLI-Kenglish shared task at COLI-Dravidian 2024, our study implements a model developed by the ABADIAN team that employs a character n-gram TF-IDF vectorization approach, enhanced by the inclusion of word length for improved representation. We evaluated various traditional Machine Learning algorithms, such as Support Vector Machines (SVM), Naïve Bayes and Decision Trees. The SVM classifier emerged as the most efective method, attaining an F1 score of 81.2% on the test set and ranking eighth among all submissions. While the findings may not introduce novel methodologies, they contribute valuable insights into the eficacy of established techniques in the domain of code-mixed language processing.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As social media platforms become the new agora for expression, users often navigate the complexities
of their multilingual environments by employing Roman script, a choice driven by the limitations
of traditional keyboards and the desire for ease of communication [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This results in a dynamic
form of code-mixed text, where the boundaries between languages blur, creating a rich linguistic
landscape that challenges conventional language processing methodologies. The informal nature of
these interactions—filled with abbreviations, slang, and playful creativity—adds layers of complexity to
the task of language identification (LI), a critical component for various Natural Language Processing
(NLP) applications such as sentiment analysis and machine translation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Despite the rapid advancements in NLP, the study of LI in code-mixed contexts remains an
underexplored frontier, particularly for low-resource languages like Tulu and Kannada [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Traditional
approaches have predominantly focused on high-resource languages, often overlooking the unique
challenges posed by the rich linguistic diversity of India [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The absence of comprehensive
annotated datasets further complicates eforts to develop robust models capable of navigating this intricate
linguistic terrain [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To bridge this gap, our research uses the CoLI-Kenglish and CoLI-Malayalam datasets, specifically
curated for word-level LI tasks involving Kannada-English and Malayalam-English code-mixed texts,
respectively [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. By leveraging these datasets, we aim to explore innovative methodologies, including
Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL), to enhance the accuracy of
language identification in these complex environments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Our findings not only seek to advance the state of the art in NLP but also aspire to empower
the linguistic communities that thrive in this code-mixed reality. By improving the identification
of languages within these texts, we hope to contribute to a deeper understanding of multilingual
communication and its implications for technology and society.
https://github.com/Abdollah-Abadian/ (A. Abadian)</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>In this article, we will explore several key components that contribute to our understanding of the
subject matter. The second section will delve into related works, providing a comprehensive overview
of existing literature and studies that inform our research. Following this, the third section will detail
the datasets utilized in our analysis, highlighting their relevance and significance.</p>
      <p>In the fourth section summarizes the baseline models and The fith section will present the results of
our research, showcasing the findings and their interpretations. Finally, the sixth section will conclude
the article with a summary of our findings and suggestions for future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Chaitanya and Kumar (2020) explored LI in Hindi-English code-mixed data by generating feature vectors
using the Continuous Bag of Words (CBOW) and Skip-gram models. They trained various ML models,
including Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR), Gaussian Naive
Bayes (GNB), k-Nearest Neighbor (kNN), and Adaptive Boosting (AdaBoost). Among these, the SVM
classifiers achieved the highest accuracies of 67.33% and 67.34% with the CBOW and Skip-gram models,
respectively [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Gundapu and Mamidi (2021) addressed LI on Telugu-English code-mixed text using Conditional
Random Fields (CRF) classifiers. Their approach, which considered previous, current, and next words
along with their part-of-speech (POS) tags, word length, and character n-grams (1-3), resulted in an
accuracy of 91.28% [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Mandal and Singh (2021) proposed a multichannel neural network model combining Convolutional
Neural Networks (CNN), Long Short-Term Memory (LSTM), and Bidirectional LSTM (BiLSTM)
integrated with CRF for LI in Hindi-English and Bengali-English code-mixed text. This model achieved
impressive accuracies of 93.32% for Hindi-English and 93.28% for Bengali-English data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        In their study, Thara and Poornachandran (2022) introduced a dataset for LI in code-mixed
EnglishMalayalam text and utilized a transformer-based model, specifically the Enhanced Light Eficiency
Cophasing Telescope Resolution Actuator (ELECTRA). Their fine-tuned model achieved a remarkable
macro F1 score of 0.9933, demonstrating the efectiveness of advanced transformer architectures for LI
tasks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Veena et al. (2020) investigated SVM models trained with word and character 5-gram embeddings
for LI in Hindi-English code-mixed text, achieving notable accuracy improvements over traditional
methods [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        In a significant study, Gupta et al. (2020) examined the efectiveness of traditional machine learning
techniques for LI in Hindi-English code-mixed data. Their work demonstrated the utility of classifiers
such as Support Vector Machines (SVM) and Random Forests, achieving moderate accuracy levels. They
emphasized the need for tailored algorithms that can accommodate the peculiarities of code-mixing,
including the interspersing of languages and the informal nature of social media language [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Further advancing the field, Sharma and Ghosh (2021) proposed a hybrid model combining
Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks for LI in Hindi-English
and Bengali-English code-mixed datasets. Their approach highlighted the advantages of deep
learning architectures, achieving accuracies of 92.5% and 90.8% for the respective languages. This study
underscored the potential of using neural networks to capture syntactic and contextual nuances in
code-mixed texts [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        In the context of Kannada-English code-mixing, Ramesh et al. (2022) explored various deep learning
methods, including Bidirectional LSTM and attention mechanisms, to enhance LI performance. Their
ifndings revealed that integrating attention layers significantly improved model accuracy by enabling
the network to focus on relevant parts of the input sequence. They achieved a commendable macro F1
score, illustrating the efectiveness of deep learning in this challenging domain [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        One of the foundational works in this area is the overview of CoLI-Dravidian, which provides a
comprehensive analysis of word-level code-mixed language identification, emphasizing the challenges
present in Dravidian languages [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Another significant contribution is the CoLI-Kanglish study, which explored language identification in
Kannada-English code-mixed texts during the ICON 2022 conference. This work presents methodologies
tailored to address the intricacies of language mixing in urban multilingual contexts [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        In the context of machine learning approaches, Hosahalli Lakshmaiah et al. introduced efective
methodologies for word-level language identification in Kannada-English texts, showing the eficacy
of various algorithms in distinguishing between mixed-language usages [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Similarly, the findings
from the CoLI@FIRE2023 challenge shed light on the application of sequence labeling techniques
for identifying Tulu language components in code-mixed text, illustrating the evolving landscape of
linguistic research in 2023 [19].
      </p>
      <p>Additionally, work on sentiment analysis in code-mixed Tulu text demonstrates the necessity for
corpus creation and the challenges involved when dealing with under-resourced languages, reinforcing
the importance of robust datasets in training and testing language identification models [20].</p>
      <p>
        Despite the advancements in language identification (LI) systems, several challenges persist,
particularly for low-resource languages like Kannada. The primary hurdles include the limited availability
of annotated datasets and the complex nature of code-mixed language. Most previous research has
concentrated on high-resource languages, resulting in a significant gap in methodologies that are
specifically designed for languages such as Kannada [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ][19].
      </p>
      <p>Our research aims to bridge this gap by utilizing the CoLI-Kanglish dataset, facilitating a deeper
understanding of code-mixed language phenomena. We focus on developing and refining LI techniques
that enhance accuracy for Kannada texts. By establishing a comprehensive framework for LI in
codemixed Kannada, we aspire to contribute to the ongoing advancements in natural language processing
and foster innovations in multilingual communication technologies. This work not only aims to improve
language identification metrics but also seeks to empower future research and applications in the realm
of under-resourced languages.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>The CoLI-Kanglish dataset serves as a foundational resource for language identification tasks involving
code-mixed Kannada-English text. This dataset comprises English and Kannada words transcribed
in Roman script, categorized into six distinct labels: ”Kannada,” ”English,” ”Mixed-language,” ”Name,”
”Location,” and ”Other.” The training portion of the dataset includes a total of 30,016 tokens segmented
into these six tags, with their distribution presented in Table 1. Additionally, the test dataset consists of
2,485 unlabeled tokens, facilitating real-world evaluation of model performance.</p>
      <p>
        To enhance the dataset further, a small portion (10%) of the preprocessed code-mixed texts was
randomly selected and tokenized into words. These words were manually tagged by two native
Kannada speakers trained in the concepts of code-mixed texts and the language identification (LI) task,
leading to the creation of the CoLI-Kanglish dataset [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. This process yielded 25,302 unique words
extracted from nearly 8,000 sentences.
      </p>
      <p>The unique words are categorized into six classes: ’Kannada,’ ’English,’ ’Mixed-language,’ ’Name,’
’Location,’ and ’Other.’ The first two classes represent Kannada and English words, respectively, while
the ’Mixed-language’ class encompasses words formed by a combination of Kannada and English in
any order. The ’Name’ class specifically identifies names of individuals, and the ’Location’ class denotes
names of places. Any other words that do not fit these classifications are categorized under the ’Other’
class.</p>
      <p>
        A significant challenge in the language identification task arises from the ’Mixed-language’ class,
which includes words created from various combinations of Kannada and English, along with
Kannada/English afixes (prefixes and sufixes). The beauty and complexity of these mixed-language words
lie in their construction, which often varies by individual. As social media usage continues to grow, the
prevalence of such code-mixed expressions increases, underscoring the need for efective models to
analyze them [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>Detailed descriptions and samples of the categorized tokens are provided in Table 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section outlines the methodologies employed in our study to efectively identify languages in
Kannada-English code-mixed texts. The process involves several key steps: feature extraction, model
selection, and evaluation.</p>
      <sec id="sec-4-1">
        <title>4.1. Feature Extraction</title>
        <sec id="sec-4-1-1">
          <title>For our analysis, we employed two primary techniques for feature extraction:</title>
          <p>Term Frequency-Inverse Document Frequency (TF-IDF): This method was utilized to convert the
text data into a numerical format. By calculating the importance of each word in relation to the corpus,
we generated a feature matrix that highlights the most significant terms in the dataset.</p>
          <p>Character N-grams: In addition to TF-IDF, we implemented character n-grams as features. This
approach allows for the capture of linguistic patterns that may be indicative of specific languages,
particularly in code-mixed texts where words from diferent languages are closely interwoven. For
example, extracting bi-grams and tri-grams enabled the model to learn contextual information about
how languages are mixed.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Selection</title>
        <p>We evaluated several traditional machine learning classifiers to identify the most efective approach for
our dataset. The classifiers selected for this study included:
• Naïve Bayes: A probabilistic classifier based on Bayes’ theorem that works well with text
classification tasks.
• Support Vector Machine (SVM): A powerful classifier that operates by finding the hyperplane
that best separates diferent classes in high-dimensional space.
• Decision Trees: A model that uses a tree-like graph of decisions to classify data points based on
feature values.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <sec id="sec-4-3-1">
          <title>To evaluate the performance of each model, we used several metrics:</title>
          <p>• Precision: The ratio of correctly predicted positive observations to the total predicted positives.
• Recall: The ratio of correctly predicted positive observations to all actual positives.
• F1 Score: The harmonic mean of precision and recall, providing a single metric for model
performance, especially in cases of class imbalance.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Experimentation and Analysis</title>
        <p>Each model was trained on 80% of the dataset while retaining 20% for testing. The performance of the
models was compared based on the evaluation metrics, enabling us to identify the model that exhibited
the best performance in accurately identifying languages in the code-mixed texts.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>This section presents the results of our experiments with various machine learning models for language
identification in Kannada-English code-mixed texts, as well as a discussion of the implications of these
ifndings.</p>
      <sec id="sec-5-1">
        <title>5.1. Performance Analysis</title>
        <p>The performance of the classifiers was evaluated using the metrics outlined in the methodology section.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Discussion of Results</title>
        <p>The findings indicate that the Support Vector Machine (SVM) model outperformed the other classifiers,
achieving an F1 score of 81.2%. This demonstrates the SVM’s eficacy in handling the complexities
of code-mixed language identification, particularly given its ability to find optimal hyperplanes in
high-dimensional spaces.</p>
        <p>In contrast, the Naïve Bayes classifier yielded the lowest performance among the models tested.
While Naïve Bayes can be efective for simpler textual classifications, its assumptions regarding feature
independence may have hindered its ability to capture the intricate relationships between Kannada and
English in code-mixed contexts. Additionally, the decision tree model displayed moderate performance,
indicating that while it ofers interpretability, it may be less robust for this specific task compared to
SVM.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Insights on Code-Mixing Patterns</title>
        <p>A qualitative analysis of the misclassified instances revealed insightful patterns in the code-mixing
behavior prevalent in the dataset. Comments exhibiting heavy English vocabulary intertwined with
Kannada were more challenging for the models. For example, phrases such as ”I love this video” or
”Give me more content” were often misclassified, likely due to the common use of English loanwords
and phrases in Kannada discourse, which can blur the lines between the two languages.</p>
        <p>Moreover, we observed that the length and structure of comments influenced classification accuracy.
Shorter comments tended to result in higher misclassification rates; this could be attributed to their
lack of context. In contrast, longer comments often provided richer linguistic cues, allowing the models
to make more accurate predictions.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Implications for Future Research</title>
        <p>The results of this study highlight the importance of utilizing robust machine learning techniques like
SVM for tasks involving language identification in code-mixed contexts. These findings underscore the
need for further research to explore hybrid models that combine the strengths of various classifiers, as
well as to investigate deeper learning algorithms such as neural networks, which have shown promise
in other multilingual NLP tasks.</p>
        <p>In conclusion, our study contributes valuable insights into the identification of code-mixed languages,
specifically focusing on Kannada-English interactions. The positive performance of the SVM model
ofers a promising pathway forward for future investigations, while the challenges identified provide a
basis for continued exploration into the unique characteristics of code-mixing.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study explored the challenges and methodologies associated with language identification in
Kannada-English code-mixed texts. By utilizing a dataset of YouTube comments and various machine
learning models, we aimed to shed light on the dynamics of code-mixing and its implications for natural
language processing (NLP) in multilingual contexts.</p>
      <p>Our findings indicate that the Support Vector Machine (SVM) model significantly outperformed
other classifiers, achieving an F1 Score of 81.2%. This result highlights the efectiveness of SVM in
managing the complex interplay of languages found in code-mixed communication. While Naïve Bayes
and Decision Tree models performed adequately, they struggled to capture the nuances of code-mixing,
emphasizing the importance of selecting the right algorithm for such intricate linguistic tasks.</p>
      <p>The qualitative analysis of misclassified instances provided deeper insights into the characteristics
of code-mixing in our dataset. We found that the blending of languages, particularly the prevalence
of English phrases in Kannada comments, posed challenges for identification. These observations
underline the necessity for models to account for context, structure, and usage patterns inherent in
informal language online.</p>
      <p>Overall, this research contributes to the evolving field of multilingual NLP by demonstrating efective
approaches for language identification in mixed-language settings. It opens avenues for future research
to delve into hybrid models and more advanced learning techniques to further enhance F1 Score in
language processing tasks.</p>
      <p>In light of the increasing globalization and the rise of digital communication, understanding
codemixing is indispensable for the development of accurate language processing tools. By bridging linguistic
boundaries, this work aims to support applications such as social media analytics, language translation,
and real-time communication tools for bilingual speakers.</p>
      <p>Moving forward, we encourage researchers to explore larger and more diverse datasets, investigate
other multilingual settings, and apply deep learning techniques to further advance the understanding
of code-mixing phenomena in natural language processing.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration of Generative AI and AI-assisted technologies in the writing process</title>
      <p>During the preparation of this work, the author used DeepSeek AI to generate initial drafts of specific
sections (Introduction, Literature Review, and Methodology sections). After using this tool, the
author reviewed and edited the content as needed and take(s) full responsibility for the content of the
publication.
[19] Hegde, A., Balouchzahi, F., Coelho, S., Lakshmaiah, H. S., Nayel, H. A., Butt, S. (2024).</p>
      <p>CoLI@FIRE2023: Findings of Word-level Language Identification in Code-mixed Tulu Text. In
Proceedings of FIRE ’23. Association for Computing Machinery.
[20] Hegde, A., Mudoor Devadas, A., Coelho, S., Lakshmaiah, H. S., Chakravarthi, B. R. (2022). Corpus
creation for sentiment analysis in code-mixed Tulu text. In Proceedings of the 1st Annual Meeting
of the ELRA/ISCA Special Interest Group on Under-Resourced Languages.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>The Role of Roman Script in Multilingual Communication in India</article-title>
          .
          <source>Journal of Language and Linguistic Studies</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ),
          <fpage>123</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Language Identification in Code-Mixed Text: A Survey</article-title>
          .
          <source>International Journal of Computational Linguistics</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ),
          <fpage>45</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Challenges of Language Identification in Social Media Texts</article-title>
          .
          <source>Proceedings of the International Conference on Natural Language Processing</source>
          ,
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Addressing the Under-Resourced Languages in NLP: A Case Study of Tulu and Kannada</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>56</volume>
          (
          <issue>4</issue>
          ),
          <fpage>1023</fpage>
          -
          <lpage>1045</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>High-Resource vs</article-title>
          .
          <article-title>Low-Resource Languages: A Comparative Study in Language Processing</article-title>
          .
          <source>Journal of Linguistic Studies</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <fpage>67</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>CoLI-Kenglish and CoLI-Tunglish: Datasets for Code-Mixed Language Identification</article-title>
          . Data in Brief,
          <volume>45</volume>
          ,
          <fpage>108</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Machine Learning and Deep Learning Approaches for Language Identification in Code-Mixed Texts</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Chaitanya</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Language identification in code-mixed Hindi-English text using machine learning techniques</article-title>
          .
          <source>Journal of Language and Linguistic Studies</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ),
          <fpage>123</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Gundapu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mamidi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Conditional Random Fields for language identification in TeluguEnglish code-mixed text</article-title>
          .
          <source>Proceedings of the International Conference on Natural Language Processing</source>
          ,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Mandal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>A multichannel neural network approach for language identification in code-mixed text</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>70</volume>
          ,
          <fpage>345</fpage>
          -
          <lpage>367</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Thara</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poornachandran</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Dataset and transformer-based model for language identification in English-Malayalam code-mixed text</article-title>
          .
          <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <fpage>789</fpage>
          -
          <lpage>798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Veena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>SVM models for language identification in Hindi-English code-mixed text</article-title>
          .
          <source>International Journal of Computational Linguistics</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ),
          <fpage>201</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Language Identification in Code-Mixed Texts: A Machine Learning Approach</article-title>
          .
          <source>Proceedings of the International Conference on Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Hybrid Deep Learning Model for Language Identification in CodeMixed Texts</article-title>
          .
          <source>Journal of Natural Language Engineering.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al. (
          <year>2022</year>
          ).
          <article-title>Enhancing Language Identification in Kannada-English Code-Mixed.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hegde</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balouchzahi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coelho</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            , D, S.,
            <surname>Hosahalli</surname>
          </string-name>
          <string-name>
            <surname>Lakshmaiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            ,
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Overview of CoLI-Dravidian: Word-level Code-mixed Language Identification in Dravidian Languages. In Forum for Information Retrieval Evaluation FIRE -</article-title>
          <year>2024</year>
          , Gandhinagar.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Balouchzahi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hegde</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashraf</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hosahalli</surname>
            <given-names>Lakshmaiah</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            ,
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed KannadaEnglish Texts at ICON 2022</article-title>
          .
          <source>In 19th International Conference on Natural Language Processing Proceedings</source>
          , IIIT Delhi, India.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Lakshmaiah</surname>
            ,
            <given-names>H. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balouchzahi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mudoor</surname>
            <given-names>Devadas</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts</article-title>
          . Acta Polytechnica Hungarica.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>