<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cracking the Code: Machine Learning Approaches for Dravidian Code-Mixed Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikhil Narayan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sachin Mohanty</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Z-AGI Labs</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Word-level Language Identification (LI) is a crucial task in handling multilingual and code-mixed texts, where words from diferent languages appear in a single sentence, making it challenging to accurately identify the language of individual words. This task becomes even more complex when dealing with under-resourced languages such as Tulu, Kannada, Tamil, and Malayalam, where the availability of annotated datasets is limited. Recognizing this gap, the CoLI-Dravidian Shared Task@FIRE2024 was introduced by the organizers to address the need for comprehensive datasets and methods for word-level LI in code-mixed texts involving these underresourced languages. To tackle this challenge, our team developed a robust methodology combining classical machine learning models-such as Logistic Regression, Support Vector Machines (SVM), Multinomial Naive Bayes, K-Nearest Neighbors (KNN), Decision Tree, and Random Forest-with advanced models like LightGBM (LGBM), CatBoost, and XGBoost. For each model, we performed extensive hyperparameter tuning to optimize performance and obtain the best possible scores. These models were trained using various embeddings, including Count Vectorizer and TF-IDF with n-gram ranges (1,3) and (1,4), as well as FastText embeddings, to efectively capture linguistic variations in the data. As a result, we achieved impressive rankings, securing Rank 4 in Kannada, Rank 3 in Malayalam, Rank 3 in Tulu, and Rank 2 in Tamil, demonstrating the efectiveness of our approach in this shared task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Low-Resource Languages</kwd>
        <kwd>Code-Mixed Text</kwd>
        <kwd>Dravidian Languages</kwd>
        <kwd>Hyperparameter Optimization</kwd>
        <kwd>Text Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        India’s linguistic diversity is vast, with over 650 languages spoken across the country. The Dravidian
language family, predominant in southern India, includes languages such as Tulu, Kannada, Tamil,
and Malayalam. These languages are not only primary modes of communication for millions but
also repositories of rich cultural and literary traditions. For example, Tamil boasts a literary history
spanning over two millennia, while Malayalam has made significant contributions to poetry, drama,
and philosophical discourse. Despite their cultural significance, Dravidian languages are
underrepresented in digital technologies, particularly in Natural Language Processing (NLP) applications[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
This underrepresentation poses challenges for integrating these languages into modern computational
frameworks, limiting their availability in applications such as machine translation, sentiment analysis,
and other AI-driven tools.
      </p>
      <p>
        A significant challenge in processing these languages is word-level language identification (LI),
especially in code-mixed texts where multiple languages appear within a single sentence. Code-mixing is
prevalent on social media platforms, where speakers blend their native languages with English, creating
a complex linguistic landscape that complicates text analysis[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This complexity is further exacerbated
by the use of Roman script to phonetically transcribe these languages, resulting in mixed-script text
that is dificult to process using conventional NLP tools[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Consequently, accurately identifying the
language of individual words in such contexts becomes crucial for efective downstream NLP tasks.
      </p>
      <p>
        The necessity of word-level LI arises from the growing need for accurate linguistic processing
in multilingual environments. Without efective word-level LI, code-mixed texts become dificult to
interpret, leading to reduced accuracy in applications like machine translation and sentiment analysis[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Moreover, the scarcity of annotated datasets for under-resourced languages further exacerbates the
dificulty in developing robust models capable of handling the nuances of these linguistic combinations.</p>
      <p>
        Given these challenges, the CoLI-Dravidian Shared Task@FIRE2024[
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] was established to advance
research in word-level LI for code-mixed texts involving Dravidian languages. This shared task
introduces annotated datasets comprising Tulu, Kannada, Tamil, and Malayalam texts, which are intricately
blended with English and other local languages. These datasets, collected from user-generated content
on social media, include categories such as ’Mixed-language,’ ’Name,’ ’Location,’ and other specific
linguistic markers, reflecting the nuanced and dynamic nature of language use in digital spaces. By
providing a robust corpus for word-level LI, the task aims to enhance the development of models that
can accurately identify and classify words in these diverse linguistic environments.
      </p>
      <p>The development of such models is crucial for the broader field of NLP, as they ofer the potential to
improve various applications, including machine translation, sentiment analysis, and speech recognition,
particularly for under-resourced languages. Addressing the complexities of code-mixed text will enable
more accurate and inclusive NLP tools, fostering better understanding and processing of multilingual
content in digital media.</p>
      <p>From here, the report continues in the following manner: In section 2, we give an overview of the
dataset for each language and describe the challenge at hand. In section 3, we present our approach
in detail, covering the nitty gritty of our experimental set-up, cross-validation strategy, models used,
and intuition behind them. In section 4, we brief the results from the experiments section. Then, we
conclude in section 5 with the final takeaways, our standings, and the scope of future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Description</title>
      <sec id="sec-2-1">
        <title>2.1. Dataset Overview</title>
        <p>The dataset used for word-level Language Identification (LI) in code-mixed text spans four distinct
subtasks based on diferent Indian languages: Tulu, Kannada, Tamil, and Malayalam. Each subtask
focuses on LI in sentences containing a mix of English and a regional Indian language, presenting
unique challenges and insights into the linguistic diversity of code-mixed text. Below is a detailed
description of each subtask dataset and its composition, drawn from the competition website.
2.2. Tulu</p>
        <p>
          The Tulu dataset[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] comprises code-mixed sentences sourced from YouTube videos, which have been
preprocessed to remove non-textual characters and standardized by transliteration into Roman script.
The dataset categorizes the extracted words into classes including ‘Tulu,’ ‘Kannada,’ ‘English,’
‘Mixedlanguage,’ ‘Name,’ and ‘Location.’ The complexity of the ‘Mixed-language’ class, which blends Tulu with
Kannada and/or English, highlights the intricate linguistic patterns present in digital communication.
This dataset is essential for developing robust models for language identification (LI) in
Tulu-EnglishKannada code-mixed text.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Kannada</title>
        <p>
          The Kannada dataset[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] contains tokens that have been transliterated into Roman script to ensure
uniform processing. These tokens are classified into ‘Kannada,’ ‘English,’ ‘Mixed-language,’ ‘Name,’
‘Location,’ and ‘Other.’ This dataset is designed to enhance language identification in Kannada-English
code-mixed texts, commonly found in informal digital communication. The ‘Other’ category provides
additional flexibility by including tokens that do not align with the main language categories.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Malayalam</title>
        <p>The Malayalam dataset is the most extensive among the four, containing tokens categorized into
‘Malayalam,’ ‘English,’ ‘Mixed,’ ‘Name,’ ‘Number,’ ‘Location,’ and ‘Sym’ for sentence boundaries. Similar
to the other datasets, the tokens are presented in Roman script for standardized processing. The
inclusion of categories like ‘Number’ and ‘Sym’ broadens the dataset’s applicability, supporting a wide
range of natural language processing tasks, particularly in language identification.
2.5. Tamil
The Tamil dataset includes tokens categorized into classes consistent with those used in the Tulu and
Kannada datasets. These tokens are also presented in Roman script to facilitate uniform processing.
This dataset is intended to support language identification in Tamil-English code-mixed text, providing
a structured resource for exploring language mixing patterns specific to Tamil.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Set-up</title>
      <p>In this section, we discuss our approach and explain the experimental set-up details. We start with
creating a validation strategy for each language. As the dataset for each languages are fairly unbalanced,
we opt for Stratified K-Fold cross-validation with 5 folds. And while creating the splits, we set the
random seed to 42 for Reproducibility.</p>
      <sec id="sec-3-1">
        <title>3.1. Preprocessing</title>
        <p>
          The preprocessing phase involved several cleaning and preparation tasks designed to optimize the
datasets for feature extraction and modeling. Initially, the datasets were reviewed for missing values,
irregularities, and non-textual noise. Any entries with missing values were removed, and non-standard
symbols or excessive punctuation were stripped away to enhance data quality. Given the code-mixed
nature of the text, which included English interspersed with Tulu, Kannada, Tamil, or Malayalam, we
employed language detection tools to tag and separate these instances, ensuring consistency across the
datasets. To address the significant class imbalance, particularly for underrepresented languages like
Tulu and Kannada, class weights were computed and applied during model training to mitigate bias
towards the majority classes. Text data was then transformed into numerical form using various text
embedding techniques, such as Count Vectorizer and TF-IDF with n-grams ranging from 1 to 4, we
also used FastText[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] embeddings to capture both semantic and syntactic features. Target labels were
encoded from categorical to numerical format using LabelEncoder, facilitating the training of machine
learning models.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Modeling</title>
        <p>
          The modeling phase involved a diverse set of machine learning algorithms to classify the text data for all
four subtasks. We began with a suite of baseline models, including MultinomialNB, Logistic Regression,
LinearSVC, KNeighborsClassifier, Decision Tree Classifier, Random Forest Classifier, SVC. These models
were selected for their simplicity and efectiveness in handling text classification tasks, providing a solid
foundation for initial experimentation and serving as benchmarks for further comparisons. As we aimed
to improve performance, we incorporated more advanced ensemble models such as LGBMClassifier[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
XGBoostClassifier[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], and CatBoostClassifier[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which are renowned for their high performance in
structured data and text classification tasks. To fine-tune the models, we employed two hyperparameter
optimization techniques: Grid Search and Optuna. Grid Search was used to systematically explore
predefined hyperparameter grids for baseline models, including MultinomialNB, Logistic Regression,
LinearSVC, KNeighborsClassifier, Decision Tree, and Random Forest. For more sophisticated models
like Random Forest, LightGBM, XGBoost, and CatBoost, we utilized Optuna, a robust hyperparameter
optimization framework that eficiently searches for optimal parameters using Bayesian optimization.
Throughout the modeling process, all experiments, including data preprocessing, model training, and
hyperparameter tuning, were consistently applied across the four subtasks (Tulu, Kannada, Tamil,
and Malayalam) to ensure fair and meaningful performance comparisons among models and datasets.
Each experiment explored diferent text embeddings and model configurations to determine the most
efective strategies for each language classification task. During inference, models trained on each fold
were ensembled by averaging the logits, which provided a comprehensive and robust classification
output.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The results from the CoLI Dravidian tasks, presented in tables 1, 2, 3, 4, demonstrate the efectiveness of
various machine learning algorithms in language identification across Kannada, Tamil, Malayalam, and
Tulu. XgBoost emerged as the leading model for Kannada, achieving the highest macro-F1 score of 0.860
with the Count Vectorizer ngram(1,4) Embedding set, highlighting its robustness in identifying language
patterns. In Tamil, the Multinomial Naive Bayes model, using the Tf-Idf ngram(1,4) configuration,
slightly outperformed other models with a macro-F1 score of 0.732, showcasing its eficiency in handling
language-specific nuances. For Malayalam, LightGBM excelled, attaining a macro-F1 score of 0.864
with the Count Vectorizer ngram(1,4) representation. Logistic Regression was the top performer for
Tulu, with a macro-F1 score of 0.847 using Tf-Idf ngram(1,4).</p>
      <p>These findings are further corroborated by the leaderboard scores (refer to table 5), where we find that
SVC model generally performs well across languages. The obtained results help us climb to 2nd/10 for
Tamil, 4th/10 for Kanadda, 3rd/10 for Malayalam, and 3rd/9 for Tulu. The overall rankings highlight the
strengths of diferent machine learning models and feature-engineered representations in multilingual
language identification tasks</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study explored various machine learning models and text embeddings for word-level language
identification in code-mixed texts involving Dravidian languages as part of the CoLI-Dravidian Shared
Task@FIRE2024. We experimented with a wide range of models, from classical machine learning
algorithms such as Logistic Regression, Support Vector Machines (SVM), Multinomial Naive Bayes,
K-Nearest Neighbors (KNN), Decision Trees, and Random Forests, to advanced ensemble methods like
XGBoost, CatBoost, and LightGBM. These models were trained using diferent embeddings, including
Count Vectorizer and TF-IDF with n-gram ranges (1,3) and (1,4), as well as FastText embeddings. We
also conducted extensive hyperparameter tuning to optimize each model’s performance. Our results
demonstrated that advanced models like XGBoost and LightGBM generally outperformed classical
methods across most tasks, yet simpler models such as Logistic Regression also exhibited competitive
performance when paired with optimized text embeddings and hyperparameter tuning. The overall
efectiveness of our methodology was reflected in our rankings, where we secured Rank 4 in Kannada,
Rank 3 in Malayalam, Rank 3 in Tulu, and Rank 2 in Tamil. These results underscore the complexity and
variability of code-mixed language processing, highlighting the necessity for more tailored approaches
depending on the linguistic context.</p>
      <p>Moving forward, we aim to extend our work to include deep learning models and transformer-based
architectures, explore zero-shot and few-shot learning techniques, and develop a unified model capable
of handling multiple languages more efectively. These steps will further enhance the robustness and
inclusivity of NLP applications for under-resourced languages.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Chat GPT-4 in order to:Grammar and spelling
check. Using this tool, the author(s) reviewed and edited the content as needed and take full responsibility
for the publication’s control.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. U.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          <article-title>, Multi-task learning in under-resourced dravidian languages</article-title>
          ,
          <source>Journal of Data, Information and Management</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>137</fpage>
          -
          <lpage>165</lpage>
          . URL: https://doi.org/ 10.1007/s42488-022-00070-w. doi:
          <volume>10</volume>
          .1007/s42488-022-00070-w.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Dravidiancodemix: sentiment analysis and ofensive language identification dataset for dravidian languages in code-mixed text</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>56</volume>
          (
          <year>2022</year>
          )
          <fpage>765</fpage>
          -
          <lpage>806</lpage>
          . URL: https://doi.org/10.1007/s10579-022-09583-7. doi:
          <volume>10</volume>
          .1007/s10579-022-09583-7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>Usefulness of graphemes in word-level language identification in code-mixed text</article-title>
          , in: J. P.
          <string-name>
            <surname>Sahoo</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          <string-name>
            <surname>Tripathy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mohanty</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-C. Li</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          Nayak (Eds.),
          <source>Advances in Distributed Computing and Machine Learning</source>
          , Springer Singapore, Singapore,
          <year>2022</year>
          , pp.
          <fpage>174</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sarma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Sanasam</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <article-title>Switchnet: Learning to switch for word-level language identification in code-mixed social media text</article-title>
          ,
          <source>Natural Language Engineering</source>
          <volume>28</volume>
          (
          <year>2022</year>
          )
          <fpage>337</fpage>
          -
          <lpage>359</lpage>
          . URL: https://www.cambridge.org/core/product/5E8F2A0046A559107F025E7B9DEE155B. doi:
          <volume>10</volume>
          . 1017/S1351324921000115.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          , K. G,
          <string-name>
            <surname>H. S Kumar</surname>
            , S. D, S. Hosahalli Lakshmaiah,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          , Overview of CoLI-Dravidian:
          <article-title>Word-level Code-mixed Language Identification in Dravidian Languages, in: Forum for Information Retrieval Evaluation FIRE -</article-title>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Hosahalli</given-names>
            <surname>Lakshmaiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          , Overview of CoLI-Kanglish:
          <article-title>Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022</article-title>
          , in: 19th
          <source>International Conference on Natural Language Processing Proceedings</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Anusha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tulu text</article-title>
          , in: M.
          <string-name>
            <surname>Melero</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Soria (Eds.),
          <source>Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, European Language Resources Association</source>
          , Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          . URL: https:// aclanthology.org/
          <year>2022</year>
          .sigul-
          <volume>1</volume>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Anusha</surname>
          </string-name>
          , G. Sidorov,
          <article-title>Coli-machine learning approaches for code-mixed language identification at the word level in kannada-english texts</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2211.09847. arXiv:
          <volume>2211</volume>
          .
          <fpage>09847</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1607.04606. arXiv:
          <volume>1607</volume>
          .
          <fpage>04606</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>Lightgbm: A highly eficient gradient boosting decision tree</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM</source>
          ,
          <year>2016</year>
          . URL: http://dx.doi.org/10.1145/2939672.2939785. doi:
          <volume>10</volume>
          .1145/2939672.2939785.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Prokhorenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gusev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vorobev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Dorogush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gulin</surname>
          </string-name>
          ,
          <article-title>Catboost: unbiased boosting with categorical features</article-title>
          , in: S. Bengio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          , N. CesaBianchi, R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>31</volume>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>