<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Accurate Language Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rob van der Goot</string-name>
          <email>robv@itu.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Language Identification, Named Entity Recognition</institution>
          ,
          <addr-line>Language models</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NLPnorth, IT University of Copenhagen</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>2</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the participation of the NLPnorth team at the CoLI-Dravidian shared task hosted at FIRE2024 [1]. Detecting language on the word level of noisy social media data is still an open challenge. Specifically, for Dravidian languages it is common to code-switch with English in online communication, posing challenges for automatic processing of texts. Starting from a standard language model finetuning, we propose a wide variety of approaches to increase performance on word-level language identification. Our results show that the choice of language model has a large efect on performance, and other methods can lead to even further performance improvements. We experiment with a CRF layer, training on multiple datasets, and language modeling, where each of the methods show diferent trends across languages/datasets. 1 ments. model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>https://robvanderg.github.io/ (R. v. d. Goot)</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1YouTube was reported as the source platform for Kannada and Tulu data, upon manual inspection it seems like the others are
from similar platforms
CEUR</p>
      <p>ceur-ws.org
• Multi-lingual models outperform mono-lingual models in our setups, but this is likely an efect of
scale (multi-lingual models are larger, and are trained on more data).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>We first compare all included languages (the four Dravidian languages and English) from a statistical
perspective; we collected their number of speakers [12], number of Wikipedia articles,2 commonly
used scripts, AES endangered status (1-5, 5 is not-endangered) from Glottolog [13], and their resource
status according to Joshi et al. [14]. Results in Table 1 show that there are quite many speakers for all
languages, and the included languages are mostly not endangered, but are also in the lowest resource
level (1: The Scraping-Bys) as defined by Joshi et al. [ 14].</p>
      <p>Since the original data had diferent labels across the diferent languages, I first designed a mapping
to standardize the labels across languages,3 which eases the training of multi-dataset models, and
simplifies evaluation. Furthermore, the data was originally tokenized on the word level, but sentence
boundaries were not annotated. I separated the data on occurrences of ‘*’ and ‘.’ to have shorter chunks
of inputs that can more easily be used in length-constrained language models.</p>
      <p>After the pre-processing, the resulting label distribution ( Table 2) shows that the English label is
relatively frequent across all datasets, and that the named entity labels and the mixed labels (a
combination of languages within a single word, mostly due to compounds and inflections upon inspection) are
more scarce. It should also be noted that the SYM label was much more common in the original data,
but it was pre-processed away during the “sentence splitting” (and re-inserted before uploading the test
predictions). The only dataset with mixing across Dravidian languages is the Kannada Dataset, which
includes a words in Tulu.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>We use the MaChAmp toolkit [16] with default hyperparameters for all our experiments (except the
statistical baseline). This means we train for 20 epochs, use the adam optimizer with a learning rate of
0.0001, a slanted triangular learning rate [17], and a batch size of 32. MaChAmp uses a language model
2https://en.wikipedia.org/wiki/List_of_Wikipedias
3Since detailed annotation guidelines were not available the mapping is based on manual inspection of occurrences of labels
in the data.
Datasets</p>
      <p>Kannada
Malayalam
Tamil
Tulu
lang-eng
lang-kan</p>
      <p>l
lang-ma
lang-tam
lang-tul
mixed-kan-eng
mixed-mal-eng
mixed-tam-eng
mixed-tcy-eng
ne-LOC
ne-MISC</p>
      <p>E
ne-NAM
ne-NUMBER</p>
      <p>SYM
as an encoder, and then adds a feedforward layer on top for classification, and finetunes all weights
during training.</p>
      <sec id="sec-3-1">
        <title>3.1. Statistical Baseline</title>
        <p>
          We use character-based profiles as used in textcat [ 18]. textcat builds character n-gram profiles of texts
(which are frequency-ranked lists), which it then uses to compare a new input text to all profiles of the
training classes. Since textcat is usually used for sentences and we are classifying on the word level,
we re-tuned the hyperparameters where the range of minimum n-gram size is [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1,2,3</xref>
          ], the maximum
[
          <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3,4,5,6</xref>
          ], and the top-n most frequent n-grams to take into account is [
          <xref ref-type="bibr" rid="ref1">500, 1,000, 10,000, 20,000</xref>
          ]. We
found that a character n-gram range of 1-6 and the top-n of 20,000 led to the best performance.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Language models</title>
        <p>As a first step, we evaluate a variety of transformer based language models. We use only discriminative
language models, and they should be trained on at least one of the included languages. We use the
huggingface portal with the language filters and the “fill-mask” task. We excluded language models
for which training did not fit on our 40gb GPU’s. We pick the best 5 language model based on the
average scores, and also the single best language model for each language for further investigations.
The following methods are only evaluated on this sub-selection of language models.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. CRF-layer</title>
        <p>Upon inspection of the outputs of the initial models, we noticed that many of the cases where the model
made an error there was a single label surrounded by other labels. Hence, we add a CRF-layer [19] that
incorporates surrounding predictions and models the likelihood of transitioning from a certain label to
another label. We also adopt BIO-labels for this setup (and disallow illegal transitions like B-mal ↦
I-eng), as the MaChAmp toolkit enforces this when adding a CRF layer.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Multi-dataset training</title>
        <p>Because the languages are related and annotations are similar, we also attempt to use multi-dataset
learning. We first train a single model for all datasets, where we experiment with a separate decoder
for each dataset as well as a combined decoder.</p>
        <p>Based on this joint model, we also do re-training on each target language. The intuition here is to
benefit from all the data while avoiding parameter sharing. For this setup, we also experiment with a
lower learning rate (i.e. *0.1), because the models should have already learned the tasks, and can now
focus on learning the more detailed peculiarities of the target language/dataset.</p>
        <p>We looked into adding other datasets (for other tasks), but all annotated datasets for the target
languages that we could find were in the native (non-Latin) scripts.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Language modeling</title>
        <p>As the larger-sized datasets we could find were all in other scripts than the one used in the shared task,
we opted for task-adaptive pretraining [20]. This means that we do language modeling on the training
data that is also annotated for the downstream evaluation task. We evaluate the diference between
doing language modeling in a sequential setup (first language modeling, then language identification),
or in a joint setup (learn both tasks simultaneously). We also evaluate if it is beneficial to see the
data only once, or use multiple iterations (up to 20). Note that we keep the amount of epochs and the
learning rate stable in the last experiment (i.e. if we see the data only once, the epochs are 20 times
smaller), and we use model selection based on the perplexity on the dev set to avoid overfitting.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The oficial metrics for the shared task are macro F1 and weighted F1. Since the task is language
identification, and many of the small labels do not refer to languages (i.e. named entities, numbers, and
symbols, see Table 1), we use weighted F1 for our evaluations (macro F1 gives equal weight to all labels,
so mistakes on smaller labels have a relatively large impact). All reported results (except on the test
data) are the average over three seeds.</p>
      <sec id="sec-4-1">
        <title>4.1. Language Model</title>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Improvement strategies</title>
        <p>We have the exact numbers for all strategies summarized in Appendix A. In this subsection, we will
summarize findings for each category of improvements.</p>
        <p>CRF-layer The results with an added CRF layer in Figure 3 show that the efect of this difer per
language. For Kannada (kan) efects are positive, for Malayalam (mal) negative, and for the other two
languages mixed (depending on the language model). Overall, especially when taking into account the
standard deviations, diferences in performance are relatively small.</p>
        <p>Multi-dataset training When training on all datasets simultaneously, the drawback of weight
sharing seems to outweigh the benefits of increased training data size as performance is usually lower
with higher standard deviations ( Figure 4). After re-training on the target dataset/language, we see
again that the results difer per language: For Kannadian, this is beneficial for most language models,
for Malayalam and Tamil it is negative, whereas for Tulu results are mixed. The lower learning rate has
no clear positive efect over the normal re-training. The results of our experiments with a combined
decoder classification head showed lower performance for all language models, the scores can be found
in Appendix A.
Language modeling For the language modeling experiments we only plot the sequential strategy,
as the joint results are consistently substantially lower (Table 5). The remaining results ( Figure 5)
show that results are mainly positive for Kannadian and Malayam. Also, training on the data 20 times
(mlm-20) is not beneficial (in fact, for most models, performance on the dev set was highest in epoch
5-10, so that model was used). Results for twhin-bert-large are worse compared to the other models,
probably because its pertaining strategy is the most diferent compared to standard masked language
modeling.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results on test data</title>
        <p>On the test data, we selected the best 4 models on the average scores over all languages (based on
individual seeds), and also submitted the single best models for each language. One interesting observation
is that there is a wide variety on what the best five models are, depending on the dataset/language (i.e.
the bold numbers in Table 5 do not show clear trends.). This leads us to conclude that we should be
careful when claiming generalized findings across diferent language models in these types of setups.</p>
        <p>
          Results ( Table 3) show that our best models performed highly competitively on most languages,
except Tamil, which surprisingly was the 2nd highest ranking dataset in our own experiments. It should
base
crf
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
twhin-bert-large
infoxlm-large
mluke-large
mluke-large-lite
xlm-roberta-large
multilingual-e5-large
base
all
retrain
retrain-low
kan mal tam tcy
base
mlm-1
mlm-20
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
kan mal tam tcy
twhin-bert-large
infoxlm-large
mluke-large
mluke-large-lite
xlm-roberta-large
multilingual-e5-large
be noted that the oficial ranking is based on Macro-F1, which I do not report in my paper. Performances
are much higher compared to previous shared tasks on Kannada [
          <xref ref-type="bibr" rid="ref15">45</xref>
          ] where the winning team achieved
weighted F1 of 86, and Tulu [
          <xref ref-type="bibr" rid="ref16">46</xref>
          ] where the winning team achieved a macro F1 of 81.3 (our best model
has 86.7). However, it is unclear which amount of this change can be ascribed to diferences in the data.
It can also be seen that performance is slightly lower on the test data compared to the dev data for most
languages. This can be due to overfitting, or the test set being more challenging, (dev) results from
other teams participating in the shared task might shed more light on this.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Analysis</title>
      <sec id="sec-5-1">
        <title>5.1. How to pick the right language model?</title>
        <p>Because the choice of language model is an important factor for final performance, we perform a
correlation study of diferent properties of the language models against the final performance. From
each model we extract: the number of weights, the size of the vocabulary, the percentage of the
vocabulary that is used in a dataset, and the average length of a word (in subwords). We initially also
extracted the coverage of the vocabulary for each dataset, but that was almost always 100%, so no usable
correlation could be calculated. Results (Table 4) show that none of the weights have a very strong
correlation. None of the p-values were &lt; .05. Perhaps surprisingly, the percentage of the vocabulary
used and the average word length have a negative correlation, although they intuitively could be an
indicator of having a better vocabulary. However, this can be explained with the mixed efect to the
number of weights; these two variables have a significant (  &lt; 0.05 ) Pearson correlation between 35-36
for all languages.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. What are remaining errors?</title>
        <p>To investigate what the remaining errors are, we took the best performing model for each development
set, plot a confusion matrix of all combined errors ( Figure 6), and manually inspect the errors. It should
be noted that this is all done by the first author, who is not speaker of any of the target languages.</p>
        <p>Kannada and Tulu are commonly confused, as they occur in the same dataset (the Tulu dataset), some
of the cases of confusion are for words that occur in both languages, in other cases it are mostly the
context or the individual subwords that occur in the other language that mislead the model. the other
main confusion is underprediction of the misc category. As the name suggest, this is likely because it is
a less clearly defined category. Upon inspection, we found that for English this is commonly because
ng
ld n-eng
o kal-eng</p>
        <p>a e g
G m - n
am-ec
t cylo
t isc
m e</p>
        <p>m
naber
engkanmaltamkatnum-leanltg-aemng-tecnyg-eng
loc iscmebersym
mnaum</p>
        <p>n</p>
        <p>Prediction
standard words are used as part of a name. The eng label also has quite some errors, in both directions
(over-prediction and under-prediction). Errors are made here because of interjections (like ah, hahha,
which are annotated as eng), typos and slang (Tha is labeled as Tulu by our models) and there seems to
be some annotation for the English class which is incorrect (e.g. padike, Bakrid). Finally, the mixed
language labels are commonly confused with the dataset languages, in almost all cases this is where
only the inflection is done in English, which only leads to 1-2 characters that are diferent compared to
the word in the Dravidian language, hence it is easy for the model to make mistakes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The choice of language model is the most important compared to the other strategies we tested, including
a CRF layer, multi-dataset training, and various strategies for including masked language modeling in
training. The remaining strategies lead to improved results in certain setups, however, the trends are
diferent across language models and across datasets/languages. Hence, we conclude that future work
should be careful with generalizing claims when reporting gains with a limited amount of datasets,
languages and/or language models. In our setup, multi-lingual models outperform the mono-lingual
models, probably because they are also larger in scale. We evaluated the efect of model size, vocabulary
size, vocabulary utility, and average word length with respect to final model performance. Our results
show the strongest correlation for model size, and negative correlations for vocabulary utility, but this
is probably because of the model size confounder (with an even stronger correlation). An analysis of
the errors showed that the remaining cases are often ambiguous words (i.e. their surface form can be
used in the annotated and predicted class) or subwords, and interpretation of context is thus still an
open challenge.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>I would like to thank Lottie for maintaining the HPC cluster at the ITU, and the organizers of the shared
task for creating the data and sharing it.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any generative AI tools.
(ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English
Texts, 2022, pp. 38–45.
[10] S. H. Lakshmaiah, F. Balouchzahi, M. D. Anusha, G. Sidorov, Coli-machine learning approaches for
code-mixed language identification at the word level in kannada-english texts, Acta Polytechnica
Hungarica 19 (2022).
[11] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus creation for
sentiment analysis in code-mixed tulu text, in: Proceedings of the 1st Annual Meeting of the
ELRA/ISCA Special Interest Group on Under-Resourced Languages, 2022, pp. 33–40.
[12] Wichmann, Søren, E. W. Holman, C. H. Brown, The ASJP database (version 20), 2022.
[13] H. Hammarström, R. Forkel, M. Haspelmath, S. Bank, Glottolog 5.0., 2024. URL: https://doi.org/10.</p>
      <p>5281/zenodo.10804357, (Available online at http://glottolog.org, Accessed on 2024-04-24.).
[14] P. Joshi, S. Santy, A. Budhiraja, K. Bali, M. Choudhury, The state and fate of linguistic diversity and
inclusion in the NLP world, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, Online, 2020, pp. 6282–6293. URL: https://aclanthology.org/2020.acl-main.560.
doi:10.18653/v1/2020.acl- main.560.
[15] H. Hammarström, R. Forkel, M. Haspelmath, S. Bank, GlottoScope, 2024. URL: https://glottolog.</p>
      <p>org/langdoc/status.
[16] R. van der Goot, A. Üstün, A. Ramponi, I. Sharaf, B. Plank, Massive choice, ample tasks (MaChAmp):
A toolkit for multi-task learning in NLP, in: D. Gkatzia, D. Seddah (Eds.), Proceedings of the 16th
Conference of the European Chapter of the Association for Computational Linguistics: System
Demonstrations, Association for Computational Linguistics, Online, 2021, pp. 176–197. URL:
https://aclanthology.org/2021.eacl-demos.22. doi:10.18653/v1/2021.eacl- demos.22.
[17] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint
arXiv:1801.06146 (2018).
[18] W. B. Cavnar, J. M. Trenkle, et al., N-gram-based text categorization, in: Proceedings of SDAIR-94,
3rd annual symposium on document analysis and information retrieval, Las Vegas, NV, 1994, p. 14.
[19] J. Laferty, A. McCallum, F. Pereira, et al., Conditional random fields: Probabilistic models for
segmenting and labeling sequence data, in: Icml, volume 1, Williamstown, MA, 2001, p. 3.
[20] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t
stop pretraining: Adapt language models to domains and tasks, in: D. Jurafsky, J. Chai, N. Schluter,
J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8342–8360. URL: https:
//aclanthology.org/2020.acl-main.740. doi:10.18653/v1/2020.acl- main.740.
[21] A. DeLucia, S. Wu, A. Mueller, C. Aguirre, P. Resnik, M. Dredze, Bernice: A multilingual
pretrained encoder for Twitter, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 6191–6205. URL: https://aclanthology.org/
2022.emnlp-main.415. doi:10.18653/v1/2022.emnlp- main.415.
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
doi:10.18653/v1/N19- 1423.
[23] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, C. Rafel, ByT5:
Towards a token-free future with pre-trained byte-to-byte models, Transactions of the Association
for Computational Linguistics 10 (2022) 291–306. URL: https://aclanthology.org/2022.tacl-1.17.
doi:10.1162/tacl_a_00461.
[24] J. H. Clark, D. Garrette, I. Turc, J. Wieting, Canine: Pre-training an eficient tokenization-free
encoder for language representation, Transactions of the Association for Computational Linguistics
10 (2022) 73–91. URL: https://aclanthology.org/2022.tacl-1.5. doi:10.1162/tacl_a_00448.
[25] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[26] A. ImaniGooghari, P. Lin, A. H. Kargaran, S. Severini, M. Jalili Sabet, N. Kassner, C. Ma, H. Schmid,
A. Martins, F. Yvon, H. Schütze, Glot500: Scaling multilingual corpora and language models
to 500 languages, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Association for Computational Linguistics, Toronto, Canada, 2023, pp. 1082–1117. URL: https:
//aclanthology.org/2023.acl-long.61. doi:10.18653/v1/2023.acl-long.61.
[27] Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X.-L. Mao, H. Huang, M. Zhou,
InfoXLM: An information-theoretic framework for cross-lingual language model pre-training, in:
K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell,
T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Association
for Computational Linguistics, Online, 2021, pp. 3576–3588. URL: https://aclanthology.org/2021.
naacl-main.280. doi:10.18653/v1/2021.naacl-main.280.
[28] R. Joshi, L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based
hindi and marathi languages, arXiv preprint arXiv:2211.11418 (2022).
[29] N. Maltesh, Kanberto, 2020. URL: https://huggingface.co/Naveen-k/KanBERTo.
[30] S. Rashinkar, S. Doddapaneni, M. Khapra, KooBERT, https://huggingface.co/KooAI/KooBERT,
2023.
[31] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic BERT sentence embedding,
in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Dublin, Ireland, 2022, pp. 878–891. URL: https://aclanthology.org/2022.acl-long.62.
doi:10.18653/v1/2022.acl-long.62.
[32] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
gradient-disentangled embedding sharing, The Eleventh International Conference on Learning
Representations (2021).
[33] R. Ri, I. Yamada, Y. Tsuruoka, mLUKE: The power of entity representations in multilingual
pretrained language models, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of
the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 7316–7330. URL:
https://aclanthology.org/2022.acl-long.505. doi:10.18653/v1/2022.acl-long.505.
[34] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A
technical report, arXiv preprint arXiv:2402.05672 (2024).
[35] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal, R. T.</p>
      <p>Nagipogu, S. Dave, et al., Muril: Multilingual representations for indian languages, arXiv preprint
arXiv:2103.10730 (2021).
[36] H. W. Chung, T. Fevry, H. Tsai, M. Johnson, S. Ruder, Rethinking embedding coupling in pre-trained
language models, International Conference on Learning Representations (2020).
[37] A. Singapore, Sea-lion (southeast asian languages in one network): A family of large language
models for southeast asia, https://github.com/aisingapore/sealion, 2024.
[38] X. Zhang, Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, A. El-Kishky, Twhin-bert: A
socially-enriched pre-trained language model for multilingual tweet representations at twitter, in:
Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023,
pp. 5597–5607.
[39] F. Barbieri, L. Espinosa Anke, J. Camacho-Collados, XLM-T: Multilingual language models in
Twitter for sentiment analysis and beyond, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri,
C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis
(Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, European
Language Resources Association, Marseille, France, 2022, pp. 258–266. URL: https://aclanthology.
org/2022.lrec-1.27.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Full results</title>
      <p>The full results for all our experiments and our six selected language models are reported in Table 5.
97.19
97.20
95.88
97.27
97.08
97.31
96.44
96.23
94.66
96.72
94.20
94.13
92.61
94.22
94.13
94.05
93.86
93.78
92.32
93.67</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          , K. G,
          <string-name>
            <surname>H. S Kumar</surname>
            , S. D, S. Hosahalli Lakshmaiah,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          , Overview of CoLI-Dravidian:
          <article-title>Word-level code-mixed language identification in Dravidian languages, in: Forum for Information Retrieval Evaluation FIRE -</article-title>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Bag of tricks for eficient text classification</article-title>
          , in: M.
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blunsom</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Koller (Eds.),
          <source>Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Short</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Association for Computational Linguistics, Valencia, Spain,
          <year>2017</year>
          , pp.
          <fpage>427</fpage>
          -
          <lpage>431</lpage>
          . URL: https://aclanthology.org/ E17-2068.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Kargaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Imani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yvon</surname>
          </string-name>
          , H. Schuetze,
          <article-title>GlotLID: Language identification for low-resource languages</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>6155</fpage>
          -
          <lpage>6218</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>410</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . findings- emnlp.410.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Burchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bogoychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <article-title>An open dataset and model for language identification</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>865</fpage>
          -
          <lpage>879</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .acl-short.
          <volume>75</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .acl- short.75.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanuja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dandapat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sitaram</surname>
          </string-name>
          , M. Choudhury,
          <string-name>
            <surname>GLUECoS:</surname>
          </string-name>
          <article-title>An evaluation benchmark for code-switched NLP</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>3575</fpage>
          -
          <lpage>3585</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>329</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl- main.329.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Aguilar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kar</surname>
          </string-name>
          , T. Solorio,
          <article-title>LinCE: A centralized benchmark for linguistic code-switching evaluation</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>1803</fpage>
          -
          <lpage>1813</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>223</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H L</given-names>
            , H. A.
            <surname>Nayel</surname>
          </string-name>
          , S. Butt, CoLI@FIRE2023:
          <article-title>Findings of word-level language identification in code-mixed Tulu text</article-title>
          ,
          <source>FIRE '23</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>25</fpage>
          -
          <lpage>26</lpage>
          . URL: https://doi.org/10.1145/3632754.3633075. doi:
          <volume>10</volume>
          .1145/3632754.3633075.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Burchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <article-title>Code-switched language identification is harder than you think</article-title>
          , in: Y. Graham, M. Purver (Eds.),
          <source>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics, St. Julian's, Malta</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>646</fpage>
          -
          <lpage>658</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .
          <article-title>eacl-long</article-title>
          .
          <volume>38</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Overview of coli-kanglish: Word Level Language Identification in Code-mixed</article-title>
          <source>Kannada-English Texts at Icon</source>
          <year>2022</year>
          ,
          <source>in: Proceedings of the 19th International Conference on Natural Language Processing</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aralikatte</surname>
          </string-name>
          , Z. Cheng, S. Doddapaneni,
          <string-name>
            <given-names>J. C. K.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <article-title>Varta: A large-scale headline-generation dataset for Indic languages</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>3468</fpage>
          -
          <lpage>3492</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>215</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>215</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , G. Lample,
          <article-title>Cross-lingual language model pretraining</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sagen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Large-Context Question Answering with Cross-Lingual</surname>
            <given-names>Transfer</given-names>
          </string-name>
          ,
          <source>Master's thesis</source>
          , Uppsala University, Department of Information Technology,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Khabsa, XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>13142</fpage>
          -
          <lpage>13152</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>813</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>813</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Overview of CoLI-kanglish: Word level language identification in code-mixed Kannada-English texts at ICON 2022</article-title>
          , in: B.
          <string-name>
            <surname>R. Chakravarthi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Murugappan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnappa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hane</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          <string-name>
            <surname>Kumeresan</surname>
          </string-name>
          , R. Ponnusamy (Eds.),
          <source>Proceedings of the 19th International Conference on Natural Language Processing</source>
          (ICON):
          <article-title>Shared Task on Word Level Language Identification in Code-mixed KannadaEnglish Texts, Association for Computational Linguistics</article-title>
          , IIIT Delhi, New Delhi, India,
          <year>2022</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .icon-wlli.
          <volume>8</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Nayel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <article-title>Overview of coli-tunglish: Word-level language identification in code-mixed tulu text at fire 2023., in: Forum for Information Retrieval Evaluation (FIRE</article-title>
          <year>2023</year>
          ),
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>