=Paper=
{{Paper
|id=Vol-3681/T4-2
|storemode=property
|title=Using Character Ngrams for Word-Level Language Identification in Trilingual Code-Mixed Data (and Even More)
|pdfUrl=https://ceur-ws.org/Vol-3681/T4-2.pdf
|volume=Vol-3681
|authors=Yves Bestgen
|dblpUrl=https://dblp.org/rec/conf/fire/Bestgen23
}}
==Using Character Ngrams for Word-Level Language Identification in Trilingual Code-Mixed Data (and Even More)==
Using Character Ngrams for Word-Level Language Identification in Trilingual Code-Mixed Data (and Even More) Yves Bestgen1 1 Laboratoire d’analyse statistique des textes - Statistical Analysis of Text Laboratory (LAST - SATLab), Université catholique de Louvain, 10 place Cardinal Mercier, Louvain-la-Neuve, 1348, Belgium Abstract This paper presents the solution proposed by the SATLab to classify all the words in short utterances into one of seven categories which include three languages, two of which are closely related Dravidian languages sparsely endowed with linguistic resources (Tulu and Kannada) and a category for tokens that mix several languages. This language-agnostic system uses only character ngrams as features and a classical supervised learning procedure. After optimizing a series of parameters, it ranked first in the CoLI-Tunglish challenge, with a Macro-F1 of 0.813, virtually on a par with the second-place system. Part of its effectiveness comes from taking into account the context in which each word to be categorized is used. Keywords Word-level language identification, character ngrams, logistic regression, low-resource languages 1. Introduction Identifying the language in which a document is written is a classic and important task in natural language processing, because it conditions many other tasks such as translation, summarization or sentiment analysis. Over the last ten years, researchers have turned their attention to more complex issues, such as the distinction of dialectal varieties like those existing between Swiss German dialects in VarDial 2017 [1] or the distinction of several closely related Dravidian languages mixed with English in short YouTube comments written in Roman script in VarDial 2021 [2]. In our multilingual world, code-mixing has become a very frequent phenomenon, particularly in the multitude of posts of all kinds on social networks [3, 4, 5, 6]. The code-mixed task just described (VarDial 2021) already seems particularly complex. Yet it only covers half of the job. Only one of the two languages used in an utterance is the target of discrimination. Thus, an utterance mixing English and Kannada must be identified as "Kannada". A far more complex situation exists: determining the language in which each word of an utterance is written when the utterance in question combines words from several languages Forum for Information Retrieval Evaluation, December 15-18, 2023, India " yves.bestgen@uclouvain.be (Y. Bestgen) ~ https://perso.uclouvain.be/yves.bestgen (Y. Bestgen) 0000-0001-7407-7797 (Y. Bestgen) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and even "words" that mix two of them, all written in the same script. This situation can be illustrated by the following example (see Table 1) from the CoLI-Tunglish dataset [7]. Table 1 Tokens of a comment to categorize in the CoLI-Tunglish dataset Token Category anna Kannada every English weekg Mixed new English contentn Mixed padva Tulu pandu Tulu getondar Tulu This type of utterance is relatively frequent in multilingual situations where speakers of a regional language frequently know several other languages, as in the case of the Tuluvas in the Southern part of Karnataka in India [8]. When this task has to be carried out for languages well endowed with linguistic resources, especially electronic dictionaries and very large corpora, such as English, German or Italian, it seems fairly straightforward. This is no longer the case when some of them are low-resource languages, such as Kannada and especially Tulu, respectively the official language and a regional language of Karnataka [8]. For this reason, Hagde et al. [7] have proposed a shared task within the framework of the Forum for Information Retrieval Evaluation (FIRE 2023), asking users to classify all the words in short utterances into one of seven proposed categories. These categories include three languages, two of which are sparsely endowed with linguistic resources, but also a category for tokens that mix several languages, hence the "and even more" in the title of this paper. To tackle this type of challenge, the SATLab has developed a language-agnostic system that uses only character ngrams as features, and therefore no other linguistic resources. The character ngrams were provided to a classical supervised learning procedure such as an SVM or a gradient boosting decision tree. The aim of the present study is to determine whether this approach is also effective for the CoLI-Tunglish task, which is even more complex than those proposed in the past. The remainder of this summary presents the task and the two systems proposed by the SATLab. The results obtained are then described. 2. Materials and Challenge Rules The data for this challenge comes from the CoLI-Tunglish dataset [8], which is made up of short YouTube comments in Roman script. These comments contain Tulu, Kannada, and English words, but also personal names and place names, as well as "Mixed-language" and "Other" cases. In all, there are seven categories to distinguish. Table 2 shows the distribution of these categories in the whole material provided by organizers to develop a system, which included a training and a validation part. Table 2 Frequency of the seven categories in the learning and development set Category Frequency % Tulu 10268 47.39 English 6400 29.54 Kannada 2238 10.33 Name 1224 5.65 Other 639 2.95 Mixed 478 2.21 Location 421 1.94 Total 21668 100.00 The imbalance between the frequencies of certain categories is immediately apparent, since the Tulu category is 24 times more frequent than the Location category. This imbalance led the organizers to choose the Macro-F1 to evaluate the efficiency of the systems, an index that gives equal weight to all categories ([9]). As explained above, words are included in comments, so comments constitute a higher hierarchical level structuring the data. This second level is used in one of the two systems proposed in the next section. The complete training set contains 3,629 comments ranging from 2 to 21 tokens. Almost 95% of the comments are between 3 and 12 tokens long. While 15% of these comments contain words from only one category, 29% include words from three different categories, 9% from four and 1.60% from five. The test material consisted of 10,505 words, or approximately one third of the complete CoLI- Tunglish dataset. Participating teams were allowed to use any additional linguistic resource, a possibility that the SATLab did not use. They could submit a maximum of three solutions, and were completely unaware of their performance until the end of the challenge. 3. The Two Developed Systems 3.1. Basic System The basic system, used for Run 1, is derived from the one that won first place at VarDial 2021. The features used are composed exclusively of character ngrams. The major advantage of these is that they enable the development of a system that can be applied without any modification to any writing script or even combination of scripts and to any language, including those that do not explicitly signal separations between words. When extracting the features of each word, several parameters were set on the basis of a cross-validation procedure using four folds in which all words of an instance are in the same fold: • The length of the character ngrams, which ranged from 1 to 5. • The frequency threshold a feature must reach to be used, which was set at 2. • The weighting scheme applied to the frequency of each feature of an instance (i.e., binary, logarithmic...). In the present case, a sublinear weighting TfIdf was used. • The weighting scheme applied to the features of an instance, i.e. L2 normalization. Inferences were performed by a LIBLinear L2-regularized logistic regression model (dual, -s 7) for classification [10]. Three parameters of this procedure were set by means of the cross-validation procedure : • The regularization parameter C, which was set at 12. • The -wi options for adjusting the parameter C of different categories, which was set at 4 for "Mixed" and at 3 for "Other", and let to 1 for the five other categories. Table 2 shows that the two overweighted categories are not the rarest, contrary to what might have been expected. These two categories were chosen because they were those that the system predicted too infrequently, and because the use of these weights significantly improved the system’s Macro-F1. • The bias parameter (-B), which shifts the separating hyperplane from the origin, which was set to 1. 3.2. Context-Sensitive System The basic system extracts the character ngrams of each word independently of all the other words making up the comment in question. This approach, which treats each word in isolation, seems justified, since the word following or preceding the target may a priori belong to any other category. The consequence, however, is that information from a higher hierarchical level, the comment, is not taken into account. To compensate at least partially for this limitation, a second system has been designed. It is also based on a LIBLinear L2-regularized logistic regression model (dual, -s 7) for classification, but takes as input not the character grams, but the output of the basic system. More precisely, for each word, the features are the probability of belonging to each of the seven categories computed by the base system and those of the two neighbors to the left and to the right (if any) in the corresponding comment. The LIBLinear parameters were identical to those used for the base system except that C was set to 1. 4. Results 4.1. Cross-Validation Performance of Both Systems Table 3 shows the performance of the systems in 4-fold cross-validation using the following five indices: accuracy, weighted-averaged F1 (WA-F1), Macro-F1, Mean Recall and Mean Precision. As a reminder, the official challenge measure is Macro-F1. This table shows that the basic system is relatively efficient, achieving a Macro-F1 of 0.767 for a seven-category problem. The context-sensitive system is slightly more efficient than the basic system, but the difference is only 0.016 of Macro-F1. Precision is also significantly higher than recall. Table 3 System performance in 4-fold cross-validation System Accuracy WA-F1 Macro-F1 Recall Precision Base 86.80 0.864 0.767 0.730 0.817 Contextual 87.60 0.873 0.783 0.746 0.830 4.2. Challenge Results Table 4 shows the results of the best run of five teams who took part in the challenge, as provided by the organizers. The SATLab’s context-sensitive system came in first, but only by a small margin, since its lead over the BFCAI team was only 0.001 of Macro-F1. The third team is also quite close, with a gap of only 0.014. The best SATLab performance was obtained using the context-sensitive system. The basic model obtained a Macro-F1 of 0.80 and was thus outperformed by the BFCAI system. It is noteworthy that the Macro-F1 on test data is significantly higher than that obtained in cross-validation (Table 3). If the separation into training material versus test material was carried out in a completely random way, this suggests that having more training data is quite useful, since all the training material is used for the test phase, whereas only 75% of it is used in cross-validation. Table 4 Official results of the CoLI-Tunglish shared task Rank Team Name Precision Recall Macro-F1 1 SATLAB 0.851 0.783 0.813 2 BFCAI 0.859 0.777 0.812 3 Poorvi 0.821 0.781 0.799 4 MUCS 0.807 0.743 0.770 5 IRLab@IITBHU 0.740 0.571 0.602 5. Conclusion The aim of the CoLI-Tunglish shared task is to develop automatic systems for word-level language identification in code-mixed Tulu, Kannada and English short YouTube comments in Roman script. The systems proposed by the SATLab are a slightly modified version of those that had achieved excellent results in several VarDial and HASOC challenges, whose common feature was to deal with low-resources languages. The first system is based solely on character ngrams, while the second also takes into account contextual information from the commentary. The contextual system came first in the challenge, but the difference with the second-place system (BFCAI) was very small. Clearly, this difference is insufficient to favor one system over the other. The criterion must therefore be the complexity of the systems [11]. The SATLab system is clearly a very simple one, requiring few computational resources and no linguistic resources apart from the learning material. When the BFCAI team’s report is available, it will be possible to determine how complex their system is. If the SATLab system is much simpler, it would be natural to give it preference. If this is not the case, it would be preferable to consider the two systems as a tie. The major advantage of the systems proposed by the SATLab is that they are based solely on character ngrams. They can therefore be deployed for any combination of languages for which learning material is available. The main development that can be envisaged is better context awareness. But how this could be possible is not easy to determine unless it can be shown that code-switching at least partially obeys some kind of syntactic rules, which is far from obvious. It seems far more likely that code-switching results from differences in the accessibility of concepts in the different languages known to the author and in the frequency of usage on social networks. Building a system capable of taking such data into account does seem very challenging. Acknowledgments The author is a Research Associate of the Fonds de la Recherche Scientifique - FNRS (Fédération Wallonie Bruxelles de Belgique). References [1] Y. Bestgen, Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets, in: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, 2017, pp. 115–123. [2] Y. Bestgen, Optimizing a supervised classifier for a difficult language identification prob- lem., in: Proceedings of the Eigth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2021, pp. 96–101. [3] K. Bali, J. Sharma, M. Choudhury, Y. Vyas, “I am borrowing ya mixing ?” an analysis of English-Hindi code mixing in Facebook, in: Proceedings of the First Workshop on Computational Approaches to Code Switching, Association for Computational Linguistics, Doha, Qatar, 2014, pp. 116–126. URL: https://aclanthology.org/W14-3914. doi:10.3115/ v1/W14-3914. [4] A. Das, B. Gambäck, Code-mixing in social media text, Traitement Automatique des Langues 54 (2013) 41–64. URL: https://aclanthology.org/2013.tal-3.3. [5] S. Thara, P. Poornachandran, Code-mixing: A brief survey, in: 2018 International Confer- ence on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp. 2382–2388. doi:10.1109/ICACCI.2018.8554413. [6] F. Balouchzahi, S. Butt, A. Hegde, N. Ashraf, H. Shashirekha, G. Sidorov, A. Gelbukh, Overview of coli-kanglish: Word level language identification in code-mixed kannada- english texts at icon 2022, in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, 2022, pp. 38–45. [7] A. Hagde, F. Balouchzahi, S. Coelho, S. Hosahalli Lakshmaiah, H. A Nayel, S. Butt, Overview of coli-tunglish: Word-level language identification in code-mixed tulu texts at fire 2023, in: Forum for Information Retrieval Evaluation FIRE - 2023, 2023. [8] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus Creation for Sentiment Analysis in Code-Mixed Tulu Text, in: Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, 2022, pp. 33–40. [9] J. Opitz, S. Burst, Macro F1 and macro F1, 2021. arXiv:1911.03347. [10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008) 1871–1874. [11] J. Dodge, S. Gururangan, D. Card, R. Schwartz, N. A. Smith, Show your work: Improved re- porting of experimental results, in: Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 2185–2194. URL: https://www.aclweb.org/anthology/D19-1224. doi:10.18653/v1/D19-1224.