Short text language identification for under resourced languages Bernardt Duvenhage Feersum Engine, Praekelt Consulting, Johannesburg, South Africa bernardt@praekelt.com Abstract. The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. 1 Keywords: Language identification · Similar languages. 1 Background Accurate language identification (LID) is the first step in many natural language processing and machine comprehension pipelines. LID is further also an impor- tant step in harvesting scarce language resources. Availability of data is still one of the big roadblocks for applying data driven approaches like supervised machine learning in developing countries. An in depth survey of algorithms, features, datasets, shared tasks and evalu- ation methods may be found in [5]. The datasets for the DSL 2015 & DSL 2017 shared tasks [8] are often used in LID benchmarks. The NCHLT text cor- pora [1] may be used for a shared LID task for the South African languages. The DSL 2017 paper [8] gives an overview of the solutions of all of the teams that competed on the shared task and the winning approach [2] used an SVM with character n-gram, parts of speech tag features and some other engineered features. The winning approach for DSL 2015 [7] used an ensemble naive Bayes classifier. The fasttext classifier [6] is perhaps one of the best known efficient ’shallow’ text classifiers that have been used for LID 2 . Hierarchical stacked classifiers (including lexicons) have also been proposed that would for example first classify a piece of text by language group and then by exact language [4][3]. 2 Methodology and results The proposed LID algorithm3 builds on the work in [3] and [7]. We apply a naive Bayesian classifier with character (2, 4 & 6)-grams, word unigram and 1 Full paper presented at NeurIPS 2019 Workshop on Machine Learning for the De- veloping World. 2 https://fasttext.cc/blog/2017/10/02/blog-post.html 3 Available at https://github.com/praekelt/feersum-lid-shared-task. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 B. Duvenhage word bigram features with a hierarchical lexicon based classifier. The algorithm is evaluated against recent approaches using existing test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) 2015 and 2017 shared tasks. The naive Bayesian classifier is trained to predict the specific language label of a piece of text, but used to first classify text as belonging to either the Nguni family, the Sotho family, English, Afrikaans, Xitsonga or Tshivenda. The lexicon based classifier is then used to predict the specific language within a language group. If the lexicon prediction of the specific language has high confidence then its result is used as the final label else the naive Bayesian classifier’s specific language prediction is used as the final result. The lexicon is built over all the data and includes the vocabulary from both the training and testing sets. Table 1. LID Accuracy - The models we executed ourselves are marked with *. The results that are not available from our own tests or the literature are indicated with ’—’. Model Algorithm NCHLT DSL ’15 DSL ’17 Joulin et al. 2017 [6] * fasttext 93.30 93.20 88.60 Bestgen 2017 (DSL winner) [2] SVM — — 92.74 Malmasi & Dras 2015 (DSL winner) [7] NB ensemble — 95.54 — Duvenhage et al. 2017 [3] * NB+Lex 94.59 — — Naive-Bayes only * NB 94.36 94.98 91.89 Stacked model * NB+Lex 96.12 99.34 98.70 Stacked model (50% lex dropout) * NB+Lex 94.90 98.06 96.21 The average classification accuracy results are summarised in Table 1. The accuracies reported are for classifying a piece of text by its specific language label. The accuracy of the proposed algorithm seems to be dependent on the support of the lexicon. Without a good lexicon a non-stacked naive Bayesian classifier might even perform better. 3 Conclusion LID of short texts, informal styles and similar languages remains a difficult problem which is actively being researched. We would like to investigate the value of a lexicon in a production system and how to possibly maintain it using self-supervised learning. We are investigating the application of deeper language models some of which have been used in more recent DSL shared tasks. We would also like to investigate data augmentation strategies to reduce the amount of training data that is required. Further research opportunities include data harvesting, building standardised datasets and shared tasks for South Africa as well as the rest of Africa. In general, the support for language codes that include more languages seems to be growing, discoverability of research is improving and paywalls seem to no longer be a big problem in getting access to published research. Short text language identification for under resourced languages 3 References 1. NCHLT text corpora (2014), available from http://www.nwu.ac.za/ctext 2. Bestgen, Y.: Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 115–123. Association for Computational Linguistics, Valencia, Spain (Apr 2017). https://doi.org/10.18653/v1/W17-1214, https://www.aclweb.org/anthology/W17- 1214 3. Duvenhage, B., Ntini, M., Ramonyai, P.: Improved text language identification for the south african languages. 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech) pp. 214–218 (2017) 4. Goutte, C., Léger, S., Carpuat, M.: The NRC system for discriminating simi- lar languages. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. pp. 139–145. Association for Com- putational Linguistics and Dublin City University, Dublin, Ireland (Aug 2014). https://doi.org/10.3115/v1/W14-5316, https://www.aclweb.org/anthology/W14- 5316 5. Jauhiainen, T.S., Lui, M., Zampieri, M., Baldwin, T., Lindén, K.: Automatic lan- guage identification in texts: A survey. Journal of Artificial Intelligence Research 65, 675–782 (2019) 6. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. pp. 427–431. Association for Computational Linguistics, Valencia, Spain (Apr 2017), https://www.aclweb.org/anthology/E17-2068 7. Malmasi, S., Dras, M.: Language identification using classifier ensembles. In: Pro- ceedings of the Joint Workshop on Language Technology for Closely Related Lan- guages, Varieties and Dialects. pp. 35–43. Association for Computational Linguis- tics, Hissar, Bulgaria (Sep 2015), https://www.aclweb.org/anthology/W15-5407 8. Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y., Aepli, N.: Findings of the VarDial evaluation campaign 2017. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Va- rieties and Dialects (VarDial). pp. 1–15. Association for Computational Lin- guistics, Valencia, Spain (Apr 2017). https://doi.org/10.18653/v1/W17-1201, https://www.aclweb.org/anthology/W17-1201