Background

Short text language identi cation for under resourced languages

0 Feersum Engine , Praekelt Consulting, Johannesburg , South Africa

The paper presents a hierarchical naive Bayesian and lexicon based classi er for short text language identi cation (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 o cial South African languages some of which are similar languages. 1 Accurate language identi cation (LID) is the rst step in many natural language processing and machine comprehension pipelines. LID is further also an important step in harvesting scarce language resources. Availability of data is still one of the big roadblocks for applying data driven approaches like supervised machine learning in developing countries. An in depth survey of algorithms, features, datasets, shared tasks and evaluation methods may be found in [5]. The datasets for the DSL 2015 & DSL 2017 shared tasks [8] are often used in LID benchmarks. The NCHLT text corpora [1] may be used for a shared LID task for the South African languages. The DSL 2017 paper [8] gives an overview of the solutions of all of the teams that competed on the shared task and the winning approach [2] used an SVM with character n-gram, parts of speech tag features and some other engineered features. The winning approach for DSL 2015 [7] used an ensemble naive Bayes classi er. The fasttext classi er [6] is perhaps one of the best known e cient 'shallow' text classi ers that have been used for LID 2. Hierarchical stacked classi ers (including lexicons) have also been proposed that would for example rst classify a piece of text by language group and then by exact language [4][3].

Language identi cation Similar languages

Background Methodology and results

The proposed LID algorithm3 builds on the work in [ 3 ] and [ 7 ]. We apply a naive Bayesian classi er with character (2, 4 & 6)-grams, word unigram and 1 Full paper presented at NeurIPS 2019 Workshop on Machine Learning for the Developing World. 2 https://fasttext.cc/blog/2017/10/02/blog-post.html 3 Available at https://github.com/praekelt/feersum-lid-shared-task.

B. Duvenhage word bigram features with a hierarchical lexicon based classi er. The algorithm is evaluated against recent approaches using existing test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) 2015 and 2017 shared tasks.

The naive Bayesian classi er is trained to predict the speci c language label of a piece of text, but used to rst classify text as belonging to either the Nguni family, the Sotho family, English, Afrikaans, Xitsonga or Tshivenda. The lexicon based classi er is then used to predict the speci c language within a language group. If the lexicon prediction of the speci c language has high con dence then its result is used as the nal label else the naive Bayesian classi er's speci c language prediction is used as the nal result. The lexicon is built over all the data and includes the vocabulary from both the training and testing sets.

The average classi cation accuracy results are summarised in Table 1. The accuracies reported are for classifying a piece of text by its speci c language label. The accuracy of the proposed algorithm seems to be dependent on the support of the lexicon. Without a good lexicon a non-stacked naive Bayesian classi er might even perform better. 3

Conclusion

LID of short texts, informal styles and similar languages remains a di cult problem which is actively being researched. We would like to investigate the value of a lexicon in a production system and how to possibly maintain it using self-supervised learning. We are investigating the application of deeper language models some of which have been used in more recent DSL shared tasks. We would also like to investigate data augmentation strategies to reduce the amount of training data that is required.

Further research opportunities include data harvesting, building standardised datasets and shared tasks for South Africa as well as the rest of Africa. In general, the support for language codes that include more languages seems to be growing, discoverability of research is improving and paywalls seem to no longer be a big problem in getting access to published research.

Short text language identi cation for under resourced languages

1. NCHLT text corpora ( 2014 ), available from http://www.nwu.ac.za/ctext

2. Bestgen , Y. : Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets . In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) . pp. 115 { 123 . Association for Computational Linguistics, Valencia, Spain (Apr 2017 ). https://doi.org/10.18653/v1/ W17 -1214, https://www.aclweb.org/anthology/W17- 1214

3. Duvenhage , B. , Ntini , M. , Ramonyai , P. : Improved text language identi cation for the south african languages. 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech ) pp. 214 { 218 ( 2017 )

4. Goutte , C. , Leger , S. , Carpuat , M.: The NRC system for discriminating similar languages . In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects . pp. 139 { 145 . Association for Computational Linguistics and Dublin City University, Dublin, Ireland (Aug 2014 ). https://doi.org/10.3115/v1/ W14 -5316, https://www.aclweb.org/anthology/W14- 5316

5. Jauhiainen , T.S. , Lui , M. , Zampieri , M. , Baldwin , T. , Linden , K. : Automatic language identi cation in texts: A survey . Journal of Arti cial Intelligence Research 65 , 675 { 782 ( 2019 )

6. Joulin , A. , Grave , E. , Bojanowski , P. , Mikolov , T. : Bag of tricks for e cient text classi cation . In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2 ,

Short

Papers . pp. 427 { 431 . Association for Computational Linguistics, Valencia, Spain (Apr 2017 ), https://www.aclweb.org/anthology/E17-2068

7. Malmasi , S. , Dras , M. : Language identi cation using classi er ensembles . In: Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects . pp. 35 { 43 . Association for Computational Linguistics, Hissar, Bulgaria (Sep 2015 ), https://www.aclweb.org/anthology/W15-5407

8. Zampieri , M. , Malmasi , S. , Ljubesic , N. , Nakov , P. , Ali , A. , Tiedemann , J. , Scherrer , Y. , Aepli , N.: Findings of the VarDial evaluation campaign 2017 . In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) . pp. 1 { 15 . Association for Computational Linguistics, Valencia, Spain (Apr 2017 ). https://doi.org/10.18653/v1/ W17 -1201, https://www.aclweb.org/anthology/W17-1201