A Low-Resourced Peruvian Language Identification Model Alexandra Espichán Linares1 and Arturo Oncevay-Marcos2 1 Facultad de Ciencias e Ingenierı́a, 2 Departamento de Ingenierı́a Grupo de Reconocimiento de Patrones e Inteligencia Artificial Aplicada Pontificia Universidad Católica del Perú, Lima, Perú a.espichan@pucp.pe,arturo.oncevay@pucp.edu.pe Abstract guages. For that reason, there is a need to support the linguistic research from an informatics point of Due to the linguistic revitalization in Perú view, and one of the first required tools is an auto- through the last years, there is a grow- matic language detector for written text (in differ- ing interest to reinforce the bilingual ed- ent levels, such as a complete document, a para- ucation in the country and to increase the graph or even a sentence) (Malmasi et al., 2015). research focused in its native languages. To develop an automatic language identifier, From the computer science perspective, a basic natural language processing (NLP) task, one of the first steps to support the lan- an annotated textual corpus for the languages is guages study is the implementation of an required first. However, not all the languages automatic language identification tool us- have large enough digital corpus for any compu- ing machine learning methods. Therefore, tational task, so they are known as low-resourced this work focuses in two steps: (1) the languages from a computer science point of building of a digital and annotated cor- view (Forcada, 2006). pus for 16 Peruvian native languages ex- In this way, it is a must to build a digital repos- tracted from documents in web reposito- itory of textual corpora for these languages. That ries, and (2) the fit of a supervised learn- will be a previous step to the develop of an auto- ing model for the language identification matic language model identification. task using features identified from related In the next section, the Peruvian native lan- studies in the state of the art, such as n- guages used in this work are presented. Then, in grams. The obtained results were promis- Section 3 some related works are described. After ing (97% in average precision), and it is that, Section 4 presents the corpus building and the expected to take advantage of the corpus details of the dataset obtained for the study. Then, and the model for more complex tasks in Section 5 contains the implementation of the lan- the future. guage identification model. Finally, the results and 1 Introduction discussions are included in Section 6, while the conclusions and future work for the study are pre- In Perú, there are 4 million people that are speak- sented in Section 7. ers of a native language. They are part of the rich linguistic diversity in the country, with a presence 2 Peruvian native languages of 47 original languages divided by 19 linguistic families. These peruvian languages are distributed Among the 47 languages spoken by peruvian peo- across the highlands and jungle (Amazon) regions, ple, 43 are Amazonian (from the jungle) and 4 are and most of them are very unique, in spite of their Andean (from the highlands). These languages geographical or linguistic closeness (Ministerio de are considered prevailing languages because they Educación, Perú, 2013). have live speakers. Therefore, there are 19 lin- The linguistic diversity calls for equal opportu- guistic families (a set of languages related to each nity across the different native communities, and other and with a common origin): 2 Andean (Aru this could be supported by high-level bilingual ed- and Quechua) and 17 Amazonian (Ministerio de ucation and a deep knowledge about these lan- Educación, Perú, 2013). 57 Table 1: Basic information of the languages within the scope of the study. Linguistic Family Language ISO-639-3 Speakers Ashaninka cni 88 703 Asheninka cjo 8 774 Arawak Matsigenka mcb 11 275 Yine pib 3 261 Aru Aymara aym 443 248 Jı́baro Awajún agr 55 366 Cashinahua cbs 2 419 Kakataibo cbr 1 879 Pano Matses mcf 1 724 Shipibo-konibo shp 22 517 Quechua Wanca qxw 37 559 Quechua de Lambayeque quf 21 496 Quechua de Yauyos qux 456 225 Quechua Quechua del Callejon de Huaylas qwh 451 789 Quechua del Cusco quz 566 581 Quechua del Este de Apurimac qve 266 336 The 47 original native languages are highly ag- Botha and Barnard (2012) research the factors glomerative, unlike Spanish (Castillan), the main that may determine the performance of text-based official language in the country. Even though, language identification, with a special focus in the most of them presents more than 100 morphemes 11 official languages of South Africa, using n- for the word formation process. For instance, grams as language features. In the study 3 classi- Quechua del Cusco contains 130 suffixes (Rios, fication methods were tested: SVM, Naive Bayes 2016), meanwhile Shipibo-konibo uses 114 suf- and n-gram rank ordering on different training and fixes plus 31 prefixes (Valenzuela, 2003). test text sizes. In this way, it was found that the In this work, the language identification task 6-gram Naive Bayes model has the best perfor- was performed on 16 languages (from 5 families) mance in general, obtaining 99.4% accuracy for including 6 dialects of Quechua. The ISO-639-3 large training-test sets and 83% for shorter sets. codes and the approximate number of speakers of Selamat and Akosu (2016) propose a language each language are presented in Table 1. identification algorithm based on lexical features that works with a minimum amount of training 3 Related Work data. For this study, a dataset of 15 languages, Given that Peruvian languages can be considered mostly low-resourced, extracted from the Univer- as low-resourced ones, a systematic search for sal Declaration of Human Rights was used. The studies focused on automatic language identifica- used technique is based on a spelling checker- tion for low-resourced languages was carried out. based method (Pienaar and Snyman, 2011) and the The results are described as follows. improvement proposed in this research was related Malmasi et al. (2015) present the first study to to the indexation of the vocabulary words accord- distinguish texts between the Persian and Dari lan- ing to its length. In this way, the average precision guages at the sentence level. As Dari is a low- of the method was 93% and an improvement of resourced language, it was developed a 28 thou- 73% in execution time was obtained. sand sentences corpus for this task (they used 14 Grothe et al. (2008) compare the performance thousand for each language). Characters and sen- of three feature extraction approaches for language tences n-grams were considered as language fea- identification using the Leipzig Corpora Collec- tures. Finally, using a SVM (Support Vector Ma- tion (Quasthoff et al., 2006) and randomly selected chine) implementation within a classification en- Wikipedia articles. The considered approaches for semble scheme, they discriminate both languages features were short words (SW), frequent words with 96% accuracy. (FW) and n-grams (NG). Meanwhile, the em- 58 ployed classification method was Ad-Hoc Rank- Table 2: Retrieved corpus information: |D| = doc- ing. Hence, the best obtained results for each ap- ument collection size; |S | = sentences/phrases col- proach were: FW 25% (99.2%), SW 4 (94.1%) lection size; |V| = word vocabulary size, without and NG with 3-grams (79.2%). considering punctuation; |C| = character vocabu- 4 Corpus Development lary size; T = number of tokens. Lang. |D| |S | |V| |C| T To build the corpus used in this study, digital doc- cni 4 7 516 10 125 35 25 119 uments containing Peruvian native languages texts cjo 1 555 1 308 35 2 691 were retrieved from the web, while others one mcb 1 2 502 4 276 33 10 092 were obtained directly from private repositories or pib 1 106 299 21 465 books. In this way, it was possible to collect and aym 5 16 431 16 216 39 53 115 annotate documents from 16 different native lan- agr 5 14 258 18 631 36 47 127 guages. It may be considered that these documents cbs 1 33 129 26 161 must be annotated, i.e., the language in which they cbr 195 6 970 6 584 38 37 117 are written must be known. mcf 2 16 356 14 722 36 64 779 Then, as almost all the documents were in PDF shp 4 15 866 24 597 35 203 988 format, the text content was extracted and some qxw 6 1 259 2 782 38 6 640 manual corrections were made if it was necessary. quf 8 442 1 289 36 2 027 Next, a preprocessing program was developed to qux 2 20 496 25 105 35 85 124 clean the punctuation, to lowercase the text and to qwh 3 665 2 029 36 3 448 split the sentences. After that, Spanish and English quz 2 3 496 5 866 37 13 592 sentences were discarded using the resources of a qve 9 635 1 744 31 2 957 language generic spell-checking library1 , remain- ing only Peruvian native languages sentences. Table 2 contains the total amount of files, plus Moreover, the distribution among languages of the the number of sentences/phrases and tokens split Quechua family is pretty similar. for each Peruvian language used in this study. This On Figure 2, it can be noticed that the longest preprocessed collection is partially available in a collected sentences are from Shipibo-konibo (shp) project site, including details of the sources of while the shortest are from Aymara (aym). The each language text2 . reason for the first case is the origin of the Shipibo- Moreover, Figures 1 and 2 presents some statis- konibo corpus: a parallel one built for a SMT ex- tics regarding the distribution of the total of char- periment, which legal and educational text domain acters per word and per sentence, respectively, in sources contains longer sentences than the ones each processed language. found in dictionary or lexicon samples (Galarreta The first boxplot in Figure 1 supports the rich et al., 2017). morphology feature of the Peruvian native lan- guages, as a high number of characters is observed 5 Language Identification Model for the word length value in most of them. Also, it can be noticed that most of the words are formed As it is proposed to perform language identifica- by 5 to 10 characters. Nevertheless, there are tion at the sentence level, the aim was to learn a very large words from Matses (cbf), such as cuis- classifier or classification function ( ) that maps honquededcuishonquededtsëcquiec or tantiaben- the sentences from the corpus (S ) to a target lan- tantiabentsëccondaidquio, with 35 and 33 char- guage class (L): acters, respectively. Although, most words from Matses presents a word-length value between 5 to :S !L (1) 10 characters. In order to identify which classifier is most On the other hand, on average, the language suited in the task, each sentence s 2 S will be with longer words is Matsigenka (mcb), while the represented in a feature vector space model: si = language with shorter words is Kakataibo (cbr). {w1,i , w2,i , ..., wt,i }, where t indicates the number 1 libenchant: https://github.com/AbiWord/enchant of dimensions or terms to be extracted. 2 chana.inf.pucp.edu.pe/resources/multi-lang-corpus Character-level n-grams was one of the most 59 Figure 1: Boxplots representing the distribution of the word length in number of characters per each language. Figure 2: Boxplots representing the distribution of the sentence length in number of characters per each language. The vertical axe (lenght) is in a log10 scale. 60 used language features in the revised works for Table 4: Main classification results for each lan- this task (Grothe et al., 2008; Botha and Barnard, guage (SVM with a linear kernel) 2012; Malmasi et al., 2015). Hence, the dimen- Lang. Precision Recall Support sionality of each vector in the space model will be cni 0.94 0.97 2 225 equal to the number of distinct subsequences of cjo 0.83 0.58 158 n characters in a given sequence from the corpus mcb 0.96 0.93 753 S (Cavnar and Trenkle, 1994). pib 1.00 0.85 39 In this experiment, bigrams and trigrams were aym 0.97 0.97 4 894 used to built the vector space model, and a term frequency - inverse document frequency (TF-IDF) agr 0.99 0.99 4 340 matrix from the aforementioned n-grams scheme cbs 1.00 0.58 12 was calculated (Prager, 1999). cbr 0.99 0.99 2 157 After that, the matrix was split in train and test mcf 0.97 0.98 4 984 sub-datasets (70%-30%) and some classification shp 0.99 0.99 4 795 methods identified in the related works (Grothe qxw 0.99 0.92 391 et al., 2008) were fit using a 5-fold cross-validation quf 0.93 0.46 142 schema on the training sub-dataset. The obtained qux 0.94 0.97 5 991 results are shown in Table 3. qwh 0.92 0.78 198 quz 0.91 0.89 1 024 Table 3: Results of the 5-fold cross-validation qve 0.86 0.55 173 classification on the train sub-dataset avg/total 0.97 0.97 32 276 Method Accuracy (%) SVM (linear kernel) 96.22 left with too few data. For instance, for Yine (pib) Multinomial Naive Bayes 92.76 it was only collected 106 sentences, from which at SGD Classifier 94.52 most 39 ones were to the test part. For that lan- Perceptron 95.05 guage, a precision and recall of 100% and 85% Passive Aggressive Classifier 95.89 respectively were obtained. This may indicate an acceptable low-resourced language identification As the SVM classifier with a linear kernel got model, but to avoid the possibility of overfitting the best accuracy result, this method was used there must be additional tests when more textual to fit the main model on the entire train sub- documents can be retrieved. dataset. Next, this model was validated on the test On the other hand, as seen in Figure 3, for sub-dataset. A report of the performance of this closely-related languages like Ashaninka (cni) and model at classifying each language was made and Asheninka (cjo), there was a considerable confu- is shown in Table 4 (where Support indicates the sion in the model since 22% of the Asheninka number of samples that were classified). Further- test sentences were misclassified as Ashaninka and more, the confusion matrix of this model is pre- only 58% of them were correctly identified. sented in Figure 3. Likewise, although the Quechua family ob- tained an acceptable overall precision, a not so 6 Results and Discussions good recall is shown for those with less data. As In this study, a straightforward experiment was seen in Figure 3, for Quechua de Lambayeque performed for the automatic identification of some (quf), which is the variety of Quechua with the Peruvian languages, showing that they can be dis- least amount of extracted sentences, only 46% of tinguishable with 96% accuracy. This is a new re- the test sentences of this variety was properly clas- sult for languages that have not previously been sified, and the model misclassified 42% of them as worked with. Quechua de Yauyos (qux). Also, there is confusion The acceptable overall result was obtained al- at discriminating Quechua del Este de Apurı́mac though there was a great disadvantages to face: the (qve) since 21% of the sentences of this variety unbalanced corpus, because it was not possible to was misidentified as Quechua de Yauyos (qux) and extract many more sentences from some languages 17% as Quechua del Cusco (quz). than from others, and even some languages were Both scenarios may indicate the need to go 61 Figure 3: Confusion matrix obtained by the main language identification model. The dashed lines sepa- rate the different linguistic families. deeper in the representation features used for lan- tive languages preserved, it is essential to ex- guages within the same linguistic family, and to pand the corpus to cover most of them. The consider a hierarchical classifying scheme. Bible will be targeted first, as it is translated Additionally, Cashinahua (cbs) was confused in some of the left unworked languages, and is as Awajun (agr) 25% of the time. This is an inter- a very important resource in NLP for minority esting result since both languages are from differ- cases (Christodouloupoulos and Steedman, 2015). ent families: Pano and Jibaru, respectively. How- Also, as the corpus may be growing, other re- ever, as Cashinahua was the language with the cent methods could be tested on it, such as the least amount of collected sentences (only 33), it bidirectional recurrent neural network proposed was expected that its results were not as precise as by Kocmi and Bojar (2017) or other similar deep the obtained for the other ones. architectures (Bjerva, 2016; Mathur et al., 2017). Although in our scenario, this kind of algorithms 7 Conclusions and Future Works may face the low-resourced and unbalanced cor- For this study, a corpus for 16 Peruvian native lan- pus, so there must be an adaptive and tuning steps. guages was built through web and private repos- However, those methods could help to decrease itories. Also, it was performed a straightforward the window approach of the classification to a classification experiment with it, using n-grams as phrase or word-level. features in a tf-idf vector model space. The ob- Moreover, regarding the confusion presented in tained results (97% in overall precision) were in languages within the same family, there must be the expected range regarding the state of the art of specific considerations in the following experi- language identification in a low-resource scenario. ments with the hierarchy nature in the peruvian The fit model may be exploited for other tasks, linguistic context (Koller and Sahami, 1997; Mc- such as the automatic increasing of the corpus Callum et al., 1998; Jaech et al., 2016). through web and document search (Martins and Finally, it is desired to develop and integrate Silva, 2005). As there are 68 Peruvian na- a way to discriminate languages that are not part 62 of the scheme, in order to not misclassify out of Tom Kocmi and Ondřej Bojar. 2017. LanideNN: Mul- model languages to a Peruvian one. tilingual language identification on character win- dow. arXiv preprint arXiv:1701.03338 . Acknowledgements Daphne Koller and Mehran Sahami. 1997. Hierarchi- The authors are thankful to J. Rubén Ruiz, bilin- cally classifying documents using very few words. gual education professor at NOPOKI, for provid- Technical report, Stanford InfoLab. ing access to some private books written in na- Shervin Malmasi, Mark Dras, et al. 2015. Automatic tive languages (Universidad Católica Sedes Sapi- language identification for persian and dari texts. In entiae, 2015; Dı́az, 2012). Likewise, it is appreci- Proceedings of PACLING. pages 59–64. ated the collaboration of Dr. Roberto Zariquiey, Bruno Martins and Mário J. Silva. 2005. Language linguistic professor at PUCP, for allowing the identification in web pages. In Proceedings of the use of his own corpus for the Panoan fam- 2005 ACM symposium on Applied computing. ACM, ily (Zariquiey Biondi, 2011). pages 764–768. Furthermore, it is acknowledged the support of Priyank Mathur, Arkajyoti Misra, and Emrah Budur. the “Concejo Nacional de Ciencia, Tecnologı́a e 2017. LIDE: Language identification from text doc- Innovación Tecnológica” (CONCYTEC Perú) un- uments. arXiv preprint arXiv:1701.03682 . der the contract 225-2015-FONDECYT. Andrew McCallum, Ronald Rosenfeld, Tom M. Mitchell, and Andrew Y. Ng. 1998. Improving text classification by shrinkage in a hierarchy of classes. References In ICML. volume 98, pages 359–367. Johannes Bjerva. 2016. Byte-based language identi- Ministerio de Educación, Perú. 2013. Documento na- fication with deep convolutional networks. arXiv cional de lenguas originarias del Perú. URI: http:// preprint arXiv:1609.09004 . repositorio.minedu.gob.pe/handle/123456789/3549. Gerrit Reinier Botha and Etienne Barnard. 2012. Fac- Wikus Pienaar and DP Snyman. 2011. Spelling tors that affect the accuracy of text-based lan- checker-based language identification for the eleven guage identification. Computer Speech & Language official south african languages. In Proceedings of 26(5):307–320. the 21st Annual Symposium of Pattern Recognition William B. Cavnar and John M. Trenkle. 1994. N- of SA, Stellenbosch, South Africa. pages 213–216. gram-based text categorization. In Proceedings of John M. Prager. 1999. Linguini: Language identifica- the Third Annual Symposium on Document Analysis tion for multilingual documents. Journal of Man- and Information Retrieval. pages 161–169. agement Information Systems 16(3):71–101. Christos Christodouloupoulos and Mark Steedman. Uwe Quasthoff, Matthias Richter, and Christian Bie- 2015. A massively parallel corpus: the bible in mann. 2006. Corpus portal for search in monolin- 100 languages. Language resources and evaluation gual corpora. In Proceedings of the fifth interna- 49(2):375–395. tional conference on language resources and evalu- Darinka Pacaya Dı́az, editor. 2012. Relatos de Nopoki. ation. volume 17991802, page 21. Universidad Católica Sedes Sapientiae. Annette Rios. 2016. A basic language technology Mikel Forcada. 2006. Open source machine transla- toolkit for quechua . tion: an opportunity for minor languages. In Pro- Ali Selamat and Nicholas Akosu. 2016. Word- ceedings of the Workshop “Strategies for develop- length algorithm for language identification of ing machine translation for minority languages”, under-resourced languages. Journal of King LREC. Citeseer, volume 6, pages 1–6. Saud University-Computer and Information Sci- Ana-Paula Galarreta, Andres Melgar, and Arturo ences 28(4):457–469. Oncevay-Marcos. 2017. Corpus creation and ini- Universidad Católica Sedes Sapientiae. 2015. Relatos tial SMT experiments between spanish and shipibo- Matsigenkas. Universidad Católica Sedes Sapien- konibo. In RANLP. ACL Anthology. In-press. tiae. Lena Grothe, Ernesto William De Luca, and Andreas Pilar Valenzuela. 2003. Transitivity in shipibo-konibo Nürnberger. 2008. A comparative study on language grammar. Ph.D. thesis, University of Oregon. identification methods. In LREC. Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Roberto Zariquiey Biondi. 2011. A grammar of Ostendorf, and Noah A Smith. 2016. Hierarchical Kashibo-Kakataibo. Ph.D. thesis, La Trobe Univer- character-word models for language identification. sity. arXiv preprint arXiv:1608.03030 . 63