A Low-Resourced Peruvian Language Identification Model

                 Alexandra Espichán Linares1 and Arturo Oncevay-Marcos2
                1
                  Facultad de Ciencias e Ingenierı́a, 2 Departamento de Ingenierı́a
             Grupo de Reconocimiento de Patrones e Inteligencia Artificial Aplicada
                      Pontificia Universidad Católica del Perú, Lima, Perú
              a.espichan@pucp.pe,arturo.oncevay@pucp.edu.pe


                     Abstract                               guages. For that reason, there is a need to support
                                                            the linguistic research from an informatics point of
    Due to the linguistic revitalization in Perú           view, and one of the first required tools is an auto-
    through the last years, there is a grow-                matic language detector for written text (in differ-
    ing interest to reinforce the bilingual ed-             ent levels, such as a complete document, a para-
    ucation in the country and to increase the              graph or even a sentence) (Malmasi et al., 2015).
    research focused in its native languages.                  To develop an automatic language identifier,
    From the computer science perspective,                  a basic natural language processing (NLP) task,
    one of the first steps to support the lan-              an annotated textual corpus for the languages is
    guages study is the implementation of an                required first. However, not all the languages
    automatic language identification tool us-              have large enough digital corpus for any compu-
    ing machine learning methods. Therefore,                tational task, so they are known as low-resourced
    this work focuses in two steps: (1) the                 languages from a computer science point of
    building of a digital and annotated cor-                view (Forcada, 2006).
    pus for 16 Peruvian native languages ex-
                                                               In this way, it is a must to build a digital repos-
    tracted from documents in web reposito-
                                                            itory of textual corpora for these languages. That
    ries, and (2) the fit of a supervised learn-
                                                            will be a previous step to the develop of an auto-
    ing model for the language identification
                                                            matic language model identification.
    task using features identified from related
                                                               In the next section, the Peruvian native lan-
    studies in the state of the art, such as n-
                                                            guages used in this work are presented. Then, in
    grams. The obtained results were promis-
                                                            Section 3 some related works are described. After
    ing (97% in average precision), and it is
                                                            that, Section 4 presents the corpus building and the
    expected to take advantage of the corpus
                                                            details of the dataset obtained for the study. Then,
    and the model for more complex tasks in
                                                            Section 5 contains the implementation of the lan-
    the future.
                                                            guage identification model. Finally, the results and
1   Introduction                                            discussions are included in Section 6, while the
                                                            conclusions and future work for the study are pre-
In Perú, there are 4 million people that are speak-        sented in Section 7.
ers of a native language. They are part of the rich
linguistic diversity in the country, with a presence        2   Peruvian native languages
of 47 original languages divided by 19 linguistic
families. These peruvian languages are distributed          Among the 47 languages spoken by peruvian peo-
across the highlands and jungle (Amazon) regions,           ple, 43 are Amazonian (from the jungle) and 4 are
and most of them are very unique, in spite of their         Andean (from the highlands). These languages
geographical or linguistic closeness (Ministerio de         are considered prevailing languages because they
Educación, Perú, 2013).                                   have live speakers. Therefore, there are 19 lin-
   The linguistic diversity calls for equal opportu-        guistic families (a set of languages related to each
nity across the different native communities, and           other and with a common origin): 2 Andean (Aru
this could be supported by high-level bilingual ed-         and Quechua) and 17 Amazonian (Ministerio de
ucation and a deep knowledge about these lan-               Educación, Perú, 2013).


                                                       57
                Table 1: Basic information of the languages within the scope of the study.
           Linguistic Family Language                                 ISO-639-3 Speakers
                                  Ashaninka                               cni          88 703
                                  Asheninka                               cjo           8 774
           Arawak
                                  Matsigenka                             mcb           11 275
                                  Yine                                    pib           3 261
           Aru                    Aymara                                 aym          443 248
           Jı́baro                Awajún                                 agr          55 366
                                  Cashinahua                              cbs           2 419
                                  Kakataibo                               cbr           1 879
           Pano
                                  Matses                                 mcf            1 724
                                  Shipibo-konibo                         shp           22 517
                                  Quechua Wanca                          qxw           37 559
                                  Quechua de Lambayeque                   quf          21 496
                                  Quechua de Yauyos                      qux          456 225
           Quechua
                                  Quechua del Callejon de Huaylas        qwh          451 789
                                  Quechua del Cusco                      quz          566 581
                                  Quechua del Este de Apurimac           qve          266 336


   The 47 original native languages are highly ag-            Botha and Barnard (2012) research the factors
glomerative, unlike Spanish (Castillan), the main          that may determine the performance of text-based
official language in the country. Even though,             language identification, with a special focus in the
most of them presents more than 100 morphemes              11 official languages of South Africa, using n-
for the word formation process. For instance,              grams as language features. In the study 3 classi-
Quechua del Cusco contains 130 suffixes (Rios,             fication methods were tested: SVM, Naive Bayes
2016), meanwhile Shipibo-konibo uses 114 suf-              and n-gram rank ordering on different training and
fixes plus 31 prefixes (Valenzuela, 2003).                 test text sizes. In this way, it was found that the
   In this work, the language identification task          6-gram Naive Bayes model has the best perfor-
was performed on 16 languages (from 5 families)            mance in general, obtaining 99.4% accuracy for
including 6 dialects of Quechua. The ISO-639-3             large training-test sets and 83% for shorter sets.
codes and the approximate number of speakers of               Selamat and Akosu (2016) propose a language
each language are presented in Table 1.                    identification algorithm based on lexical features
                                                           that works with a minimum amount of training
3   Related Work                                           data. For this study, a dataset of 15 languages,
Given that Peruvian languages can be considered            mostly low-resourced, extracted from the Univer-
as low-resourced ones, a systematic search for             sal Declaration of Human Rights was used. The
studies focused on automatic language identifica-          used technique is based on a spelling checker-
tion for low-resourced languages was carried out.          based method (Pienaar and Snyman, 2011) and the
The results are described as follows.                      improvement proposed in this research was related
   Malmasi et al. (2015) present the first study to        to the indexation of the vocabulary words accord-
distinguish texts between the Persian and Dari lan-        ing to its length. In this way, the average precision
guages at the sentence level. As Dari is a low-            of the method was 93% and an improvement of
resourced language, it was developed a 28 thou-            73% in execution time was obtained.
sand sentences corpus for this task (they used 14             Grothe et al. (2008) compare the performance
thousand for each language). Characters and sen-           of three feature extraction approaches for language
tences n-grams were considered as language fea-            identification using the Leipzig Corpora Collec-
tures. Finally, using a SVM (Support Vector Ma-            tion (Quasthoff et al., 2006) and randomly selected
chine) implementation within a classification en-          Wikipedia articles. The considered approaches for
semble scheme, they discriminate both languages            features were short words (SW), frequent words
with 96% accuracy.                                         (FW) and n-grams (NG). Meanwhile, the em-


                                                      58
ployed classification method was Ad-Hoc Rank-
                                                                 Table 2: Retrieved corpus information: |D| = doc-
ing. Hence, the best obtained results for each ap-
                                                                 ument collection size; |S | = sentences/phrases col-
proach were: FW 25% (99.2%), SW 4 (94.1%)
                                                                 lection size; |V| = word vocabulary size, without
and NG with 3-grams (79.2%).
                                                                 considering punctuation; |C| = character vocabu-
4       Corpus Development                                       lary size; T = number of tokens.
                                                                   Lang. |D|         |S |       |V|    |C|      T
To build the corpus used in this study, digital doc-                 cni       4    7 516 10 125 35           25 119
uments containing Peruvian native languages texts                    cjo       1      555      1 308 35        2 691
were retrieved from the web, while others one                       mcb        1    2 502      4 276 33       10 092
were obtained directly from private repositories or                  pib       1      106        299 21          465
books. In this way, it was possible to collect and                  aym        5 16 431 16 216 39             53 115
annotate documents from 16 different native lan-                     agr       5 14 258 18 631 36             47 127
guages. It may be considered that these documents                   cbs        1        33       129 26          161
must be annotated, i.e., the language in which they                  cbr    195     6 970      6 584 38       37 117
are written must be known.                                          mcf        2 16 356 14 722 36             64 779
   Then, as almost all the documents were in PDF                    shp        4 15 866 24 597 35 203 988
format, the text content was extracted and some                     qxw        6    1 259      2 782 38        6 640
manual corrections were made if it was necessary.                   quf        8      442      1 289 36        2 027
Next, a preprocessing program was developed to
                                                                    qux        2 20 496 25 105 35             85 124
clean the punctuation, to lowercase the text and to
                                                                    qwh        3      665      2 029 36        3 448
split the sentences. After that, Spanish and English
                                                                    quz        2    3 496      5 866 37       13 592
sentences were discarded using the resources of a
                                                                    qve        9      635      1 744 31        2 957
language generic spell-checking library1 , remain-
ing only Peruvian native languages sentences.
   Table 2 contains the total amount of files, plus              Moreover, the distribution among languages of the
the number of sentences/phrases and tokens split                 Quechua family is pretty similar.
for each Peruvian language used in this study. This                 On Figure 2, it can be noticed that the longest
preprocessed collection is partially available in a              collected sentences are from Shipibo-konibo (shp)
project site, including details of the sources of                while the shortest are from Aymara (aym). The
each language text2 .                                            reason for the first case is the origin of the Shipibo-
   Moreover, Figures 1 and 2 presents some statis-               konibo corpus: a parallel one built for a SMT ex-
tics regarding the distribution of the total of char-            periment, which legal and educational text domain
acters per word and per sentence, respectively, in               sources contains longer sentences than the ones
each processed language.                                         found in dictionary or lexicon samples (Galarreta
   The first boxplot in Figure 1 supports the rich               et al., 2017).
morphology feature of the Peruvian native lan-
guages, as a high number of characters is observed               5   Language Identification Model
for the word length value in most of them. Also, it
can be noticed that most of the words are formed                 As it is proposed to perform language identifica-
by 5 to 10 characters. Nevertheless, there are                   tion at the sentence level, the aim was to learn a
very large words from Matses (cbf), such as cuis-                classifier or classification function ( ) that maps
honquededcuishonquededtsëcquiec or tantiaben-                   the sentences from the corpus (S ) to a target lan-
tantiabentsëccondaidquio, with 35 and 33 char-                  guage class (L):
acters, respectively. Although, most words from
Matses presents a word-length value between 5 to                                         :S !L                       (1)
10 characters.                                                      In order to identify which classifier is most
   On the other hand, on average, the language                   suited in the task, each sentence s 2 S will be
with longer words is Matsigenka (mcb), while the                 represented in a feature vector space model: si =
language with shorter words is Kakataibo (cbr).                  {w1,i , w2,i , ..., wt,i }, where t indicates the number
    1
        libenchant: https://github.com/AbiWord/enchant           of dimensions or terms to be extracted.
    2
        chana.inf.pucp.edu.pe/resources/multi-lang-corpus           Character-level n-grams was one of the most


                                                            59
Figure 1: Boxplots representing the distribution of the word length in number of characters per each
language.


Figure 2: Boxplots representing the distribution of the sentence length in number of characters per each
language. The vertical axe (lenght) is in a log10 scale.


                                                  60
used language features in the revised works for
                                                              Table 4: Main classification results for each lan-
this task (Grothe et al., 2008; Botha and Barnard,
                                                              guage (SVM with a linear kernel)
2012; Malmasi et al., 2015). Hence, the dimen-
                                                                   Lang.    Precision Recall Support
sionality of each vector in the space model will be
                                                                     cni       0.94        0.97        2 225
equal to the number of distinct subsequences of
                                                                     cjo       0.83        0.58          158
n characters in a given sequence from the corpus
                                                                    mcb        0.96        0.93          753
S (Cavnar and Trenkle, 1994).
                                                                     pib       1.00        0.85           39
   In this experiment, bigrams and trigrams were
                                                                    aym        0.97        0.97        4 894
used to built the vector space model, and a term
frequency - inverse document frequency (TF-IDF)                      agr       0.99        0.99        4 340
matrix from the aforementioned n-grams scheme                       cbs        1.00        0.58           12
was calculated (Prager, 1999).                                       cbr       0.99        0.99        2 157
   After that, the matrix was split in train and test               mcf        0.97        0.98        4 984
sub-datasets (70%-30%) and some classification                      shp        0.99        0.99        4 795
methods identified in the related works (Grothe                     qxw        0.99        0.92          391
et al., 2008) were fit using a 5-fold cross-validation              quf        0.93        0.46          142
schema on the training sub-dataset. The obtained                    qux        0.94        0.97        5 991
results are shown in Table 3.                                       qwh        0.92        0.78          198
                                                                    quz        0.91        0.89        1 024
Table 3: Results of the 5-fold cross-validation                     qve        0.86        0.55          173
classification on the train sub-dataset                          avg/total     0.97        0.97       32 276
  Method                           Accuracy (%)
  SVM (linear kernel)                   96.22
                                                              left with too few data. For instance, for Yine (pib)
  Multinomial Naive Bayes               92.76
                                                              it was only collected 106 sentences, from which at
  SGD Classifier                        94.52
                                                              most 39 ones were to the test part. For that lan-
  Perceptron                            95.05
                                                              guage, a precision and recall of 100% and 85%
  Passive Aggressive Classifier         95.89
                                                              respectively were obtained. This may indicate an
                                                              acceptable low-resourced language identification
   As the SVM classifier with a linear kernel got             model, but to avoid the possibility of overfitting
the best accuracy result, this method was used                there must be additional tests when more textual
to fit the main model on the entire train sub-                documents can be retrieved.
dataset. Next, this model was validated on the test              On the other hand, as seen in Figure 3, for
sub-dataset. A report of the performance of this              closely-related languages like Ashaninka (cni) and
model at classifying each language was made and               Asheninka (cjo), there was a considerable confu-
is shown in Table 4 (where Support indicates the              sion in the model since 22% of the Asheninka
number of samples that were classified). Further-             test sentences were misclassified as Ashaninka and
more, the confusion matrix of this model is pre-              only 58% of them were correctly identified.
sented in Figure 3.                                              Likewise, although the Quechua family ob-
                                                              tained an acceptable overall precision, a not so
6   Results and Discussions
                                                              good recall is shown for those with less data. As
In this study, a straightforward experiment was               seen in Figure 3, for Quechua de Lambayeque
performed for the automatic identification of some            (quf), which is the variety of Quechua with the
Peruvian languages, showing that they can be dis-             least amount of extracted sentences, only 46% of
tinguishable with 96% accuracy. This is a new re-             the test sentences of this variety was properly clas-
sult for languages that have not previously been              sified, and the model misclassified 42% of them as
worked with.                                                  Quechua de Yauyos (qux). Also, there is confusion
   The acceptable overall result was obtained al-             at discriminating Quechua del Este de Apurı́mac
though there was a great disadvantages to face: the           (qve) since 21% of the sentences of this variety
unbalanced corpus, because it was not possible to             was misidentified as Quechua de Yauyos (qux) and
extract many more sentences from some languages               17% as Quechua del Cusco (quz).
than from others, and even some languages were                   Both scenarios may indicate the need to go


                                                         61
Figure 3: Confusion matrix obtained by the main language identification model. The dashed lines sepa-
rate the different linguistic families.


deeper in the representation features used for lan-         tive languages preserved, it is essential to ex-
guages within the same linguistic family, and to            pand the corpus to cover most of them. The
consider a hierarchical classifying scheme.                 Bible will be targeted first, as it is translated
   Additionally, Cashinahua (cbs) was confused              in some of the left unworked languages, and is
as Awajun (agr) 25% of the time. This is an inter-          a very important resource in NLP for minority
esting result since both languages are from differ-         cases (Christodouloupoulos and Steedman, 2015).
ent families: Pano and Jibaru, respectively. How-              Also, as the corpus may be growing, other re-
ever, as Cashinahua was the language with the               cent methods could be tested on it, such as the
least amount of collected sentences (only 33), it           bidirectional recurrent neural network proposed
was expected that its results were not as precise as        by Kocmi and Bojar (2017) or other similar deep
the obtained for the other ones.                            architectures (Bjerva, 2016; Mathur et al., 2017).
                                                            Although in our scenario, this kind of algorithms
7   Conclusions and Future Works                            may face the low-resourced and unbalanced cor-
For this study, a corpus for 16 Peruvian native lan-        pus, so there must be an adaptive and tuning steps.
guages was built through web and private repos-             However, those methods could help to decrease
itories. Also, it was performed a straightforward           the window approach of the classification to a
classification experiment with it, using n-grams as         phrase or word-level.
features in a tf-idf vector model space. The ob-               Moreover, regarding the confusion presented in
tained results (97% in overall precision) were in           languages within the same family, there must be
the expected range regarding the state of the art of        specific considerations in the following experi-
language identification in a low-resource scenario.         ments with the hierarchy nature in the peruvian
   The fit model may be exploited for other tasks,          linguistic context (Koller and Sahami, 1997; Mc-
such as the automatic increasing of the corpus              Callum et al., 1998; Jaech et al., 2016).
through web and document search (Martins and                   Finally, it is desired to develop and integrate
Silva, 2005). As there are 68 Peruvian na-                  a way to discriminate languages that are not part


                                                       62
of the scheme, in order to not misclassify out of             Tom Kocmi and Ondřej Bojar. 2017. LanideNN: Mul-
model languages to a Peruvian one.                              tilingual language identification on character win-
                                                                dow. arXiv preprint arXiv:1701.03338 .
Acknowledgements
                                                              Daphne Koller and Mehran Sahami. 1997. Hierarchi-
The authors are thankful to J. Rubén Ruiz, bilin-              cally classifying documents using very few words.
gual education professor at NOPOKI, for provid-                 Technical report, Stanford InfoLab.
ing access to some private books written in na-               Shervin Malmasi, Mark Dras, et al. 2015. Automatic
tive languages (Universidad Católica Sedes Sapi-               language identification for persian and dari texts. In
entiae, 2015; Dı́az, 2012). Likewise, it is appreci-            Proceedings of PACLING. pages 59–64.
ated the collaboration of Dr. Roberto Zariquiey,              Bruno Martins and Mário J. Silva. 2005. Language
linguistic professor at PUCP, for allowing the                  identification in web pages. In Proceedings of the
use of his own corpus for the Panoan fam-                       2005 ACM symposium on Applied computing. ACM,
ily (Zariquiey Biondi, 2011).                                   pages 764–768.
   Furthermore, it is acknowledged the support of             Priyank Mathur, Arkajyoti Misra, and Emrah Budur.
the “Concejo Nacional de Ciencia, Tecnologı́a e                  2017. LIDE: Language identification from text doc-
Innovación Tecnológica” (CONCYTEC Perú) un-                   uments. arXiv preprint arXiv:1701.03682 .
der the contract 225-2015-FONDECYT.                           Andrew McCallum, Ronald Rosenfeld, Tom M.
                                                                Mitchell, and Andrew Y. Ng. 1998. Improving text
                                                                classification by shrinkage in a hierarchy of classes.
References                                                      In ICML. volume 98, pages 359–367.
Johannes Bjerva. 2016. Byte-based language identi-            Ministerio de Educación, Perú. 2013. Documento na-
  fication with deep convolutional networks. arXiv              cional de lenguas originarias del Perú. URI: http://
  preprint arXiv:1609.09004 .                                   repositorio.minedu.gob.pe/handle/123456789/3549.
Gerrit Reinier Botha and Etienne Barnard. 2012. Fac-          Wikus Pienaar and DP Snyman. 2011. Spelling
  tors that affect the accuracy of text-based lan-              checker-based language identification for the eleven
  guage identification. Computer Speech & Language              official south african languages. In Proceedings of
  26(5):307–320.                                                the 21st Annual Symposium of Pattern Recognition
William B. Cavnar and John M. Trenkle. 1994. N-                 of SA, Stellenbosch, South Africa. pages 213–216.
  gram-based text categorization. In Proceedings of           John M. Prager. 1999. Linguini: Language identifica-
  the Third Annual Symposium on Document Analysis               tion for multilingual documents. Journal of Man-
  and Information Retrieval. pages 161–169.                     agement Information Systems 16(3):71–101.
Christos Christodouloupoulos and Mark Steedman.               Uwe Quasthoff, Matthias Richter, and Christian Bie-
  2015. A massively parallel corpus: the bible in               mann. 2006. Corpus portal for search in monolin-
  100 languages. Language resources and evaluation              gual corpora. In Proceedings of the fifth interna-
  49(2):375–395.                                                tional conference on language resources and evalu-
Darinka Pacaya Dı́az, editor. 2012. Relatos de Nopoki.          ation. volume 17991802, page 21.
  Universidad Católica Sedes Sapientiae.                     Annette Rios. 2016. A basic language technology
Mikel Forcada. 2006. Open source machine transla-               toolkit for quechua .
  tion: an opportunity for minor languages. In Pro-
                                                              Ali Selamat and Nicholas Akosu. 2016.       Word-
  ceedings of the Workshop “Strategies for develop-
                                                                length algorithm for language identification of
  ing machine translation for minority languages”,
                                                                under-resourced languages.   Journal of King
  LREC. Citeseer, volume 6, pages 1–6.
                                                                Saud University-Computer and Information Sci-
Ana-Paula Galarreta, Andres Melgar, and Arturo                  ences 28(4):457–469.
  Oncevay-Marcos. 2017. Corpus creation and ini-
                                                              Universidad Católica Sedes Sapientiae. 2015. Relatos
  tial SMT experiments between spanish and shipibo-
                                                                Matsigenkas. Universidad Católica Sedes Sapien-
  konibo. In RANLP. ACL Anthology. In-press.
                                                                tiae.
Lena Grothe, Ernesto William De Luca, and Andreas
                                                              Pilar Valenzuela. 2003. Transitivity in shipibo-konibo
  Nürnberger. 2008. A comparative study on language
                                                                 grammar. Ph.D. thesis, University of Oregon.
  identification methods. In LREC.
Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari             Roberto Zariquiey Biondi. 2011. A grammar of
  Ostendorf, and Noah A Smith. 2016. Hierarchical               Kashibo-Kakataibo. Ph.D. thesis, La Trobe Univer-
  character-word models for language identification.            sity.
  arXiv preprint arXiv:1608.03030 .


                                                         63