<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Low-Resourced Peruvian Language Identification Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandra Espich a´n Linares</string-name>
          <email>a.espichan@pucp.pe</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arturo Oncevay-Marcos</string-name>
          <email>arturo.oncevay@pucp.edu.pe</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Ingenier ́ıa Grupo de Reconocimiento de Patrones e Inteligencia Artificial Aplicada Pontificia Universidad Cat o ́lica del Per u ́</institution>
          ,
          <addr-line>Lima, Per u ́</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Facultad de Ciencias e Ingenier ́ıa</institution>
        </aff>
      </contrib-group>
      <fpage>57</fpage>
      <lpage>63</lpage>
      <abstract>
        <p>Due to the linguistic revitalization in Peru´ through the last years, there is a growing interest to reinforce the bilingual education in the country and to increase the research focused in its native languages. From the computer science perspective, one of the first steps to support the languages study is the implementation of an automatic language identification tool using machine learning methods. Therefore, this work focuses in two steps: (1) the building of a digital and annotated corpus for 16 Peruvian native languages extracted from documents in web repositories, and (2) the fit of a supervised learning model for the language identification task using features identified from related studies in the state of the art, such as ngrams. The obtained results were promising (97% in average precision), and it is expected to take advantage of the corpus and the model for more complex tasks in the future.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In Peru´, there are 4 million people that are
speakers of a native language. They are part of the rich
linguistic diversity in the country, with a presence
of 47 original languages divided by 19 linguistic
families. These peruvian languages are distributed
across the highlands and jungle (Amazon) regions,
and most of them are very unique, in spite of their
geographical or linguistic closeness
        <xref ref-type="bibr" rid="ref16">(Ministerio de
Educacio´n, Peru´, 2013)</xref>
        .
      </p>
      <p>
        The linguistic diversity calls for equal
opportunity across the different native communities, and
this could be supported by high-level bilingual
education and a deep knowledge about these
languages. For that reason, there is a need to support
the linguistic research from an informatics point of
view, and one of the first required tools is an
automatic language detector for written text (in
different levels, such as a complete document, a
paragraph or even a sentence)
        <xref ref-type="bibr" rid="ref12">(Malmasi et al., 2015)</xref>
        .
      </p>
      <p>
        To develop an automatic language identifier,
a basic natural language processing (NLP) task,
an annotated textual corpus for the languages is
required first. However, not all the languages
have large enough digital corpus for any
computational task, so they are known as low-resourced
languages from a computer science point of
view
        <xref ref-type="bibr" rid="ref6">(Forcada, 2006)</xref>
        .
      </p>
      <p>In this way, it is a must to build a digital
repository of textual corpora for these languages. That
will be a previous step to the develop of an
automatic language model identification.</p>
      <p>In the next section, the Peruvian native
languages used in this work are presented. Then, in
Section 3 some related works are described. After
that, Section 4 presents the corpus building and the
details of the dataset obtained for the study. Then,
Section 5 contains the implementation of the
language identification model. Finally, the results and
discussions are included in Section 6, while the
conclusions and future work for the study are
presented in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Peruvian native languages</title>
      <p>
        Among the 47 languages spoken by peruvian
people, 43 are Amazonian (from the jungle) and 4 are
Andean (from the highlands). These languages
are considered prevailing languages because they
have live speakers. Therefore, there are 19
linguistic families (a set of languages related to each
other and with a common origin): 2 Andean (Aru
and Quechua) and 17 Amazonian
        <xref ref-type="bibr" rid="ref16">(Ministerio de
Educacio´n, Peru´, 2013)</xref>
        .
The 47 original native languages are highly
agglomerative, unlike Spanish (Castillan), the main
official language in the country. Even though,
most of them presents more than 100 morphemes
for the word formation process. For instance,
Quechua del Cusco contains 130 suffixes
        <xref ref-type="bibr" rid="ref20">(Rios,
2016)</xref>
        , meanwhile Shipibo-konibo uses 114
suffixes plus 31 prefixes
        <xref ref-type="bibr" rid="ref23">(Valenzuela, 2003)</xref>
        .
      </p>
      <p>In this work, the language identification task
was performed on 16 languages (from 5 families)
including 6 dialects of Quechua. The ISO-639-3
codes and the approximate number of speakers of
each language are presented in Table 1.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>Given that Peruvian languages can be considered
as low-resourced ones, a systematic search for
studies focused on automatic language
identification for low-resourced languages was carried out.
The results are described as follows.</p>
      <p>Malmasi et al. (2015) present the first study to
distinguish texts between the Persian and Dari
languages at the sentence level. As Dari is a
lowresourced language, it was developed a 28
thousand sentences corpus for this task (they used 14
thousand for each language). Characters and
sentences n-grams were considered as language
features. Finally, using a SVM (Support Vector
Machine) implementation within a classification
ensemble scheme, they discriminate both languages
with 96% accuracy.</p>
      <p>Botha and Barnard (2012) research the factors
that may determine the performance of text-based
language identification, with a special focus in the
11 official languages of South Africa, using
ngrams as language features. In the study 3
classification methods were tested: SVM, Naive Bayes
and n-gram rank ordering on different training and
test text sizes. In this way, it was found that the
6-gram Naive Bayes model has the best
performance in general, obtaining 99.4% accuracy for
large training-test sets and 83% for shorter sets.</p>
      <p>
        Selamat and Akosu (2016) propose a language
identification algorithm based on lexical features
that works with a minimum amount of training
data. For this study, a dataset of 15 languages,
mostly low-resourced, extracted from the
Universal Declaration of Human Rights was used. The
used technique is based on a spelling
checkerbased method
        <xref ref-type="bibr" rid="ref17">(Pienaar and Snyman, 2011)</xref>
        and the
improvement proposed in this research was related
to the indexation of the vocabulary words
according to its length. In this way, the average precision
of the method was 93% and an improvement of
73% in execution time was obtained.
      </p>
      <p>
        Grothe et al. (2008) compare the performance
of three feature extraction approaches for language
identification using the Leipzig Corpora
Collection
        <xref ref-type="bibr" rid="ref19">(Quasthoff et al., 2006)</xref>
        and randomly selected
Wikipedia articles. The considered approaches for
features were short words (SW), frequent words
(FW) and n-grams (NG). Meanwhile, the
employed classification method was Ad-Hoc
Ranking. Hence, the best obtained results for each
approach were: FW 25% (99.2%), SW 4 (94.1%)
and NG with 3-grams (79.2%).
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Corpus Development</title>
      <p>To build the corpus used in this study, digital
documents containing Peruvian native languages texts
were retrieved from the web, while others one
were obtained directly from private repositories or
books. In this way, it was possible to collect and
annotate documents from 16 different native
languages. It may be considered that these documents
must be annotated, i.e., the language in which they
are written must be known.</p>
      <p>Then, as almost all the documents were in PDF
format, the text content was extracted and some
manual corrections were made if it was necessary.
Next, a preprocessing program was developed to
clean the punctuation, to lowercase the text and to
split the sentences. After that, Spanish and English
sentences were discarded using the resources of a
language generic spell-checking library1,
remaining only Peruvian native languages sentences.</p>
      <p>Table 2 contains the total amount of files, plus
the number of sentences/phrases and tokens split
for each Peruvian language used in this study. This
preprocessed collection is partially available in a
project site, including details of the sources of
each language text2.</p>
      <p>Moreover, Figures 1 and 2 presents some
statistics regarding the distribution of the total of
characters per word and per sentence, respectively, in
each processed language.</p>
      <p>The first boxplot in Figure 1 supports the rich
morphology feature of the Peruvian native
languages, as a high number of characters is observed
for the word length value in most of them. Also, it
can be noticed that most of the words are formed
by 5 to 10 characters. Nevertheless, there are
very large words from Matses (cbf), such as
cuishonquededcuishonquededtse¨cquiec or
tantiabentantiabentse¨ccondaidquio, with 35 and 33
characters, respectively. Although, most words from
Matses presents a word-length value between 5 to
10 characters.</p>
      <p>On the other hand, on average, the language
with longer words is Matsigenka (mcb), while the
language with shorter words is Kakataibo (cbr).
1libenchant: https://github.com/AbiWord/enchant
2chana.inf.pucp.edu.pe/resources/multi-lang-corpus
Moreover, the distribution among languages of the
Quechua family is pretty similar.</p>
      <p>
        On Figure 2, it can be noticed that the longest
collected sentences are from Shipibo-konibo (shp)
while the shortest are from Aymara (aym). The
reason for the first case is the origin of the
Shipibokonibo corpus: a parallel one built for a SMT
experiment, which legal and educational text domain
sources contains longer sentences than the ones
found in dictionary or lexicon samples
        <xref ref-type="bibr" rid="ref7">(Galarreta
et al., 2017)</xref>
        .
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Language Identification Model</title>
      <p>As it is proposed to perform language
identification at the sentence level, the aim was to learn a
classifier or classification function ( ) that maps
the sentences from the corpus (S ) to a target
language class (L):
: S ! L
(1)</p>
      <p>In order to identify which classifier is most
suited in the task, each sentence s 2 S will be
represented in a feature vector space model: si =
{w1,i, w2,i, ..., wt,i}, where t indicates the number
of dimensions or terms to be extracted.</p>
      <p>
        Character-level n-grams was one of the most
used language features in the revised works for
this task
        <xref ref-type="bibr" rid="ref12 ref2 ref8">(Grothe et al., 2008; Botha and Barnard,
2012; Malmasi et al., 2015)</xref>
        . Hence, the
dimensionality of each vector in the space model will be
equal to the number of distinct subsequences of
n characters in a given sequence from the corpus
S
        <xref ref-type="bibr" rid="ref3">(Cavnar and Trenkle, 1994)</xref>
        .
      </p>
      <p>
        In this experiment, bigrams and trigrams were
used to built the vector space model, and a term
frequency - inverse document frequency (TF-IDF)
matrix from the aforementioned n-grams scheme
was calculated
        <xref ref-type="bibr" rid="ref18">(Prager, 1999)</xref>
        .
      </p>
      <p>
        After that, the matrix was split in train and test
sub-datasets (70%-30%) and some classification
methods identified in the related works
        <xref ref-type="bibr" rid="ref8">(Grothe
et al., 2008)</xref>
        were fit using a 5-fold cross-validation
schema on the training sub-dataset. The obtained
results are shown in Table 3.
As the SVM classifier with a linear kernel got
the best accuracy result, this method was used
to fit the main model on the entire train
subdataset. Next, this model was validated on the test
sub-dataset. A report of the performance of this
model at classifying each language was made and
is shown in Table 4 (where Support indicates the
number of samples that were classified).
Furthermore, the confusion matrix of this model is
presented in Figure 3.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussions</title>
      <p>In this study, a straightforward experiment was
performed for the automatic identification of some
Peruvian languages, showing that they can be
distinguishable with 96% accuracy. This is a new
result for languages that have not previously been
worked with.</p>
      <p>The acceptable overall result was obtained
although there was a great disadvantages to face: the
unbalanced corpus, because it was not possible to
extract many more sentences from some languages
than from others, and even some languages were
left with too few data. For instance, for Yine (pib)
it was only collected 106 sentences, from which at
most 39 ones were to the test part. For that
language, a precision and recall of 100% and 85%
respectively were obtained. This may indicate an
acceptable low-resourced language identification
model, but to avoid the possibility of overfitting
there must be additional tests when more textual
documents can be retrieved.</p>
      <p>On the other hand, as seen in Figure 3, for
closely-related languages like Ashaninka (cni) and
Asheninka (cjo), there was a considerable
confusion in the model since 22% of the Asheninka
test sentences were misclassified as Ashaninka and
only 58% of them were correctly identified.</p>
      <p>Likewise, although the Quechua family
obtained an acceptable overall precision, a not so
good recall is shown for those with less data. As
seen in Figure 3, for Quechua de Lambayeque
(quf), which is the variety of Quechua with the
least amount of extracted sentences, only 46% of
the test sentences of this variety was properly
classified, and the model misclassified 42% of them as
Quechua de Yauyos (qux). Also, there is confusion
at discriminating Quechua del Este de Apur´ımac
(qve) since 21% of the sentences of this variety
was misidentified as Quechua de Yauyos (qux) and
17% as Quechua del Cusco (quz).</p>
      <p>Both scenarios may indicate the need to go
deeper in the representation features used for
languages within the same linguistic family, and to
consider a hierarchical classifying scheme.</p>
      <p>Additionally, Cashinahua (cbs) was confused
as Awajun (agr) 25% of the time. This is an
interesting result since both languages are from
different families: Pano and Jibaru, respectively.
However, as Cashinahua was the language with the
least amount of collected sentences (only 33), it
was expected that its results were not as precise as
the obtained for the other ones.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Works</title>
      <p>For this study, a corpus for 16 Peruvian native
languages was built through web and private
repositories. Also, it was performed a straightforward
classification experiment with it, using n-grams as
features in a tf-idf vector model space. The
obtained results (97% in overall precision) were in
the expected range regarding the state of the art of
language identification in a low-resource scenario.</p>
      <p>
        The fit model may be exploited for other tasks,
such as the automatic increasing of the corpus
through web and document search
        <xref ref-type="bibr" rid="ref13">(Martins and
Silva, 2005)</xref>
        . As there are 68 Peruvian
native languages preserved, it is essential to
expand the corpus to cover most of them. The
Bible will be targeted first, as it is translated
in some of the left unworked languages, and is
a very important resource in NLP for minority
cases
        <xref ref-type="bibr" rid="ref4">(Christodouloupoulos and Steedman, 2015)</xref>
        .
      </p>
      <p>
        Also, as the corpus may be growing, other
recent methods could be tested on it, such as the
bidirectional recurrent neural network proposed
by Kocmi and Bojar (2017) or other similar deep
architectures
        <xref ref-type="bibr" rid="ref1 ref14">(Bjerva, 2016; Mathur et al., 2017)</xref>
        .
Although in our scenario, this kind of algorithms
may face the low-resourced and unbalanced
corpus, so there must be an adaptive and tuning steps.
However, those methods could help to decrease
the window approach of the classification to a
phrase or word-level.
      </p>
      <p>
        Moreover, regarding the confusion presented in
languages within the same family, there must be
specific considerations in the following
experiments with the hierarchy nature in the peruvian
linguistic context
        <xref ref-type="bibr" rid="ref11 ref15 ref9">(Koller and Sahami, 1997;
McCallum et al., 1998; Jaech et al., 2016)</xref>
        .
      </p>
      <p>Finally, it is desired to develop and integrate
a way to discriminate languages that are not part
of the scheme, in order to not misclassify out of
model languages to a Peruvian one.</p>
      <p>
        Acknowledgements
The authors are thankful to J. Rube´n Ruiz,
bilingual education professor at NOPOKI, for
providing access to some private books written in
native languages
        <xref ref-type="bibr" rid="ref22 ref5">(Universidad Cato´ lica Sedes
Sapientiae, 2015; D´ıaz, 2012)</xref>
        . Likewise, it is
appreciated the collaboration of Dr. Roberto Zariquiey,
linguistic professor at PUCP, for allowing the
use of his own corpus for the Panoan
family
        <xref ref-type="bibr" rid="ref24">(Zariquiey Biondi, 2011)</xref>
        .
      </p>
      <p>Furthermore, it is acknowledged the support of
the “Concejo Nacional de Ciencia, Tecnolog´ıa e
Innovacio´ n Tecnolo´ gica” (CONCYTEC Peru´ )
under the contract 225-2015-FONDECYT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Bjerva</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Byte-based language identification with deep convolutional networks</article-title>
          .
          <source>arXiv preprint arXiv:1609</source>
          .
          <fpage>09004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Gerrit</given-names>
            <surname>Reinier</surname>
          </string-name>
          Botha and
          <string-name>
            <given-names>Etienne</given-names>
            <surname>Barnard</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Factors that affect the accuracy of text-based language identification</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          <volume>26</volume>
          (
          <issue>5</issue>
          ):
          <fpage>307</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>William</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Cavnar and John M. Trenkle</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Ngram-based text categorization</article-title>
          .
          <source>In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval</source>
          . pages
          <fpage>161</fpage>
          -
          <lpage>169</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Christos</given-names>
            <surname>Christodouloupoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Steedman</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A massively parallel corpus: the bible in 100 languages. Language resources and evaluation 49(2</article-title>
          ):
          <fpage>375</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Darinka Pacaya</surname>
            <given-names>D</given-names>
          </string-name>
          ´ıaz, editor.
          <year>2012</year>
          . Relatos de Nopoki. Universidad Cato´lica Sedes Sapientiae.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Mikel</given-names>
            <surname>Forcada</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Open source machine translation: an opportunity for minor languages</article-title>
          .
          <source>In Proceedings of the Workshop</source>
          “
          <article-title>Strategies for developing machine translation for minority languages”</article-title>
          ,
          <source>LREC. Citeseer</source>
          , volume
          <volume>6</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Ana-Paula</surname>
            <given-names>Galarreta</given-names>
          </string-name>
          , Andres Melgar, and
          <string-name>
            <surname>Arturo</surname>
          </string-name>
          Oncevay-Marcos.
          <year>2017</year>
          .
          <article-title>Corpus creation and initial SMT experiments between spanish and shipibokonibo</article-title>
          .
          <source>In RANLP. ACL Anthology</source>
          . In-press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Lena</given-names>
            <surname>Grothe</surname>
          </string-name>
          , Ernesto William De Luca, and Andreas Nu¨rnberger.
          <year>2008</year>
          .
          <article-title>A comparative study on language identification methods</article-title>
          .
          <source>In LREC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Jaech</surname>
          </string-name>
          , George Mulcaire, Shobhit Hathi,
          <source>Mari Ostendorf, and Noah A Smith</source>
          .
          <year>2016</year>
          .
          <article-title>Hierarchical character-word models for language identification</article-title>
          .
          <source>arXiv preprint arXiv:1608</source>
          .
          <fpage>03030</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Tom</given-names>
            <surname>Kocmi and Ondrˇej Bojar</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>LanideNN: Multilingual language identification on character window</article-title>
          .
          <source>arXiv preprint arXiv:1701</source>
          .
          <fpage>03338</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Daphne</given-names>
            <surname>Koller</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mehran</given-names>
            <surname>Sahami</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Hierarchically classifying documents using very few words</article-title>
          .
          <source>Technical report</source>
          , Stanford InfoLab.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Shervin</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dras</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>Automatic language identification for persian and dari texts</article-title>
          .
          <source>In Proceedings of PACLING</source>
          . pages
          <fpage>59</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Bruno</given-names>
            <surname>Martins</surname>
          </string-name>
          and Ma´rio
          <string-name>
            <given-names>J.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Language identification in web pages</article-title>
          .
          <source>In Proceedings of the 2005 ACM symposium on Applied computing. ACM</source>
          , pages
          <fpage>764</fpage>
          -
          <lpage>768</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Priyank</given-names>
            <surname>Mathur</surname>
          </string-name>
          , Arkajyoti Misra, and
          <string-name>
            <given-names>Emrah</given-names>
            <surname>Budur</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>LIDE: Language identification from text documents</article-title>
          .
          <source>arXiv preprint arXiv:1701</source>
          .
          <fpage>03682</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Andrew</surname>
            <given-names>McCallum</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ronald</surname>
            <given-names>Rosenfeld</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Tom M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andrew Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Improving text classification by shrinkage in a hierarchy of classes</article-title>
          . In ICML. volume
          <volume>98</volume>
          , pages
          <fpage>359</fpage>
          -
          <lpage>367</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Ministerio de Educacio´</surname>
          </string-name>
          n, Peru´.
          <year>2013</year>
          .
          <article-title>Documento nacional de lenguas originarias del Peru´</article-title>
          . URI: http:// repositorio.minedu.gob.pe/handle/123456789/3549.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Wikus</given-names>
            <surname>Pienaar</surname>
          </string-name>
          and
          <string-name>
            <given-names>DP</given-names>
            <surname>Snyman</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Spelling checker-based language identification for the eleven official south african languages</article-title>
          .
          <source>In Proceedings of the 21st Annual Symposium of Pattern Recognition of SA</source>
          , Stellenbosch, South Africa. pages
          <fpage>213</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>John M. Prager</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Linguini: Language identification for multilingual documents</article-title>
          .
          <source>Journal of Management Information Systems</source>
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>71</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Uwe</given-names>
            <surname>Quasthoff</surname>
          </string-name>
          , Matthias Richter, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Corpus portal for search in monolingual corpora</article-title>
          .
          <source>In Proceedings of the fifth international conference on language resources and evaluation</source>
          . volume
          <volume>17991802</volume>
          , page 21.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Annette</given-names>
            <surname>Rios</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>toolkit for quechua .</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Ali</given-names>
            <surname>Selamat</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nicholas</given-names>
            <surname>Akosu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Wordlength algorithm for language identification of under-resourced languages</article-title>
          .
          <source>Journal of King Saud University-Computer and Information Sciences</source>
          <volume>28</volume>
          (
          <issue>4</issue>
          ):
          <fpage>457</fpage>
          -
          <lpage>469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>Universidad Cato´lica Sedes Sapientiae</source>
          .
          <year>2015</year>
          .
          <string-name>
            <given-names>Relatos</given-names>
            <surname>Matsigenkas</surname>
          </string-name>
          . Universidad Cato´lica Sedes Sapientiae.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Pilar</given-names>
            <surname>Valenzuela</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Transitivity in shipibo-konibo grammar</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Oregon.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Zariquiey Biondi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A grammar of Kashibo-Kakataibo</article-title>
          .
          <source>Ph.D. thesis</source>
          , La Trobe University.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>