<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Short text language identi cation for under resourced languages</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Feersum Engine</institution>
          ,
          <addr-line>Praekelt Consulting, Johannesburg</addr-line>
          ,
          <country country="ZA">South Africa</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents a hierarchical naive Bayesian and lexicon based classi er for short text language identi cation (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 o cial South African languages some of which are similar languages. 1 Accurate language identi cation (LID) is the rst step in many natural language processing and machine comprehension pipelines. LID is further also an important step in harvesting scarce language resources. Availability of data is still one of the big roadblocks for applying data driven approaches like supervised machine learning in developing countries. An in depth survey of algorithms, features, datasets, shared tasks and evaluation methods may be found in [5]. The datasets for the DSL 2015 &amp; DSL 2017 shared tasks [8] are often used in LID benchmarks. The NCHLT text corpora [1] may be used for a shared LID task for the South African languages. The DSL 2017 paper [8] gives an overview of the solutions of all of the teams that competed on the shared task and the winning approach [2] used an SVM with character n-gram, parts of speech tag features and some other engineered features. The winning approach for DSL 2015 [7] used an ensemble naive Bayes classi er. The fasttext classi er [6] is perhaps one of the best known e cient 'shallow' text classi ers that have been used for LID 2. Hierarchical stacked classi ers (including lexicons) have also been proposed that would for example rst classify a piece of text by language group and then by exact language [4][3].</p>
      </abstract>
      <kwd-group>
        <kwd>Language identi cation Similar languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Background</title>
    </sec>
    <sec id="sec-2">
      <title>Methodology and results</title>
      <p>
        The proposed LID algorithm3 builds on the work in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We apply a
naive Bayesian classi er with character (2, 4 &amp; 6)-grams, word unigram and
1 Full paper presented at NeurIPS 2019 Workshop on Machine Learning for the
Developing World.
2 https://fasttext.cc/blog/2017/10/02/blog-post.html
3 Available at https://github.com/praekelt/feersum-lid-shared-task.
      </p>
      <p>B. Duvenhage
word bigram features with a hierarchical lexicon based classi er. The algorithm
is evaluated against recent approaches using existing test sets from previous
works on South African languages as well as the Discriminating between Similar
Languages (DSL) 2015 and 2017 shared tasks.</p>
      <p>The naive Bayesian classi er is trained to predict the speci c language label
of a piece of text, but used to rst classify text as belonging to either the Nguni
family, the Sotho family, English, Afrikaans, Xitsonga or Tshivenda. The lexicon
based classi er is then used to predict the speci c language within a language
group. If the lexicon prediction of the speci c language has high con dence then
its result is used as the nal label else the naive Bayesian classi er's speci c
language prediction is used as the nal result. The lexicon is built over all the
data and includes the vocabulary from both the training and testing sets.</p>
      <p>The average classi cation accuracy results are summarised in Table 1. The
accuracies reported are for classifying a piece of text by its speci c language
label. The accuracy of the proposed algorithm seems to be dependent on the
support of the lexicon. Without a good lexicon a non-stacked naive Bayesian
classi er might even perform better.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>LID of short texts, informal styles and similar languages remains a di cult
problem which is actively being researched. We would like to investigate the
value of a lexicon in a production system and how to possibly maintain it using
self-supervised learning. We are investigating the application of deeper language
models some of which have been used in more recent DSL shared tasks. We
would also like to investigate data augmentation strategies to reduce the amount
of training data that is required.</p>
      <p>Further research opportunities include data harvesting, building standardised
datasets and shared tasks for South Africa as well as the rest of Africa. In general,
the support for language codes that include more languages seems to be growing,
discoverability of research is improving and paywalls seem to no longer be a big
problem in getting access to published research.</p>
      <p>Short text language identi cation for under resourced languages</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. NCHLT text corpora (
          <year>2014</year>
          ), available from http://www.nwu.ac.za/ctext
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bestgen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets</article-title>
          .
          <source>In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</source>
          . pp.
          <volume>115</volume>
          {
          <fpage>123</fpage>
          . Association for Computational Linguistics, Valencia,
          <source>Spain (Apr</source>
          <year>2017</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W17</fpage>
          -1214, https://www.aclweb.org/anthology/W17- 1214
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Duvenhage</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ntini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramonyai</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Improved text language identi cation for the south african languages. 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech</article-title>
          ) pp.
          <volume>214</volume>
          {
          <issue>218</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Goutte</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpuat</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The NRC system for discriminating similar languages</article-title>
          .
          <source>In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects</source>
          . pp.
          <volume>139</volume>
          {
          <fpage>145</fpage>
          .
          <article-title>Association for Computational Linguistics</article-title>
          and Dublin City University, Dublin, Ireland (Aug
          <year>2014</year>
          ). https://doi.org/10.3115/v1/
          <fpage>W14</fpage>
          -5316, https://www.aclweb.org/anthology/W14- 5316
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jauhiainen</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Linden</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automatic language identi cation in texts: A survey</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>65</volume>
          ,
          <volume>675</volume>
          {
          <fpage>782</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of tricks for e cient text classi cation</article-title>
          .
          <source>In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          . pp.
          <volume>427</volume>
          {
          <fpage>431</fpage>
          . Association for Computational Linguistics, Valencia,
          <source>Spain (Apr</source>
          <year>2017</year>
          ), https://www.aclweb.org/anthology/E17-2068
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Language identi cation using classi er ensembles</article-title>
          .
          <source>In: Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects</source>
          . pp.
          <volume>35</volume>
          {
          <fpage>43</fpage>
          . Association for Computational Linguistics, Hissar,
          <source>Bulgaria (Sep</source>
          <year>2015</year>
          ), https://www.aclweb.org/anthology/W15-5407
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubesic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiedemann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherrer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aepli</surname>
          </string-name>
          , N.:
          <article-title>Findings of the VarDial evaluation campaign 2017</article-title>
          .
          <source>In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</source>
          . pp.
          <volume>1</volume>
          {
          <fpage>15</fpage>
          . Association for Computational Linguistics, Valencia,
          <source>Spain (Apr</source>
          <year>2017</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W17</fpage>
          -1201, https://www.aclweb.org/anthology/W17-1201
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>