<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Machine-Learning Based Text Classification and Machine Translator for Selected Under-Resourced Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aniedi Bernard Oboho-Etuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patience U. Usip</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olufemi S. Adeoye</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikechukwu Ollawa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Uyo</institution>
          ,
          <addr-line>Uyo</addr-line>
          ,
          <country country="NG">Nigeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine Learning, which aims at removing language barrier, uses the performance of computers to achieve efficient translation of any language text. For non-native Efik/Ibibio speakers, the cost of learning this language may be high and very difficult to reach the level of free communication. In this paper, we have used machine learning technique for translation of Efik/Ibibio to English Language. The Stochastic Gradient Descent approach is the machine learning algorithm adopted for this work. The resulting machine translator is able to translate text in selected under-resourced languages to English and vise-versa.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Machine Translator</kwd>
        <kwd>Under-Resourced Languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Text data is a common communication type on social media and oftentimes we read through
texts and wonder what language they are in. It is difficult to obtain insights from text data because
they do not always come structured. The misclassification of these texts can be another major
challenge; “is Annang a subset (dialect) of Ibibio or is it another language in its right?”</p>
      <p>The dialects/languages, Ibibio, Efik, Annang, Oro and Ekid, have their variants in each
community. Spoken Ibibio, Efik or Annang are well understood amongst all groups of speakers,
however, they are different when presented in text format. The variants “esiere”(Efik),
“asiere”(Ibibio) and “achiere”(Annang) should be properly identified and should return the same
translation in the English language. The aim of this work is to develop a text classification and
machine translator for selected under-resourced languages.</p>
      <sec id="sec-1-1">
        <title>2. Related works</title>
        <p>
          When working on Text Classification we assign predefined categories to free-text documents.
According to Devopedia [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Machine Learning is the main tool used to extract keywords from
text and classify them into categories. Text classification can be implemented using the several
supervised algorithms, Naïve Bayes, SVM and Deep Learning being common choices.
        </p>
        <p>Text classification is mostly useful in Natural Language Processing (NLP), used for detecting
spam, sentiment analysis, subject labeling or analyzing intent. Automating mundane tasks makes
search, analysis and decision making faster and easier. For an effective Text classification process
we need vast amounts of historical data, even though real time data is used to improve the model.</p>
        <p>
          Text classification process consist the following steps, Documents ↦ Preprocessing ↦
Indexing ↦ Feature Selection ↦ Classification Algorithm ↦ Performance Measure. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
There are about four approaches to machine translation;
a. Statistical Machine Translations: Ordóñez [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], wrote on Statistical machine translation,
known as SMT or StatMT, as an approach to machine translation that yields the most
probable output (translation) of each element that makes up a sentence. StatMT is based
on the use of statistical models that analyze and search for relationships between two
texts with the same content: one in the source language and the other in the target
language.
        </p>
        <p>
          According to Koehn [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ], Statistical Machine Translation (SMT) can learn how to
translate by analyzing existing human translations (which he calls bilingual text corpora).
This, he said, is different from the Rules-Based Machine Translation (RBMT) approach
that is usually word-based, most modern SMT systems are phrase-based and assemble
translations using overlap phrases. The idea of using phrase-based translation is to
reduce the limitations inherent in word-based translation by translating whole sequences
of words, where the lengths may differ. Each sequence of words is called a phrase. Though
they may not be linguistic phrases, but phrases found using statistical methods from
bilingual text corpora.
        </p>
        <p>We can analyze bilingual text corpora (source and target languages) and monolingual
corpora (target language) and generate statistical models that can transform text from
one language to another by giving statistical weights that decide the most likely output of
the text.
a. Rules-Based Machine Translation: Koehn also described the Rules-Based Machine
Translation (RBMT) systems as the first commercial machine translation systems and are
based on linguistic rules that allow the words to be put in different places and to have
different meanings depending on the context. RBMT technology applies to large
collections of linguistic rules in three different phases: analysis, transfer, and generation.
The rules, he said, are developed by human language experts and programmers who have
deployed extensive efforts to understand and map the rules between two languages.
RBMT would rely on manually built translation lexicons, some of which can be edited and
refined by its users to improve their translations.
b. Hybrid Machine Translation: the hybrid machine combines both the Statistical and the</p>
        <p>Rules-Based Machine Translations.
c. Neural Machine Translation: this applies neural networks in learning from existing
translations and its previous translations to improve results without human inputs.</p>
        <p>Stochastic Gradient Descent: Stochastic gradient descent is an optimization algorithm that
uses a binary comparison approach to predict an output. It is often used in machine learning
applications to find the model parameters that correspond to the best fit between predicted and
actual outputs. The results may not be accurate, however, it makes a powerful technique.
Stochastic gradient descent described in equation (1). Considering the minimization of an
average of functions:
w is a d-dimensional vector (or the feature dimension is d).</p>
        <p>Minimizing the negative of a log-likelihood function of the full gradient descent is given in
equation (1).
(1)</p>
        <p>Assuming a computational cost O(dn). When reducing the cost, a subset of all samples is used
to approximate the full gradient. The revised gradient descent step as given in equation (2).
(2)</p>
        <p>Let's say It is randomly chosen within {1, 2, …, n} with equal probabilities. We can then have
our stochastic gradient descent (SGD) with the computational cost of a single step now reduced
to O(d).</p>
        <p>In order to identify which language a text belongs to and also provide its translation we adopt
the machine learning-based approach to our language classification and machine translator. This
project will provide a simple tool for further works to document more texts in selected Akwa
Ibom languages/dialects. There is a rule-based approach that can tell the system to classify text
into a particular category based on the content of a text by using semantically relevant textual
elements.</p>
        <p>In our case, the machine learning-based system would learn the mapping of the input data
(raw text) with the labels (also known as target variables). This is similar to non-text
classification problems where we train a supervised classification algorithm on a tabular dataset
to predict a class, with the exception that in text classification, our input data is raw text rather
than using numeric features. Because we are working on classifying more than two languages,
the Stochastic Gradient Descent classifier presents a good algorithm for multi-class classification
despite being a binary classifier.</p>
        <p>Why do we care? There is a need to have a record of all the world’s languages and Akwa Ibom
State being my homeland should be represented too. Language is one the the driving force of
civilization, everything begins with communication; in 1997, Philip Parker provided a detailed
statistical analysis of more than 460 language groups in 234 countries. He illustrated issues
connecting linguistic cultures to nine areas of concern (which he listed as economics, cultural
resources, demography) with key variables for each area (railways, water, telecommunications).</p>
        <p>The significance of this project centers on the preservation of selected Akwa Ibom
languages/dialects through deliberate inclusion in contemporary technology, contributing to
linguistic research and improved translation to the English language. As stated above, this project
will be valuable for the development of Akwa Ibom State by providing a tool which can be built
upon for future indigenous language classification projects.</p>
        <p>This project explores machine learning-based in attempting to classify selected Akwa Ibom
languages and provide a translation tool for English language users. The Under-Resourced
Languages used in this study are Ibibio, Efik, Annang and Oro dialects. The project will not cover
the accuracy of the model used and data will not be fine-tuned for better performance.
- Collect sample data of selected texts and their translations written in the language of
interest
- Develop a Stochastic Gradient Descent (SGD) algorithm to translate texts to English
- Use same SGD algorithm to classify the language each text belongs
- Launch algorithms as web application
3.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Machine Translator</title>
      <p>The machine translator is defined by following the steps described in the following sub-sections.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Data Collection</title>
      <p>We would create a register of words, sentences in the selected under-resourced Akwa Ibom
languages (Ibibio, Efik, Oro) and their English translations for the study.</p>
      <p>The data will be stored in three columns and saved in a .csv (comma separated values) file
format as represented in the table 1.</p>
      <p>
        Here is the link
(https://docs.google.com/spreadsheets/d/1PgaKArdAFo4fRTdbgCepZs4rudxp55wEw-nzluovl0/edit?usp=sharing) to our dataset. It is a table with three
columns namely dialect, translation and (language/dialect) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Dialect: this column contains texts read in Akwa Ibom dialects.</p>
      <p>Translation: this column contains the English language translation of each text.
(language/dialect): this column contains the dialect in which the text was written in.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>System Architecture</title>
      <p>There are six key items in our system; User input, User interface, Classifier, Translator, Text
Corpus and System Output. The User interface handles input and output, the processing unit
handles the algorithms (classifier and translator) while the text corpus are stored in the system
database.</p>
      <p>The system takes input through the User Interface, classifies and translates it by comparing
with the Text Corpus in the database, then outputs the result through the user interface too.</p>
      <sec id="sec-4-1">
        <title>User Input</title>
        <p>The user inputs a text in any of the selected under-resourced languages for the system to
provide a translation for the text</p>
      </sec>
      <sec id="sec-4-2">
        <title>System Output</title>
        <p>The system takes a user input and provides the most probable output form the text corpus
(database)</p>
      </sec>
      <sec id="sec-4-3">
        <title>User Interface</title>
        <p>The user interface interacts with the user of the web application, it allows for inputs and
outputs the result to the user.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Classifier</title>
        <p>The classifier assigns the language label to the input text and returns an output on the user
interface</p>
      </sec>
      <sec id="sec-4-5">
        <title>Translator</title>
        <p>The translator assigns the probable translation to the input text and returns an output on the
user interface</p>
      </sec>
      <sec id="sec-4-6">
        <title>Text Corpus</title>
        <p>The text corpus is the database of texts written in the selected under-resourced languages with
their English translations.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Python Scripts</title>
      <p>Three python scripts were written: the classify.py, ibom.py and ibibio_app.py.</p>
      <sec id="sec-5-1">
        <title>Classify.py:</title>
        <p>The classify.py file identifies which dialect the text reads in.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Ibom.py:</title>
        <p>The next python script is the ibom.py, and it follows a similar algorithm as the classify.py. The
ibom.py is for text translation, translating the text into English language.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Ibibio_app.py:</title>
        <p>The ibibio_app.py is designed to run as an interactive Streamlit application.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>System Implementation</title>
      <p>There are two approaches to launching the Web App; the local and the remote approaches. To
launch locally one will need a computer with at least 2 gigahertz processing speed and 6 gigabyte
RAM. If one wants to launch remotely, login to github.com and fork the repo
https://github.com/imanibom/Akwa-Ibom-Language-Classification-and-Machine-Translation
and open the github workspace and run the command streamlit run ibibio_app.py (remember
to press enter).</p>
      <sec id="sec-6-1">
        <title>Launching the Streamlit Application Locally</title>
        <p>The dataset (ibibio.csv) and the three python scripts (classify.py, ibom.py and ibibio_app.py) must
be in the same folder (you can create an empty folder for this). Then follow the following steps;
1. Go to the Start Menu on your computer and launch the Command Prompt (or GitBash,</p>
        <p>Anaconda Prompt etc.)
2. Navigate to the file location
3. Type the command streamlit run ibibio_app.py and press enter</p>
        <p>Running the script in figure 4 will cause the interface to load on your browser page, ready for
use!</p>
      </sec>
      <sec id="sec-6-2">
        <title>Web App Interface</title>
        <p>The web app interface has the following features;
● Welcome message: “You are Welcome. Enter Akwa Ibom Word or Sentence of Your</p>
        <p>Choice”
● User Input: Textbox
● Translation: Output
● Dialect: Classifies word into a language group (dialect)</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Language classification and machine translation are large fields in machine learning. Several large
language models (LLM) have been developed to handle specific areas ranging from machine
virtual assistance to machine translation. Haystack, Langchain, Bert, Chat-GPT etc., provide
system ready platforms to launch LLM apps, however in this project we developed a simple
machine translation and language classification app for selected under-resourced Akwa Ibom
languages. In this research we have created a web application where we can both classify and
translate texts written in selected Akwa Ibom dialects. This would also open up further AI
researches such as,
● Real-time machine translation
● Audio machine translation etc.</p>
      <p>With the 10th generation laptop computers, it is possible to work on large datasets and use
text annotation techniques to scrape text documents. For further works I would recommend;
a. Implementing text scraping and annotation for the dataset
b. Linking to a databank of Akwa Ibom language texts
c. Implementation of Web Scraping tools on online and social media posts written in Akwa</p>
      <p>Ibom dialects</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Orza</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Text Classifiers in Machine Learning: A Guide. levity</article-title>
          .ai
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Firsanova</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>A Quick Guide to Low-Resource NLP</article-title>
          . mlops.community
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Riehl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Why are Languages Worth Preserving</article-title>
          . sapiens.org
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Usip</surname>
            ,
            <given-names>P. U.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ekpenyong</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Towards Ontology-Driven Application for Multilingual Speech Language Therapy. Human Language Technologies for Under-Resourced African Languages: Design, Challenges, and</article-title>
          <string-name>
            <surname>Prospects</surname>
          </string-name>
          ,
          <volume>85</volume>
          -
          <fpage>101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Efik</given-names>
            <surname>Language</surname>
          </string-name>
          <string-name>
            <surname>Encyclopedia</surname>
          </string-name>
          ,
          <source>Science News &amp; Research Reviews: Classification of Language</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Rani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Text Classification Using Machine Learning Techniques: A Comparative Study</article-title>
          , pp.
          <fpage>551</fpage>
          -
          <lpage>555</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ordóñez</surname>
            ,
            <given-names>M. B.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>What Is Statistical Machine Translation?</article-title>
          . Blog.pangeanic.com
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>What is Statistical Machine Translation (SMT)?</article-title>
          . Omniscien.com
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>What is Rules-Based Machine Translation (RBMT)?</article-title>
          . Omniscien.com
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Stochastic Gradient Descent</article-title>
          . www.stat.cmu.edu
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Oboho-Etuk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>South African Language Identification</article-title>
          . Kaggle.com https://github.com/imanibom/Akwa-Ibom
          <string-name>
            <surname>-Language-Classification-</surname>
          </string-name>
          and-MachineTranslation
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>