=Paper=
{{Paper
|id=Vol-3335/MSW_paper4
|storemode=property
|title=Semantic Representation of Igbo Text Using Knowledge Graph
|pdfUrl=https://ceur-ws.org/Vol-3335/MSW_Paper4.pdf
|volume=Vol-3335
|authors=Nkechi J.Ifeanyi-Reuben,Patience Usoro Usip
|dblpUrl=https://dblp.org/rec/conf/kgswc/Ifeanyi-ReubenU22
}}
==Semantic Representation of Igbo Text Using Knowledge Graph==
Semantic Representation of Igbo Text Using
Knowledge Graph⋆
Nkechi J.Ifeanyi-Reuben1,† , Patience Usoro Usip2,∗,†
1
Computer Science Department, Nnamdi Azikiwe University Awka, Nigeria)
2
Computer Science Department, University of Uyo, Uyo, Nigeria
Abstract
With the fast growth of Artificial Intelligence and its application in different areas of Natural Language
Processing, semantic representation contributes immensely to smoothing the progress of different
automated language processing applications. Semantic representation returns the meaning of the text as
it may be understood by humans. Although semantic representation is very useful for several applications,
no semantic model is proposed for the Igbo language. The usage of Igbo language in the text-based
applications such as text mining, information retrieval, natural language processing is at the increase.
Igbo language uses compounding in its word formation and word ordering play high role in the language.
The uncertainty in dealing with these compound words has made the representation of Igbo text very
difficult. There is need to for smart data representation model in the said language to enhance efficiency
and effectiveness in its text-based application. This paper presents the analysis of a language classification,
considering Igbo language, considering its compounding nature and describes a smart model for text
representation using a Knowledge Graph. The model will create a smart data repository the real-world
usage of text and tangled its context relationship. The proposed Igbo Knowledge Graph (IKG) text
representation model was used in Igbo text classification system. The performance of the Igbo text
classification system is measured by computing the precision, recall and F1-measure of the result obtained
on bigram, semantic-based and unigram represented textual documents. The Igbo text classification
on semantic-based represented text has highest degree of exactness (precision). This shows that the
classification on semantic-based Igbo represented text outperforms bigram and unigram represented
texts. Semantic-based text representation model using knowledge graph is highly recommended for any
Igbo text-based system. It enables automated reasoning as well addresses the challenges incurred as a
result of Igbo compounding, word ordering and collocations language peculiarities.
Keywords
Igbo Language, Text Representation, Text Classification, Ontology, Knowledge Graph, Artificial Intelli-
gence, Compound Word, Semantics
IWMSW-2022: International Workshop on Multilingual Semantic Web, Co-located with the KGSWC-2022, November
21–23, 2022, Madrid, Spain
⋆
You can use this document as the template for preparing your publication. We recommend using the latest version
of the ceurart style.
∗∗
Corresponding author.
†
These authors contributed equally.
Envelope-Open nj.ifeanyi-reuben@unizik.edu.ng (N. J.Ifeanyi-Reuben); patienceusip@uniuyo.edu.ng (P. U. Usip)
Orcid 0000-0002-6516-5194 (P. U. Usip)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
1. Introduction
Text representation is the selection of appropriate features to represent document [1]. The
approach in which text is represented has a big effect in the performance of any text-based ap-
plications [2]. It is powerfully controlled by the language of the text. The spread of Information
Technology (IT) in real life activities has assisted in inculcating Igbo language in text-based
application such as text creation, web creation, text mining, information retrieval and natural
language processing. This research improved the existing research of [3]. on the analysis and
representation of Igbo text for a text-based system by incorporating the semantic represen-
tation of the text in order to create detailed notations of the text that accurately conveys its
meaning. Semantic representation of the textual document is very rich and is adopted in many
applications of Natural Language Processing (NLP) such as machine translation, information
retrieval, question answering, text classification, sentiment analysis, text summarisation and
text extraction. It reflects the meaning of the text as it may be understood by humans. Thus, it
contributes to facilitating various automated language processing applications. The research
on [3, 4, 5, 6] emphasized that the semantic representation of Arabic text can facilitate several
natural language processing applications such as text summarization and textual entailment.
Semantic representation can be achieved using Knowledge Graph (KG). Semantic representation
reflects the meaning of the text as it may be understood by humans. Thus, it contributes to
facilitating various automated language processing applications. Semantic representation can
be achieved using Knowledge Graph (KG).
Knowledge Graph (KG) is a way to represent and organize the data in a more efficient and
easy way to modify, use, and understand [7] It is also referred to as a collection of interlinked
description of concepts, entities, relationships and events via linking and semantic metadata,
providing a framework for data integration, unification, analytics and sharing.
With the widespread growth of Igbo data on the Web, the need for efficient methods to get and
arrange valuable information from these big noisy data is increased. This research presents
an Igbo Knowledge Graph (IKG) for representing data created with Igbo language for better
performance for any Igbo text-based applications.
This Igbo smart representation will be useful for many purposes such as question answering,
summarization and information retrieval.
The model chosen by the researchers will also help to discover unidentified facts and concealed
knowledge that may exist in the lexical, semantic or relations in Igbo text corpus.
1.1. Language classification
A language is a method of communication between individuals who share common code, in
form of symbols [8]. In linguistics, there are two kinds of language classification: genetic (or
genealogical) and typological.
Genetic, also known as genealogical language is a type that group languages into families based
to their degree of diachronic relatedness. Examples of genealogic language group are German,
English, Dutch, Swedish, Norwegian, Danish, Irish, Welsh, Breton, etc.
Typological classification groups languages into types according to their structural characteris-
tics. These structural characteristics can be phonological typology, morphological typology or
syntactic typology. Typological languages form words by agglutination. Examples are Igbo,
Turkish, Finnish, Japanese, etc. [9]
Igbo Language The Igbo language is one of the agglutinative languages, a language that form
words through the combination of smaller morphemes to get compound words. It is one of
the three major languages in Nigeria. It is largely spoken by the people in the eastern part of
Nigeria. Igbo language has many dialects. The standard Igbo is used formally and is adopted for
this research. The current Igbo orthography [8] is based on the Standard Igbo. Orthography is
a way of writing sentence or constructing grammar in a language. Standard Igbo has thirty-six
(36) alphabets (a, b, ch, d, e, f, g, gb, gh, gw, h, i, ị, j, k, kw, kp, l, m, n, nw, ny, ṅ, o, ọ, p, r, s, sh, t,
u, ụ, v, w, y, z). Igbo language has a large number of compound words. A compound word
is a word that has more than one root, and can be made from combination of either nouns,
pronouns or adjectives.
Ifeanyi-Reuben et al. [8] studied the Igbo compound words and categorized them as follows:
i. Nominal (NN) Compound Word: A nominal compound word is formed by the combination of
two or more nouns. The nominal compound words are written separately not minding the
semantic status of the nouns in Igbo. Example of Igbo nominal compound words are: nwa
akwụkwọ - student; onye nkuzi – teacher; ama egwuregwu – stadium; ụlọ ọgwụ - hospital; ụlọ
akwụkwọ - school.
ii. Agentive Compound Words: In agentive compound word, one or more nouns express the
meaning of the agent, doer of the action. The Igbo agentive compound words are written
separately irrespective of the translations in English. They can also be referred to as VN (Verb
Noun) compound words. Example: oje ozi – messenger; oti ịgba - drummer.
iii. Igbo Duplicated Compound Word: Igbo duplicated compound words are formed by the
repetition of the exact word two or more times to show a variety of meaning. For example: ọsọ
ọsọ - quickly; mmiri mmiri – watery; ọbara ọbara – reddish.
iv. Igbo Coordinate Compound Words: This compound word is formed by the combination
of two or words joined by the Igbo conjunction “na” meaning “and” in English. All the Igbo
compound words of this category is written separately. Example: Ezi na ụlọ - family; okwu na
ụka – quarrel.
v. Igbo Proper Compound Words: This category of Igbo compound words includes personal
names, place names, and club names. All words in this category are wriiten together not
minding how long they may be. Example: Uchechukwu; Ngozichukwuka; Ifeanyichukwu.
vi. Igbo Derived Compound Words: The derived Igbo compound words are words derived from
verbs or phrases. The roots of the derived Igbo compound words are written together. Example:
Dinweụlọ - landlord.
Igbo, being an agglutinative language, has a huge number of compounds words and can be
referred to as a language of compound words. The proposed research of Igbo Knowledge Graph
representation will consider this peculiarity to get a good result.
2. Related Works
Ifeanyi-Reuben et al. [chidiebere2020analysis] presents the analysis of Igbo language text
document and describes its representation with the Word-based N-gram model. The result
shows that Bigram and Trigram n-gram text representation models perform better than
unigram model.
Wael and Arafat [3] proposed a graph-based semantic representation model for Arabic text.
The proposed model aims to extract the semantic relations between Arabic words. The results
proved that the proposed graph-based model is able to enhance the performance of the textual
entailment recognition task in comparison to other baseline models.
Zhang, Yoshida and Tang [10] studied and compared the performance of adopting TF*IDF,
LSI (Latent Semantic Indexing) together with multiple words for text representation. They
used Chinese and English corpora to assess the three techniques in information retrieval and
text categorization. Their result showed that LSI produced greatest performance in retrieving
English documents and also produced best performance for Chinese text categorization.
Chih-Fong [11] improved and applied Bag of Word (BOW) for image annotation. An image
annotation is used to allocate keywords to images automatically and the images are represented
using characteristics such as color, texture and shape. This is applied in Content-Based Image
Retrieval System (CBIRS) and the retrieval of the image is based on indexed image features.
Usip and Ntekop [12] posited that ontology is a necessary technology tool for easy and
intelligent reasoning with knowledge. Being the underlying schema for every knowledge graph,
this study will improve the existing work of Ifeanyi-Reuben et al. [8] by adding intelligence
to the work using Knowledge Graph. Ontology-driven applications for multilinguality was
described by Usip and Ekpenyong [13].
Etaiwi and Awajan [14] proposed SemG-TS, a novel semantic graph embedding-based
abstractive text summarization model for the Arabic language which employed a deep
neural network to generate abstractive summary. The result obtained shows SemG-TS model
outperforms the popular baseline word embedding technique, word2vec.
3. Methodology
The bulk of concerns for any text-based system are attributed to text representation considering
the peculiarities of the natural language involved. In this section, we propose an efficient
and effective model to represent Igbo text to be adopted by any text-based system. This is a
process of transforming unstructured Igbo textual document into a form proper for automatic
processing. This is a vital step in text processing because it affects the general performance of
the system. The proposed approach for the Igbo text representation process is shown in Figure
1.
3.1. Text Preprocessing
Text preprocessing involves tasks that are performed on text to convert the original natural
language text to a structure ready for processing. It performs very important functions in
Figure 1: Igbo Text Representation Process.
different text-based system. The tasks are Igbo text normalization, Igbo text tokenization and
Igbo text Stop words Removal.
Igbo Text Normalization: In Normalization process, we transformed the Igbo textual document
to a format to make its contents consistent, convenient and full words for an efficient processing.
We transformed all text cases to lower case and also removed diacritics and noisy data. The
noisy data is assumed to be data that are not in Igbo dataset. Text Tokenization: Tokenization
is the task of analyzing or separating text into a sequence of discrete tokens (words).
Igbo Stop-words Removal: Stop-words are language-specific functional words; the most fre-
quently used words in a language that usually carry no information. There are no specific
number of stop-words which all Natural Language Processing (NLP) tools should have. Most
of the language stop-words are generally pronouns, prepositions, and conjunctions. This task
removes the stopwords in Igbo text. Some of Igbo stopwords is shown in Figure 2.
In the proposed system, a stop-word list will be created and saved in a file named “stop-words”
and is loaded to the system whenever the task is asked to perform.
Figure 2: Sample of Igbo Stop-words list.
3.2. Knowledge Graph Text Representation
Knowledge graphs combine characteristics of several data management paradigms:
• Database: The data can be explored via structured queries.
• Graph: Data can be analyzed as any other network data structure.
• Knowledge base: The model will bear formal semantics, which can be used to interpret the
data and infer new facts.
Igbo Knowledge graphs will provide good framework for Igbo data integration, unification,
linking and reuse.
4. Sample Igbo Text and the Corresponding Proposed
Knowledge Graph
Given the examples of Igbo compound words in table 1, it is observed that the actual meaning
of the semantic correctness of Igbo compound words is not the same when compared with
their roots and meaning after decomposition to Igbo single words. Hence, the need for the
compound word categorization.
Following the categorization of the Igbo compound words, a knowledge graph representation
of Igbo words and the various categories is given in Figure 3.
The underlying ontology used as the schema for the knowledge graph has the domain
knowledge which includes the bilingual corpora of Igbo single words and their English meaning,
the n-gram modeling feature and resulting Igbo compound words classified based on the
compound word categorization.
From the knowledge graph, the relationship among the various Igbo compound word, single
Igbo word and their English word meaning can be determined and used in an effort towards the
construction of a semantically correct bilingual Igbo - English Language dictionary consisting
of both single and compound Igbo words. With the knowledge graph, missing links between
Igbo compound and single words can be detemined at a glance for proper restructuring and
fixture to produce a semantically correct Igbo word
Table 1
Igbo Compound Words [8]
Igbo Compound Words Meaning Roots and meaning Compound Word Category
Onye - Person
Onye nkuzi Teacher Nkuzi – Teach Nominal
Ezi – surrounding
Na – and
Ezi na ụlọ Family ụlọ - family Coordinate
Ojiiego – use money
Ojiiegoachọego businessman achọego – find money Derived
ụgbọ - vessel
ụgbọ ala Car, motor ala - land (road) Nominal
Egbe – gun
Egbe igwe Thunder Igwe – sky Nominal
Iri – ten
Iri abụọ Twenty Abụọ - two Nominal
Ode – Write
Ode akwụkwọ Secretary Akwụkwọ - book Agentive
Ebere – mercy
Eberechukwu God’s mercy Chukwu - God Proper
Mmiri mmiri Watery Mmiri -water Duplicate
ọcha ọcha Whitish ọcha – white Duplicate
Onye – person
Onye nchekwa Administrator Nchekwa – protect Nominal
Kọmputa – Computer
Kọmputa Nkunaka Laptop Nkunaka – Handcarry Nominal
ọkpụ - mold
Ọkpụ ụzụ Blacksmith ụzụ - clay Agentive
Nche – protect
Nche anwụ Umbrella Anwụ - sun Agentive
Onyonyo – screen
Onyonyo kọmputa Monitor Kọmputa- computer Nominal
Okwu – speech
Okwu ntughe Password Ntughe - opening Nominal
Figure 4 is a designed Igbo knowledge graph model showing all the due processes employed to
represent Igbo textual document based on its semantic (reasoning) using knowledge graph.
5. System Performance Evaluation
The system performance is evaluated by computing the precision, F1-measure and Recall.
Precision is defined as the quotient of total TPs and sum of total TPs and FPs. Precision point is
known to as a point of correctness.
𝑇𝑃
Precision = (1)
𝑇𝑃 + 𝐹𝑃
Recall of the classification system is described as the quotient of total TPs and sum of total
Figure 3: Knowledge Graph Representation of Sample Igbo Compound.
TPs and total FNs. Recall level measures completeness.
𝑇𝑃
Recall = (2)
𝑇𝑃 + 𝐹𝑁
F1-Measure is single function that joins recall and precision points. When the F1-measure is
high, it means that the overall text classification system is high.
(2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙)
F1-Measure = (3)
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)
2𝑇 𝑃
= (4)
(2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 )
In summary, computation of precision, recall and f1-measure required four input parameters:
TP, FP, TN and FN.
i. TP - total of text documents accurately allotted to document class.
ii. FP - total of text documents wrongly allotted to document class.
iii. FN - total of text documents wrongly rejected from document class.
iv. TN - total of text documents correctly rejected from document class.
These parameters are input to the evaluator. They are obtained from the classification result.
Figure 4: Igbo Knowledge Graph Model
6. Experiments
This involves the practical method of putting into work all the theoretical design of the proposed
model. The semantic representation of Igbo text using knowledge graph is implemented on
Igbo text classification system with Python and tools from Natural Language Toolkit (NLTK).
Table 2
Summary of Bigram, Semantic-Based and Unigram Text Classification result
Text Document Bigram Semantic-Based Unigram
IgboText1 Administration Administration Administration
IgboText10 Religion Religion Religion
IgboText11 Computer Computer Administration
IgboText2 Religion Computer Religion
IgboText4 Administration Administration Administration
IgboText5 Religion Administration Administration
IgboText6 Computer Computer Administration
IgboText7 Administration Administration Administration
IgboText8 Religion Religion Religion
IgboText9 Religion Religion Religion
Table 3
Performance Measure Result
Category TP NP FP FN Precision Recall F1 Measure
Bigram 8 0 2 0 0.80 1.00 0.89
Semantic- 9 0 1 0 0.90 1.00 0.95
Based
Unigram 7 0 3 0 0.70 1.00 0.82
7. Result Analysis
Figure 5 displays the Text classification module of the system used to test the effectiveness of the
proposed model. The result obtained in text classification on Igbo text represented semantically
using knowledge graph is compared with the results obtained in unigram and bigram text
representation models.
Table 3 and Figure 6 show the classification performance measure result and chart respectively.
The result shows that the recall, precision and F1 for bigram Igbo represented text are 1.00,
.80 and .89 respectively. The recall, precision and F1 for semantic-based Igbo text are 1.00, .90
and .95 respectively. The recall, precision and F1 for unigram Igbo represented text are 1.00,
.62 and .82 respectively. Recall evaluates the degree of completeness. The result shows Igbo
text classification on the text represented with the three models (bigram, semantic-based and
unigram) has the equal level of recall (completeness). This means all the text documents that
were given to the classifier, were given a label name. Precision measures the degree of exactness.
The classification with semantic-based has highest degree (0.90) of exactness (precision).
Table 2 gives the summary of classification result obtained on Bigram, Semantic-Based and
Unigram text representation. A total of 10 testing documents are used for the experiment. In
bigram, eight documents are correctly assigned a class label while two are incorrectly assigned
a class label. In semantic-based text representation using knowledge graph, 9 documents are
correctly assigned a class label while one is incorrectly assigned. In unigram, 7 documents are
correctly assigned a class label while 3 are incorrectly assigned a class label.
8. Conclusion
An improved intelligent approach for representing Igbo text document using Knowledge Graph
model considering the agglutinative nature of Igbo language is proposed. This is to solve the
issues of collocations, compounding, and word ordering that plays major roles in the language,
thereby making the representation semantic-enriched. The model is implemented and evaluated
using Igbo text classification system.
The performance was measured by computing the classification accuracy of Bigram, Semantic-
Based and Unigram represented text. The result showed that the classification performed on
Semantic-based represented text has higher performance than Bigram and unigram represented
texts. It has shown that a high quality text representation model certainly boost performance
of NLP tasks.
The model will be of high commercial potential value and will be useful in any text based
intelligent system on the language. It will also motivate other researchers to develop interest in
doing more research on Igbo language processing to the benefit of people and society.
9. Acknowledgments
The authors wish to express gratitude the unknown reviewers of this work for their useful
comments and contributions that assisted in enhancing the worth of this paper.
References
[1] D. Shen, J.-T. Sun, Q. Yang, Z. Chen, Text classification improved through multigram
models, in: Proceedings of the 15th ACM international conference on Information and
knowledge management, 2006, pp. 672–681.
[2] D. D. Lewis, Representation quality in text classification: An introduction and experiment,
in: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley,
Pennsylvania, June 24-27, 1990, 1990.
[3] W. Etaiwi, A. Awajan, Graph-based arabic text semantic representation, Information
Processing & Management 57 (2020) 102183.
[4] S. Tiwari, P. Siarry, S. Mehta, M. Jabbar, Tools, Languages, Methodologies for Representing
Semantics on the Web of Things, John Wiley & Sons, 2022.
[5] R. Panchal, P. Swaminarayan, S. Tiwari, F. Ortiz-Rodriguez, Aishe-onto: a semantic model
for public higher education universities, in: DG. O2021: The 22nd Annual International
Conference on Digital Government Research, 2021, pp. 545–547.
[6] F. Ortiz-Rodriguez, S. Tiwari, R. Panchal, J. M. Medina-Quintero, R. Barrera, Mexin:
Multidialectal ontology supporting nlp approach to improve government electronic com-
munication with the mexican ethnic groups, in: DG. O 2022: The 23rd Annual International
Conference on Digital Government Research, 2022, pp. 461–463.
[7] I. A. Ahmed, F. N. AL-Aswadi, K. M. Noaman, et al., Arabic knowledge graph construction:
a close look in the present and into the future, Journal of King Saud University-Computer
and Information Sciences (2022).
[8] U. Chidiebere, A. Tunde, et al., Analysis and representation of igbo text document for a
text-based system, arXiv preprint arXiv:2009.06376 (2020).
[9] M. Onukawa, The writing of standard igbo in okereke oo (ed.) readings in citizenship
education, Okigwe: Wythem Publishers (2001).
[10] W. Zhang, T. Yoshida, X. Tang, Text classification based on multi-word with support vector
machine, Knowledge-Based Systems 21 (2008) 879–886.
[11] C.-F. Tsai, Bag-of-words representation in image annotation: A review, International
Scholarly Research Notices 2012 (2012).
[12] P. U. Usip, M. Ntekop, The use of ontologies as efficient and intelligent knowledge
management tool, in: 2016 Future Technologies Conference (FTC), IEEE, 2016, pp. 626–631.
[13] P. U. Usip, M. E. Ekpenyong, Towards ontology-driven application for multilingual speech
language therapy, in: Human Language Technologies for Under-Resourced African
Languages, Springer, 2018, pp. 85–101.
[14] W. Etaiwi, A. Awajan, Semg-ts: Abstractive arabic text summarization using semantic
graph embedding, Mathematics 10 (2022) 3225.
Figure 5: Igbo Text Classification System Result
Figure 6: Igbo Text Classification System Performance Measure Result Chart