Semantic Representation of Igbo Text Using Knowledge Graph ⋆

Semantic Representation of Igbo Text Using Knowledge Graph ⋆ NkechiJIfeanyi-Reuben Computer Science Department Nnamdi Azikiwe University Awka

Nigeria

PatienceUsoroUsip Computer Science Department University of Uyo

Uyo Nigeria

Semantic Representation of Igbo Text Using Knowledge Graph ⋆ 1613-0073 2163B6A1F3249AD464555EEAF1B53EEA GROBID - A machine learning software for extracting information from scholarly documents Igbo Language Text Representation Text Classification Ontology Knowledge Graph Artificial Intelligence Compound Word Semantics

With the fast growth of Artificial Intelligence and its application in different areas of Natural Language Processing, semantic representation contributes immensely to smoothing the progress of different automated language processing applications. Semantic representation returns the meaning of the text as it may be understood by humans. Although semantic representation is very useful for several applications, no semantic model is proposed for the Igbo language. The usage of Igbo language in the text-based applications such as text mining, information retrieval, natural language processing is at the increase. Igbo language uses compounding in its word formation and word ordering play high role in the language. The uncertainty in dealing with these compound words has made the representation of Igbo text very difficult. There is need to for smart data representation model in the said language to enhance efficiency and effectiveness in its text-based application. This paper presents the analysis of a language classification, considering Igbo language, considering its compounding nature and describes a smart model for text representation using a Knowledge Graph. The model will create a smart data repository the real-world usage of text and tangled its context relationship. The proposed Igbo Knowledge Graph (IKG) text representation model was used in Igbo text classification system. The performance of the Igbo text classification system is measured by computing the precision, recall and F1-measure of the result obtained on bigram, semantic-based and unigram represented textual documents. The Igbo text classification on semantic-based represented text has highest degree of exactness (precision). This shows that the classification on semantic-based Igbo represented text outperforms bigram and unigram represented texts. Semantic-based text representation model using knowledge graph is highly recommended for any Igbo text-based system. It enables automated reasoning as well addresses the challenges incurred as a result of Igbo compounding, word ordering and collocations language peculiarities.

IWMSW-2022: International Workshop on Multilingual Semantic Web, Co-located with the KGSWC-2022, November 21-23, 2022, Madrid, Spain

⋆ You can use this document as the template for preparing your publication. We recommend using the latest version of the ceurart style. * * Corresponding author. † These authors contributed equally. Envelope nj.ifeanyi-reuben@unizik.edu.ng (N. J.Ifeanyi-Reuben); patienceusip@uniuyo.edu.ng (P. U. Usip) Orcid 0000-0002-6516-5194 (P. U. Usip)

Introduction

Text representation is the selection of appropriate features to represent document [1]. The approach in which text is represented has a big effect in the performance of any text-based applications [2]. It is powerfully controlled by the language of the text. The spread of Information Technology (IT) in real life activities has assisted in inculcating Igbo language in text-based application such as text creation, web creation, text mining, information retrieval and natural language processing. This research improved the existing research of [3]. on the analysis and representation of Igbo text for a text-based system by incorporating the semantic representation of the text in order to create detailed notations of the text that accurately conveys its meaning. Semantic representation of the textual document is very rich and is adopted in many applications of Natural Language Processing (NLP) such as machine translation, information retrieval, question answering, text classification, sentiment analysis, text summarisation and text extraction. It reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. The research on [3,4,5,6] emphasized that the semantic representation of Arabic text can facilitate several natural language processing applications such as text summarization and textual entailment. Semantic representation can be achieved using Knowledge Graph (KG). Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Semantic representation can be achieved using Knowledge Graph (KG). Knowledge Graph (KG) is a way to represent and organize the data in a more efficient and easy way to modify, use, and understand [7] It is also referred to as a collection of interlinked description of concepts, entities, relationships and events via linking and semantic metadata, providing a framework for data integration, unification, analytics and sharing. With the widespread growth of Igbo data on the Web, the need for efficient methods to get and arrange valuable information from these big noisy data is increased. This research presents an Igbo Knowledge Graph (IKG) for representing data created with Igbo language for better performance for any Igbo text-based applications. This Igbo smart representation will be useful for many purposes such as question answering, summarization and information retrieval. The model chosen by the researchers will also help to discover unidentified facts and concealed knowledge that may exist in the lexical, semantic or relations in Igbo text corpus.

Language classification

A language is a method of communication between individuals who share common code, in form of symbols [8]. In linguistics, there are two kinds of language classification: genetic (or genealogical) and typological. Genetic, also known as genealogical language is a type that group languages into families based to their degree of diachronic relatedness. Examples of genealogic language group are German, English, Dutch, Swedish, Norwegian, Danish, Irish, Welsh, Breton, etc. Typological classification groups languages into types according to their structural characteristics. These structural characteristics can be phonological typology, morphological typology or syntactic typology. Typological languages form words by agglutination. Examples are Igbo, Turkish, Finnish, Japanese, etc. [9] Igbo Language The Igbo language is one of the agglutinative languages, a language that form words through the combination of smaller morphemes to get compound words. It is one of the three major languages in Nigeria. It is largely spoken by the people in the eastern part of Nigeria. Igbo language has many dialects. The standard Igbo is used formally and is adopted for this research. The current Igbo orthography [8] is based on the Standard Igbo. Orthography is a way of writing sentence or constructing grammar in a language. Standard Igbo has thirty-six (36) alphabets (a, b, ch, d, e, f, g, gb, gh, gw, h, i, ị, j, k, kw, kp, l, m, n, nw, ny, ṅ, o, ọ, p, r, s, sh, t, u, ụ, v, w, y, z). Igbo language has a large number of compound words. A compound word is a word that has more than one root, and can be made from combination of either nouns, pronouns or adjectives. Ifeanyi-Reuben et al. [8] studied the Igbo compound words and categorized them as follows: i. Nominal (NN) Compound Word: A nominal compound word is formed by the combination of two or more nouns. The nominal compound words are written separately not minding the semantic status of the nouns in Igbo. Example of Igbo nominal compound words are: nwa akwụkwọ -student; onye nkuzi -teacher; ama egwuregwu -stadium; ụlọ ọgwụ -hospital; ụlọ akwụkwọ -school. ii. Agentive Compound Words: In agentive compound word, one or more nouns express the meaning of the agent, doer of the action. The Igbo agentive compound words are written separately irrespective of the translations in English. They can also be referred to as VN (Verb Noun) compound words. Example: oje ozi -messenger; oti ịgba -drummer. iii. Igbo Duplicated Compound Word: Igbo duplicated compound words are formed by the repetition of the exact word two or more times to show a variety of meaning. For example: ọsọ ọsọ -quickly; mmiri mmiri -watery; ọbara ọbara -reddish. iv. Igbo Coordinate Compound Words: This compound word is formed by the combination of two or words joined by the Igbo conjunction "na" meaning "and" in English. All the Igbo compound words of this category is written separately. Example: Ezi na ụlọ -family; okwu na ụka -quarrel. v. Igbo Proper Compound Words: This category of Igbo compound words includes personal names, place names, and club names. All words in this category are wriiten together not minding how long they may be. Example: Uchechukwu; Ngozichukwuka; Ifeanyichukwu. vi. Igbo Derived Compound Words: The derived Igbo compound words are words derived from verbs or phrases. The roots of the derived Igbo compound words are written together. Example: Dinweụlọ -landlord. Igbo, being an agglutinative language, has a huge number of compounds words and can be referred to as a language of compound words. The proposed research of Igbo Knowledge Graph representation will consider this peculiarity to get a good result. Zhang, Yoshida and Tang [10] studied and compared the performance of adopting TF*IDF, LSI (Latent Semantic Indexing) together with multiple words for text representation. They used Chinese and English corpora to assess the three techniques in information retrieval and text categorization. Their result showed that LSI produced greatest performance in retrieving English documents and also produced best performance for Chinese text categorization. Chih-Fong [11] improved and applied Bag of Word (BOW) for image annotation. An image annotation is used to allocate keywords to images automatically and the images are represented using characteristics such as color, texture and shape. This is applied in Content-Based Image Retrieval System (CBIRS) and the retrieval of the image is based on indexed image features. Usip and Ntekop [12] posited that ontology is a necessary technology tool for easy and intelligent reasoning with knowledge. Being the underlying schema for every knowledge graph, this study will improve the existing work of Ifeanyi-Reuben et al. [8] by adding intelligence to the work using Knowledge Graph. Ontology-driven applications for multilinguality was described by Usip and Ekpenyong [13]. Etaiwi and Awajan [14] proposed SemG-TS, a novel semantic graph embedding-based abstractive text summarization model for the Arabic language which employed a deep neural network to generate abstractive summary. The result obtained shows SemG-TS model outperforms the popular baseline word embedding technique, word2vec.

Related Works

Methodology

The bulk of concerns for any text-based system are attributed to text representation considering the peculiarities of the natural language involved. In this section, we propose an efficient and effective model to represent Igbo text to be adopted by any text-based system. This is a process of transforming unstructured Igbo textual document into a form proper for automatic processing. This is a vital step in text processing because it affects the general performance of the system. The proposed approach for the Igbo text representation process is shown in Figure 1.

Text Preprocessing

Text preprocessing involves tasks that are performed on text to convert the original natural language text to a structure ready for processing. It performs very important functions in Igbo Text Normalization: In Normalization process, we transformed the Igbo textual document to a format to make its contents consistent, convenient and full words for an efficient processing. We transformed all text cases to lower case and also removed diacritics and noisy data. The noisy data is assumed to be data that are not in Igbo dataset. Text Tokenization: Tokenization is the task of analyzing or separating text into a sequence of discrete tokens (words). Igbo Stop-words Removal: Stop-words are language-specific functional words; the most frequently used words in a language that usually carry no information. There are no specific number of stop-words which all Natural Language Processing (NLP) tools should have. Most of the language stop-words are generally pronouns, prepositions, and conjunctions. This task removes the stopwords in Igbo text. Some of Igbo stopwords is shown in Figure 2.

In the proposed system, a stop-word list will be created and saved in a file named "stop-words" and is loaded to the system whenever the task is asked to perform.

Knowledge Graph Text Representation

Knowledge graphs combine characteristics of several data management paradigms:

• Database: The data can be explored via structured queries.

• Graph: Data can be analyzed as any other network data structure.

• Knowledge base: The model will bear formal semantics, which can be used to interpret the data and infer new facts. Igbo Knowledge graphs will provide good framework for Igbo data integration, unification, linking and reuse.

Sample Igbo Text and the Corresponding Proposed Knowledge Graph

Given the examples of Igbo compound words in table 1, it is observed that the actual meaning of the semantic correctness of Igbo compound words is not the same when compared with their roots and meaning after decomposition to Igbo single words. Hence, the need for the compound word categorization.

Following the categorization of the Igbo compound words, a knowledge graph representation of Igbo words and the various categories is given in Figure 3.

The underlying ontology used as the schema for the knowledge graph has the domain knowledge which includes the bilingual corpora of Igbo single words and their English meaning, the n-gram modeling feature and resulting Igbo compound words classified based on the compound word categorization.

From the knowledge graph, the relationship among the various Igbo compound word, single Igbo word and their English word meaning can be determined and used in an effort towards the construction of a semantically correct bilingual Igbo -English Language dictionary consisting of both single and compound Igbo words. With the knowledge graph, missing links between Igbo compound and single words can be detemined at a glance for proper restructuring and fixture to produce a semantically correct Igbo word

System Performance Evaluation

The system performance is evaluated by computing the precision, F1-measure and Recall. Precision is defined as the quotient of total TPs and sum of total TPs and FPs. Precision point is known to as a point of correctness.

Precision = 𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑃(1)

Recall of the classification system is described as the quotient of total TPs and sum of total TPs and total FNs. Recall level measures completeness.

Recall = 𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑁

(2) F1-Measure is single function that joins recall and precision points. When the F1-measure is high, it means that the overall text classification system is high.

F1-Measure = (2 * 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙) (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)(3)= 2𝑇 𝑃 (2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 )(4)

In summary, computation of precision, recall and f1-measure required four input parameters: TP, FP, TN and FN. i. TP -total of text documents accurately allotted to document class. ii. FP -total of text documents wrongly allotted to document class. iii. FN -total of text documents wrongly rejected from document class. iv. TN -total of text documents correctly rejected from document class. These parameters are input to the evaluator. They are obtained from the classification result.

Result Analysis

Conclusion

An improved intelligent approach for representing Igbo text document using Knowledge Graph model considering the agglutinative nature of Igbo language is proposed. This is to solve the issues of collocations, compounding, and word ordering that plays major roles in the language, thereby making the representation semantic-enriched. The model is implemented and evaluated using Igbo text classification system. The model will be of high commercial potential value and will be useful in any text based intelligent system on the language. It will also motivate other researchers to develop interest in doing more research on Igbo language processing to the benefit of people and society.

Ifeanyi-Reuben et al. [chidiebere2020analysis] presents the analysis of Igbo language text document and describes its representation with the Word-based N-gram model. The result shows that Bigram and Trigram n-gram text representation models perform better than unigram model. Wael and Arafat [3] proposed a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models.

Figure 1 :1Figure 1: Igbo Text Representation Process.

Figure 2 :2Figure 2: Sample of Igbo Stop-words list.

Figure 44Figure 4 is a designed Igbo knowledge graph model showing all the due processes employed to represent Igbo textual document based on its semantic (reasoning) using knowledge graph.

Figure 3 :3Figure 3: Knowledge Graph Representation of Sample Igbo Compound.

Figure 4 :4Figure 4: Igbo Knowledge Graph Model

Figure 55Figure 5 displays the Text classification module of the system used to test the effectiveness of the proposed model. The result obtained in text classification on Igbo text represented semantically using knowledge graph is compared with the results obtained in unigram and bigram text representation models. Table3and Figure6show the classification performance measure result and chart respectively. The result shows that the recall, precision and F1 for bigram Igbo represented text are 1.00, .80 and .89 respectively. The recall, precision and F1 for semantic-based Igbo text are 1.00, .90 and .95 respectively. The recall, precision and F1 for unigram Igbo represented text are 1.00, .62 and .82 respectively. Recall evaluates the degree of completeness. The result shows Igbo text classification on the text represented with the three models (bigram, semantic-based and unigram) has the equal level of recall (completeness). This means all the text documents that were given to the classifier, were given a label name. Precision measures the degree of exactness. The classification with semantic-based has highest degree (0.90) of exactness (precision). Table2gives the summary of classification result obtained on Bigram, Semantic-Based and Unigram text representation. A total of 10 testing documents are used for the experiment. In bigram, eight documents are correctly assigned a class label while two are incorrectly assigned a class label. In semantic-based text representation using knowledge graph, 9 documents are correctly assigned a class label while one is incorrectly assigned. In unigram, 7 documents are correctly assigned a class label while 3 are incorrectly assigned a class label.

The performance was measured by computing the classification accuracy of Bigram, Semantic-Based and Unigram represented text. The result showed that the classification performed on Semantic-based represented text has higher performance than Bigram and unigram represented texts. It has shown that a high quality text representation model certainly boost performance of NLP tasks.

Table 11Igbo Compound Words[8] Igbo Compound WordsMeaningRoots and meaningCompound Word CategoryOnye -PersonOnye nkuziTeacherNkuzi -TeachNominalEzi -surroundingNa -andEzi na ụlọFamilyụlọ -familyCoordinateOjiiego -use moneyOjiiegoachọegobusinessmanachọego -find moneyDerivedụgbọ -vesselụgbọ alaCar, motorala -land (road)NominalEgbe -gunEgbe igweThunderIgwe -skyNominalIri -tenIri abụọTwentyAbụọ -twoNominalOde -WriteOde akwụkwọSecretaryAkwụkwọ -bookAgentiveEbere -mercyEberechukwuGod's mercyChukwu -GodProperMmiri mmiriWateryMmiri -waterDuplicateọcha ọchaWhitishọcha -whiteDuplicateOnye -personOnye nchekwaAdministratorNchekwa -protectNominalKọmputa -ComputerKọmputa NkunakaLaptopNkunaka -HandcarryNominalọkpụ -moldỌkpụ ụzụBlacksmithụzụ -clayAgentiveNche -protectNche anwụUmbrellaAnwụ -sunAgentiveOnyonyo -screenOnyonyo kọmputaMonitorKọmputa-computerNominalOkwu -speechOkwu ntughePasswordNtughe -openingNominal

Table 3 and3Figure 6 show the classification performance measure result and chart respectively. The result shows that the recall, precision and F1 for bigram Igbo represented text are 1.00, .80 and .89 respectively. The recall, precision and F1 for semantic-based Igbo text are 1.00, .90 and .95 respectively. The recall, precision and F1 for unigram Igbo represented text are 1.00, .62 and .82 respectively. Recall evaluates the degree of completeness. The result shows Igbo text classification on the text represented with the three models (bigram, semantic-based and unigram) has the equal level of recall (completeness). This means all the text documents that were given to the classifier, were given a label name. Precision measures the degree of exactness. The classification with semantic-based has highest degree (0.90) of exactness (precision). Table2gives the summary of classification result obtained on Bigram, Semantic-Based and Unigram text representation. A total of 10 testing documents are used for the experiment. In bigram, eight documents are correctly assigned a class label while two are incorrectly assigned a class label. In semantic-based text representation using knowledge graph, 9 documents are correctly assigned a class label while one is incorrectly assigned. In unigram, 7 documents are correctly assigned a class label while 3 are incorrectly assigned a class label.

Acknowledgments

The authors wish to express gratitude the unknown reviewers of this work for their useful comments and contributions that assisted in enhancing the worth of this paper.

Text classification improved through multigram models DShen J.-TSun QYang ZChen Proceedings of the 15th ACM international conference on Information and knowledge management the 15th ACM international conference on Information and knowledge management 2006 Representation quality in text classification: An introduction and experiment DDLewis Speech and Natural Language: Proceedings of a Workshop Held at Hidden

Valley, Pennsylvania

June 24-27, 1990, 1990 Graph-based arabic text semantic representation WEtaiwi AAwajan Information Processing & Management 57 102183 2020 Tools, Languages, Methodologies for Representing Semantics on the Web of Things STiwari PSiarry SMehta MJabbar 2022 John Wiley & Sons Aishe-onto: a semantic model for public higher education universities RPanchal PSwaminarayan STiwari FOrtiz-Rodriguez The 22nd Annual International Conference on Digital Government Research Dg O2021. 2021 Mexin: Multidialectal ontology supporting nlp approach to improve government electronic communication with the mexican ethnic groups FOrtiz-Rodriguez STiwari RPanchal JMMedina-Quintero RBarrera The 23rd Annual International Conference on Digital Government Research Dg 2022. 2022 IAAhmed FNAl-Aswadi KMNoaman Arabic knowledge graph construction: a close look in the present and into the future 2022 Journal of King Saud University-Computer and Information Sciences UChidiebere ATunde arXiv:2009.06376 Analysis and representation of igbo text document for a text-based system 2020 arXiv preprint The writing of standard igbo in okereke oo MOnukawa readings in citizenship education

Okigwe

Wythem Publishers 2001 Text classification based on multi-word with support vector machine WZhang TYoshida XTang Knowledge-Based Systems 21 2008 Bag-of-words representation in image annotation: A review C.-FTsai International Scholarly Research Notices 2012. 2012 The use of ontologies as efficient and intelligent knowledge management tool PUUsip MNtekop Future Technologies Conference (FTC), IEEE 2016. 2016 Towards ontology-driven application for multilingual speech language therapy PUUsip MEEkpenyong Human Language Technologies for Under-Resourced African Languages Springer 2018 Semg-ts: Abstractive arabic text summarization using semantic graph embedding WEtaiwi AAwajan Mathematics 10 3225 2022