Semantic Representation of Igbo Text Using Knowledge Graph⋆ Nkechi J.Ifeanyi-Reuben1,† , Patience Usoro Usip2,∗,† 1 Computer Science Department, Nnamdi Azikiwe University Awka, Nigeria) 2 Computer Science Department, University of Uyo, Uyo, Nigeria Abstract With the fast growth of Artificial Intelligence and its application in different areas of Natural Language Processing, semantic representation contributes immensely to smoothing the progress of different automated language processing applications. Semantic representation returns the meaning of the text as it may be understood by humans. Although semantic representation is very useful for several applications, no semantic model is proposed for the Igbo language. The usage of Igbo language in the text-based applications such as text mining, information retrieval, natural language processing is at the increase. Igbo language uses compounding in its word formation and word ordering play high role in the language. The uncertainty in dealing with these compound words has made the representation of Igbo text very difficult. There is need to for smart data representation model in the said language to enhance efficiency and effectiveness in its text-based application. This paper presents the analysis of a language classification, considering Igbo language, considering its compounding nature and describes a smart model for text representation using a Knowledge Graph. The model will create a smart data repository the real-world usage of text and tangled its context relationship. The proposed Igbo Knowledge Graph (IKG) text representation model was used in Igbo text classification system. The performance of the Igbo text classification system is measured by computing the precision, recall and F1-measure of the result obtained on bigram, semantic-based and unigram represented textual documents. The Igbo text classification on semantic-based represented text has highest degree of exactness (precision). This shows that the classification on semantic-based Igbo represented text outperforms bigram and unigram represented texts. Semantic-based text representation model using knowledge graph is highly recommended for any Igbo text-based system. It enables automated reasoning as well addresses the challenges incurred as a result of Igbo compounding, word ordering and collocations language peculiarities. Keywords Igbo Language, Text Representation, Text Classification, Ontology, Knowledge Graph, Artificial Intelli- gence, Compound Word, Semantics IWMSW-2022: International Workshop on Multilingual Semantic Web, Co-located with the KGSWC-2022, November 21–23, 2022, Madrid, Spain ⋆ You can use this document as the template for preparing your publication. We recommend using the latest version of the ceurart style. ∗∗ Corresponding author. † These authors contributed equally. Envelope-Open nj.ifeanyi-reuben@unizik.edu.ng (N. J.Ifeanyi-Reuben); patienceusip@uniuyo.edu.ng (P. U. Usip) Orcid 0000-0002-6516-5194 (P. U. Usip) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1. Introduction Text representation is the selection of appropriate features to represent document [1]. The approach in which text is represented has a big effect in the performance of any text-based ap- plications [2]. It is powerfully controlled by the language of the text. The spread of Information Technology (IT) in real life activities has assisted in inculcating Igbo language in text-based application such as text creation, web creation, text mining, information retrieval and natural language processing. This research improved the existing research of [3]. on the analysis and representation of Igbo text for a text-based system by incorporating the semantic represen- tation of the text in order to create detailed notations of the text that accurately conveys its meaning. Semantic representation of the textual document is very rich and is adopted in many applications of Natural Language Processing (NLP) such as machine translation, information retrieval, question answering, text classification, sentiment analysis, text summarisation and text extraction. It reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. The research on [3, 4, 5, 6] emphasized that the semantic representation of Arabic text can facilitate several natural language processing applications such as text summarization and textual entailment. Semantic representation can be achieved using Knowledge Graph (KG). Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Semantic representation can be achieved using Knowledge Graph (KG). Knowledge Graph (KG) is a way to represent and organize the data in a more efficient and easy way to modify, use, and understand [7] It is also referred to as a collection of interlinked description of concepts, entities, relationships and events via linking and semantic metadata, providing a framework for data integration, unification, analytics and sharing. With the widespread growth of Igbo data on the Web, the need for efficient methods to get and arrange valuable information from these big noisy data is increased. This research presents an Igbo Knowledge Graph (IKG) for representing data created with Igbo language for better performance for any Igbo text-based applications. This Igbo smart representation will be useful for many purposes such as question answering, summarization and information retrieval. The model chosen by the researchers will also help to discover unidentified facts and concealed knowledge that may exist in the lexical, semantic or relations in Igbo text corpus. 1.1. Language classification A language is a method of communication between individuals who share common code, in form of symbols [8]. In linguistics, there are two kinds of language classification: genetic (or genealogical) and typological. Genetic, also known as genealogical language is a type that group languages into families based to their degree of diachronic relatedness. Examples of genealogic language group are German, English, Dutch, Swedish, Norwegian, Danish, Irish, Welsh, Breton, etc. Typological classification groups languages into types according to their structural characteris- tics. These structural characteristics can be phonological typology, morphological typology or syntactic typology. Typological languages form words by agglutination. Examples are Igbo, Turkish, Finnish, Japanese, etc. [9] Igbo Language The Igbo language is one of the agglutinative languages, a language that form words through the combination of smaller morphemes to get compound words. It is one of the three major languages in Nigeria. It is largely spoken by the people in the eastern part of Nigeria. Igbo language has many dialects. The standard Igbo is used formally and is adopted for this research. The current Igbo orthography [8] is based on the Standard Igbo. Orthography is a way of writing sentence or constructing grammar in a language. Standard Igbo has thirty-six (36) alphabets (a, b, ch, d, e, f, g, gb, gh, gw, h, i, ị, j, k, kw, kp, l, m, n, nw, ny, ṅ, o, ọ, p, r, s, sh, t, u, ụ, v, w, y, z). Igbo language has a large number of compound words. A compound word is a word that has more than one root, and can be made from combination of either nouns, pronouns or adjectives. Ifeanyi-Reuben et al. [8] studied the Igbo compound words and categorized them as follows: i. Nominal (NN) Compound Word: A nominal compound word is formed by the combination of two or more nouns. The nominal compound words are written separately not minding the semantic status of the nouns in Igbo. Example of Igbo nominal compound words are: nwa akwụkwọ - student; onye nkuzi – teacher; ama egwuregwu – stadium; ụlọ ọgwụ - hospital; ụlọ akwụkwọ - school. ii. Agentive Compound Words: In agentive compound word, one or more nouns express the meaning of the agent, doer of the action. The Igbo agentive compound words are written separately irrespective of the translations in English. They can also be referred to as VN (Verb Noun) compound words. Example: oje ozi – messenger; oti ịgba - drummer. iii. Igbo Duplicated Compound Word: Igbo duplicated compound words are formed by the repetition of the exact word two or more times to show a variety of meaning. For example: ọsọ ọsọ - quickly; mmiri mmiri – watery; ọbara ọbara – reddish. iv. Igbo Coordinate Compound Words: This compound word is formed by the combination of two or words joined by the Igbo conjunction “na” meaning “and” in English. All the Igbo compound words of this category is written separately. Example: Ezi na ụlọ - family; okwu na ụka – quarrel. v. Igbo Proper Compound Words: This category of Igbo compound words includes personal names, place names, and club names. All words in this category are wriiten together not minding how long they may be. Example: Uchechukwu; Ngozichukwuka; Ifeanyichukwu. vi. Igbo Derived Compound Words: The derived Igbo compound words are words derived from verbs or phrases. The roots of the derived Igbo compound words are written together. Example: Dinweụlọ - landlord. Igbo, being an agglutinative language, has a huge number of compounds words and can be referred to as a language of compound words. The proposed research of Igbo Knowledge Graph representation will consider this peculiarity to get a good result. 2. Related Works Ifeanyi-Reuben et al. [chidiebere2020analysis] presents the analysis of Igbo language text document and describes its representation with the Word-based N-gram model. The result shows that Bigram and Trigram n-gram text representation models perform better than unigram model. Wael and Arafat [3] proposed a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. Zhang, Yoshida and Tang [10] studied and compared the performance of adopting TF*IDF, LSI (Latent Semantic Indexing) together with multiple words for text representation. They used Chinese and English corpora to assess the three techniques in information retrieval and text categorization. Their result showed that LSI produced greatest performance in retrieving English documents and also produced best performance for Chinese text categorization. Chih-Fong [11] improved and applied Bag of Word (BOW) for image annotation. An image annotation is used to allocate keywords to images automatically and the images are represented using characteristics such as color, texture and shape. This is applied in Content-Based Image Retrieval System (CBIRS) and the retrieval of the image is based on indexed image features. Usip and Ntekop [12] posited that ontology is a necessary technology tool for easy and intelligent reasoning with knowledge. Being the underlying schema for every knowledge graph, this study will improve the existing work of Ifeanyi-Reuben et al. [8] by adding intelligence to the work using Knowledge Graph. Ontology-driven applications for multilinguality was described by Usip and Ekpenyong [13]. Etaiwi and Awajan [14] proposed SemG-TS, a novel semantic graph embedding-based abstractive text summarization model for the Arabic language which employed a deep neural network to generate abstractive summary. The result obtained shows SemG-TS model outperforms the popular baseline word embedding technique, word2vec. 3. Methodology The bulk of concerns for any text-based system are attributed to text representation considering the peculiarities of the natural language involved. In this section, we propose an efficient and effective model to represent Igbo text to be adopted by any text-based system. This is a process of transforming unstructured Igbo textual document into a form proper for automatic processing. This is a vital step in text processing because it affects the general performance of the system. The proposed approach for the Igbo text representation process is shown in Figure 1. 3.1. Text Preprocessing Text preprocessing involves tasks that are performed on text to convert the original natural language text to a structure ready for processing. It performs very important functions in Figure 1: Igbo Text Representation Process. different text-based system. The tasks are Igbo text normalization, Igbo text tokenization and Igbo text Stop words Removal. Igbo Text Normalization: In Normalization process, we transformed the Igbo textual document to a format to make its contents consistent, convenient and full words for an efficient processing. We transformed all text cases to lower case and also removed diacritics and noisy data. The noisy data is assumed to be data that are not in Igbo dataset. Text Tokenization: Tokenization is the task of analyzing or separating text into a sequence of discrete tokens (words). Igbo Stop-words Removal: Stop-words are language-specific functional words; the most fre- quently used words in a language that usually carry no information. There are no specific number of stop-words which all Natural Language Processing (NLP) tools should have. Most of the language stop-words are generally pronouns, prepositions, and conjunctions. This task removes the stopwords in Igbo text. Some of Igbo stopwords is shown in Figure 2. In the proposed system, a stop-word list will be created and saved in a file named “stop-words” and is loaded to the system whenever the task is asked to perform. Figure 2: Sample of Igbo Stop-words list. 3.2. Knowledge Graph Text Representation Knowledge graphs combine characteristics of several data management paradigms: • Database: The data can be explored via structured queries. • Graph: Data can be analyzed as any other network data structure. • Knowledge base: The model will bear formal semantics, which can be used to interpret the data and infer new facts. Igbo Knowledge graphs will provide good framework for Igbo data integration, unification, linking and reuse. 4. Sample Igbo Text and the Corresponding Proposed Knowledge Graph Given the examples of Igbo compound words in table 1, it is observed that the actual meaning of the semantic correctness of Igbo compound words is not the same when compared with their roots and meaning after decomposition to Igbo single words. Hence, the need for the compound word categorization. Following the categorization of the Igbo compound words, a knowledge graph representation of Igbo words and the various categories is given in Figure 3. The underlying ontology used as the schema for the knowledge graph has the domain knowledge which includes the bilingual corpora of Igbo single words and their English meaning, the n-gram modeling feature and resulting Igbo compound words classified based on the compound word categorization. From the knowledge graph, the relationship among the various Igbo compound word, single Igbo word and their English word meaning can be determined and used in an effort towards the construction of a semantically correct bilingual Igbo - English Language dictionary consisting of both single and compound Igbo words. With the knowledge graph, missing links between Igbo compound and single words can be detemined at a glance for proper restructuring and fixture to produce a semantically correct Igbo word Table 1 Igbo Compound Words [8] Igbo Compound Words Meaning Roots and meaning Compound Word Category Onye - Person Onye nkuzi Teacher Nkuzi – Teach Nominal Ezi – surrounding Na – and Ezi na ụlọ Family ụlọ - family Coordinate Ojiiego – use money Ojiiegoachọego businessman achọego – find money Derived ụgbọ - vessel ụgbọ ala Car, motor ala - land (road) Nominal Egbe – gun Egbe igwe Thunder Igwe – sky Nominal Iri – ten Iri abụọ Twenty Abụọ - two Nominal Ode – Write Ode akwụkwọ Secretary Akwụkwọ - book Agentive Ebere – mercy Eberechukwu God’s mercy Chukwu - God Proper Mmiri mmiri Watery Mmiri -water Duplicate ọcha ọcha Whitish ọcha – white Duplicate Onye – person Onye nchekwa Administrator Nchekwa – protect Nominal Kọmputa – Computer Kọmputa Nkunaka Laptop Nkunaka – Handcarry Nominal ọkpụ - mold Ọkpụ ụzụ Blacksmith ụzụ - clay Agentive Nche – protect Nche anwụ Umbrella Anwụ - sun Agentive Onyonyo – screen Onyonyo kọmputa Monitor Kọmputa- computer Nominal Okwu – speech Okwu ntughe Password Ntughe - opening Nominal Figure 4 is a designed Igbo knowledge graph model showing all the due processes employed to represent Igbo textual document based on its semantic (reasoning) using knowledge graph. 5. System Performance Evaluation The system performance is evaluated by computing the precision, F1-measure and Recall. Precision is defined as the quotient of total TPs and sum of total TPs and FPs. Precision point is known to as a point of correctness. 𝑇𝑃 Precision = (1) 𝑇𝑃 + 𝐹𝑃 Recall of the classification system is described as the quotient of total TPs and sum of total Figure 3: Knowledge Graph Representation of Sample Igbo Compound. TPs and total FNs. Recall level measures completeness. 𝑇𝑃 Recall = (2) 𝑇𝑃 + 𝐹𝑁 F1-Measure is single function that joins recall and precision points. When the F1-measure is high, it means that the overall text classification system is high. (2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙) F1-Measure = (3) (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙) 2𝑇 𝑃 = (4) (2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 ) In summary, computation of precision, recall and f1-measure required four input parameters: TP, FP, TN and FN. i. TP - total of text documents accurately allotted to document class. ii. FP - total of text documents wrongly allotted to document class. iii. FN - total of text documents wrongly rejected from document class. iv. TN - total of text documents correctly rejected from document class. These parameters are input to the evaluator. They are obtained from the classification result. Figure 4: Igbo Knowledge Graph Model 6. Experiments This involves the practical method of putting into work all the theoretical design of the proposed model. The semantic representation of Igbo text using knowledge graph is implemented on Igbo text classification system with Python and tools from Natural Language Toolkit (NLTK). Table 2 Summary of Bigram, Semantic-Based and Unigram Text Classification result Text Document Bigram Semantic-Based Unigram IgboText1 Administration Administration Administration IgboText10 Religion Religion Religion IgboText11 Computer Computer Administration IgboText2 Religion Computer Religion IgboText4 Administration Administration Administration IgboText5 Religion Administration Administration IgboText6 Computer Computer Administration IgboText7 Administration Administration Administration IgboText8 Religion Religion Religion IgboText9 Religion Religion Religion Table 3 Performance Measure Result Category TP NP FP FN Precision Recall F1 Measure Bigram 8 0 2 0 0.80 1.00 0.89 Semantic- 9 0 1 0 0.90 1.00 0.95 Based Unigram 7 0 3 0 0.70 1.00 0.82 7. Result Analysis Figure 5 displays the Text classification module of the system used to test the effectiveness of the proposed model. The result obtained in text classification on Igbo text represented semantically using knowledge graph is compared with the results obtained in unigram and bigram text representation models. Table 3 and Figure 6 show the classification performance measure result and chart respectively. The result shows that the recall, precision and F1 for bigram Igbo represented text are 1.00, .80 and .89 respectively. The recall, precision and F1 for semantic-based Igbo text are 1.00, .90 and .95 respectively. The recall, precision and F1 for unigram Igbo represented text are 1.00, .62 and .82 respectively. Recall evaluates the degree of completeness. The result shows Igbo text classification on the text represented with the three models (bigram, semantic-based and unigram) has the equal level of recall (completeness). This means all the text documents that were given to the classifier, were given a label name. Precision measures the degree of exactness. The classification with semantic-based has highest degree (0.90) of exactness (precision). Table 2 gives the summary of classification result obtained on Bigram, Semantic-Based and Unigram text representation. A total of 10 testing documents are used for the experiment. In bigram, eight documents are correctly assigned a class label while two are incorrectly assigned a class label. In semantic-based text representation using knowledge graph, 9 documents are correctly assigned a class label while one is incorrectly assigned. In unigram, 7 documents are correctly assigned a class label while 3 are incorrectly assigned a class label. 8. Conclusion An improved intelligent approach for representing Igbo text document using Knowledge Graph model considering the agglutinative nature of Igbo language is proposed. This is to solve the issues of collocations, compounding, and word ordering that plays major roles in the language, thereby making the representation semantic-enriched. The model is implemented and evaluated using Igbo text classification system. The performance was measured by computing the classification accuracy of Bigram, Semantic- Based and Unigram represented text. The result showed that the classification performed on Semantic-based represented text has higher performance than Bigram and unigram represented texts. It has shown that a high quality text representation model certainly boost performance of NLP tasks. The model will be of high commercial potential value and will be useful in any text based intelligent system on the language. It will also motivate other researchers to develop interest in doing more research on Igbo language processing to the benefit of people and society. 9. Acknowledgments The authors wish to express gratitude the unknown reviewers of this work for their useful comments and contributions that assisted in enhancing the worth of this paper. References [1] D. Shen, J.-T. Sun, Q. Yang, Z. Chen, Text classification improved through multigram models, in: Proceedings of the 15th ACM international conference on Information and knowledge management, 2006, pp. 672–681. [2] D. D. Lewis, Representation quality in text classification: An introduction and experiment, in: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, 1990. [3] W. Etaiwi, A. Awajan, Graph-based arabic text semantic representation, Information Processing & Management 57 (2020) 102183. [4] S. Tiwari, P. Siarry, S. Mehta, M. Jabbar, Tools, Languages, Methodologies for Representing Semantics on the Web of Things, John Wiley & Sons, 2022. [5] R. Panchal, P. Swaminarayan, S. Tiwari, F. Ortiz-Rodriguez, Aishe-onto: a semantic model for public higher education universities, in: DG. O2021: The 22nd Annual International Conference on Digital Government Research, 2021, pp. 545–547. [6] F. Ortiz-Rodriguez, S. Tiwari, R. Panchal, J. M. Medina-Quintero, R. Barrera, Mexin: Multidialectal ontology supporting nlp approach to improve government electronic com- munication with the mexican ethnic groups, in: DG. O 2022: The 23rd Annual International Conference on Digital Government Research, 2022, pp. 461–463. [7] I. A. Ahmed, F. N. AL-Aswadi, K. M. Noaman, et al., Arabic knowledge graph construction: a close look in the present and into the future, Journal of King Saud University-Computer and Information Sciences (2022). [8] U. Chidiebere, A. Tunde, et al., Analysis and representation of igbo text document for a text-based system, arXiv preprint arXiv:2009.06376 (2020). [9] M. Onukawa, The writing of standard igbo in okereke oo (ed.) readings in citizenship education, Okigwe: Wythem Publishers (2001). [10] W. Zhang, T. Yoshida, X. Tang, Text classification based on multi-word with support vector machine, Knowledge-Based Systems 21 (2008) 879–886. [11] C.-F. Tsai, Bag-of-words representation in image annotation: A review, International Scholarly Research Notices 2012 (2012). [12] P. U. Usip, M. Ntekop, The use of ontologies as efficient and intelligent knowledge management tool, in: 2016 Future Technologies Conference (FTC), IEEE, 2016, pp. 626–631. [13] P. U. Usip, M. E. Ekpenyong, Towards ontology-driven application for multilingual speech language therapy, in: Human Language Technologies for Under-Resourced African Languages, Springer, 2018, pp. 85–101. [14] W. Etaiwi, A. Awajan, Semg-ts: Abstractive arabic text summarization using semantic graph embedding, Mathematics 10 (2022) 3225. Figure 5: Igbo Text Classification System Result Figure 6: Igbo Text Classification System Performance Measure Result Chart