UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language Maksud Sharipov 1, Ollabergan Yuldashov 1 1 Urgench State University, 14. Kh.Alimdjan str, Urgench City, 220100, Uzbekistan Abstract In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs. Keywords 1 Uzbek stemmer, stemming, Uzbek language, finite state machines, natural language processing 1. Introduction Stemming is defined as the conflation of all variations of specific words to a single form called the root or stem. Stemming algorithms for some languages have been published and applied in the building of information retrieval systems, among which for English is the well-known Porter’s algorithm [1,2]. According to the stemming principle, stemming algorithms can be divided into four categories, that are rule-based (truncating, affix removal) method, dictionary look-up (table lookup) method, statistical method, and mixed-method [3]. This article presents [4] the application of the stemming task of verbs in the low-resource highly- agglutinative Uzbek language, one of the most important aspects of Natural Language Processing (NLP). The methodology is proposed for doing the stemming of the Uzbek verb words with an affix stripping approach whereas not including any lexicon. Verb affixes are classified into three classes and designed as finite state machines (FSMs) for each class according to morphological rules. Previous work on stemming algorithms for the Uzbek language [5] were used for normalizing texts. When developing the algorithm, a hybrid approach was used based on the joint application of an algorithmic method, a lexicon of linguistic rules, and a database of normal forms of words in the Uzbek language. The article [2] offers an intelligent web application developed for the morphological analysis of words in the Uzbek language. The web application is based on the concept of generation and stem analysis of the Uzbek language word forms. A well-known Porter algorithm was chosen as the basis for our stemming approach. Stemming is an important stage of natural language processing. The research in this paper is a contribution to the development of the stemming tool for the Uzbek language. Due to the fact that the Uzbek language belongs to the family of agglutinative languages, words are formed as a result of a series of suffixes added to the stem word, which makes it difficult to find the 1 The International Conference on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP), June 6, 2022, Koper, Slovenia EMAIL: maqsbek72@gmail.com (M.Sharipov); ollaberganyuldashov@gmail.com (O.Yuldashov); ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) stem of the word [6]. For instance, a single word “O‘qi/ma/gan/lar/dan/mi/siz?” (Are you from those who did not read?), which in fact is a question, can be parsed as following: o‘qi to read o‘qi/ma don’t read o‘qi/ma/gan he/she di not read o‘qi/ma/gan/lar they did not read o‘qi/ma/gan/lar/dan from those who did not read o‘qi/ma/gan/lar/dan/mi/siz are you from those who did not read 2. Finite State Machines(FSM) Generation Finite-state technology is very simple and is used a lot for many kinds of tasks in computational linguistics. This technology may be used in phonology, morphology, syntax, spell-checking, tokenization, text-to-speech, data mining, and other fields. The two terms finite-state machines and finite-state automata are synonymous. A finite-state machine is a machine consisting of a set of states with a set of allowable inputs which change the state as well as a set of outputs [7,8]. We use FSMs to create the stemming algorithm, which allows us to analyze words from right to left. There are four stages to create these FSMs [9]: ● Creating a left to right FSM; ● Labeling the suffixes ● Inverting the left to right FSM and obtaining a non deterministic finite state automaton (NFA) ● Converting NFA to a deterministic finite automaton (DFA) and constructing the right to left FSM Affixes were divided into 15 classes according to FSM and were stored in XML format. Table 1 Affix classes Class Class name Type Affixes Allomorphs 1 Particle suffixes Inflectional 10 12 2 Declension suffixes Inflectional 21 29 3 Conjugation suffixes Inflectional 34 15 4 Participle & Gerund suffixes Inflectional 12 19 5 Verbal Adverb suffixes Inflectional 10 20 6 Relative verb suffixes Inflectional 15 25 7 Derivational [from verb to noun] Derivational 8 13 8 Noun & Adjective suffixes Inflectional 11 12 9 Number suffixes Inflectional 11 12 10 Pronouns suffixes Inflectional 2 2 11 Prefixes Derivational 7 7 12 Derivational [Verb] suffixes Derivational 67 73 13 Derivational [Adjective] suffixes Derivational 113 137 14 Derivational [Noun] suffixes Derivational 106 124 15 Derivational [Adverb] suffixes Derivational 27 33 Total 454 533 An affix in Uzbek can have multiple allomorphs in order to provide sound harmony (as the phonological rules) in the word to which it is affixed. For example, the Declension suffix with generic representation –Ga has three allomorphs: –ga, –ka, –qa. The abbreviations used to show suffixes in a generic way are shown below: G: g, k, q Y: a, y K: k, g Q: k, g, g’, q T: t, d A: a,o (): the letter between parentheses can be omitted All affixes are stored in the Suffixes.xml file. The structure of the XML file is as follows: ... ... ... ... ... ... Here, the ... tag is the root tag, ... teg for affixes, there are fsm_id(for FSM id in Table 1), suff_id(for affixes id in each FSM Suffixes table), group="" category="" class="" type="" ( for to express the internal division of affixes in Table 2 within word classes) attributes in .... The and tags are used for exceptions. The and tags indicate the state before and after the affixes under consideration in the context of open word classes. If the allomorph attribute of the ... tag is true, all allomorphs of the affixes are entered as: ... ... ... ... ... ... 2.1. Declension Suffixes Declension suffixes are added in Uzbek after open word classes such as noun, adjective, pronoun, numeral and past participle. First, we learn the sequence of affixes, then we build the FSM from left to right [6,10,11,12,13]. In Figure 1, a numerical value is indicated a state. In the following stages, the states will be expressed with these numbers: 0 – ending state, 1 – initial state. The character “ε” means the empty transitions between the states. The final state is represented by 2 circles. In the next step, the affixes in Figure 1 are numbered. In the next steps, the sequence number in Table 2 is used instead of the affixes. In Figure 2, the right to left NFA is shown, in which the FSM position from left to right in reverse order is constructed from right to left. Figure 1: Declension Suffixes left to right FSM Figure 2: Declension Suffixes right to left NFA Let us look at some examples of how the following Declension suffixes are made from left to right finite automata (Figure 1): qishlog‘ (1) → im (2) → dan (5) → siz (6) → lar (0) (You are from my village), where the numbers in parentheses indicate the transitions in Figure 1. The large number of gaps in the FSM from left to right and the depth of the FSM complicate the construction of the FSM from right to left. Table 3 illustrates the transition from NFA to DFA. In DFA, there is only one way out of each entry and exit. There will be no empty transaction at DFA. Table 2 Declension Suffixes 1. –(i)m 8. –Ga 15. –siz 2. –(i)ng 9. –da 16. –lar 3. –s(i) 10. –dan 17. –(n)iki 4. –(i)miz 11. –man 18. –dagi 5. –(i)ngiz 12. –san 19. –day 6. –ning 13. –dir 20. –dek 7. –ni 14. –miz 21. –gina Figure 3: Declension Suffixes right to left FSM If we define the input state of the DFA as A, it includes the input state of the NFA 0 and the states 1,2,4,5,6,7,8 that can pass through the 0-statement empty transition. Entry in DFA A state includes all {0,1,2,4,5,6,7,8} initial cases in NFA. We analyze the new states formed as a result of trimming the suffixes of each state, after the initial case. For example, if the initial state of a word in state A is defined by the suffixes "1-5"(1 - (i)m, 2 - (i)ng, 3 - (s)i, 4 - (i)miz, 5 - (i)ngiz) in Table 2, it will cut the suffix defined on the right side of the word and move the word to the next state B. When depicting the FSM from right to left, the state contains the final state of the NFA - 1, while from right to left the state of the FSM is the final state and is represented by 2 circular circles. In this case, the FSM ends in the state if no other suffix is found after the word that cut the suffix and the incoming state is the final state. Table 3 Declension Suffixes NFA to DFA Operation A={0,1,2,4,5,6,7,8} “13,19,20” : T={3} →{1,2,3} →K “1-5” : T={1,7} →B “6,17,18” : T={2} →{1,2} →J “6-10,17,18” : T={2,8} →{1,2,7,8} →C F={1,2,5,6} “11,12,14,15” : T={5} →{1,2,5} →D “1-5” : T={1} →I “13” : T={3,5,8} →{1,2,3,5,7,8} →E “9,10,17” : T={2} →{1,2} →J “16” : T={6} →{1,2,5,6} →F “12,13,15” : T={5} →{1,2,5} →D “19,20” : T={3,8} →{1,2,3,7,8} →G G={1,2,3,7,8} “21” : T={1,4} →{1,2,4} →H “1-5” : T={1,7} →B B={1,7} “21” : T={1} →I “21” : T={1} →I “6,17,18” : T={2} →{1,2} →J C={1,2,7,8} H={1,2,4} “1-5” : T={1,7} →B “1-5” : T={1} →I “21” : T={1} →I “6-10,17,18” : T={2} →{1,2} →J D={1,2,5} J={1,2} “1-5” : T={1} →I “1-5” : T={1} →I “9,10,17” : T={1,2} →J K={1,2,3} E={1,2,3,5,7,8} “1-5” : T={1} →I “1-5” : T={1,7} →B “6,17,18” : T={2} →{1,2} →J “21” : T={1} →I 2.2. Noun & Adjective Suffixes Figure 4 shows the FSM of the Noun & Adjective suffixes from left to right. This is an integral continuation of the Declension additions. This FSM has a relatively simple structure. Table 4 Noun & Adjective Suffixes 1. –cha 7. –(a)loq 2. –xon 8. –roq 3. –oy 9. –imtir 4. –choq 10. –ish 5. –chak 11. –lar 6. –jon Figure 5: Noun & Adjective Figure 4: Noun & Adjective Suffixes right to left FSM Suffixes left to right FSM If we analyze the word “yaxshiroqlaridan” (from things/people which are better). The word initially passes in Figure 3, through Declension Suffixes FSM from right to left, and in this FSM, 10(dan) suffix of the word are cut and go from A to C state, then FSM cut the 3((s)i) suffix and it move from C to B state: “yaxshiroqlaridan”( A– 10 → C ) → “yaxshiroqlari”( C – 3 →B). It stops here because no suffixes after B state are found and B state is the final state. The resulting “yaxshiroqlar”(things/people which are better) word is passed to the next Noun&Adjective Suffixes FSM in Figure 5. In this FSM, the suffix 11(lar) is cut and moves from state A to state B, then cut the suffix 8 (roq) and move from B to C state: “yaxshiroqlar”( A – 11 → B) → “yaxshiroq” ( B – 8 → C) → “yaxshi” (good). The result is the stem of the word. 2.3. Derivational [Noun] Suffixes Figure 6: Derivational [Noun] Suffixes left to right FSM Most suffixes in Uzbek belong to the category of nouns[14]. Figure 6 depicts the left to right Noun Derivational Suffixes. In Uzbek, Derivational suffixes belonging to several different word groups can be added to the same base. In order to reduce errors in Figure 6, other word group suffixes that are added to the base before Noun Suffixes are also given without spaces. Table 5 Derivational [Noun] Suffixes 1. –shunos 23. –i 45. –manchi 67. –(i)ndi 89. –(r)uvchan 2. –(ma)kor 24. –bon 46. –gir 68. –ich 90. –chang 3. –paz 25. –k 47. –don 69. –machoq 91. –gin 4. –soz 26. –q 48. –zor 70. –movchilik 92. –mas 5. –gar 27. –v 49. –navis 71. –mish 93. –ta 6. –do‘z 28. –moq 50. –xo‘r 72. –inch 94. –g‘uch(i) 7. –go‘y 29. –a 51. –ch 73. –loq 95. –imli 8. –furush 30. –um 52. –(i)m 74. –chiq 96. –imsiz 9. –At 31. –im 53. –imdon 75. –goh 97. –tak 10. –on 32. –ma 54. –uv 76. –archilik 98. –an 11. –qin 33. –dor 55. –cha 77. –ildoq 99. –vur 12. –boz 34. –qoq 56. –iyat 78. –bin 100. –inki 13. –oq 35. –choq 57. –dosh 79. –noma 101. –imtik 14. –in 36. –dak 58. –la 80. –g‘u 102. –(a)loq 15. –mand 37. –iq 59. –o 81. –viy 103. –yi 16. –ak(i) 38. –ik 60. –vor 82. –chil 104. –liK 17. –(ma)kash 39. –Gi 61. –n 83. –Qir 105. –chiliK 18. –iy 40. –doq 62. –xona 84. –inchoq 106. –garchiliK 19. –parast 41. –xon 63. –gun 85. –ag‘on 20. –li 42. –uq 64. –qun 86. –sh 21. –siz 43. –kov 65. –g‘in 87. –chak 22. –chi 44. –iston 66. –diq 88. –(ov)chan Figure 7: Derivational [Noun] Suffixes right to left FSM 2.4. Main FSM Figure 8: Final FSM Figure 9: Final FSM Tree The main FSM has entry from 2 FSMs: Particle and Derivational [Adv]. If adverb affixes are added to the stem, inflectional ones are not added. A sequence of connections of FSMs is given in the main FSM. Each word should be in the entry part of the main FSM. Words are passed from one FSM to another. There are 3 outputs in the main FSM: Prefixes, Pronouns, and Numbers. There are 9 paths in the main FSM. The word is passed through all the ways whichever way many affixes are stripped, we take the word in this way as a stem. Depending on the way in which the stem is found, it is possible to tell which word classes it belongs to. 3. Conclusion & Future Work Currently, the issue of creating an Uzbek stemmer has not been resolved due to insufficient NLP resources for the Uzbek language. Therefore, it is developed for Uzbek by using the rule-based structure of the language. In Uzbek sentences, words are usually concatenated with lots of suffixes, that is the reason an information retrieval system needs a quick performing stemming algorithm. In systems that work with root detection, it is recommended not to use a dictionary to increase application speed. Therefore, this study does not use any word databases. The considered method in this study can be the basis for the development of an Uzbek stemming algorithm and according to the approaches (algorithm) the python library developed that is easy to install and use in NLP applications. The software and instructions can be reached from the address [15]. This will be applied to develop a software in information retrieval system for Uzbek documents. 4. References [1] Porter MF. An algorithm for suffix stripping. 1980; 130–137. [2] Mengliev D, Barakhnin V, Abdurakhmonova N. Development of intellectual web system for morph analyzing of uzbek words. Applied Sciences (Switzerland) 2021; 11. [3] Tengku Mohd T. Sembok, Belal Mustafa Abu Ata, Zainab Abu Bakar. A Rule and Template Based Stemming Algorithm for Arabic Language. 2011 974–981 (May 28, 2022, date last accessed). [4] Maksud Sharipov, Ulugbek Salaev, Gayrat Matlatipov. IMPLEMENTED STEMMING ALGORITHMS BASED ON FINITE STATE MACHINE FOR UZBEK VERBS | COMPUTER LINGUISTICS: PROBLEMS, SOLUTIONS, PROSPECTS. 2021 (May 28, 2022, date last accessed). [5] Bakayev I. View of DEVELOPMENT OF A STEMMING ALGORITHM BASED ON A LINGUISTIC APPROACH FOR WORDS OF THE UZBEK LANGUAGE. 2021 195–202 (May 28, 2022, date last accessed). [6] Sapayev Qalandar. Hozirgi o’zbek tili(morfemika, so’z yasalishi va morfologiya). 2009. [7] Tayebeh Mosavi Miangah. Finite-State Technology in Natural Language Processing. 2014 (May 28, 2022, date last accessed). [8] Bakaev I. Creation of a morphological analyzer based on finite-state techniques for the Uzbek language. 2020. [9] Eryiğit G, Adalı E. AN AFFIX STRIPPING MORPHOLOGICAL ANALYZER FOR TURKISH. 2004; 299–304. [10] Hamroyeva Orzigul. Ona tili ma’ruzalar to’plami. 2021; [11] Yormat Tojiyev. O‘zbek tili morfemikasi. 1992. [12] M.Hamroyev, D.Muhamedova, va b. Ona tili. Toshkent: IQTISOD-MOLIYA, 2007. [13] Tojiyev Yormat, Nazarova Nodira, Tojiyeva Gulchexra. O‘zbek tilidagi ergash morfemalarning semantik-stilistik xususiyatlari. Toshkent, 2012. [14] Azim Hojiev. O‘zbek tili morfologiyasi, morfemikasi va so‘z yasalishining nazariy masalalari. Toshkent, 2010. [15] Maksud Sharipov, Salaev Ulugbek, Ollabergan Yuldashov, Sobirov Jasurbek. Stemming Algorithm for Uzbek language: UzbekStemmer. 2021 https://www.higithub.com/MaksudSharipov/repo/UzbekStemmer (May 29, 2022, date last accessed).