=Paper= {{Paper |id=Vol-3315/paper10 |storemode=property |title=Creating a Morphological and Syntactic Tagged Corpus for the Uzbek Language |pdfUrl=https://ceur-ws.org/Vol-3315/paper10.pdf |volume=Vol-3315 |authors=Maksud Sharipov,Jamolbek Mattiev,Jasur Sobirov,Rustam Baltayev }} ==Creating a Morphological and Syntactic Tagged Corpus for the Uzbek Language== https://ceur-ws.org/Vol-3315/paper10.pdf
Creating a Morphological and Syntactic Tagged Corpus for the
Uzbek Language
Maksud Sharipov 1, Jamolbek Mattiev 1, Jasur Sobirov 1, Rustam Baltayev 2
1
    Urgench State University, Khamid Alimdjan 14, Urgench, 220100, Uzbekistan
2
    Urgench Branch of Tashkent university of Information Technologies Named After Muhammad al-Khwarizmi,
    110, Al-Khwarizmi str, Urgench, 220100, Uzbekistan


                 Abstract
                 Nowadays, creation of the tagged corpora is becoming one of the most important tasks of
                 Natural Language Processing (NLP). There are not enough tagged corpora to build machine
                 learning models for the low-resource Uzbek language. In this paper, we tried to fill that gap
                 by developing a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic
                 and morphologically tagged corpus of the Uzbek language. This work also includes detailed
                 description and presentation of a web-based application to work on a tagging as well. Based
                 on the developed annotation tool and the software, we share our experience results of the first
                 stage of the tagged corpus creaton.

                 Keywords 1
                 Syntactic tags, morphological tags, language corpus, Uzbek language, natural language
                 processing

1. Introduction
    Nowadays, the Natural Language Processing (NLP) field is developing rapidly and is playing an
important role to solve the problems in the scientific, economic, and cultural fields. NLP also covers
industries such as business data analysis, web application development, corpus linguistics, computer
science, as well as artificial intelligence. The majority of the information available on the Internet is
textual, therefore, obtaining the necessary information through the analysis of textual data, through
various techniques, such as morphological and syntactic analysis of such texts, are becoming main
fields of interest in NLP.
    To date, there are many language corpora of most spoken languages, some of the very early works
and also popular ones are the Brown corpus [1], and the International Corpus of English and the
British National Corpus [2]. At present, practical research is underway in the field of corpus
linguistics to create language corpus for various purposes. The usefulness of corpora for linguistic
research works is provided by the creation of tagged sub-corpus in these corpora [3].
    Some research works have been done to create tagged corpora for the Uzbek language, for
example: [4,5] which provides information on the basic requirements and principles of linguistic
annotation for text processing in the creation of the electronic corpus of the Uzbek language, and the
results of theoretical and practical research on morphological tagging and morphological analyzer
construction using FST technology.
    Due to the lack of language resources in Uzbek language, there are difficulties in solving NLP
problems. To solve NLP problems, we need a morphologically and syntactically tagged corpus. To
date, the lack of an open source morphologically and syntactically tagged corpus for the Uzbek
1
 The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing
(ALTNLP), June 7-8, 2022, Koper, Slovenia
EMAIL: maqsbek72@gmail.com (M.Sharipov. 1); jamolbek_1992@mail.ru(J. Mattiev. 2);
ORCID: 0000-0002-2363-6533(A. 1);0000-0002-7614-118X3(A. 2);
                 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
language makes it difficult to conduct research in the field of linguistics. There are 12 word classes in
the Uzbek language. A word can be polyfunctional depending on the state of its realization in the
sentence and the semantic valence of the N-gramm words [6]. The typical approach for most NLP
applications using tagged corpora consists of the creation of a corpus through manual annotation and
then training a machine learning model [7]. To solve the above-mentioned issues, we aimed to create
an open source tagged corpus in this research. The goal is to build a supervised tagger using the
tagged corpus which is being created. Typically, a pre-tagged corpus using a tool is required to create
supervised taggers [8].
   The importance of the proposed work lies beneath the complex structure of the tagset built, and the
tool to annotate given texts to create a tagged corpus, which in turn will be used for the upcoming
work of tagger tool for Uzbek, to train sequence labeling language model.

2. Related work
Since a morphological and syntctic tagset and tagged corpus is one of the fundamental must-have
resources and one of the first steps of creating resources for languages, all the well-resourced
languages can be said to have their tagsets and tagged corpora developed at some point. All the
languages in use differ from each other with their syntax, morphology and phonetics, but at teh same
time, majority of them have a similar constructive structure, which allows linguists to create
multilingual resources and tools. In an attempt to a creation of a multilingual tagset that can be used
by as many languages as possible, there has been a work by Google research to create a universal
POS tagset [9], which presents a tagset that was obtained by mapping similar features of 22 languages
together. This universal POS tagset is now used by many languages as the base of their tagset, which
is then extended by more tags that encode language-specific features. This universal POS tagset is
also used by the Universal Dependencies (UD) project [10], one of the fastest growing multilingual
tagged NLP data platform that has data over 130 languages.
On the topic of a similar work done on Uzbek language, the first work that presented the
morphological tags list and the morphological tagger [11] presented a tool created in Prolog. But the
problem with the work was that it only covered main parts of speech in Uzbek text, and was missing
many tags to deal with complex words.
    In [12], the issue of tagging the Uzbek language corpus was considered. Authors proposed 14 POS
tags, that is, almost one tag is created for each word class, but in Uzbek language each word class is
divided into several types in terms of meaning and structure. In our approach, we took into
consideration those issues and created the expanded tagset by deeper analysis. The novel tagset allows
us to analyze the text in depth from a semantic point of view. In [13], the importance of rule-based
and stochastic tagging methods for the Uzbek language is discussed. The need of a tagged corpus for
the Uzbek language is indicated and the occurrence of words in sentences with different functions is
described, however, authors did not provide any morphological or syntactic tagset which can be used
for tagging.
    There are very limited amount of NLP work done on Uzbek, some of the important ones include
Sentiment analysis datasets [14,15], cross-lingual word embeddings over closely-related Turkic
languages [16], stopwords dataset [17], Stemmer for Uzbek verbs [18], as well as recent neural
transformer based (BERT) language model [19] which was trained on a big raaw Uzbek text.
Although there is a big amount of scientific works published claiming that they have contributed to
the Uzbek NLP, the quality of works, be it a language resource, or a tool, is nowhere near that
amount. This statement about some scientific works which claim they have done something, but not
providing an open-source code or the data itself, are mentioned as “zigglebottom” papers in a recent
work done on Uzbek [20].
    Regarding related works done on similar languages, there has been a work done on the Kazakh
language [21], which syntactic and POS tags have been developed to create a tagged corpus. The
authors produced 36 morphological tags and 9 syntactic tags and developed an annotated corpus
which consist of 613 511 words based on their tagset.
3. Proposed methods
    We know that corpus can be used as a basis in many fields and scientific processes. Considering
that corpus texts come in different genres and categories, it is easier to use a corpus if each word is
accompanied by its morphological and syntactic classification (which group of words it belongs to
and which syntactic function it belongs to in a sentence) provided. This is the process of text
interpretation.
    This work explains the question of how it is done for the Uzbek language specifically.

    3.1. Tag list development

   First of all, for the interpretation of Uzbek words, tags (explanations) are needed, consisting of
abbreviations expressing morphological and syntactic meanings. Table 1 shows some of the
morphological tags which we used in below examples (Detailed information about whole
morphological and syntactic tag lists can be found at [22])

Table 1
Description of selected morphologic tags, with description and example words for each part of
speech.

      Name             Tag                 Description                            Example
                    SOT           Personal noun                    Teacher (oʻqituvchi)
     NOUN           NOT           Object noun                      Bag (sumka)
                    JOT           Place noun                       Village (qishloq)
                    MOT           Abstract noun                    Love (muhabbat)
                    XSF           Peculiarity-state adjective      Hard (qattiq)
  ADJECTIVE
                    RSF           Color adjective                  White, black (oq, qora)
                    KOL           Personal Pronouns                I (men)
   PRONOUN
                    KROL          Demonstrative Pronouns           This (shu)
                    HRV           Adverbs of manner                Rapidly (tez)
    ADVERB          MIRV          Adverbs of quantifiers           A lot (koʻp)
                    PRV           Adverbs moder fier of time       Before (avval)
                    SIFL          Past participle verb form        Gone, seen (borgan, koʻrgan)
                    HFL           Infinitive form of the verb      Going, saying (borish, aytish)
                    SFL           Original verb form               See, read (koʻrdi, oʻqidi)
                    KFSQ          Auxiliary verb combination       Fell in love (sevib qoldi)
      VERB          1B            First person singular            I am flying (uchyapman)
                    3B            Third person singular            He said (aytdi)
                    2K            Second person plural             You worked (ishladingiz)
                    OTZ           Past simple tense                I said (aytdim)
                    KEZ           Future simple                    I will tell (aytaman)


    It is known from linguistics that the morphology field studies words, their categories and
morphological features. The morphological analysis indicates to which category the word belongs, its
nominal form and suffixes. The above table shows the word-class, its conditional abbreviation (POS
tags), the description of the category, and examples. For example, if we take the verb word-class, here
are provided 9 POS tags belonging to this category and some examples of them. The examples above
are just a few examples of common morphological tags that belong to the noun, adjective, pronoun,
adverb and verb family. Information about the whole tagset is shown in Table2.
    It can be seen from the Table 2 that we created 102 morphological tags in total for the Uzbek
language. For example: 22 POS tags were developed for the noun word-class, 10 POS tags for the
adjective word-class, 11 POS tags for pronouns and so on. Similarly, tags are used in the parsing of
words.

Table 2
Detailed information about the whole POS tagset. Number of POS tags created for each word class
are presented.

                              Word class                  Part of speech tags
                               NOUN                                22
                             ADJECTIVE                             10
                              NUMBER                               11
                             PRONOUN                               11
                              ADVERB                               10
                               VERB                                18
                           CONJUNCTION                              8
                              HELPERS                               1
                             PARTICLE                               6
                           INTERJECTION                             2
                          IMITATIVE WORD                            2
                           MODAL WORD                               1
                               Total:                             102


    We developed the comprehensive morphological tagset for the Uzbek language by deeper analysis
for in-depth morphological tagging of Uzbek words. Similarly, the syntactic tagset was also created in
this research for syntactic tagging of Uzbek words. The Table 3 lists 14 syntax tags and examples of
their usage:

Table 3
Detailed information about the whole syntactic tagset. For each syntactic tag, there is a small
description and an example is given.

                          Syntactic           Description                       Example
        Name
                            tag
      SUBJECT               EG                  Subject                   Brother came home
     PREDICATE              OK              Noun Predicate             Urgench is a beautiful city
                            FK              Verb Predicate                     Salim came
     ATTRIBUTE              QA             Genetive Attribute              My brother's face
                            SA            Adjectival Attribute        Good students were rewarded
       OBJECT               VL              Indirect Object              We talked about home
                            VS               Direct Object               The knife cut my hand
                            VH            Condition Modifiers         He agreed out of desperation
                            PH              Time Modifier               He finished work in the
     MODIFIERS
                                                                                 evening
                             OH             Place Modifier            He wants to live in Tashkent
                             SH           The Reason Modifier         Kasallangani uchun kelmadi
                             MH            The Aim Modifier           He deliberately does not enter
                                                                              the building
  EXCLAMATION                UN         A person or object that is         Anwar, look at me
                                           focused on speech
 THE ENTRY WORD              KR             The entry word             Unfortunately, he returned
                                                                                        home


     The Syntax section studies phrases and sentences. Syntactic analysis of a sentence analyzes the
parts that make it up: the relationship of 5 parts of speech (there are 5 parts of speech in Uzbek
language, namely: subject, predicate, attribute, object and modifiers.There are also some parts of
speech that do not interact with the those parts of speech, which are called “EXCLAMATION” and
“THE ENTRY WORD”) and the parts of speech that do not interact with the parts of speech.
Syntactic analysis is not only a scientific aspect of linguistics, but also plays an important role in the
attractiveness of discourse and the fluent formation of the text. Table 3 provides the names of the
parts of speech, their conditional abbreviations, explanations of the parts of speech, and related
examples. For example: if we look at a “predicate” word-class, we can see that it has 2 different types
(OK and FK), their names (Noun Predicate and Verb Predicate), and examples of their usage.

       3.2. Developed algorithm for tagging

   Part of speech is much more complicated than simply comparing words to word classes. Because
POS and syntactic tagging are not easy. A single word can serve as a different word class in different
sentences based on different contexts [3]. So far, there is not enough tagged corpus for the Uzbek
language to create machine learning algorithms, so the main goal of our research is to develop an
algorithm for tagging texts and to develop a web-based tagging system. All the tags and the tagger
proposed in this use the official Latin alphabet as a default script, but the problem with texts in Uzbek
language is that the old Cyrillic script is equally popular all in official written documents, literature, as
well as internet websites. The texts that appear in Cyrillic are pre-processed using available tools,
such as web-based transliterator Savodxon2, or a machine transliteration Application Programming
Interface (API) [23], before being fed as an input to the tagger. The steps for syntactic and
morphological tagging of texts is shown in Figure 1.




                               Figure 1: Syntactic and Morphological tagging of texts

   According to Figure 1, we developed a web-based tagging application which can be found at [24].
To utilize the application, user has to follow the following steps:
             ● Registration of experts who perform the syntactic and morphological tagging of texts;
             ● Extracting the texts from corpus and splitting them into sentences as well as words;
             ● Sending the selected sentence to the user interface;
             ● Writing the result to the file with the user ID number of the tagged sentence;
             ● Producing the final result after each sentence is tagged;
             ● Writing the result to the corpus in XML or TXT format;

       3.3. Developed tagged corpus

   The following syntactic and morphologically tagged corpus is created based on the developed web
application. Let's see how tags are used in several sentences (a - morphological tags; b - syntactic
tags):

 1. I opened the window to get some fresh air (Men biroz toza havo olish uchun derazani
    ochdim)
   2
       Savodxon machine transliterator: https://savodxon.uz/
 a) Men/KOL biroz/MIRV toza/XSF havo/MOT olish/HFL uchun/KM derazani/NOT
    ochdim/SFL/1B/OTZ
 b) Men/EG biroz/PH toza+havo+olish+uchun/MH derazani/VS ochdim/FK
2. Anvar suddenly came to the door
 a) Anvar/SOT toʻsatdan/HRV eshik/NOT yoniga/JOT keldi/SFL/3B/OTZ
 b) Anvar/EG toʻsatdan/VH eshik+yoniga/OH keldi/FK
3. Only today you can buy a car at this price, tomorrow new price will be set (Faqat bugun
    ushbu narxda mashina xarid qila olasiz, ertaga yangi narx qoʻyiladi)
 a) Bugun/PRV ushbu/KOL narxda/MOT mashina/NOT xarid+qila+olasiz/SFL/2K/KEZ,
    ertaga/PRV yangi/XSF narx/MOT qoʻyiladi/SFL/3B/KEZ
 b) Bugun/PH ushbu/SA narxda/VL mashina/VS xarid+qila+olasiz/FK, ertaga/PH yangi/SA
    narx/EG qoʻyiladi/FK
 4. A snake cannot move on a flat surface (Ilon yassi yuzada harakatlana olmaydi)
 a) Ilon/NOT yassi/XSF yuzada/JOT harakatlana+olmaydi/KFSQ
 b) Ilon/EG yassi/SA yuzada/OH harakatlana+olmaydi/FK
 5. You have to take into consideration the performance characteristics    rather than its
    price when you are buying a mobile phone (Siz mobil telefon sotib olayotganingizda uning
    narxiga emas, ishlash xususiyatlariga e’tibor qaratishingiz kerak)
a) Siz/KOL mobil/XSF telefon/NOT sotib+olayotganingizda/SIFL uning/KROL narxiga/MOT
    emas, ishlash/HFL xususiyatlariga/MOT e’tibor+qaratishingiz+kerak/HFL/2K
b) Siz/EG mobil/SA telefon/VS sotib+olayotganingizda/PH uning/QA narxiga/VL emas,
    ishlash/SA xususiyatlariga/VL e’tibor+qaratishingiz+kerak/OK
   Let's now briefly explain how these symbols are used. Each interpreted word is followed by a
forward slash (/) followed by a shorthand tag(s) indicating the morphological or syntactic nature of
the word (for example, / is NOT followed by an object). The number of tags that can be placed after a
word can be more than one (came/ SFL/3B/OTZ). Figure 2 shows that the tagging process of
words/sentences in newly developed web-based application.
                     Figure 2: Example of tagging process in web-based application

   Application is easy to follow: (1) the text is inserted into the database; (2) you have to select the
tagging type: syntactic or morphologic (grey buttons in right-top corner); (3) the tagging process is
performed word-by-word, so you should choose the tag name from the developed tagset shown below
the text. A tagged word is added to the tagged text by clicking +/- buttons. (4) Once you tag all the
words in the text, you can add the tagged text into the tagged corpus by clicking the “Tasdiqlash”
(“confirm”, green color) button.

4. Results and evaluation
    A morphologically and syntactically tagged corpus for the Uzbek language was created by using
the developed tagset and program. More than 1,200 sentences and more than 10,000 words from texts
in different fields (a total of more than 15 categories), such as literature, technology, sports,
psychology, politics, society, medicine, religion and philosophy, have been tagged and the tagging
process is still going on. Our main goal is to develop the largest tagged corpus for the Uzbek
language. The current tagged corpus created by us can be used by researchers as a dataset to solve
NLP tasks in Uzbek language. Since there is no open-source tagged corpus for the Uzbek language,
this research work can be considered a novel and important contribution in the NLP field for the
Uzbek language.

5. Conclusion and future work
    In this paper, a part of speech tagset which is required to create a tagged corpus was developed for
the Uzbek language. A web-based annotation tool for tagging texts based on the developed tagset
were created. Using the created program, the texts are tagged by experts. Using the syntactically and
morphologically tagged corpus of the Uzbek language created by us, it is possible to solve such
problems as Named entity recognition, statistical language modeling, text generation pattern
identification, machine translation and syntactic analysis. The tagged dataset is now being expanded.
In the future work, it is planned to build automatic tagging algorithms (named: Uzbek tagger) for the
Uzbek language with machine learning using a tagged dataset.
    The created annotation tool can be used for other Turkic languages as well, for which it is
necessary to place the tag set of this language in the application. The process of applying the
algorithm to other Turkic languages can be carried out by using most of the available tags, plus some
language-specific tags regarding the target language.

6. Acknowledgements
   The author (Jamolbek Mattiev) gratefully acknowledges the European Commission for funding the
InnoRenewCoE project (Grant Agreement #739574) under the Horizon2020 Widespread-Teaming
program and the Republic of Slovenia (Investment funding of the Republic of Slovenia and the
European Union of the European Regional Development Fund). Jamolbek Mattiev is also funded for
his Ph.D. by the “El-Yurt-Umidi” foundation under the Cabinet of Ministers of the Republic of
Uzbekistan.

7. References
       1.     Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz. Building a Large
              Annotated Corpus of English: The Penn Treebank. 1993; 313–330.
       2.     Kholkovskaia O. Role of the Brown Corpus in the History of Corpus Linguistics. 2017.
       3.     Atwell, Es. This is a repository copy of Development of tag sets for part-of-speech
              tagging. 2008;
4.    Nilufar Abduraxmonova. СЎЗ САНЪАТИ ХАЛҚАРО ЖУРНАЛИ 4 ЖИЛД, 1 СОН
      МЕЖДУНАРОДНЫЙ ЖУРНАЛ ИСКУССТВО СЛОВА INTERNATIONAL
      JOURNAL OF WORD ART VOLUME 4, ISSUE 1. 2021;
5.    Abdurakhmonova N. O’zbek tili korpusini morfologik teglashda FST texnologiyasi
      tatbiqi. International Journal of Art & Design Education 2021; 4: 319–326.
6.    M. Abjalova. Linguistic modules of editing and analysis programs. 2020;
7.    Almashraee M, Monett Diaz D, Unland R. Sentiment Classification of on-line Products
      based on Machine Learning Techniques and Multi-agent Systems Technologies. .
8.    Altunyurt L, Orhan Z. PART OF SPEECH TAGGER FOR TURKISH. 2006;
9.    Petrov S, Das D, McDonald R. A Universal Part-of-Speech Tagset. 2011;
10.   Nivre J, de Marneffe M-C, Ginter F et al. Universal Dependencies v1: A Multilingual
      Treebank Collection. Proceedings of the Tenth International Conference on Language
      Resources and Evaluation (LREC’16), European Language Resources Association
      (ELRA) 2016, 1659–1666.
11.   Matlatipov G, Vetulani Z. Representation of Uzbek Morphology in Prolog. In:
      Marciniak M, Mykowiecka A, editors. Aspects of Natural Language Processing:
      Essays Dedicated to Leonard Bolc on the Occasion of His 75th Birthday. Berlin,
      Heidelberg: Springer Berlin Heidelberg, 2009: 83–110.
12.   Ilyos Rabbimov, Svetlana Umirova, Baxtiyor Xolmuxamedov. Alisher Navoiy
      nomidagi Toshkent davlat o’zbek tili va adabiyoti universiteti “O’ZBEK MILLIY VA
      TA’LIMIY KORPUSLARINI YARATISHNING NAZARIY HAMDA AMALIY
      MASALALARI” THE PROBLEM OF TAGGING WORDS IN UZBEK LANGUAGE
      CORPUS Rabbimov Ilyos Mehriddinovich 38. 2021.
13.   M. Abjalova, O. Iskandarov. Methods of Tagging Part of Speech of Uzbek Language.
      2021;
14.   Kuriyozov E, Matlatipov S. Building a New Sentiment Analysis Dataset for Uzbek
      Language and Creating Baseline Models. MDPI AG 2019, 37.
15.   Rabbimov I, Mporas I, Simaki V, Kobilov S. Investigating the Effect of Emoji in
      Opinion Classification of Uzbek Movie Review Comments. 2020.
16.   Kuriyozov E, Doval Y, Gómez-Rodríguez C. Cross-Lingual Word Embeddings for
      Turkic Languages. 2020.
17.   Madatov K, Bekchanov S, Vičič J. Automatic detection of stop words Automatic
      detection of stop words for texts in the Uzbek language. 2022;
18.   Maksud Sharipov, Ulugbek Salaev, Gayrat Matlatipov. IMPLEMENTED STEMMING
      ALGORITHMS BASED ON FINITE STATE MACHINE FOR UZBEK VERBS |
      COMPUTER LINGUISTICS: PROBLEMS, SOLUTIONS, PROSPECTS. 2022
      http://compling.navoiy-uni.uz/index.php/conferences/article/view/6
19.   Mansurov, B, A.Mansurov. UzBERT: pretraining a BERT model for Uzbek. 2021;
20.   Salaev U, Kuriyozov E, Gómez-Rodríguez C. SimRelUz: Similarity and Relatedness
      scores as a Semantic Evaluation dataset for Uzbek language. 2022.
21.   Makhambetov O, Makazhanov A, Yessenbayev Z, Matkarimov B, Sabyrgaliyev I,
      Sharafudinov A. Assembling the Kazakh Language Corpus. Association for
      Computational Linguistics.
22.   Maqsud Sharipov. Uzbek_POS_tag_list/Uzbek POS tag list.pdf at main ·
      MaksudSharipov/Uzbek_POS_tag_list·GitHub.2020
      https://github.com/MaksudSharipov/Uzbek_POS_tag_list/blob/main/Uzbek%20POS%
      20tag%20list.pdf
23.   Salaev U, Kuriyozov E, Gómez-Rodríguez C. A machine transliteration tool between
      Uzbek alphabets. 2022;
24.   Maqsud Sharipov, Jasur Sobirov, Rustam Baltaev. Morfologiya | Authorincate. 2021
      https://morphology-base.herokuapp.com/login