=Paper= {{Paper |id=Vol-1975/paper3 |storemode=property |title=A Uniform Morphological Analyzer for the Kazakh and Turkish Languages |pdfUrl=https://ceur-ws.org/Vol-1975/paper3.pdf |volume=Vol-1975 |authors=Gulmira Bekmanova,Altynbek Sharipbay,Gulila Altnbek,Eşref Adah,Lena Zhetkenbay,Unzila Kamanur,Altanbek Zulkhazhav |dblpUrl=https://dblp.org/rec/conf/aist/BekmanovaSAAZKZ17 }} ==A Uniform Morphological Analyzer for the Kazakh and Turkish Languages== https://ceur-ws.org/Vol-1975/paper3.pdf
     A Uniform Morphological Analyzer for the
         Kazakh and Turkish Languages

 Gulmira Bekmanova1 , Altynbek Sharipbay1 , Gulila Altenbek2 , Eşref Adalı3 ,
        Lena Zhetkenbay1 , Unzila Kamanur1 , Altanbek Zulkhazhav1
        1
          L.N. Gumilyov Eurasian National University, Astana Kazakhstan,
 2
   College of Information Science and Engineering, Xinjiang University, P.R. China
                  3
                    Istanbul Technical University, Istanbul, Turkey,
gulmira-r@yandex.ru, sharalt@mail.ru, jetlen 7@mail.ru unzila.88@mail.ru,
     altinbekpin@gmail.com, glaxd2014@163.com, esrefadali@gmail.com



      Abstract. The Kazakh and Turkish languages belong to the group of
      the Turkic languages and have much in common. The detailed compar-
      ison of the ontologies on the example of the Kazakh and Turkish nouns
      allowed entering the analysis of morphological rules of these languages
      and the unified system of designations to create the uniform morpho-
      logical analyzer based on the general algorithm of the morphological
      analysis.

      Keywords: morphological analysis of the Kazakh and Turkish languages,
      ontology, analysis of morphological rules


1    Introduction
One of the methods to reduce the semantic barrier between the human and the
computer is searching new methods of a natural language processing. Nowadays
it is obvious that in order to implement the human-computer interaction in a
natural language and to create a linguistic support of the information processes
the study of the language itself is required. Besides the resources consumed could
be decreased due to formalization of language rules providing the storage of in-
formation in procedural but not in declarative form. For the Kazakh and the
Turkish languages which morphological regularities are quite well yielded to for-
malization, it would produce an excellent result.

All language levels are characterized by existence of basic elements. A language
studying can take place from two positions – the analysis and synthesis because
the revealed rules of synthesis can assist to carry out the analysis and vice versa.
In this case the Kazakh and Turkish languages are studied from both positions
the analysis and synthesis. This very integrated approach allows to study in
details all regularities and to reveal such nuances which when using only of
one of approaches would remain outside our attention. For researching and the
maximum formalizing of each language subsystem it is necessary to create the
program tools implementing the studying process by identifying and verifying
the analysis and synthesis rules. There-with it will greatly automate the research
process and a researcher does not need to accumulate and collect information.
And the labor intensity is very low.

The morphology modeling is related to all applications such as natural lan-
guage and tasks processing and includes information search, moods analysis,
spelling correction, detection of the generated texts, parts of speech marking
and entity extraction. The morphology is used in linguistics to refer to the study
of structure and formation of words. The Agglutinative languages (agglutinare
from Latin means to stick together) are languages which morphological system
is characterized by agglutination (”pasting”) of various formants. As a formant
either prefixes or suffixes act and each of them makes its own sense.

As the Kazakh and Turkish languages belong to the group of Turkic languages
and the languages of this group can be classified as agglutinative languages.
These languages are full of word forms (inflections). Inflections are formed by
addition of suffixes. The suffixes are attached in the strict sequence and the re-
sulting new words can belong to the other part of speech. The possessive form in
Kazakh is similar to a possessive form in English [1, 2]. Plenty of researches cov-
ering formalization of morphological rules and morphological analyses of [3-6] the
Turkic languages are avail-able. The first morphological analyzer of Kazakh was
developed in 2009 and based on the procedural method. The procedural method
implies the preliminary systematization of morphological knowledge about a
natural language and development of morphological information assignment al-
gorithms to a separate word form [7, 8].The procedural morphological analyzer
of Kazakh consisted of the following stages: marking the stem in the current
word form, its identification, assigning to a word form the corresponding list
of morphological information. The disadvantage of this method is high labor
intensity while compiling the dictionaries of compatibility. This challenge is dif-
ficult to be settled and cannot be automated completely for languages which
are characterized by a large number of counterexamples. The implementation of
this method occupies considerably smaller memory size, but at the same time
the morphological analysis period due to splitting a word form into components
and applying the procedures of compatibility increases [8]. The second version
of the morphological analyzer was developed in 2012 and based on the formal
morphological rules [9]. Later versions were based on using the ontological mod-
els and the hyper graphs [10-13]. The other research groups developed their own
morphological analyzers [15-16].

The works on creation of the morphological analysis for the Turkish language
are carried out for a long period of time and presented in papers [17-21]. In
this paper the results received in [17] were used. The peculiarity of this morpho-
logical analyzer is the methodology for carrying out the analysis. The Turkish
words with affixation were used without any lexicon. This morphological ana-
lyzer is completely based on the rules and implies using only the dictionary of
counterexamples. The analyzer is based on the final automatic model.


2   The generalized ontologic models of parts of speech of
    the Kazakh and Turkish languages
Ontology is a powerful and widely used tool for modelling relationships between
objects which belong to the different subject area. Ontology should be classified
based on the degree of dependence on the task or application area, ontological
model for knowledge representation and expression as well as other criteria [22].

We used the ontology editor Protg (http://protege.stanford.edu) to build the
ontology. It is a free open source ontology editor and a framework for building
knowledge bases. It was developed at Stanford University in collaboration with
the University of Manchester.

The morphological features of initial forms of nouns (N) are as follows. A noun
can be either animate (anim) or inanimate (inanim); this feature determines the
trajectory of the inflection of a noun. Nouns in the Kazakh language can be
conjugated (pers end) and vary for case (cases) and number (number), as well
as have a possessive form (poss end) [10].




              Fig.1: Ontological model of the Kazakh noun [10]
Figure 1 shows the ontological model of the Kazakh noun with its morphological
features. Concepts and relationships used in this ontological model are explained
in Table 1.




                      Table 1. Concepts and relationships
                 Notation Description    NotationDescription
                 N          Noun         2 pr     2 personal
                 Part    of              3 pr     3 personal
                 speech
                 Item       Item         Poss endPossessive
                                                  endings
                 Anim       Animate      1 ps     1 personal
                 Sign    of v16          2 per-
                 animacy                 sonal
                 Inanim     Inanimate    3 ps     3 personal
                 Sign    of              Number Number
                 inani-
                 macy
                 Cases      Cases        Pl       Plural

                 Nom      Nominative     Sg      Singular
                          case
                 Gen      Genitive case is a
                 Dat      Direction- da- Denotes
                          tive case
                 Acc      Accusative
                 e3, e4   has feature
                 Loc      Locative case Has
                 Abl      Ablative case Devided
                 Ins      Instrumental Change
                          case
                 Pers end Personal end- Add
                          ings
                 1 pr     1 personal



    The ontology model of the Kazakh parts of speech allows us to completely
de-scribe the morphological rules and their relationships. On the basis of this on-
tological model we developed generalized ontological models of the Kazakh and
Turkish language parts of speech. The developed ontological model of nouns of
the Kazakh language in Protege environment is displayed in the Figure 2, and
the ontological model of nouns of the Turkish language is shown below in the
Figure 3.
       Fig.2: The ontological models of nouns for the Kazakh language




       Fig.3: The ontological models of nouns for the Turkish language

In this way the comparative ontological models of noun for machine translation
system include all the categories of morphological features, for instance, noun is
divided as stem and complex according to the structure of noun in the Kazakh
language whereas in the Turkish language there is not such division, further-
more, a noun can be common, proper, concrete, abstract, animated, inanimate
according to meaning in the Kazakh language, while in the Turkish language
a noun can be common, proper, animated, inanimated. In both languages the
divisions of affixation are similar, e.g., the forms of cases, number, possessives
and conjugations. There are seven cases in Kazakh whereas in Turkish there are
five.
3    The uniform morphological analyzer for the Kazakh
     and Turkish languages
The comparison of the ontological models allowed creating the general symbol
system of morphological markers which are used in morphological analyzer.

    Table 2. - The comparison of morphological markers of the Kazakh and
                  Turkish languages nouns


    N Abbrevia    Name in En- Name           in Name           in Unified Tag
      tion        glish          Kazakh         Turkish
    1 +Noun       Noun           Zat esim       İsim             Noun
    2 +A1sg       Personal     1 Zhiktik        1.         Tekil PERS.1SG
                  singular       1       zhaq, Şahıs Uyum
                                 zhekeshe       Özelliği
    3 +A2sg       Personal     2 Zhiktik        2.         Tekil PERS.2SG
                  singular       2       zhaq, Şahıs Uyum
                                 zhekeshe       Özelliği
    4                                                             PERS.2SG.POL
    5 +A3sg       Personal     3 Zhiktik        3.         Tekil PERS.3SG
                  singular       3       zhaq, Şahıs Uyum
                                 zhekeshe       Özelliği
    6 +A1pl       Personal     1 Zhiktik        1.        Çoğul PERS.1PL
                  plural         1       zhaq, Şahıs Uyum
                                 koepshe        Özelliği
    7 +A2pl       Personal     2 Zhiktik        2.        Çoğul PERS.2PL
                  plural         2       zhaq, Şahıs Uyum
                                 koepshe        Özelliği
    8                                                             PERS.2PL.POL
    9 +A3pl       Personal     3 Zhiktik        3.        Çoğul PERS.3PL
                  plural         3       zhaq, Şahıs Uyum
                                 koepshe        Özelliği
    10 +P1sg      Possessive 1 Zhiktik          1. Tekil Şahıs POSS.1SG
                  singular       1       zhaq, Iyelik Eki
                                 Taweldik
                                 1       zhaq,
                                 zhekeshe
    11 +P2sg      Possessive 2 Taweldik         2. Tekil Şahıs POSS.2SG
                  singular       2       zhaq, Iyelik Eki
                                 zhekeshe
    12 +P2sgpol   Possessive 2 Taweldik                           POSS.2SG.POL
                  singular (for- 2       zhaq,
                  mal)           zhekeshe,
                                 resmi tueri
   13 +P3sg       Possessive 3 Taweldik       3. Tekil Şahıs POSS.3SG
                  singular      3       zhaq, Iyelik Eki
                                zhekeshe
   14 +P1pl       Possessive 1 Taweldik       1.       Çoğul POSS.1PL
                  plural        1       zhaq, Şahıs Iyelik
                                koepshe       Eki
   15 +P2pl       Possessive 2 Taweldik       2.       Çoğul POSS.2PL
                  plural        2       zhaq, Şahıs Iyelik
                                koepshe       Eki
   16 +P2plpol    Possessive    Taweldik                       POSS.2PL.POL
                  2      plural 2       zhaq,
                  (formal)      koepshe,
                                resmi tueri
   17 +P3pl       Possessive 3 Taweldik       3. Çoğul Iye- POSS.3PL
                  plural        3       zhaq, lik Eki
                                koepshe
   18 +Pnon       Non Posses- Taweldenbegen   Belirsiz         NON. POSS
                  sive                        İyelik
   19 +Nom        Nominative Atau             Yalın Durum NOM
   20 +Acc        Accusative Tabys            Belirtme Du- ACC
                  (whom?)                     rumu
   21 +Dat        Dative        Barys         Yönelme Du- DAT
                                              rumu
   22 +Abl        Ablative      Shyghys       Çıkma Du- ABL
                                              rumu
   23 +Loc        Locative      Zhatys        Kalma Du- LOC
                  (where?)                    rumu
   24 +Gen        Genitive      Ilik          Tamlayan         GEN
                  (whose?)                    Durumu
   25 +Ins        Instrumental Koemektes Aracılık Du- INS
                                              rumu
   26 +Pos        +Positive     Bolymdy       Olumlu           POSIT
   27 +Neg        +Negative Bolymsyz          Olumsuz          NEGAT


For example, the line 4 of the above-mentioned table does not have any mean-
ings in the Kazakh and Turkish columns, there is no analogue in English, but
preserved name means that for the other Turkic languages such morphological
marker for noun exists. In the lines 12 and 16 the blank value in the Turkish lan-
guage means that this morphological marker exists only for the Kazakh language.

Metalanguage is one of key concepts of system of the description of an object of
science and is defined as artificial language of ”the second order” in relation to
which natural human language acts as ”language object”, that is as a subject of
a linguistic research. In our case a natural language are the Kazakh and Turkish
languages enter-ing into the Turkic group of languages.

The unified symbol system (UNIFIED TAG) was developed based on the idea
of creating unified metalanguage for Turkic Languages.

Firstly, the idea of creating metalanguage was proposed at the 1st so-called Inter-
national Conference on Computer processing of the Turkic Languages (TurkLang-
2013) which was held in Astana on 3-4 October, 2013. A group of famous
professors of technical sciences A.A. Sharipbay (Astana, Kazakhstan), D.SH.
Suleimenov (Kazan, Tatarstan, Russia), Eşref Adalı(Istanbul, Turkey) is work-
ing on the creation of metalanguage.

At the UniTurk scientific-practical seminar of the conference the discussion of
problems related to the unification of grammatical categories for the corpuses of
the Turkic languages raised a great interest and held successfully.

For computerizing the Kazakh language it is very important step to research the
computational linguistics of the other Turkic-speaking countries. From this point
studying the structures of agglutinative languages that are similar to Kazakh and
make comparisons between them leads to a successful computer transforming of
all languages belonging to the Turkic languages group. We are very confident that
it will bring great success in development of the Kazakh language computerizing.

Our goal is to use correctly these similarities and differences in the language
automating direction. While entering to a computer the similarities between
languages help to solve the unsolved problems in one language by supplement-
ing the achievements of another language, moreover, studying the differences of
languages according to its features in cooperation gives us an opportunity to
implement a method in one language which didn’t give any results in another
language. The analysis made revealed that the Kazakh and Turkish languages
have much in common. The Table 3 shows the comparison of the rules for a noun
window in the Kazakh and Turkish languages.

          Table 3. - Example of inflection a noun window


 English Kazakh                         Turkish
 Case endings of
 Noun (singular
 form)
 window tereze:       tereze+Noun+A3sg+ pencere: pencere+Noun+A3sg
           Pnon+Nom                     +Pnon+Nom
 window ’s terezening: tereze+Noun+A3sg pencerenin:
           +Pnon+Gen                    pencere+Noun+A3sg
                                        +Pnon+Gen
 to    win- terezege: tereze+Noun+A3sg+ pencereye: pencere+Noun+A3sg
 dow        Pnon+Dat                      +Pnon+Dat
 window terezeni: tereze+Noun+A3sg+ pencereyi: pencere+Noun+A3sg
            Pnon+ Acc                     +Pnon+ Acc
 window terezede: tereze+Noun+A3sg+ pencerede: pencere+Noun+A3sg
            Pnon+ Loc                     +Pnon+ Loc
 from win- terezeden: tereze+Noun+A3sg+ pencereden:
 dow        Pnon+ Abl                     pencere+Noun+A3sg +Pnon+
                                          Abl
 with win- terezemen: tereze+Noun+A3sg pencerele:
 dow        +Pnon+Ins                     pencere+Noun+A3sg+ Pnon+
            terezemen: tereze+Noun+A3sg+ Ins
            P1sg +Ins                     pencerele:
                                          pencere+Noun+A3sg+P1sg+
                                          Ins
 Case endings of
 Noun      (plural
 form)
 windows terezeler:      tereze+Noun+A3pl pencereler: pencere+Noun+A3pl
            +Pnon+Nom                     +Pnon+Nom
 windows’ terezelerding:                  pencerelerin:   pencere+Noun+
            tereze+Noun+A3pl +Pnon+Gen A3pl +Pnon+Gen
 to    win- terezelerge: tereze+Noun+A3pl pencerelere:    pencere+Noun+
 dows       +Pnon+Dat                     A3pl +Pnon+Dat
 windows terezelerdi: tereze+Noun+A3pl pencereleri:       pencere+Noun+
            +Pnon+ Acc                    A3pl +Pnon+ Acc
 windows terezelerde: tereze+Noun+A3pl pencerelerde:      pencere+Noun+
            +Pnon+ Loc                    A3pl +Pnon+ Loc
 from win- terezelerden: tereze+Noun+A3pl pencerelerden: pencere+Noun+
 dows       +Pnon+ Abl                    A3pl +Pnon+ Abl
 with win- terezelermen:                  pencerelerle:   pencere+Noun+
 dows       tereze+Noun+A3pl +Pnon+Ins A3pl +Pnon+ Ins

The record of morphological rules in the unified form allowed to create the uni-
form rule-based algorithm of morphological analysis for the Kazakh and Turkish
languages in the papers [9, 10, 17].


4   Conclusion
In the present scientific paper the morphological features of the Kazakh and
Turkish languages are analyzed. The ontologies comparison is made, the uni-
form symbol system of morphological features is developed and the morphologi-
cal rules of the Kazakh and Turkish languages are written over via new symbol
system. The unified morphological analyzer is developed based on the general
morphological analysis algorithm.

In the future it is supposed to create the unified metalanguage of the Turkic
languages that will allow reaching the new level the Turkic languages process-
ing.


References
1. Batayeva, Z. (2012). Colloquial Kazakh, Routledge
2. Kazakh grammar. (2002). Phonetics, word formation, morphology, syntax (in
   Kazakh). Astana
3. Sharipbayev A. A., Bekmanova G. T.: The synthesis of word forms of Turkic lan-
   guage using semantic neural networks. Modern problems of applied mathematics
   and information technologies: abstracts Al Khorezmy, pp.145 (2009)
4. Sharipbayev, A.A., Bekmanova, G. T.: The building of logical semantics of the
   Kazakh words. The materials of the all-Russian conference with International partic-
   ipations Knowledge-Ontology-Theory (ZONT-09), Novosibirsk, pp. 246-249 (2009)
5. Tantuğ, A. C., Adalı, E., Oflazer, K.: Computer Analysis of the Turkmen Language
   Morphology. In: Salakoski T., Ginter F., Pyysalo S., Pahikkala T. (eds) Advances
   in Natural Language Processing. Lecture Notes in Computer Science, vol 4139, pp.
   186 193, Springer, Berlin, Heidelberg (2006)
6. Orhun, M., Tantuğ, A. C., Adalı, E.: Rule Based Analysis of the Uyghur Nouns.
   Proceedings of the International Conference on Asian Language Processing (IALP).
   Chiang Mai, Thailand, 19 (1): pp. 33-43 (2008)
7. Bekmanova, G. T.: Some approaches to the problems of automatic inflection and
   morpholog-ical analysis in the Kazakh language. The newsletter of D. Serikbayev
   East Kazakhstan state technical university, Ust-Kamenogorsk, pp. 192-197 (2009)
8. Dobrushina, E. P., Savina, G. B., Gelbukh, A. G.: The system of an accurate
   morphological analysis and synthesis. The software of new information technology,
   Kalinin (1989)
9. Sharipbayev, A., Bekmanova, G., Mukanova, ., Buribayeva, A., Yergesh, B., Kaliyev,
   A.: Semantic neural network model of morphological rules of the agglutinative lan-
   guages. The 6th International Conference on Soft Computing and Intelligent Sys-
   tems The 13th International Symposium on Advanced Intelligent Systems, Kobe,
   Japan, 20-24 November 2012, pp. 1094-1099
10. Yergesh, B., Mukanova, A., Bekmanova, G., Sharipbay, A., Razakhova, B.: Seman-
   tic hy-per-graph based representation of nouns in the Kazakh language. Computa-
   cion y Sistemas; Volume 18, Issue 3, 1 July 2014, pp. 627-635
11. Mukanova, A., Yergesh, B., Bekmanova G., Razakhova, B., Sharipbay, A.: Formal
   models of nouns in the Kazakh language. Leonardo Electronic Journal of Practices
   and Technologies; Is-sue 25 (July-December), 2014 (13), pp. 264-273
12. Zetkenbay, L., Sharipbay, A., Bekmanova, G., Kamanur, U.: Ontological modeling
   of morphological rules for the adjectives in Kazakh and Turkish languages. Journal of
   Theoretical and Applied Information Technology, Vol. 91. No.2, pp. 257-263 (2016)
13. Kamanur U., Sharipbay A., Altenbek G., Bekmanova G., Zhetkenbay L.: Investiga-
   tion and Use of Methods for Defining the Extends of Similarity of Kazakh Language
   Sentences. In: Sun M., Huang X., Lin H., Liu Z., Liu Y. (eds) Chinese Computa-
   tional Linguistics and Natural Language Processing Based on Naturally Annotated
   Big Data. CCL 2016, NLP-NABD 2016. LNCS, vol 10035. Springer, Cham (2016)
14. Tukeyev, U., Zhumanov, Zh., Rakhimova, D., Kartbayev, A.: Combinational Cir-
   cuits Model of Kazakh and Russian Languages Morphology. Abstracts of Interna-
   tional Conference Computational and Informational Technologies in Science, Engi-
   neering and Education, pp. 241-242. Al-Farabi KazNU Press, Almaty (2015)
15. Kessikbayeva, G., Cicekli, I.: Rule-Based Morphological Analyzer of Kazakh Lan-
   guage. Linguistics and Literature Studies 4(1): pp. 96-104 (2016)
16. Makhambetov, O., Makazhanov, A., Sabyrgaliyev, I., Yessenbayev, Z. Data-Driven
   Morphological Analysis and Disambiguation for Kazakh. In: Gelbukh A. (eds) Com-
   putational Linguistics and Intelligent Text Processing. CICLing 2015.LNCS, vol
   9041, pp. 151-163. Springer, Cham (2015)
17. Eryğit, G., Adalı, E.: An affix stripping morphological analyzer for turkish. In
   Proceedings of the IASTED International Conference on Artificial Intelligence and
   Applications, Innsbruck, Austria, pp. 299304 (2004)
18. Eryğit, G., Adalı, E.: Synthetic Turkish Word Root Generation. Proceedings of
   the Turkish Artificial Intelligence and Neural Networks, TAINN, Canak-kale,Turkey
   (2003)
19. Akın, A.A., Akın, M.D.: Zemberek, an open source nlp framework for Turkic lan-
   guages. Available at http://zemberek.googlecode.com
20. Hakkani-Tür, D. Z., Oflazer, K., Tür, G.: Statistical morphological disambiguation
   for agglutinative languages. In Proceedings of COLING. ICCL (2000)
21. Hankamer, J. Finite State Morphology and Left to Right Phonology. Proceedings
   of the West Coast Conference on Formal Linguistics 5. Stanford University (1986)
22. Gruber, T.R.: Toward Principles for the Design of Ontologies Used for Knowledge
   Sharing. International Journal Human-Computer Studies Vol. 43, Issues 5-6, 907928
   (1995)