=Paper=
{{Paper
|id=Vol-2718/paper14
|storemode=property
|title=Tree Structure of Slovak Sentences
|pdfUrl=https://ceur-ws.org/Vol-2718/paper14.pdf
|volume=Vol-2718
|authors=Michaela Linková,Stanislav Krajči
|dblpUrl=https://dblp.org/rec/conf/itat/LinkovaK20
}}
==Tree Structure of Slovak Sentences==
Tree structure of Slovak sentences Michaela Linková1 and Stanislav Krajči Institute of Computer Science, Pavol Jozef Šafárik University in Košice, Slovakia, michaela.linkova@student.upjs.sk Abstract: In this work-in-progress paper, we The natural language processing in the Slovak propose the algorithm for creation of a tree language is a very actual and fruitful area of structure of the Slovak sentence. The tree research interest. It is the most challenging issue structure of a sentence represents the relationships to arise in recent years. However, we can find and dependencies between words in a sentence. several tools, dictionaries and conferences in this The root of the tree is predicate. Finding the right area of research. For example, Paper Online sentence structure helps to understand its meaning Natural Language Processing of the Slovak better. In Slovak language, words have different Language presents the web site for NLP tools forms, and there are various ways how to compose such as lemmatization, correction, finding part-of- sentences. We found that the algorithm properly speech of words in the sentence and others [1]. works for simple sentences with one predicate Paper Morphological analysis of Slovak language and subject and sentences connected with one introduces a statistic algorithm of segmenting conjunction. words by identification of a suffix. This ability to identify suffix helps to classify even unseen words in the training corpus [2]. It is necessary to 1 Introduction and related works have more information about words to work with them. Language Institute of L’udovit Štúr offers The natural language processing in the Slovak a wide selection of dictionaries. These include language is a very actual and fruitful area of a Dictionary of the Slovak language, Slovak research interest. It is the most challenging issue spelling rules dictionary, Dictionary of foreign to arise in recent years. The Slovak language words, Synonymous dictionary and much more belongs to a group of a flexible language and [3]. It also provides the Slovak National Corpus. has complex rules for word inflection as there It is an electronic database, mainly containing are many possible word forms, classification of Slovak texts from 1955 from different styles, contexts. One part of natural language processing genres, thematic areas, region and other. Every is understanding the structure of sentences. This two years, the institute organizes a conference work-in-progress paper proposes the algorithm for SLOVKO on natural language processing [4].In creation of a tree structure of the Slovak sentence. 2017, Slovak Dependency Treebank in Universal Algorithm is not base on statistical data from the Dependencies was created form Slovak National corpus, but takes raw data from Tvaroslovník. Corpus.[5] Database of Slovak words and their It is a database of all forms of Slovak words. forms Tvaroslovník was created at Pavol Jozef The tree structure of a sentence can represent the Šafárik University at Košice [6], [7]. Master thesis relationships and dependencies between words in Syntaktická analýza slovenskej vety pomocou a sentence. The root of the tree is a predicate. Tvaroslovníka deals with the creation of an The tree structure for Slovak sentence: "Lucia číta algorithm for finding the structure of the sentence vel’mi peknú knihu." 1 is shown in Figure 1. [8]. Statistical and machine learning approaches have been developed in recent years. For example two-stage multilingual dependency parser, which Copyright c 2020 for this paper by its authors. Use was evaluate on 13 diverse languages including permitted under Creative Commons License Attribution 4.0 Czech language [9] or a neural network classifier International (CC BY 4.0). 1 Lucy is reading very beautiful book. for use in a greedy, transition-based dependency Figure 1: Example of sentence tree structure sentences "Lucia číta vel’mi peknú knihu."2 parser [10]. 2 Slovak syntax Grammatical rules define the structure of a sentence. A sentence member is a basic unit of a sentence. It is part of a sentence, which is in some relation to the other parts. Main sentence Slovak and Czech language have similar members are subject and predicate. grammatical rules and structure of sentences. Therefore tools and algorithms for natural Subject is part of the sentence, which describes language processing of the Czech language can who or what is doing something. It could be noun, also be inspiration for algorithms for the Slovak adjective, pronoun or numeral. Subject should be Language. There are three main universities in the first case. In Slovak sentences, subject could in the Czech republic that deals with natural be expressed or unexpressed. Unexpressed subject language processing. One is Masaryk University means that subject is not in this sentence. For in Brno and the other is Charles University in example, in the sentence "Nakupujeme", which Prague. Masaryk university created tools for means We are shopping, is only verb and no adding diacritics into texts, topics detection, subject. Verb "Nakupujeme" is in plural and named entity recognition, morphological analyzer in first person form, it is noticeable in Slovak "Majka" and other [11]. Institute of Formal language that subject is "we". Predicate expresses and Applied Linguistic at Charles University an action or situation of the subject. It could developed tools for annotation, tagging, be verb or auxiliary verb plus noun, adjective or correction and also morphological analyzer and numeral. In Figure 1., subject is "Lucia" and other [12]. This institute develops s a trainable predicate is "číta". pipeline "UDPipe" which performs sentence segmentation, tokenization, lemmatization and Other sentence members are an object, attribute dependency parsing[13]. The third is Institute and adverbial. Object specifies predicate and of Theoretical and Computational Linguistics shows object. It could be noun or pronoun and at Charles University. This institute develops it should not be in the first case. Adverbial computational tools for automatic language gives more information about time, place, manner processing, for example syntactic annotation of or cause. Adverbial could be adverb or noun. Czech corpora or grammar-based treebank of Attribute modifies subject and it is adjective, Czech [14]. pronoun or numeral. There are two types of attribute. One type has to have the same • categories: grammatical categories, there are grammatical categories as subject and the other different for every part-of-speech. type has at least one of the grammatical categories different from subject. In Figure 1., object is Table 1. shows an example of records for the word "knihu" and attribute is "peknú". In the sentence "kniha" 5 . "Deti prišli večer."3 , adverbial is "večer". Every sentence member except for predicate can be in The algorithm for finding a tree structure of a a sentence more than one time. Conjunction or sentence has Slovak text as input. In the first commas connect same sentence members. These step, an input is split into sentences by using a sentence members create a relation between them. dot as a separator. Sentences are put into list of The word which depends on the other word in sentences. Then algorithm iterates over list of relation is called dependent and another word is sentences. It finds all forms and characteristics for superior. each word in a sentence by using Tvaroslovník. In the next step, the algorithm finds out how The Slovak language distinguishes three types many of a predicates sentence has. If there is of sentence: simple, compound and fundament more than one predicate, the algorithm separates sentence. Simple sentence has only one predicate. sentence into smaller parts. Separation is done Compound sentence has two or more predicates. according to conjunction or coma. The next steps Part of compound sentence are connected by are same as steps for sentences with one predicate. conjunction or commas. The third type has However, the algorithm uses these smaller parts no subject, only verb or noun. Verb in these instead of a whole sentence. After identifying the sentences usually describes general action, for type of sentence, algorithms checks how many example "Prší" 4 [15]. of same sentence members are in a sentence. If there is more than one, the algorithm takes this part of a sentence and creates a relation between 3 Finding structure of sentence each member and comma or conjunction, which is connecting these members. These relations Tvaroslovník is a database of all forms of are added to the list of possible relations of Slovak words. It was created at the University sentence. After that, the algorithm takes tuples of of Pavol Jozef Šafárik. The database contains words, which are standing next to each other and a lot of Slovak words, their form and other tries to choose the possible relationship between information about these words. Every row those words. The algorithm has defined 29 contains information about form of the word, its types of relations between words. Except for part-of-speech and grammatical categories of the the definition of part-of-speech and grammatical word. Data in Tvaroslovník was collected from categories, every relation has its priority defined the dictionary of Slovak language. All data and by empirical experience. All types of relations and information are saved in one table. There is a list their priorities are presented in Table 2. of columns: For each tuple algorithm iterates over all form • idWord: unique identification number for of words of tuple and try to find suitable word word, forms according to possible types of relations. • idForm: unique identification number of After choosing the type of relation for each word’s form, tuple, relations are put in the list of possible relations and sorted according to priority from • form: a form of a word, the highest to the smallest. If there is more than one possible relation, algorithm put into • part-of-speech, list of possible relations all relations. Next, the algorithm selects the relation with the highest 3 Children came in the evening. 4 It is raining. 5 book idWord idForm form part-of-speech categories 1 0 kniha noun gender: feminine; number: singular; case: nominative 1 1 knihy noun gender: feminine; number: singular; case: genitive 1 2 knihe noun gender: feminine; number: singular; case: dative 1 3 knihu noun gender: feminine; number: singular; case: accusative 1 4 knihe noun gender: feminine; number: singular; case: locative 1 5 knihou noun gender: feminine; number: singular; case: instrumental 1 6 knihy noun gender: feminine; number: plural; case: nominative 1 7 kníh noun gender: feminine; number: plural; case: genitive 1 8 knihám noun gender: feminine; number: plural; case: dative 1 9 knihy noun gender: feminine; number: plural; case: accusative 1 10 knihách noun gender: feminine; number: plural; case: locative 1 11 knihami noun gender: feminine; number: plural; case: instrumental Table 1: Tvaroslovnik priority at the list. Selected relation is added to a list of words from a sentence and finds all the list of final relations. This list helps built possible forms from Tvaroslovník and puts it to the tree. After that, the tuples are removed from the map. In our example, the algorithm generates the list of possible relations, and the dependent the following map: word is removed from a sentence. Every relation, which has the same dependent word and other • Lucia: [idWord: 128848, idForm: 1, form: superior word as selected relation, is removed Lucia, part-of-speech: noun, categories: from the list of possibilities. Then words next gender: feminine; number: singular; case: to the removed word are candidates for a new nominative] relation, therefore algorithm checks if they can form relation. If there is some type of relation, • číta: [idWord: 8679, idForm: 6, form:číta, this new relation is added to the list of possible part-of-speech: verb, categories: person: relations, and then the list is sorted according third; number: singular; time: present] to priority again. If there are two relation with • vel’mi: [idWord: 102690, idForm: 0,form: same dependent and superior word form, but they vel’mi, part-of-speech: adverb, categories: have different type of relation as selected relation, None] algorithm creates copy of list of possible relation. Algorithm creates from copied list other possible • peknú: [idWord: 56578, idForm: 17, form: tree structure with different types of relations. The peknú, part-of-speech: adjective, categories: process is repeated again until there is only one gender: feminine; number: singular; case: word in a sentence.The last step is to build a tree accusative] structure from the list of final relations. Algorithm for finding tree structure is implemented in Java • knihu: [idWord: 27834, idForm: 4, form: programming language and data are stored in Java knihu, part-of-speech: noun, categories: Collections classes List and Map. If there is some gender: feminine; number: singular; case: ambiguity in relation, algorithm finds all possible accusative] tree structure of sentences After that, the map of possible relations is created The example below illustrates the work of the with their priorities and it is sorted according to algorithm. Input is the sentence "Lucia číta priority: vel’mi peknú knihu." 6 First, the algorithm creates • dependent: vel’mi and superior: peknú, 6 Lucy is reading very beautiful book. priority: 9 Dependent Superior Priority Required grammatical categories Type of relation Noun Conjunction 10 None Multiple sentence member with noun Adjective Conjunction 10 None Multiple sentence member with adjective Pronoun Conjunction 10 None Multiple sentence member with pronoun Numeral Conjunction 10 None Multiple sentence member with numeral Adverb Adverb 9 None Complex adverbial Adverb Adjective 9 None Complex attribute Pronoun Pronoun 9 None Two pronoun Pronoun sa, si Verb 8 None Reflexive verb Preposition Adjective 7 Same case Preposition with adjective Preposition Pronoun 7 Same case Preposition with pronoun Preposition Noun 6 Same case Preposition with noun Auxiliary verb Verb 6 None Complex predicate Auxiliary verb Noun 6 None Complex predicate Auxiliary verb Adjective 6 None Complex predicate Auxiliary verb Pronoun 6 None Complex predicate Pronoun Adjective 5 Same gender, number and case Complex attribute Pronoun Noun 4 Same gender, number and case Pronoun attribute Adjective Noun 4 Same gender, number and case Adjective attribute Noun Noun 3 Nouns shouldn’t have same gender, number or case Attribute Preposition Verb 3 None Preposition with verb Pronoun Verb 2 Pronoun shouldn’t be in the nominative case Pronoun object Noun Verb 2 Noun shouldn’t be in the nominative case Noun object Adjective Verb 2 Noun shouldn’t be in the nominative case Adjective object Numeral Verb 2 Noun shouldn’t be nominative case Numeral object Adverb Verb 2 None Adverbial Noun Verb 1 Noun should be in the first case Noun subject Adjective Verb 1 Adjective should be in the first case Adjective subject Pronoun Verb 1 Pronoun should be in the first case Pronoun subject Numeral Noun 1 Numeral should be in the first case Numeral subject Table 2: Relations and their priorities • dependent: peknú and superior: knihu, checks if a new relation is created. After removal, priority: 4 there is a new possible relation between words číta and peknú with priority 2. This new relation is • dependent: vel’mi and superior: číta, added to the list of possible relations, and the list priority: 2 is sorted again. After the first iteration, there is the sentence "Lucia číta peknú knihu."7 .The list • dependent: Lucia and superior: číta, of final relations is: priority:1 • dependent: vel’mi and superior: peknú, First iteration Relation with priority 9 is chosen. priority: 9 This relation is added to the list of final relations And the list of possible relations is: and removed from the list of possible relations. The algorithm is going through the list of • dependent: peknú and superior: knihu, possible relations and removes every relation priority: 4 with dependent word vel’mi. Dependent word is removed from a sentence, and the algorithm 7 Lucy is reading beautiful book • dependent: peknú and superior: číta, • dependent: knihu and superior: číta, priority: priority: 2 2 • dependent: Lucia and superior: číta, And the list of possible relations is: priority:1 • dependent: Lucia and superior: číta, Second iteration Relation with priority 4 is priority:1 selected. This relation is added to the list of final relations and removed from the list of possible Fourth iteration Relation with priority 1 is relations. the algorithm is going through the list chosen. This relation is added to the list of final of possible relations and removes every relation relations and removed from the list of possible with dependent word peknú. Dependent word is relations. Dependent word is removed from removed from sentence, and the algorithm checks the sentence. Only last word remains in the if a new relation is created. There is a new sentence and the list of possible relations is empty, possible relation between words číta and knihu therefore the fourth iteration is the last one and all with priority 2. This new relation is added to the needed relations are in the list of final relations. list of possible relations, and list is sorted again. The list of final relations is: After second iteration, the sentence is "Lucia číta • dependent: vel’mi and superior: peknú, knihu."8 . The list of final relations is: priority: 9 • dependent: vel’mi and superior: peknú, • dependent: peknú and superior: knihu, priority: 9 priority: 4 • dependent: peknú and superior: knihu, • dependent: knihu and superior: číta, priority: priority: 4 2 And the list of possible relations is: • dependent: Lucia and superior: číta, • dependent: knihu and superior: číta, priority: priority:1 2 In the last step, a tree structure is built from • dependent: Lucia and superior: číta, the list of final relations. Figure 2. illustrates the priority:1 gradual construction of a tree structure. (a),(b),(c) Third iteration Relation with priority 2 is show a partial tree of a sentence after iterations, selected. This relation is added to the list of and (d) gives an output of a whole sentence after final relation and removed from the list of possible the fourth iteration. relations. The algorithm is going through the list of possible relations and removed every relation 4 Conclusions and further research with dependent word knihu. Dependent word is removed from sentence and it checks if a In this paper, we have proposed the algorithm for new relation is created. However, there is no creating a tree structure of the Slovak sentence. possibility to create a new relation. After third We have presented the running of our algorithm on iteration, the sentence is "Lucia číta.".9 The list an example of Slovak sentence "Lucia číta vel’mi of final relations is: peknú knihu". However, the Slovak language • dependent: vel’mi and superior: peknú, has various orders of words and combinations of priority: 9 sentence members. The algorithm also finds a tree structure of sentences: • dependent: peknú and superior: knihu, priority: 4 • simple sentence, for example "Lenka nakupuje oblečenie."10 8 Lucy is reading book. 9 Lucy is reading. 10 Lenka is buying clothes. • simple sentence with a complex predicate, for example "Mal by som už íst’." 16 • compound sentence with more predicates, for example "Na záhrade máme červené tulipány, ktoré nám darovala stará mama a (a) First iteration. vedl’a tulipánov je záhon ruží." 17 (b) Second iteration. The testing of the algorithm for the various types of Slovak sentences and the visualization of the outputs in the user friendly environment are the main objectives of our future research. References [1] D. Hladek, S. Ondáš, and J. Staš, “Online natural language processing of the slovak language,” 11 2014. [2] D. Hladek, J. Stas, and J. Juhar, “Morphological (c) Third iteration. (d) Fourth iteration. analysis of the slovak language,” Advances in Electrical and Electronic Engineering, vol. 13, Figure 2: Tree structure after each iteration. no. 4, pp. 289–294, 2015. [3] “Slovenské slovníky.” https://slovnik. juls.savba.sk/?d=pskcs&d=psken&d= • simple sentence without subject, for example locutio&d=ma. (Accessed on 06/09/2020). "Kráčame do školy."11 [4] R. Garabík, “Slovenský národný korpus,” 2020. • simple sentence with same sentence member, [5] D. Zeman, “Slovak dependency treebank for example "Milý a pekný Martin kupuje in universal dependencies,” Journal of Linguistics/Jazykovednỳ casopis, vol. 68, no. 2, kvety Lucke."12 pp. 385–395, 2017. • compound sentence with two predicates, for [6] N. R. Krajči S., “Tvaroslovník – databáza tvarov example "Deti sa hrajú na ihrisku a rodičia sa slov slovenského jazyka,” in zborník príspevkov z rozprávajú."13 pracovného seminára ITAT, pp. 57–61, 2012. [7] N. R. Krajči S., “Projekt tvaroslovník – slovník • compound sentence with two predicates and všetkých tvarov všetkých slovenských slov,” in same sentence member, for example "Pekný Znalosti 2012, pp. 109–112, Vydavatelství MFF a upravený dom stojí na kraji ulice a bývajú UK, 2012. v ňom dvaja l’udia."14 [8] J. Hil’ovská, Syntaktická analýza slovenskej vety pomocou Tvaroslovníka. PhD thesis, 2017. We aim to improve the presented algorithm in our [9] R. McDonald, K. Lerman, and F. Pereira, future research. Particularly, our objective is to “Multilingual dependency analysis with a two- analyze the following special types of sentences: stage discriminative parser,” in Proceedings of the Tenth Conference on Computational Natural • fundament sentence, for example "Prší." 15 Language Learning (CoNLL-X), pp. 216–220, 2006. [10] D. Chen and C. D. Manning, “A fast and accurate 11 We are going to school. dependency parser using neural networks,” 12 Nice and handsome Martin buys flowers for Lucy. in Proceedings of the 2014 conference on 13 Children are playing on a playground and parents talk. 14 A beautiful and tidy house stands on the side of the street 16 I should go now. and two people are living in it. 17 We have red tulips in the garden that our grandmother 15 It is raining. gave us, and next to the tulips is a bed of roses. empirical methods in natural language processing (EMNLP), pp. 740–750, 2014. [11] “Natural language processing centre.” https:// nlp.fi.muni.cz/en/NLPCentre. (Accessed on 06/09/2020). [12] “Tools | Úfal.” https://ufal.mff.cuni.cz/ tools. (Accessed on 06/09/2020). [13] M. Straka, “Udpipe 2.0 prototype at conll 2018 ud shared task,” in Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207, 2018. [14] “Utkl.” http://utkl.ff.cuni.cz/en/utkl. html. (Accessed on 06/09/2020). [15] J. Pavlovič, Syntax slovenského jazyka I. 2012.