Syntactic analysis of the Slovak sentence Michaela Vočková and Stanislav Krajči Institute of Computer Science, Pavol Jozef Šafárik University in Košice, Slovakia, michaela.vockova@student.upjs.sk Abstract: The natural language processing is recently a • Part of speech tagging describes a sentence, deter- very discussed topic in computer science. The main idea mines the part of speech for each word. is an understanding of human languages by computers. In Some of the tasks can be used as a subtask for more com- this work-in-progress paper, we propose the algorithm for plex assignments [1]. creation of a tree structure of the Slovak sentence. The tree Semantic and syntactic parsing is also part of natural lan- structure of a sentence represents the relationships and de- guage processing, aiming to provide internal relations be- pendencies between words in a sentence. The root of the tween words. There are two approaches for finding the tree is a predicate. Understanding a structure of sentence structure of sentence: constituent parsing and dependency is important for other natural language processing tasks, parsing. Constituent parsing provides a constituent tree such as semantic analysis. There are many different types where nodes are phrases. The goal is to find these phrases of sentences in the Slovak language, which we took into and their relations. The approaches of constituent pars- account for creating the algorithm. For example, a mul- ing include the chart-based and the transition-based mod- tiple sentence member, compound sentence, compound els. Both have statistical and neural models. Dependency predicate and others. Our algorithm correctly analysed 85 parsing is using bilexicalized dependency grammar, which sentences from 100 different sentences. contains all semantic and syntactic dependencies. Depen- dency parsing models are divided into two groups: graph- 1 Introduction based models and transition-based models, both of which have their own statistical or neural network approaches Natural language processing is part of artificial intelli- [2]. gence and linguistics, focusing on understanding human This work-in-progress paper proposes the improvement of language by computers. There are different tasks in natu- algorithm for creation of a tree structure of the Slovak sen- ral language processing: tence [19]. This algorithm is not based on statistical data from the corpus, but takes raw data from Tvaroslovník. It • Automatic summarization provides summaries or de- is a database of all forms of Slovak words. The tree struc- tailed information of text of a known type. ture of a sentence can represent the relationships and de- pendencies between words in a sentence. The root of the • Co-reference resolution refers to a sentence or more tree is a predicate. The tree structure for Slovak sentence: extensive set of text determining which word refers Hodina dnes za£ala malým kvízom.1 is shown in to the same object. Figure 1. • Discourse analysis refers to the task of identifying the discourse structure of a text. 2 State of Art • Machine translation refers to automatic translation of Institute of Formal and Applied Linguistic at Charles Uni- text from one human language to another. versity in Prague has created the Prague Dependency Cor- • Morphological segmentation refers to separate words pus, which is an excellent contribution to natural language into individual morphemes and identifies the class of processing. Several tools have been developed to find out the morphemes. a sentence structure or work on other natural language pro- cessing tasks based on this corpus or Universal Depen- • Named entity recognition describes a stream of text dency Treebank. For example [3]: and determines which text items relate to proper names. • Netgraph – this is a graphically oriented client-server application for searching in an annotated corpus. • Optical character recognition gives an image repre- • TrEd – an editor used to search for a syntactically senting printed text, which helps determine the corre- annotated sentence structure. sponding or related text. • Morfo – a system for morphological analysis of the Czech language. Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 The class starts with a small quiz today. za£ala Hodina dnes kvízom malým Figure 1: Example of sentence tree structure for Hodina dnes za£ala malým kvízom.1 • MorfoDita – a free tool for morphological analysis of For example, DEVELOPER visualizes an occurrence of natural language texts. one or two words in the corpus. DIAKRITIK corrects the diacritics, and KOLOKAT visualizes distances between • Moses – a statistical machine translation system that two terms in the corpus [12]. Every two years, the insti- automatically allows training translation models for tute organizes a conference SLOVKO on natural language any language pair. processing [13]. In 2017, D. Zeman presented an article • UDPipe – a trainable channel for tokenization, label- Slovak Dependency Treebanks in Universal Dependencies ing, lemmatization, and relationship analysis. Insti- about converting the syntactically annotated part of the tute developed two version of UDPipe [4], [5]. Slovak National Corpus into the annotation scheme known as Universal Dependencies. Universal Dependencies is The Natural Language Processing Centre at Masaryk Uni- an international standard and also the largest database of versity in Brno is mainly engaged in research into the pro- freely available dependency treebank[14]. Database of cessing of the Czech, English, and Slovak languages. They Slovak words and their forms Tvaroslovník was created at deal with morphological, syntactic, and semantic analysis Pavol Jozef Šafárik University at Košice [15], [16]. Mas- and the creation of corpora and dictionaries. The insti- ter thesis [17] deals with the creation of an algorithm for tute has created several tools that work with morpholog- finding the structure of the sentence. ical, syntactic, and semantic analysis. Examples include [6]: 3 Dictionaries • Majka – morphological analyzer for Slovak, Czech, Polish, Swedish, German language. It is necessary to have more information about words to • The Sketch engine – a tool used to search for infor- create a sentence structure. Therefore we are using the mation from text corpora. dictionary Tvaroslovník and Valency dictionary for our al- gorithm of syntactic analysis. • CZ accent – a tool for adding accents to text. • Synt and SET – parsers used to determine the struc- 3.1 Tvaroslovník ture. Tvaroslovník is a database of all forms of all Slovak words • Visual Browser - Java software that visualizes data from [8] and [9]. Every row contains information about into RDT format. form of the word, its part-of-speech and grammatical cat- egories of the word. Data in Tvaroslovník was collected Institute of Theoretical and Computational Linguistics at from the dictionary of Slovak language. Database contains Charles University develops computational tools for au- approximately 220,000 words and 24,000,000 records of tomatic language processing, for example, syntactic an- words and all their forms. All data and information are notation of Czech corpora or grammar-based treebank of saved in one table. There is a list of columns: Czech language. [7]. Similar to the Czech language, there are several tools, dic- • idWord – unique identification number for word, tionaries, and conferences in natural language processing research in Slovak languages. Language Institute of L’u- • idForm – unique identification number of word’s dovít Štúr offers a wide selection of dictionaries. These in- form, clude a [8], [9], [10] and much more [11]. It also provides the Slovak National Corpus. It is an electronic database, • form – a form of a word, mainly containing Slovak texts from 1955 from different • part-of-speech, styles, genres, thematic areas, region and other. Language Institute of L’udovít Štúr developed tools for searching • categories – grammatical categories, there are differ- words in Slovak National Corpus and working with them. ent for every part-of-speech. Table 1 shows an example of records for the word hodina input: sentence 2. output: tree structure of sentence find all forms for words in sentence from Tvaroslovník; create list of possible relations; 3.2 Valency dictionary while list of possible relation is not empty or sentence Valency dictionary contains two types of the most com- has only one word do mon covalence between words. First is covalence be- choose relation with greatest priority; tween verb and preposition or verb and the most com- add chosen relation to list of final relations; remove chosen relation from list mon case of the following term. Covalence between of possible relations; noun and preposition is the second type of valency dic- foreach relation in list of possible relation tionary. To built the valency dictionary, we took noun do and verbs from Tvaroslovník and covalencies with prepo- if relation has same dependent sitions and cases were automatically created from exam- and different superior word as chosen ples in Krátky slovník slovenského jazyka [18]. Dictio- relation then nary cointans columns: remove relation from list of possible relations; • idWord — unique identification number for word end from Tvaroslovník, end • preposition — preposition which follow after noun or remove dependent word of chosen relation verb, from sentence; if new possible relation is created then • case — case of word after noun or verb. add new relation to list of possible relations; Table 2 illustrates examples from dictionary of covalence. end end build tree structure from list of final relations; 4 Tree structure of sentence Algorithm 1: Pseudocode for finding tree structure al- gorithm We presented the main idea of the algorithm for finding the tree structure in the article [19]. For the algorithm, we expanded the table of relations and added cases of Slo- • Multiple verbs in sentence: Occurrence of several vak sentences, which we describe in the subsection Special verbs in a sentence is another specification of the sen- cases of sentences. Table 3 illustrates the new relationship tence. Before we start looking for possible relation- table, and algorithm 1 describes the pseudocode for the ships in a sentence, we determine if this is not the main idea of the tree finding algorithm. case. After determining verbs, we search whether a conjunction or a comma is in the sentence between 4.1 Special cases of sentences them. Finding a comma or conjunction classifies a sentence as a sentence. Therefore, we divide the sen- Slovak is a flexible language and has many peculiarities tence according to the conjunction or comma into that we took into account when creating the method. subsections with which we work as separate sen- • Multiple sentence member: The first is multiple tences. We connect these sentences with the rela- sentence members. We find out whether there is tionships between the conjunction or comma and the a conjunction or a comma in the sentence during roots of subsentences in the resulting output. Figure 2 searching for initial possible relations. If so, we look shows us example of sentence structure for sentence at the word before and after the conjunction if it is the Mama £íta noviny a otec pí²e správu.4 In a same part of speech and has the same grammatical sentence containing more verbs without conjunction categories. After fulfilling the condition, we add a re- or comma between them, we assume that there is a lation between conjunction and the words to the pos- compound verb relation. Therefore, we combine the sible relations. The conjunction then takes over the found verbs with the relation and add them to the list grammatical categories of the words it connects. For of possible relations. Figure 3 shows us example of example, in sentence Noviny a £asopisy pí²u o such sentence structure for sentence Ráno za£alo celebritách.3 words noviny and £asopisy are pr²a´.5 same sentence member, therefore there are relations • Same form of word: Some words have the same noviny and a with priority 12 and £asopisy and a form in several cases, so it is sometimes difficult to with priority 12 in the list of possible relations. Word determine which relationship they can form. We find a participates as noun in nominative case. 2 hour 4 Mother is reading newspapers and father is writing an message. 3 Newspapers and magazines write about celebrities. 5 It started to rain in the morning. idWord idForm form part-of-speech categories 20009 0 hodina noun gender: feminine; number: singular; case: nominative 20009 1 hodiny noun gender: feminine; number: singular; case: genitive 20009 2 hodine noun gender: feminine; number: singular; case: dative 20009 3 hodinu noun gender: feminine; number: singular; case: accusative 20009 4 hodina noun gender: feminine; number: singular; case: vocative 20009 5 hodine noun gender: feminine; number: singular; case: locative 20009 6 hodinou noun gender: feminine; number: singular; case: instrumental 20009 7 hodiny noun gender: feminine; number: plural; case: nominative 20009 8 hodín noun gender: feminine; number: plural; case: genitive 20009 9 hodinám noun gender: feminine; number: plural; case: dative 20009 10 hodiny noun gender: feminine; number: plural; case: accusative 20009 11 hodiny noun gender: feminine; number: plural; case: vocative 20009 12 hodinách noun gender: feminine; number: plural; case: locative 20009 13 hodinami noun gender: feminine; number: plural; case: instrumental Table 1: Tvaroslovník idWord preposition case 6016 null accusative 6016 proti dative 31494 v locative 31494 null accusative 31494 null instrumental 62420 null accusative Table 2: Examples of covalencies for noun and verbs all possible relations for the word. In the method perníkové srdce.6 where we gradually iterate over the list of possible relations and remove relations with the same depen- • Different part-of-speech for same form: Expect a dent word as the currently selected relation, we lo- word having the same form in multiple cases may cate a relation with the same dependent and supe- also have the same form for multiple parts of speech. rior word but with a different priority. We create an- For example, the word to is a pronoun and particle. other list of final and possible relations assigning a We created a list that contains the most commonly relation with a different priority. The method then used part of speech for these words. If we set the outputs two trees. Figure 4 illustrates the two pos- method to find only the most relevant sentence struc- sible outputs for sentence Diev£a upieklo mame tures, we use only the most often used part of speech for a form. 6 The girl baked a gingerbread heart for mum. Dependent Superior Priority Required grammatical categories verb auxiliary verb 13 none noun, adjective, pro- auxiliary verb 13 none noun, numeral verb conjunction 12 none noun conjunction 12 none adjective conjunction 12 none pronoun conjunction 12 none numeral conjunction 12 none adverb conjunction 12 none adverb adverb 11 none adverb adjective 11 none pronoun sa, si verb 10 none pronoun adjective 9 none adjective noun 8 same gender, case and number numeral noun 8 same gender, case and number pronoun noun 8 same gender, case and number noun noun 7 case of dependent noun is accusative noun noun 6 case of dependent noun is genitive adjective preposition 5 same case pronoun preposition 5 same case noun preposition 4 same case preposition noun 4 noun and preposition are together in valency dictionary pronoun verb 3 case of pronoun is not in valency dictionary and pronoun shouldn’t be in the nominative case noun verb 3 case of noun is not in valency dictionary and noun shouldn’t be in the nominative case adjective verb 3 case of adjective is not in valency dictionary and adjec- tive shouldn’t be in the nominative case numeral verb 3 case of numeral is not in valency dictionary and numeral shouldn’t be nominative case pronoun verb 2 case of pronoun is in valency dictionary and pronoun shouldn’t be in the nominative case noun verb 2 case of noun is in valency dictionary and noun shouldn’t be in the nominative case adjective verb 2 case of adjective is in valency dictionary and adjective shouldn’t be in the nominative case numeral verb 2 case of numeral is in valency dictionary and numeral shouldn’t be nominative case adverb verb 2 none noun verb 1 noun should be in the first case adjective verb 1 adjective should be in the first case pronoun verb 1 pronoun should be in the first case numeral verb 1 numeral should be in the first case Table 3: Relations and their priorities 5 Conclusion and future research • simple sentences: Martin zavrtel hlavou.7 , • simple sentences with different sentence members: Chlapec vykro£il z tie¬a tmavých jedlí To analyze the algorithm for creating a tree structure, we na £istinku uprostred lesa.8 , built a dataset with 100 different Slovak sentences. Sen- tence are taken from fairy-tales and articles on Internet. 7 Martin waved his head. Dataset contains: 8 The boy walked out of the shadows of dark firs to a clearing in the a £íta pí²e Mama noviny otec správu Figure 2: Example of sentence tree structure for Mama £íta noviny a otec pí²e správu.4 za£alo Ráno pr²a´ Figure 3: Example of sentence tree structure for Ráno za£alo pr²a´.5 upieklo Diev£a mame srdce perníkové A upieklo srdce mame diev£a Perníkové B Figure 4: Example of two possible outputs for sentence Diev£a upieklo mame perníkové srdce.6 • compound sentences: Te²í sa z jeho krásy We created this dataset manually. To each sentence, we a uºíva si pokojný relax.9 , added the required tree structure. As a result, we received 85 identical tree structures. The main difficulties for find- • sentences with multiple sentence member: ing incorrect structure were: Uprostred hlu£ného a ubehaného meste£ka leºí krásny zelený park.10 , • Digital number in a sentence. For example, Hrad vznikol pravdepodobne v druhej polovici • sentences with compound predicate: V mestskej 13. storo£ia.12 £asti si môºu náv²tevníci uºi´ kúpalisko.11 . • Changing the position of words in a nominal predi- cate. For example, Vhodná je paralela z £ias middle of the forest. môjho starého otca.13 9 She enjoys its beauty and enjoys peaceful relaxation. 10 In the middle of a noisy and deserted town lies a beautiful green In our future work we want to focus on: park. 12 The castle was probably built in the second half of the 13th century. 11 Visitors can enjoy the swimming pool in the city. 13 A parallel from my grandfather’s time is appropriate. • eliminating the above problems • testing method on other sentences • creating a web interface for this algorithm References [1] Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural lan- guage processing: State of the art, current trends and chal- lenges. 2017. arXiv preprint arXiv:1708.05148. [2] Zhang, M.: A survey of syntactic-semantic parsing based on constituent and dependency structures. Science China Technological Sciences (2020): 1–23. [3] https://ufal.mff.cuni.cz/pdt2.0/doc/pdt- guide/cz/html/index.html. (Accessed on 06/10/2021) [4] Straka, M., Straková, J., Hajic, J.: Prague at EPE 2017: The UDPipe system. 2017. In Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation at the Fourth International Conference on Dependency Linguistics and the 15th International Conference on Parsing Technologies. Pisa, Italy (pp. 65–74). [5] Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. 2018. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Uni- versal Dependencies (pp. 197–207). [6] https://nlp.fi.muni.cz/en/NLPCentre. (Accessed on 06/10/2021) [7] http://utkl.ff.cuni.cz/en/utkl.html. (Accessed on 06/10/2021) [8] Peciar, Š.: (Ed.) Slovník slovenského jazyka (Vol. 4). Vy- davatel’stvo SAV. 1964. [9] Kraus, J.: Slovník cudzích slov: akademický. Slovenské pedagogické nakladatel’stvo. 2005. [10] M. Považaj. a kol.: Pravidlá slovenského pravopisu. 4. nezmenené vyd. Bratislava. Veda 2013. 592 s. ISBN 978- 80-224-1331-2 [11] https://slovnik.juls.savba.sk/. (Accessed on 06/10/2021) [12] Garabík, R.: Slovenský národný korpus. 2020. Acceseed on https://korpus.sk/. [13] https://korpus.sk/slovko.html. (Accessed on 06/10/2021) [14] Zeman, D.: Slovak dependency treebank in universal de- pendencies. 2017. Journal of Linguistics/Jazykovedný ca- sopis, 68(2), 385–395. [15] Krajči S., Novotný R.: Tvaroslovník – databáza tvarov slov slovenského jazyka. In zborník príspevkov z pracovného seminára ITAT. 2012.(pp. 57–61). [16] Krajči S., Novotný R.: Projekt Tvaroslovník – slovník všetkých tvarov všetkých slovenských slov. Znalosti 2012. 2012. 2012. pp. 109–112.Vydavatelství MFF UK. [17] Hil’ovská, J.: Syntaktická analýza slovenskej vety pomo- cou Tvaroslovníka. UPJŠ.2017. [18] Kačala, J.: (Ed.) Krátky slovník slovenského jazyka. Veda. 1987 [19] Linková, M., Krajci, S.: Tree structure of Slovak sentences. 2020. In Proceedings of the 20th Conference Information Technologies – Applications and Theory.(pp. 67–74).